The AI Era Needs Data Quality for Unstructured Data, Starting With Documents
More than 80% of enterprise data is unstructured¹ — but traditional Data Quality tools are designed to run checks or queries on structured data in databases and data warehouses.
That means many enterprises aren’t leveraging unstructured data in critical documents that power operations, analytics, compliance, and customer experience, such as:
- Financial reports with updated numbers.
- Product documentation scattered across folders or document repositories.
- Customer support knowledge bases that need constant updating.
That’s where Lightup Data Quality for Unstructured Data comes in.
Why Is Managing Unstructured Data Quality So Challenging?
Documents are a prime example of unstructured data containing enterprise insights and valuable information that can be used to drive business decisions and train AI/LLM models.
Yet, documents are typically manually managed, difficult to monitor for quality, and error-prone — making them problematic for training LLM models or decision-making.
Unstructured data is everywhere, including:
- PDFs of financial reports
- DOCX files for legal, human resources (HR), or product documentation
- Plain text or markdown (.txt, .md) files for notes
- Email archives
- Internal wikis and knowledge bases
- Files stored in cloud repositories like Amazon S3 buckets, Google Drive, OneDrive, or Box
Unstructured data is information that doesn’t conform to a predefined, structured data model or fixed schema and can’t be neatly organized into rows and columns like structured data in databases or data warehouses.
Whereas structured data is easy for machines to read and sort, unstructured data — such as documents, images, text, audio, video — is designed for people to consume and can’t be directly queried or analyzed using standard methods.
Managing unstructured Data Quality for documents is difficult to track at scale due to:
- The varying structural differences between files and document types.
- The nature of key facts being embedded in free text, not schema-defined fields.
- Untracked or unmonitored changes within the content.
- Quality issues, like missing information, factual inconsistencies, or exposed sensitive or personally identifiable information (PII).
Yet, unstructured data can hold essential enterprise context and operational knowledge. Financial numbers, compliance clauses, invoices, customer feedback, product information, and more all live in documents. However, regressions or errors of omission in documents often go unnoticed, leading to potential downstream risks and problems.
The Value of Training AI/LLM Models with Unstructured Document Data
As organizations accelerate their AI adoption, they’re realizing that some of the most valuable training data already exists inside their enterprise documents. Unlike data in structured databases, documents often contain rich context, domain-specific information, and operational nuances — exactly the kind of company–specific information that AI models and large language models (LLMs) need to be truly useful and relevant for enterprise AI applications.
Enterprise documents:
- Capture the core institutional knowledge with contextual details often missing from databases.
- Include everyday language and terminology used by employees, partners, and customers.
- Are the primary format for business decisions, compliance, and communication.
3 Enterprise Uses Cases
- Training Domain-Specific LLMs: Enterprises fine-tune foundational AI models using internal documents, such as technical manuals, customer support FAQs, and policy documents, to improve accuracy in industry-specific tasks. For example, a healthcare provider might use internal protocols and documentation to train a model that assists with medical coding or claims triaging.
- Retrieval-Augmented Generation (RAG) Systems: RAG architectures use enterprise documents as a knowledge base that models can reference at runtime. For example, a model answering a customer question can retrieve content from the latest product documentation or internal wikis to generate a contextually correct, up-to-date response.
- Automated Document Intelligence: AI models are increasingly used to extract structured information from contracts, financial reports, or onboarding documents. By applying natural language processing (NLP) to unstructured content, enterprises can automate workflows like risk scoring, revenue forecasting, and compliance checks at scale.
Simply put, maintaining high-quality enterprise documents becomes a prerequisite. If your AI is learning from or referencing documents, you need confidence that the information is accurate, complete, and consistent.
Getting Started with Lightup's Unstructured Data Quality
Getting started in Lightup is as simple as connecting an unstructured data source to Lightup:
- Navigate to the Lightup Explorer panel, now with a dedicated tab for unstructured data sources.
- Connect an unstructured data source, such as Amazon S3.
- Enable the Unstructured Data toggle to indicate that folder contents should be profiled.
- Lightup creates a directory tree of files and folders within the S3 bucket.
Supported file types: PDF, .txt, .md (more coming soon)
AI-Powered Document Profiling
After connecting your unstructured data source to Lightup and enabling the Data Profiling toggle for all documents, Lightup will automatically generate an AI-powered summary of facts that includes:
- Document metadata (type, length, creation date)
- Summary of the content
- 5 autogenerated questions and answers (Q&A), highlighting salient document facts
Editable Profiles
Review the data profile for the document, and if it needs adjusting, you can:
- Click “Regenerate” create a new version of the profile.
- Manually edit to add or delete Q&As based on domain knowledge and context.
- Save updated profiles for monitoring.
Since AI is non-deterministic by nature, each regenerated profile may offer a different perspective.
Auto Metrics for Documents and Folders
Once a data profile is activated, Lightup enables document- and folder-level Auto Metrics for continuous observability. Purpose-built for monitoring the quality of documents to understand the accuracy and reliability of the content, Lightup provides out-of-the-box coverage for the four primary dimensions of Data Quality for documents that can degrade or become problematic at scale:
- Inaccuracy
- Inconsistency
- Incompleteness
- Personally Identifiable Information (PII) Contamination
Document-Level Metrics
Lightup’s document-level metrics support rules, monitors, and alerts, keeping your teams notified as soon as anomalies occur. Each metric can be scheduled to run at regular or custom intervals, such as hourly, daily, weekly, or monthly in a specified time zone.
- Inaccuracy: Flags changes in factual data (e.g., revenue changed from $1M to $1.2M).
- Inconsistency: Detects contradictions within the document and presents a side-by-side comparison of conflicting information (e.g., document indicates revenue of $100K multiple times, but also mentions a different revenue figure of $120K).
- Incompleteness: Identifies missing information from the original Q&As and indicates factual gaps or degradations in complete information over time (e.g., net income figures were included in the original source document report, but aren’t included in subsequent reports).
- PII Contamination: Detects and lists instances of personally identifiable information (PII) and provides the count and examples of detected PII fields (e.g., name and date of birth).
Folder-Level Metrics
Custom Metrics
- Navigate to unstructured data source, then right click to select Create Metric.
- Define schema (e.g., Value: Income, Type: Number).
- Create a metric using natural language, such as “Extract net income from financial report.”
- Schedule metric collection runs to trigger document scans.
- Activate monitors to track the output, enable Anomaly Detection, and define preferred channels for alerts, while Lightup automatically tracks incidents in dashboards.
Anomaly Detection and Alerting
Anomaly Detection tracks trends over time and alerts you if quality signals change. For example, if you typically see 4 – 5 inaccuracies for a particular document and suddenly get 10, Lightup flags that incident and notifies your team.
Role-Based Access Control (RBAC)
- Users only see metrics, profiles, or PII Contamination if authorized.
- Sensitive documents are monitored securely for compliance.
Role-based access control ensures that users without permissions to view sensitive data, like PII, won’t be able to access profiles.
Explore Our Open Source Project for Unstructured Data Quality
Since we believe AI-ready documents should be accessible to everyone, we’re happy to share a Python library for assessing unstructured Data Quality, available in GitHub as an open source project.*
- Connect an LLM and S3 bucket to your project.
- Use our open source code library to evaluate the accuracy and reliability of documents, including PDFs, text files, and markdowns.
- Run computational checks: Inconsistency, Incompleteness, Inaccuracy, and PII contamination.
Explore our open source project today, contribute your ideas, and deploy anywhere!
Questions? Reach out at info@lightup.ai
*Unstructured Data Quality with Anomaly Detection, Alerting, RBAC, and other enterprise features is available for customers and trial users only.
1. Mary Shacklett, “Structured vs Unstructured Data: Key Differences,” Datamation, November 3, 2023.