banner-image-line1

Lightup Named to the 2024 CB Insights AI 100 List

banner-image-AI100
banner-image-line2

Myth-Busting Straight Talk About Data Quality Monitoring and Data Observability

Myth Fact Image

What is Data Observability vs. Data Quality Monitoring? What are they used for?
Are they interchangeable? Do they overlap?

 

Whether posed by customers, investors, or reporters, these questions inevitably come up.

Blurred messaging. Inflated positioning. Broad claims. Overstated capabilities. No wonder there’s market confusion over the differences between Data Observability and Data Quality Monitoring.

Is there a straightforward way to classify and understand these products and capabilities? We’re here to set the record straight and explain the key differences and use cases for Data Observability and Data Quality Monitoring.

First, we’ll look at the evolution of data to understand the key changes and challenges that emerged, creating a new need for Data Observability and a modern approach to Data Quality Monitoring. Then, we’ll highlight pertinent use cases for Data Observability and Data Quality Monitoring. Finally, we’ll debunk four popular myths around Data Observability and demystify the relationship between Data Observability and Data Quality Monitoring, once and for all.

Our goal is to clarify the convoluted messaging in the market and help you understand the appropriate and intended use cases for both. We’re not going to tell you if one is better than the other, but hopefully, we’ll shine a light on how these technologies have evolved and why each is useful in their own right. (You may be surprised to learn that these technologies aren’t necessarily an “either/or” decision for modern data stacks — and, in some cases, they even overlap…)

Data Evolution

To better understand the current data technology landscape, let’s review how data and related technologies have evolved over time.

Data evolution resulted in three distinct product categories for checking data:

  1. Traditional Data Quality: Extract and inspect architecture with full table scans
  2. Data Observability: Primarily metadata and log checks, with data scans left to the user or sometimes omitted
  3. Data Quality Monitoring: Pushdown architecture with in-place, incremental table scans and basic metadata checks
timeline showing data evolution from 2001 to 2019
Summary timeline of how data evolved with related emerging technologies.

Traditionally, Data Quality tools have always been about running Deep Data Checks to analyze the health of data, itself. That is, they specialize in performing data scans by pulling data out of the source system, copying it, and running Data Quality Checks. That worked well when data volumes were lower, prior to Big Data in 2001.

However, by 2012, when Big Data went mainstream and Cloud Data Warehouses became the new norm, organizations started running into Data Quality scalability challenges.

As data volumes continuously expanded, the question became, “How do I solve Data Quality at Cloud scale?” In 2017, an alternate solution emerged: Data Observability. Derived from the principles of infrastructure observability using logs and standardized metrics, Data Observability offered metadata analysis for data pipelines, instead of running full table scans.

Using metadata checks, Data Observability platforms enable organizations to observe the health and behavior of data infrastructure pipelines to understand how data flows through different systems. Tracking these metrics provides observations about data processing speed, latency, throughput, and resource usage, answering various questions, such as:

  • Are my jobs running?
  • Are systems responsive?
  • Is data flowing?
  • Are services up and returning healthy status signals?
  • Is my data warehouse processing data?
  • Is my Snowflake cluster up and running, or responsive?
  • Is data coming in from ETL tools into a lakehouse or data warehouse?
  • Are my ELT jobs running successfully and on the right cadence?

Since metadata, by nature, has a much lower volume than actual data, that makes monitoring metadata much more scalable. Problem solved, right?

Not really. While Data Observability platforms will tell you if data is flowing, metadata checks aren’t sufficient to monitor the health of actual data flowing in the pipes.

Turns out, there’s a large class of Data Quality problems that can only be solved with Deep Data scans. For instance, let’s say you’re reporting on sales numbers. Your Data Observability tool or Infrastructure Monitoring system may tell you that jobs are running fine, but your data is showing zero sales — affecting everything downstream.

Your Data Quality is clearly broken.

Key Takeaway: Data Observability tools are good at monitoring the status or condition of data pipelines and infrastructure.

In other cases, you need to know if there are 100 rows in a table. Your metadata check indicates that 100 rows showed up after a job finished processing. All good, right? Sometimes.

Sure, you got 100 rows. But you were supposed to see 1,000 as the value, not 1 and 5. That error won’t show up with a metadata check. Metadata checks will only tell you that you inserted rows and that there were 100 of them.

Only Deep Data Checks will tell you if there are errors with the data values in the table fields.

While metadata checks offer a promising approach, Data Observability still has shortcomings around Data Quality, because solving essential Data Quality problems is impossible without Deep Data Checks.

For example, when it comes to data freshness, you want to know that data is fresh and delay is low. Metadata checks will tell you that new rows were inserted in a table in the last hour. Your team will assume that, since the table got updated in the last hour, the data is fresh. Not always.

Here’s why.

Sure, you processed data in the last hour that was added to the table. But, your whole pipeline is running 24 hours behind, which means you just processed day-old data.

The only way you can check that is by using a Deep Data Check to scan the actual timestamps. Then, and only then, will you notice that the date of the data inserted within the last hour is a day old.

Key Takeaway: Metadata checks can be unreliable for monitoring Data Quality dimensions like data freshness, but Deep Data Checks will indicate if data is reliable or has been accurately refreshed.

A Closer Look at Deep Data Checks with Data Quality Monitoring vs. Metadata Checks with Data Observability

 

What are Deep Data Checks? Data Quality tools run Deep Data Checks that scan the actual data values in tables at the column and row levels. Whereas, Data Observability platforms focus on metadata checks — the ability to monitor data about data, providing high-level summaries about data stack performance.

How do they compare? Essentially, Data Observability enables scalability across high data volumes, at the expense of shallow metadata checks. Traditional Data Quality gives you depth of checks at the expense of limited scalability. Data Quality Monitoring offers the best of both: All the depth and power of Deep Data Checks with Cloud-native scalability.

graph showing data quality monitoring as highly scalable and very deep
Comparison of the depth of checks and scalability of Traditional Data Quality, Data Quality Monitoring, and Data Observability.

Some Data Observability vendors offer everything from monitoring data pipelines, to Data Quality, to data lineage, which doesn’t really belong in Data Quality tools. (We believe data lineage belongs in Data Catalog platforms, like Atlan and Alation.)

Admittedly, it’s confusing. Very confusing…

Revisiting the origins of Data Observability, the approach was originally based on metadata. As such, Data Observability vendors have positioned their solutions as a way to monitor metadata in order to understand:

  • Data Assets – tables, files, views
  • Data Pipelines – operating status, problems, effectiveness
  • Data Users – what assets are being used, types of queries running
  • Data Infrastructure – health of infrastructure, costs, performance

Here’s why that matters.

Given: If a data pipeline is broken, the data inside is also broken. But, if the data pipeline is healthy (i.e., jobs are processing, running on a cadence and successfully completing), does that mean the data inside is healthy? Not necessarily.

That’s because “hidden data outages” are hard to detect. Just because jobs are processing or data is flowing, that doesn’t automatically imply that the data itself is good.

Let’s take a common scenario: You want to detect when files land in Amazon S3 before processing them. If your Change Data Capture (CDC) returns empty files, Data Observability and Data Quality tools can discover that error once it’s available in your Lakehouse or Data Warehouse. On the other hand, if your CDC returns non-empty files with records missing from the most recent business day, and instead shows records from older business days, that type of error can’t be detected by Data Observability tools. That can only be caught by looking at data delay based on a business date (timestamp) or row count based on business date. Data Observability metadata checks can’t find that type of data discrepancy, only deep data checks can.

Key Takeaway: Data Observability platforms specialize in metadata monitoring, which is ideal for monitoring data infrastructure and data pipelines.

Debunking Four Myths Around Data Observability vs. Data Quality

“Data Observability” is a term coined in 2018, creating a new niche in the market as the end-all solution to all things observability in the data management world. But, has Data Observability lived up to its hype?

Here are four myths, debunked.

Myth 1

Data Observability does everything that Data Quality tools do and a lot more. Data Quality is a component of Data Observability, a type of anomaly that Data Observability tools monitor.

Debunked

Data Quality isn’t an anomaly for Data Observability tools to monitor. Looking “under the hood,” Data Observability tools were originally designed to monitor and process metadata and log checks.

Key Takeaway
Data Observability tools specialize in metadata checks, not Deep Data Checks on cell-level table data.

Myth 2

Data Quality problems can be solved with Data Observability metadata checks only.

Debunked

Metadata checks can’t detect values out of expected ranges or accurately test data freshness.

Key Takeaway
Metadata checks have limitations, unable to solve classic Data Quality issues.

Myth 3

Our Data Observability platform enables us to write custom SQL queries to address Data Quality issues.

Debunked

While you can write your own SQL queries (with or without Data Observability tools), implementing Data Quality checks by writing SQL queries manually is typically too much effort and doesn’t scale across the enterprise footprint, in most cases. Moreover, there are limitations around the types of Data Quality checks and custom SQL queries supported by Data Observability platforms. For example, complex Data Quality checks that require table comparisons or reconciliation checks across different data sources aren’t supported by Data Observability tools and can’t be accomplished with plain SQL queries, alone.

Key Takeaway 
While one-off Data Quality checks for single tables can be accomplished with SQL, Data Observability platforms don’t support complex Data Quality checks required to solve unique business problems and common Data Quality use cases, such as comparing multiple tables or creating reconciliation checks across disparate data sources.

Myth 4

Monitoring data stacks at Cloud scale can only be done with Data Observability metadata checks, because traditional Data Quality checks are too expensive to scale.

Debunked

Fact tables almost always have time indexing or time partitioning. Without that structure in the data model, traditional Data Quality queries aren’t scalable. Moreover, because legacy Data Quality platforms are built with an “extract and inspect” architecture, that creates a never-ending loop of expanding infrastructure costs as data is moved and copied in order to perform full table scans.

Deploying Data Quality checks optimally requires scanning new data with checkpointing, which can’t be done with metadata. That requires a data scan. However, running full table scans every time a Data Quality check is needed — on an hourly or daily basis — isn’t scalable or cost-effective. But, the right architecture will scan data once and only once. With a modern pushdown architecture (like Lightup’s), you can run time-bound, incremental scans. And that scales extremely well — without incurring the costs of full table scans.

Key Takeaway
If the ultimate goal is to detect bad data, you must scan the actual data, not just metadata. Monitoring the health of data stacks at Cloud scale can be done cost-effectively with the right Data Quality architecture.

The Case for Both?

Are Data Observability and Data Quality Monitoring platforms competitive or complementary technologies? Is there a case for both in modern enterprise data stacks?

Yes, depending on the types of problems you’re trying to solve.

infographic showing monitoring technologies for modern data stacks
Different monitoring capabilities provided by IT/Application Monitoring, Data Observability, and Data Quality Monitoring for visibility into the health of modern data stacks.

Data Observability emerged as an initial workaround to address the problem of scaling Data Quality. And now, even though some Data Observability vendors claim to do everything — including Deep Data Quality Checks — they typically partition their Data Observability and Data Quality capabilities into two separate products, only offering basic Data Quality checks or leaving Data Quality checks to be built with hand-written SQL.

Over the last few years, we’ve seen a convergence of Data Observability and Data Quality Monitoring, where Data Observability vendors offer basic Data Quality Checks and Data Quality vendors offer metadata checks. While the boundaries between product categories and capabilities are gradually blurring over time, ultimately, the technology you choose should be based on the acute problems you’re trying to solve.

chart showing data observability vs data quality where data quality monitoring is deep data-first and data observability is metadata-first
Comparative Summary of Data Observability and Data Quality Monitoring

If you want a solid solution for Data Infrastructure Monitoring or Monitoring for FinOps, then Data Observability satisfies those use cases with metadata checks. But, if the ultimate business objective is to solve Data Quality scalability and ensure that the content flowing through pipelines is good, then you need a platform with a modern pushdown architecture, like Lightup, to deploy Deep Data Quality Checks.

Bottom line: Data Quality can’t be solved by metadata checks, alone, and using Data Observability tools to solve Data Quality problems is a compromise no one can afford.

TL;DR

Data Observability is metadata-first.

Data Quality is deep data-first.

Data Quality tools are singularly focused on solving the quintessential Data Quality problem. Data Observability tools address a diverse set of use cases, depending on the specific tool. Overall, every tool in the proverbial “Data Quality and Data Observability” category strives to make data and data products more reliable and cost-effective to operate. But, the gap between these tools may shrink over time.

While no two tools are the same, Data Quality tools specialize in deep data checks, and Data Observability tools have metadata at their core. Ultimately, what works best for you depends on your specific needs and the problems you’re trying to solve.

Seeing is believing: Start a free trial today to see the Lightup difference.

Related Posts

Scroll to Top