In the ever-evolving realm of Data Quality Monitoring and Data Observability, we’ve witnessed a significant technology transformation over the past few years. Our earlier blog post explored the diverse product philosophies within these products, highlighting the differences and similarities between Data Quality Monitoring and Data Observability tools.
Now, let’s dive into the underlying architectures of these technologies, explaining their intricacies and implications for business users and vendors.
Our goal is to help you understand the fundamental architectures of Data Quality Monitoring and Data Observability platforms, so you can make the right long-term strategic investment in the right tool.
Legacy Data Quality Architecture
The traditional or legacy approach to Data Quality architecture follows an extract-and-inspect model, where data is pulled from various sources into the Data Quality platform to run checks or repair the data.
This architecture is appealing for its logical simplicity for both users and vendors. Essentially, the Data Quality platform connects to a data source or spreadsheets, pulls the data out, and runs Data Quality checks. Data Quality vendors can add new data connectors relatively quickly, since connectors just need to issue a “SELECT *” query to the data source — an easy implementation for data engineers.
However, despite the appeal, there are inevitable drawbacks. While this architecture works well when it comes to handling spreadsheets of data or a handful of tables in a MySQL database, deploying Data Quality checks at petabyte-scale is a different story. Why? When processing 100x data volumes in modern Data Lakehouses like Databricks and Snowflake, the extract-and-inspect architecture chokes up.
To address the scalability issue, the architecture had to evolve. This included embedding a Spark engine or similar MapReduce processing engine within Data Quality platforms. Seemed like a logical and straightforward solution, since Spark can process data at scale. Plus, it’s open source. So, adding that to the products enabled Data Quality vendors to partition the Data Quality tools, without overhauling the entire platform.
In theory, that was an “easy” way to enable scalability, while still preserving the legacy user experience.
But, the “easy” way isn’t always the best way. Here’s why. Embedding a Spark engine creates a mountain of DevOps problems for end users, especially when the platforms need to run in users’ compute environments, like virtual private clouds (VPC).
Even if end users are willing to consume the products as pure SaaS services running in the vendors’ Cloud environments, vendors typically struggle with delivering a reliable, operational product due to the complexity of managing Spark clusters at scale.
Arguably, managing Spark at scale is so complex, it’s nearly impossible for a vendor — unless, of course, you’re Databricks.
Ultimately, is it even feasible, sustainable, or cost-effective to maintain such operations?
Data Observability Architecture
Data Observability vendors entered the market to address scalability with a different approach: Leveraging metadata to monitor Data Quality — without scanning the entire dataset. By focusing on metadata — data about data — Data Observability tools aim to circumvent scalability challenges associated with the traditional extract-and-inspect architecture.
The rationale behind using a metadata-first architecture for Data Observability platforms to solve Data Quality Monitoring at scale is logical:
- To monitor Data Quality and get Data Quality health indicators, use precomputed summaries of Data Quality dimensions, if possible.
- These summaries — which are “data about data” or metadata by definition — are often available out-of-the-box, so just reuse them.
The benefits of this approach? No data extracts required that would otherwise choke system performance, eliminating scalability challenges.
However, this approach merely shifts the problem.
How so? Someone or some system still needs to compute metadata from data to meet the functional requirements of monitoring Data Quality. Using precomputed metadata properties avoids full table scans, it introduces limitations in the depth of Data Quality checks.
The depth of checks is restricted to Data Quality Indicators that are captured by precomputed metadata, which is commonly only freshness and availability of data. How do you check for pricing errors or other business-specific Data Quality Indicators that need customized “metadata” properties?
Unless users are willing to compromise by only getting shallow metadata insights, Data Observability tools won’t provide the deep Data Quality Monitoring coverage required to truly understand the quality of the actual data flowing through data pipelines.
Simply put, Data Observability trades off depth of monitoring Data Quality for scalability.
Lightup’s Modern Data Quality Monitoring Architecture
A progressive solution combines the strengths of both architectures by pushing queries down into the data platform. In this new modern Data Quality Monitoring architecture, a Data Quality Indicator (DQI) is computed as an SQL query directly against the data source, such as Databricks or Snowflake. And, when available, the DQI is derived from precomputed metadata properties without incurring data scans.
One of the biggest benefits of this architecture compared to the legacy approach? Easy scalability without compromising the user experience or the depth of checks. The best part? It won’t become a DevOps maintenance nightmare for end users or vendors.
How’s that possible? Since Data Quality checks are pushed down to the underlying data platform, which is inherently scalable — supporting other at-scale tasks like analytics and Machine Learning (ML) — Data Quality Indicator computation is also supported at scale.
How does this compare to Data Observability architecture? Unlike the metadata-first architecture, the modern pushdown architecture retains the full power to model any Data Quality Indicator, providing flexibility and depth in checks. That means you aren’t restricted to just predefined metadata properties.
But, what’s the trade-off?
- More effort for vendors. Every data source needs a data connector that implements logic to translate the defined DQI into an SQL query in the data source dialect. Sounds simple, but the connector isn’t just a trivial “SELECT *” client like it is in the legacy model. It requires a full development implementation by the vendor.
What about the data source compute load and cost? To avoid less sophisticated implementations that trigger full table scans for every Data Quality check, a well-engineered deployment uses the data model’s time-indexing and time-partitioning structure for incremental computations.
Unfortunately, since this optimization creates a heavier burden for the vendor, it’s not the norm. However, when done right, a pushdown architecture helps optimize compute costs.
Here’s how.
To compute a DQI that cannot be derived from metadata, the system needs to scan the data at least once — and, ideally, only once. When combining stateful, incremental DQI queries with indexing and partitioning, a modern pushdown architecture does exactly that.
The overhead? Computation costs are marginal (~ 1%) in typical at-scale deployments. Costs are high with naive implementations of pushdown queries that trigger full table scans on every run or otherwise scan the same data multiple times. An efficient pushdown architecture scans the data only as much as is necessary – once. And then, the cost of compute is negligible.
Choosing the Right Architecture for Your Enterprise
Deciding on which architecture to choose for your enterprise depends on your functional and scale requirements.
- For small-scale data volumes, the legacy Data Quality architecture may suffice, offering a wide range of tool options.
- If functional requirements are shallow and limited, then Data Observability’s metadata-first architecture may be sufficient.
- If metadata insights are enough for your purposes and scale isn’t an issue, then any architecture should be suitable.
- For complex business-specific Data Quality checks on large enterprise data volumes, the modern pushdown Data Quality architecture is the optimal choice for both depth and scale — addressing the shortcomings of both legacy and metadata-first architectures.
To simplify the decision-making process, if you’re unsure about your requirements, just identify the most functionally complex and at-scale Data Quality checks that your enterprise organization needs. That way, you’ll know the business-critical checks to stress-test when evaluating Data Quality Monitoring or Data Observability solutions.
As Data Quality Monitoring and Data Observability technologies continuously advance, understanding their architectural nuances is crucial for selecting tools that align with your organization’s current needs and future goals.