Data Quality Coverage and Readiness Part 2

Manu Bansal

May 23, 2022

How to Quantify Your Data Quality Coverage

Image by Juliana Castro , CC BY-SA 4.0, via Wikimedia Commons

In the first part of this series, we took a look at the different factors that impact your data quality and symptoms of “bad data”. We also learned that the actual percentages of all data an organization collects that is considered business-critical are actually a subset of the total amount of data the organization collects. Because this data is essential for effective decision-making, implementing a system of quality checks, which are applied throughout your data pipelines, should be a priority for any data-driven organization in order to ensure and maintain data quality.

Now that we understand the implications of bad data and how to evaluate quality, how do we quantify what your data quality coverage is? Luckily there is a framework that we can leverage that will help us quickly calculate what your overall data quality coverage is as a percentage of your total collected data. For many organizations, this is an eye-opening exercise

The equation to determine this would be:

62fad8 214f3c7380fa491cb646f96a0d75f7a7~mv2

Here’s an example scenario to get a bit better perspective: say you have 1000 tables in your data warehouse and each table has 100 columns. Running all six data checks we identified earlier against your entire data footprint would be: 1000 tables * 100 columns * 6 data checks. Let’s say in this environment, you are only running 2 kinds of the above checks on 100 of these tables, and only 10 columns from each table. In this example, your data quality coverage comes out to:

62fad8 63ad8fdfacc94242955f874c0658c0fa~mv2

Should you go for 100% coverage? Remember, not all tables and columns may be important enough to cover with continuous data quality monitoring, so while you may not want to dedicate the resources to get to 100% of your total data footprint, you will want to try to get as close to 100% as possible on your critical data footprint. In this case, you would be adopting a critical data element (CDE) approach where you only want to cover your critical data elements with checks. Therefore, you will want to calculate the denominator differently to include only your critical tables and columns. In that example earlier, if you have 100 critical tables with 20 critical columns in each, your CDE data quality coverage will be:

62fad8 e57699bd371b46a1b1c82cbae51e227d~mv2

Let’s look at another example where you may be syncing data from source systems into your data lake/warehouse. You might be bringing in event data or tables from an OLTP system into the analytics stack using a data integration or an ETL pipeline. If you have events with 4000 fields being synced using Fivetran or Stitch or another integration pipeline, and you are only monitoring 100 fields for data quality, your coverage would look like this:

62fad8 b0d026c1fa064e14946405221d859b06~mv2

The numbers presented in these examples are not just illustrative — they are frequently the data quality coverage numbers we actually see organizations operating with, much to their surprise.

How to increase your data quality coverage

If you are seeing topline business impact due to bad data, and your data quality coverage stands at 1–5%, the solution is simple: try to get to 100% data quality coverage on your critical data. Here’s a 3-step process to design the data quality blueprint:

1. Identify your critical data elements/entities that need to be covered (you probably already did this as part of assessing your readiness)

2. Identify the data checks that need to be applied (like volume checks or timeliness checks).

The challenge in this step is to arrive at a concrete definition of a Data Quality Indicator (DQI) — a metric that measures a dimension of data quality on a data element.

The DQI definition is as much a business conversation as it is a technical problem. Data engineers, analytics engineers, data analysts, product managers, and other business stakeholders all need to bless the DQI definition and agree that it captures the expectation from the data. For instance, there are multiple ways to measure data volume being brought into a table on every update. You could include all new rows from the last update, or you might want to count in rows matching a certain filtering condition like a store id or a restaurant id or a user group. In a way, the DQI is the service level indicator (SLI) for the data pipeline.

3. Agree to the success criteria for Data Quality Indicators (DQIs) that capture the expectation of a well-functioning data pipeline where the data is healthy.

How much data volume is considered normal? Do we use a fixed threshold? Is our expectation that the volume today matches the volume from yesterday or last week? What does an anomaly look like? This is a calibration question and, more importantly, a statement of an SLA between the data engineers producing the data and everyone else that is consuming the data.

This blueprint can be created for all data elements in one go or be approached as an incremental exercise. Implementation can go hand in hand with designing the data quality specification for a data element. In fact, design can evolve over time as a definition is put to use.

Lightup makes it easy to collaboratively design data quality indicators (DQIs) and calibrate expectations, implement monitoring with anomaly detection, and iterate on data quality checks with an operational feedback loop.

Care to see how Lightup can improve your data quality, or simply want to avoid issues altogether? Jump straight to a free trial to see our solution in action.