CIOReview Recognized Lightup as
Enterprise Data Quality Monitoring Solution Company of the Year
2024

CDOIQ 17th Annual Symposium: Session 21B

Your Data Keeps Breaking Silently. Why Data Quality Is So Hard to Solve.

In every single industry, decision makers need to trust their data to make informed decisions. Yet, data quality issues are causing critical business failures today. Data errors trigger compliance penalties, revenue is lost to pricing errors, and consumer applications fail to behave as expected — all caused by bad data.

But why is data quality so hard to solve? Why are the legacy data quality solutions breaking and why are major Fortune 500 enterprises unable to prevent data outages? Watch this session to learn the fantastically complex world of data quality, how you don’t have it all figured out, and what unknown-unknowns still exist in your organization.

Transcript

Welcome back. You already know this is the best track. So definitely, I’m happy to continue seeing same people in a room. I, and, we welcome people who are joining us virtually because usually our audience virtually speak of them in the room, Yeah. Which is great. So let’s, yeah, let’s not forget about people, who’s joining us virtually.

I’m gonna start from introduction.

So Kevin is gonna be helping, to manage questions. Please post your question online You can also ask in a room, but we are we are checking both sides with checking. A line with checking, you know, in the room as well. So I would love to introduce, introduce you Brian. So Brian is coming from the company light up, which is, four years old company. But what I personally alone. They do an amazing work.

So, I think it was a day one, and day two we had so many presentations about the Volvyan role of CD O. And, often it was coming actually, it would be great to have a the governance, basically, like, a data quality dashboard, which is, you’re gonna show it to your right? And also make them accountable somehow. Right?

So Brian and his company, they work in the space of the data governance and data quality.

Which is really helping you to show real metrics.

And they do this by using smart technologists like AI. So they not just do an Excel spreadsheet. Right?

And, they also look in, at the past. But what is interesting, they can predict and can give you some signals, where potentially you might have problems.

So and I I believe Brian actually can manage this because he has five kids.

So if he can manage five kids, I’m sure he can really manage such complicated topic of data quality and data governance. So welcome Brian and, let’s go.

Thank you very much. Alright, I can hear myself. So I don’t have to have to ask you all.

And speaking of five kids, I get interrupted all the time. So if you have any questions, interrupt me. I’m totally okay with it. I actually prefer that.

It’s better in the moment because then it’s all contextual rather than, like, waiting to the end and be like, hey, thirty minutes ago, you said this thing. So, yeah. So if you have any questions, please interrupt. That’d be great.

And then we’re gonna use a mic because there’s people online. And if you don’t talk in the mic, they can’t hear you. So we’re gonna use a mic for that one. Alright, I did also have a really fancy demo.

So just imagine the best demo you’ve ever seen, and, that’s what it was like. So we’re just gonna inject that in there.

And then, hopefully, you’re all here for the title, specifically your data keeps breaking silently. Right? This is a really big issue. Because first of all, it’s all retroactive.

And second of all, it’s typically driven, like, how you find out is by your customer. That’s typically how it works. Right? It’s very rare that a data engineer is gonna be like, whoa.

I didn’t realize that this report was broken. Right? That’s typically not how it works. So it’s whoever the consumer is.

Now, the question that they’re gonna have, because typically, they don’t care how the data got there, just want to make sure it’s accurate. Right? So it doesn’t matter how complex anything is. It doesn’t matter how long it takes.

It doesn’t matter what fancy tools you’re using open source, not open source, a lot of times, your consumers literally do not care. They just wanna know why is their data not working, and what what needs to happen for it to be fixed. And then if you’re on the data engineering data architecture side, and it’s probably good to start with that. Like, how many here are more on the data engineering side, data architecture side, pipeline side.

No one? Okay. How many would put yourself in the classification of data consumers? Reports dashboards?

We’ve got a couple there. We got a couple. Okay. Nice. Alright. And then other people are here, wandering around.

So this is great. Okay. So this is what we’re gonna talk about. Like, we’re gonna kind of pull the layers back on why data breaks silently.

So you have a little bit more understanding of what’s there and the complexity that goes into play with what that actually means. So we’re gonna actually walk through that. So this is me, Brian Filing. I’m the head of field engineering that basically on the technical side of cradle to gray for every customer that we have.

Right? So I don’t go away. I get to be in a long term relationship with all of my customers, which is great.

There’s my LinkedIn profile. If you want to connect there’s my email if you want to connect, and then our website, if you want to learn more, you can do getting, get in touch stuff through the email as well. We’ve also got a couple guys back there from Lightup. If you wanna, you know, talk to them afterwards as well or, you know, and then also if we can scan your badges, my marketing team would would love me. So, that’ll be good.

Okay. So what is broken data?

What do you guys think broken data is?

Can’t get value out of broken date. Okay? That’s a good response.

Not fit for consumption. And it’s just like you stole my PowerPoint presentation.

Yes. When broken data is whenever the data outputs, whatever that data output is, whatever level that data output comes through when it is not actually reflecting the data that was generated, and the way that you know that is it’s not fit for consumption. Okay? Now the reason why I specify that is because your output can be at multiple levels depending on what you’re wanting to do with your output.

Okay. You’ve got some people, has anyone internally using the medallion layering, raw, bronze, silver, gold? For your data layering. We’ve got a couple of nods. Okay.

So typically gold is where your people are interacting with the data. Right? But your raw and your bronze in a lot of ways is where your systems are interacting with the data. They’re pulling that out to provide maybe reports.

A lot of times, nowadays, it’s modeling, like AINML and data science stuff. Like, they’re gonna pull it up from, like, that silver layer. Okay? So the reason why I phrase it this way is because the data is generated significantly more upstream than your raw layer.

So if the data is generated up in, some type of manufacturing system, or maybe it’s generated by an API and put into Kafka topic or something along those lines, that data that’s generated way upstream.

If it’s broken, that means that it doesn’t reflect an accurate out put at whatever layer it the output is at. Okay. So that’s how broken data gets, put through. And you’ll see this through things like inaccurate reports and dashboards. You’ll see it in regulation and compliance penalties. You’ll see it with internal and external customer complaints.

So another question for you all, how many of you are at companies that are under strict regulation and compliance requirements?

Lot of hands. Okay. That’s helpful.

Now here’s a here’s a a potentially hurting question.

If you had to give a range of how much your company has to pay in penalties to regulation and compliance when you’re falling through on those areas, How many would say it’s less than five hundred thousand dollars a year? Do you have to give a range?

Okay. Five hundred thousand to five million a year?

Yeah. Five million above?

Jackpot. Bingo. Alright. Yes. Very common. Right? And the problem is is that from a financial perspective, they’re seeing the big number.

Right? From a broken data perspective, we see the actual, like, general ledger. We’re actually seeing each individual thing. We know how it’s calculated up. K. So I’ll give you a good example. Working with a health insurance company, they have one specific unique compliance requirement out of all of them, where every state is its own regulation, market.

Some of them are monthly, some of them are quarterly.

If they don’t get a letter. So if someone, if, you know, if you were to go to the hospital and you have a surgery for some reason, and they don’t cover your your insurance doesn’t cover your actual surgery, you can file a grievance and appeal.

They have to send you a letter within twenty four hours of you submitting it. And if they don’t, they get fined a thousand dollars per letter.

Now at potential states where that’s every month that you’re reporting that back, And other states where it’s every quarter that you’re reporting it back, a lot of those times, the data is actually missed until you about two days before. They’ll have an internal regulation group, and then they do a check on the data. And then you have to submit it to an external regulation group, which is usually the state owned group. And then they do a check on the data, and they literally give you two days to fix all of the data. Okay?

Now additionally, regulation and compliance data is never in one spot.

It is in four spots. It’s across multiple tables. It’s across multiple data sources. And if you can’t link all those together, then you’re gonna actually end up paying significant penalties. And this is the type of complexity that comes into play. This is why data breaks silently because one null in one location breaks your whole system.

This is that type of complexity.

Alright. So here’s what drives data breakage.

Data complexity is the main thing that drives why your data breaks. So the first one is your stack. Right? Now if you look up here, you will see that there is a lot of different things that are up here.

A lot of different solutions or cloud based, vendor based, open source, things like that. You’re probably seeing things that you’ve heard people whisper in hallways. Well, why do we still have hive? I don’t know.

It’s horrible. Right? Like, things like that.

So if you have all of your data spread throughout all of these different things, you have complexity. Right? And every single one in networking terms is a point of failure. At any point data transfers from one point to another, you have a point of failure every time. Even if your data is transferring within those systems, So there’s Databricks Snowflake, Redshift, Athena, BigQuery. We’re usually doing a lot of transformations of the data inside.

Even that is a point of failure, because at any point it can break.

Then you also have a complexity of your data flow. Data can be generated. It can be dropped in by an SFTP share. It could be brought in through an API.

It could be, a data dump from a completely different business unit or a different company, or maybe you purchased the data. Right? So that’s where the data is generated and brought in, and it slowly flows through your transaction system, and then it flows through your analytic system, and then it flows through how you actually do visualizations. Right?

And, I’m assuming most of us here have become Excel experts as of late. Because that’s the number one way that a lot of people will visualize their data.

Right? Like, sure, we’ve got looker. Sure. We’ve got Power BI, and those are, like, the the fancy visualization dashboarding tools that our C suite likes to see, but, like, real business is done essentially in Excel still.

Like, we’re still downloading the CSV file. Right? Most of us have probably written macros to make sure that we’re pulling it more more recently and more recently and more recently. That’s actually what my day my demo is gonna be And then you also have different types of outcomes.

Right? Some of these, some of your companies and other companies that we’ve talked to are actually using multiple of these because there are different reasons. There are different business units. There’s different domains. There’s different desired outcomes, things along those lines. So we’ve got Power BI dashboards because some of our data is in Azure, and we’re working with a company that we recently acquired, and they’re only Azure, and they have this big contract in Azure, and everything’s Power BI. But our company that we’ve had for the longest time, they’re all in Tableau.

Well, if data is gonna be split between those two, and someone says, why the heck is my report broken?

What report? Like, where are you getting this? What data did you get this from? Did you how did you snapshot that data?

Right? You’re not using a live query. You’re you’re, you know, click the button that came in four days ago, and now it’s not working anymore. These types of things actually drive data complexity, which drives broken data.

You have other data complexity, and this actually gets further upstream. You have a process called extract and load. Right? Now for anybody in here, how many of you have ever been on the data individual contributor role, like data engineer, data architect, anything like that.

We’ve got a couple hands up. Okay. Awesome. So ETL Right? You’ve probably very familiar with this.

A lot of people will use vendors for this. DBT is really good for the transfer very, very popular nowadays. A lot of companies are moving in that as like a standardization.

But then you have other tools out there. Five Trant took the world by Gorm. Airbyte was like the open source version of five tran was really, really popular. Other companies will use actual, like, code, and we’ll just code SQL or just code Python. And like, that’s what they use to actually do this process of extract transform and load. But the problem is, because it’s done in code, that means someone has to type it. And if someone has to type it, you get fat fingers.

Alright. I’ll give you a fin serve example.

When you have a company that deals with revenue or any type of financials that crosses a country And you now have to deal with currency, you have to provide a currency exchange rate. Right? Well, what if you move the decimal? Just one point to the right or the left?

Right? It’s not that hard to do. I mean, I can’t tell you how often I misspell, like, severity just because, like, I just type the e’s too fast. Right?

So this is where you can get really, really bad broken data. I mean, you can go from saying that you made a million dollars in the last year to a trillion dollars in the last year, if you just misplace that that data point, like, it’s just super easy.

The other place where complexity comes in, and this is actually the hardest one to understand is your transformation process.

Okay. So transformation is basically when I’m gonna take data, whether that’s data in a single data set or data in multiple data sets, and I’m going to create a product out of those data sets.

That’s me transforming the data. Most of us know it as things like aggregations.

Right? Or, Anyone here developing a customer three sixty application or customer three sixty process? Yeah. Okay.

Transformation is the bedrock of a customer three sixty setup. Because you have some systems that are generating ten columns worth of data, maybe. You have other systems that are generating forty columns worth of data. There’s a little venn diagram overlapping between those two, but you need all of that data.

Right? So the first thing you wanna do is say, okay. Well, what’s the ultimate product of those two brought together that gives me three sixty view. That’s the first thing you want to do.

And the second thing is if there is an overlap, how do you weight which one of those systems is more important?

Right? If one of them says that the person’s name is, you know, John Doe, and the next one says it’s John James Doe. Do you take the middle name? Do not take the middle name. Right? What if one of them says Sam and the other one says Samantha?

Which one do you use? Right? And that’s a basic understanding. Like, that’s not even, like, where it gets pretty crazy. Right? You could do that with addresses.

So you have to give a waiting system. If it comes system a, that’s way better than if it comes from system b, and system b is better than nothing. So customer three sixty, here you go. And it’s almost like I’d rather give the customer, like, bad representation than no representation.

It’s kinda how we think, but we don’t really understand if that’s actually what the customer would rather have. If they’re missing a field, would we rather have them fill it in, or would we rather guess and be way off potentially?

So transformation is actually one of the hardest parts to understand broken data because you can’t follow that back.

Right? How did that value get generated?

I don’t know. Like, literally no idea.

Right? I know what what probably mathematical requirement was there to generate the data, but what were the original values? Not sure. When was it calculated?

I don’t know. Is it granular to the second or to the hour or to the day or to the week? Like, what is the aggregation set? I don’t know.

Right? And it’s you can’t follow that back. There is no lineage tool that will follow that back. So this is where another area where data breaks. Right? And this is the one that actually impacts a lot of C suites the most because their reports are based off of the outcome of transformations.

They don’t wanna look at raw data. Even though they probably say that they wanna look at raw data, they actually don’t. Right? For those of you, again, who have been ICs, there’s this false perspective that the more data, the better.

Like, you’re gonna make a better decision if you have more data. No. You get more questions if you have more data. Why?

What? Why does that’s really weird. Why did you do that? Be like, I don’t why are you even asking me that question?

Like, you don’t care. It has nothing to do with you. Right? So we have to be intelligent about what we present, how we present it to make sure that they’re making the best business decisions.

But if we do that, if we curate that, then they’re gonna have potential broken data. And so all of us have probably experienced the brace for impact. Right? The report gets generated at eight AM, but they don’t get to their office till nine.

Like, so you got, like, an hour and a half of just, like, did it work? Did it work? Did it work? Did it work?

And then nothing happens.

You’re like, That was great.

Okay.

Now I can, you know, rest of the day is great for the next twenty four hours. Alright. Seeing some nods have had some smiles that’s happened. Okay. I totally get it. Alright.

Okay. So how do we solve broken data? This is the big marketing answer, data management boom. There you go. And you can, you know, go home happy data management. Now you just got a Google search data management, and that’s all easy. Right?

No. There’s lineage, there’s catalog, there’s observability, there’s master data management, there’s data governance, there’s data quality. Like, you have all these things that are part of this data management, data governance, it depends on who you talk to, and what Google search, and what SEO has done. But really find this out.

But in general, how do you manage your data and how do you govern it? Right? You have all of these different areas, and depending on the vendor you to. One of these is obviously more important than others, as you’ve probably seen.

And it also depends on who you’ve hired recently, will tell you which one of these is more important than others. Right? Well, I just came over from so and so and that, you know, a month ago, we were using this catalog tool and it solved all of our problems, right? And now we’re all happy.

Nope. That didn’t happen. But, okay, probably still need the tool. Is it the most important?

Hard to tell. Alright. So where I work and what I do and what I consult different companies on is data quality, I’m gonna focus on data quality. Right?

For me, this is one of the biggest things. I used to be a DBA I remember taking reports to my direct superior and saying, here’s here’s the report for the last quarter. Here’s how much we’re charging each internal business unit for their budget to give to us so that we can handle this, like, IT process and like giving us out to them. And I spent blood, sweat, tears, hours of my time rebuilding sequel queries to generate an Excel doc to do this.

And I gave it to him. And then within thirty seconds, he’s like, can’t use that. I was like, but no, it’s accurate. Like, date is good.

It’s there. And he’s like, you don’t understand. We have been doing this process this way for the past three years. It’s not that I doubt your sequel or I doubt the day that you’re giving me.

The number you gave me is so much lower than the number that we had last quarter or last month that no one’s gonna trust us.

Right?

Now for any of you who have ever written sequel queries before, the solution to your problem is a thousand lines of a where clause.

It’s just goes and goes and goes And you’re basically excluding or adding things just like one little tick at a time, lot of percent signs. That’s the only way to solve the problem at that point. But see if they they had broken data, they had no idea. For years, they had broken data, they had no idea because we’re using old school systems, like Windows three point o, like old systems.

Okay. So this is where broken data can actually cause massive problems.

Right? So we’re gonna talk about data quality near and dear to my heart.

So what is data quality?

That’s a tough question.

I did a Google search on this one.

So one vendor, came out with these data quality dimensions. And there are six of them. It was pretty cool. Okay. So we’ve got uniqueness, timeliness, validity, things like that. Another one had six, but they they used, see alliteration.

Pretty cool. Keeps it in your mind, right, chosen comprehensive, complete, clean, calculable, credible.

Awesome.

Another one only had, like, seven, but they said timeliness was the middle one, which I thought was pretty interesting.

This one had six. Didn’t really give a lot of explanation.

This one here, another six. There’s, I mean, ranges And then this one had just a lot. Like, it was just huge amount. So what I did is I actually brought all these together and, this was the list of what data quality is according to all of these.

So you saw for all this stuff, you got to have quality data, which I thought was pretty fun. So, yeah, raise your hand if these are, all the dimensions that you track on your data quality and see what the scores No one. Okay. Awesome.

Surprisingly.

Okay. So I try to boil all this down if I could into something pretty minimalistic, at least an achievable dream. Right? Like, this is the this is the battle cry of my data team.

Data quality is the process of guaranteeing that data is fit for intended use. Thank you very much. Is fit for intended uses in operations, decision making, and planning. Your data must be fit for use for these specific areas.

That is how you can guarantee data quality. And then what you do is you work your way backwards from it. So this is the mission statement for the data team. Right?

Data governance We need to provide quality data that is trustworthy. You can give confidence. This is this is an important thing. Right?

Because bad data one time doesn’t just mean that they lose confidence in that data that one time.

Right? This is a retroactive ripple effect.

Right? If they see a report and they say the report’s bad, and for those of you who get reports, you see a report and you know that the date is bad. Right? You’re not just like, oh, everyone makes mistakes.

It was the one time. It’s been solid the rest of the time. No. First question, How many other reports have been bad.

Right? How many other bad decisions have I made? That’s the first one. And then the second one is you inherently lose your trust in the data team.

You may not want to. Right? But it’s inherent in the fact that they are saying that they’re delivering you good quality data that you can make business decisions on Now you can’t, and now you’re gonna question it. And then the only reason why you don’t question it going forward is because you come callous to the fact that you haven’t had to question it, not because it becomes good.

Right? I actually was talking with a airline company recently.

They’ve had so many issues with data quality that they actually put a trust score on their reports.

Like, hey, CFO, you can be guaranteed that nine you can have, like, a ninety eight percent confidence rate that this data is good. Maybe we should adopt that. Right? Like, I’m, like, I’m, like, eighty percent sure you can make really big important decisions based off of this data. I mean, it’s not gonna fly for them their use case it does, but that’s not a common practice that we can do.

Okay. So how do we do this? Right? So now I’m gonna so we we’ve laid out the problem.

We’ve talked about where the complexity is. We’ve talked about where the issues are. I’m gonna actually talk about solutions now. Right?

So we’re actually gonna do that. So the first one is data pipeline testing. Does anyone know what a CIA is?

We’ve got a couple nods. Okay.

It’s cover your butt. That’s what you do. Right? All of us have been there. When you’re not in leadership role, it’s when you, when you CC your manager, when you’re sending an email to someone else, that’s not to cover your butt moment.

So data pipeline testing is a CIA for data engineers.

Okay. Now in get masquerades as data quality because we are testing the quality of the data by just pro just by inheritance of the fact that it’s a dimension.

Right? So the check, the data check that we’re running as a part of our pipeline technically fits in a data quality dimension, their for its data quality. That’s actually not true.

Okay. That’s correlation without causation. That’s not actually a true thing.

Okay? So here’s what data pipeline testing usually is. What was the starting time and the ending time of a specific task, job, or pipeline? What were those two times?

Okay. And then what we do is we record the the amount of time between, and we record how often they start. And we say, Well, the jobs usually run every hour on the hour, and they usually take ten minutes to run. So if at any point, it runs more than, you know, it starts more than four or five minutes after the hour.

That’s a problem. Send an alert. And anytime it takes more than ten minutes to get the job done, that’s a problem sending alert. Okay?

Now that’s a timeliness dimension in data quality. But that’s not technically data quality. That is testing a pipeline just to make sure that it’s operating the way I want to. Right?

Because if it doesn’t, then there’s actually a problem with my data.

How many rows need to move? Count star, count star, do they match, sweet, it worked.

K? That is more of a validity understanding of your data, but that’s not actually the intended purpose. The intended purposes did my job finish what it was supposed to do. Okay.

So we track that. How many columns need to move? Okay. So we look at how many columns are in table a.

Look at how many columns are supposed to be in table b, did all of that stuff move over, whether was there a break in the schema? If no, we’re good. Again, it’s a part of your accuracy dimension, but it’s not actually part of your data quality because of how it’s being implied and what the intended purpose is.

Okay. What is the percent of n in each column? Percent of nulls, percent of zeros, percent of blanks, percent of We used to do this really fun thing, and I’m sure no one else does this. But whenever we weren’t sure, we would put in nine nine nine nine nine nine nine nine and that was the first qualifier in my where clause, right, where value was not nine nine nine nine nine. That was, like, the first thing I did because it was junk data, and there’s no way it was gonna get that high.

Right? So that’s the type of stuff that we did. So what’s the percent of those? Right?

I don’t want to see a giant jump or a drop. I want to wait. What’s the percent of nulls? Okay.

Now this is where we actually start to bleed into actual data quality checks because the percent of a null, the percent of a zero can actually cause massive problems with our reports, our dashboards, our models that we’re gonna have. Right? So SAS did a report about two years ago on the difference between randomized biases inside of a data model versus, actual systemic biases in the data model. And what that means is if you have random gnolls showing up in your table, Right?

What’s the impact of degradation on your model versus if something is true or false, this column gets a null.

Right? So if the city starts with the letter b, then the revenue value is null.

Right? And so if that’s what’s called a systemic bias, right? Cause it’s systemic based off of a predefined notion, that actually has a forty percent degradation on a machine learning model.

Right? So calculating the percent of that in each column is actually a really good thing. Doesn’t have to be a hundred percent. Doesn’t necessarily have to be zero percent. Maybe it does on primary keys.

But big making sure that that’s actually around a low value and consistently around that low value of percent, that’s a that’s a good set.

Min max and median. This actually goes back up to a CIA.

Right? Well, what was the minimum for that column? What was the maximum for that column? Was the median for the column?

Run it on the source, do my my pipeline job, run it at the destination, is it accurate? Okay. And then are there duplicates? It’s a pretty basic one.

Right? Are there duplicates? That one gets pretty complex because it could be duplicates in a single column, but you can also have a multi column duplicate check.

Right? So a good example is in marketing.

Right? If you handle marketing data or you do anything with marketing data, you know that you can’t just say is my company unique in the column. Never gonna happen because you’re gonna have multiple contacts. So you’re gonna do a multi column check. You’re gonna say if column, if, company name, first name of contact, last name of contact, and email of contact, if that set of four is unique that’s what we want. Right? I think it’s pretty complex there.

And then how many categories are there? So this is gonna be, like, you know, do we have if if we only offer our services in three countries, I should only have three countries that show up as categories in my column. If all of a sudden there’s a fourth country, that’s a problem. If there’s only two countries, that’s a problem. Right? So I’m gonna do how many unique values exist in that column, that’s gonna do what I’m gonna look for. Any questions so far in any of this?

Pretty straightforward. Awesome.

Okay.

Here’s where data quality comes into play and where most companies are not actually tracking data quality. And this is where you start looking at business and data analytics.

The reason why I say it’s most is because show of hands how many of you are actually requiring that those who are consuming the data in a report or a dashboard are telling you what metrics they’re looking at so you can implement data quality checks on your data as a result.

Got one hand.

Sweet. It’s very uncommon for a reason. Yeah. You’re you’re stand out Yes. Yep. Yep. Absolutely.

But this is where data quality comes into play because the dimensions, the outcome is who’s being affected by bad data. Now this is a sales concept. Right? And as sales concept and a value selling concept, you look for not just what the pain is, but who owns the pain? Right? The owner of the pain is the one that is actually feeling the pain themselves and the one that gets the most benefit out of the pain being solved. We can apply that same concept to data governance.

Right? Who feels the most pain when the data is inaccurate?

Right? Well, the one that feels the most pain is the one who gets the most benefit out out of the pain being solved. In a lot of cases, it’s immediate. You know, I got an email from my CFO last week, and they were like, you know, hair on are screaming at me because their report’s been bad for the last three weeks.

Like, that type of stuff, they own the pain. Right? A lot of times we get this mixed up and we think that the data engineers own the pain because they’re not able to scale head count enough to cover data quality. They don’t own the pain.

They just deal with it. Right? They’re the ones that are putting the ibuprofen on the broken leg. Like, that’s that’s what they do.

K. So if we can bring in business and data analytics as requirements for data quality, we’re actually gonna match our dimensional requirements for data quality. And we’re gonna make sure that the date is actually fit for purpose.

Okay. So how do we do that? How many customers has each location interacted with? So if you handle any type of retail or ecom or any type of B2C processes at all, you need to be, like, your your data consumers are looking at that type of metric. I wanna make sure that each location, that online versus in person, I wanna make sure that that is actually growing right, or operating within the season and the trend that I expected to operate in.

Your data consumers know this.

They’re this is the reason why they come back and say my date is my date is broken. My date is off. My date is wrong.

Right? Well, why? Well, you know, usually this number is, like, up here. Right? It’s usually in this range during this season. Like, summer’s really high. Like, I don’t know why this is as high as past summers have been.

Right? That’s a data quality metric. That’s a data quality dimension. That’s what we need to be tracking.

What is the distribution of devices across each location Right? So inevitably, you’re gonna have some IoT. You’re gonna have some form of user user interactive devices of some kind if anyone in here is doing anything with, energy generation, like solar, oil, wind, hydro, anything along those lines. You’ve got sensors that are outputting data all the time. Well, how much is it outputting for every location for every season?

Right? That’s gonna be a really important metric to track for the business. That should be a data quality metric because that is a dimension of our data quality requirements.

Okay. Does each insurance claim have the correct date, location, and status?

Really important.

What is the day over day week over weekend quarter over quarter revenue per operating country? Another really important question. What’s the user adoption rate on the new version of our solution, the new version of our software, the new new version of whatever it is, a new version of our offering, compared to the old version. Right?

In system administration, we call this sun setting. Right? You’re gonna sunset an old product. You’re gonna bring in a new product.

You should see that you’re actually sun setting that product.

Right? So I can give you a really good example of this. I was working with a marketing company and they are basically companies will pay them to go do marketing ads for them. And they’ll do it across social media. They do a lot of digital marketing, things those lines.

They were about to present a report of how the money that their customer gave them. What that did what did that result in from a business output.

Right? So this was a really big candy company, huge candy company.

And what was unknown to this ad tech firm was that they were about to show this big candy company that they made zero dollars of revenue during Halloween.

Yeah. They really was a customer real fast.

The only reason they found it is because a data scientist just, you know, they had that itching. They gotta do that one last check of that data.

Right? And they literally caught it, like, a day before the actual meeting with this really big candy company. That would have been detrimental to them.

Okay? So day over day day over day week over week quarter over quarter for them, revenue was a huge thing.

Weren’t they weren’t tracking it.

K? So, this is where I would give a demo flashing lights. This is awesome, sweet.

Okay. So let me kind of, like, present what the demo was gonna do visually through words if I can.

Not through interpretive dance.

So the first one was actually a pie graph, and it was tracking how much revenue per customer type we had. Right? Now sometimes we dictate customer type. We don’t associate we don’t assign a customer type.

So assigning would something like if you are a airlines rewards member, you go through like the bronze silver gold platinum, one world blah blah blah blah, like different groups based off how often you fly and where you fly to, and you get different benefits based off of the different ones you go up. Right? That’s an assigned, grouping or category of that customer. Sometimes internally, we do that at companies.

We don’t tell them that. Right? And we do this based off of how frequently they interact with us, how much they’re worth. Right?

We assign them, big ten, right, or big customer, or, you know, super important, right, the twenty percent. Like, there’s different things that we might associate to them. Right? Big Fish is another one I’ve heard.

So internally, we do that. Well, the way that we do that is certain metrics And if we’re calculating those certain metrics, we’re going to have a pie graph that we’re going to show to our internal financial team and say, here’s how they’ve been for the last month’s worth of revenue or the last month’s worth of interaction, here’s how each one takes up that slice of the pie. Big numbers, really hard to tell if something’s wrong. Okay?

So the next graph was basically saying, what does the members versus non members actually look like over time? That’s another pretty common one that we’re looking at. Probably wanna make sure that we’re pushing more membership, more loyalty. That’s a really big thing, right, customer experience.

And the last one was actually splitting out revenue based off of Citi.

And what we found in that one is someone apparently thought Tucson was Tuscany.

And so they applied an Italian revenue currency conversion rate to that. Does anyone know what ITL conversion rate is to USD?

It’s seventeen hundred percent. So divided dollar by seventeen hundred, and that’s how much money they were giving you in, Italian money. But if you apply the currency wrong and you do a multiple, then every single person from Tucson, suddenly was giving you seventeen thousand dollars for every one dollar that they were doing. And so as a result, your bar chart that you’re going to be looking at as a CFO, your bar chart when it came to Tucson was actually way lower or way, way, way higher than the other cities, because someone just didn’t understand exactly what was supposed to happen.

Right? And that is something that happens upstream. It’s not something that necessarily downstream. Right? Currency conversion processes happen in a DBT model. They don’t happen at Power BI.

Right? So now we’re presenting really bad data, but the the point of the demo was I would show two graphs that looked normal and healthy based off what you saw. And all of a sudden, the third graph was like way off. And if you didn’t know, it would be broken. And do you know who doesn’t know?

Literally, everyone upstream.

All your data engineers, all your data architects, domain stewards, they don’t know because they’re they don’t think, like, the way that the business decision makers think. So we have to be able to bring them into the conversation.

That can’t be a Jira service now ticket queue.

Right? It literally everywhere that happens.

Right? And then you have data engineers that monitor the ticket you, and then they’re looking through. And then what do they do? Like, they have to set up a call.

Like, okay. Well, you know, you said this, but what about this, and what about this type thing here and then they bring in a data engineer buddy and they’re like, no, that will never work. You can’t do that. No.

This is and it’s like the CFO is like, I don’t I don’t care. Like, I’ve literally been writing a report on this for the last three years, and I need to make sure that my data is accurate. Right? So See that’s usually how the conversation happens and you spend eight weeks rebuilding and rebuilding and iterating on this specification of a business oriented data quality check that you want.

One check.

Okay. Multiply how many check like, how many checks you want is a multiple of how many customers are potentially impacted by the data that’s being presented.

Right? That’s I mean, 4x that. So if you’re interacting with a million potential customers, I mean, you probably need like, to get full coverage, breadth, and depth, you’re probably needing a few maybe ten thousand style checks, maybe a hundred thousand checks.

Just to be able to get the coverage to provide confidence that the data is fit for purpose based off of how it was generated and what you wanna do with it. K. Now for those of you who have implemented data quality or are implementing data quality, how many of you would say confidently that you are above a thousand data quality checks running at least regularly in your pipeline.

K? North of five hundred?

One? North of two fifty?

North of a hundred?

One?

Somewhere between zero and a hundred?

A lot. Okay.

The reason why that’s the case is and this was actually something I was sharing earlier.

Many different groups within companies will look at how other groups have solved problems. And sometimes that’s groups across industries.

What’s happening right now is a lot of data engineers and data architects.

They’re looking at what software engineers did. And how did software engineering go through what they did? While they went through the entire agile process, they went through the entire DevOps process.

Data engineering needs to do that.

Right? That’s the way that they think. But from the business perspective, they don’t see it that way. They see it from more of an application performance monitoring perspective.

You solve it by democratization, not by coding.

Right? And democratization means you after you either have to hire more people, which not really feasible depending on the size of your company. But you have to either hire more people or you have to lower the bar of entry to allow that to allow more people that you already have to actually be dedicated to that process or actually have input on that So solving your data quality debt, your debt is how much coverage you want in breadth and how much complexity you have in-depth.

That’s your debt. So to cover that debt, you can either try to code it, which is really difficult, that increase the bar of entry, right, to only those who are technical, or you have to democratize it. And democratizing means you have to lower your bar entry and allow nontechnical people to have an impact on a typically technical process. So that was going to be the demo.

So I actually have a couple questions for you all.

You don’t have to give me answers if you don’t want to. That’s totally fine. But here’s the first one.

When should a data pipeline test So that’s the CIA tests. When should a data pipeline test combine with an analytic metric?

Anyone want to give an answer to that? Or we can just ruminate, on a rhetorical question.

Alright. We’ll let that simmer.

Second question, what is your average mean time to production for developing and deploying a production ready analytical metric? That means from the moment that you know you need an analytical metric to when it’s actually in production and working as intended, What’s the mean time to production for that?

A couple of months?

Okay.

Big company could be six months.

Absolutely.

But there is usually a cycle in production Mhmm.

Because of the prices. Mhmm.

So they might say, like, we do it, like, once a month.

Yep.

That’s it.

If you miss this day.

Yep. You’re out. Yep. Yeah. It’s almost like you have a sprint, like a mini data quality sprint. Yep.

I will tell you that the average that we’re seeing with all the companies that we’ve talked to is about six weeks.

So it’s about six weeks from when it’s originally declared to when it’s actually in production, that includes all the iteration. That’s per data quality metric.

Okay. So if you need to deploy ten thousand, ten thousand times six weeks is not, not feasible in our lifetime, I guess.

How often does your data model evolve?

Now this doesn’t a lot of times we say data model, a lot of times we assume this means schema. Right? That could mean schema, but it could also mean just the way that you’re presenting your data. It could mean an AI ML model. It could be the fact that your model was developed for purpose for one thing and the purpose has evolved over time, so now you need to do more.

Right? So if we think about how often the data model evolves, especially now with the whole Gen AI chat GPT thing that everyone’s wanting to do because it’s super easy.

That’s, like, every hour.

Pretty much.

Right? The other thing that makes that evolve a lot, and we don’t really consider it as leaders, is the fact that some of the ways that we have to get the data to get the data to the model to actually have the desired output is to buy the data. We have to go and buy it from a third party.

They they’re not gonna deliver consistency. It’s just not a thing that they do. Right? Anyone in here who ever has a CSV file dump somewhere, and then you have to do something with a CSV file, there’s always a desire to do just a basic check on the file before you try to like, load it into your system because it’s going to either break the system, or it’s going to give you a false positive. Sure it loaded. We have a time stamp from when it loaded come to find out, didn’t load at all, completely rejected.

We have no idea.

Okay. How many data pipeline tests should be made for each data pipeline leg.

So a leg is every single time data needs to move from something to something or be transformed and or be transformed. That’s a leg or a task in your data pipeline. How many tests, just data pipeline tests should we implement per lake. They don’t wanna give a rough estimate.

Two thousand number of columns. Yep.

K?

A dozen or two? Yeah.

Okay.

It’s it’s pretty close to that. It’s gonna be about, about thirty.

And that’s because on a leg, on a pipeline leg, you have the, beginning in the end. And you have to implement checks on both. And then you have to implement checks that cover both together. Right? So it’s like a reconciliation or, referential integrity checks, and people will talk about that. Last one.

How confident are you after this conversation How confident are you in your data’s quality and integrity?

Like, if you look at all the stuff we’ve talked about, if we look at all the stuff that’s internal, You had to give a percentage range. I am x percent confident in the quality and the integrity of my data. How many of you would actually give a nine like, above ninety percent, above seventy five percent?

Got a couple nods, above fifty percent, twenty five and above.

K. We got a couple okay. I only think that’s everybody else. I don’t wanna see anyone be, like, crying because it’s, like, ten percent confidence.

So These are good questions to kind of think about take back, maybe even posed to your internal teams or handling data.

Right? These are really good questions to kinda get that that actually kicked off and, make sure that the conversation’s happening to make sure that we’re actually, moving in the right direction. So, yes, take pictures. If you wanna take pictures, I believe this will be recorded so you can go back.

If you wanna be dragged through the glass again on data quality issues.

Awesome. No more pictures. Alright. So that’s everything. And I believe we are done early. Sweet.

So questions? Like, what questions do you all have?

Got one right here.

Hi. This is Navaron from Chile Bank.

You know, if things were done the right way for a for the data quality perspective, the rules checks have to be built closer to the source But many a times from a practical consideration perspective, you know, it generally is done down the value chain and probably that’s not the right. The tactical thing. Like, what is your recommendation in terms of how do we balance that aspect out of what is the right point where the secure rule needs to be implemented and ensure that the time to market is also not affected as a result of the remediation. Yeah.

That is an extremely difficult question. And I’ll tell you why because you have to balance something called the snowball effect.

For any of you who are Dave Ramsey fans or financial get out of debt fans, The snowball effect is you pay minimum on all of your debt and you pay maximum on your smallest amount. Once you pay off that smallest amount, you take all of that and just roll it into the next small amount. So you’re snowballing your whole way through until you finally, you pay off your whole thing. So crash course and Dave Ramsey. You’re welcome.

The snowball effect when it comes to solving data debt is really difficult because it has to do either with the amount of debt we have and the amount of value that’s derived from solving that debt. Right? Because value which you’re gonna be providing to consumers.

Value will drive a couple things for you that will make it easier.

But it won’t be seen as like, you know, if you stop the fire when the match is lit versus stopping the fire when it’s a thousand acres, like, that’s you know, better, which is is. Right? If we can implement all these data generation, great. But it’s not gonna drive the most value because we already have problems.

So the most value is driven by stopping it in the far right hand side. Right? And then we’re gonna slowly address it and shift left. Right?

And we’re gonna kinda wipe it out that direction. The reason why you do that is because it goes back to, like, who owns the pain. Right? Your value is gonna be directly tied to the highest person that owns and feels the pain.

Right? And because once you get them, let’s say it’s your CFO. Right? They’re feeling a lot pain, they’re trying to make business decisions, they’re trying to share something with the shareholders, they’re trying to do something financially, and they don’t trust the the data.

Right?

If you solve their problems, the first thing they’re gonna do is sponsor you to solve the next person’s problems. And when you are at a company that large, If you can get a C suite to sponsor you like to do an executive sponsorship on a process and an actual like project that you’re moving through, I mean, that’s huge. Right? But where you’d like to stop it is far left, really far upstream.

You’re not gonna get executive sponsorship there ninety nine percent of the time. Sure. There’s one percent that’s usually if your CFO was a practitioner, not common. Right?

So if you start left, you’re actually going to find more blockers because you’re gonna have other team. It’s gonna take you too long to implement. You’re gonna have other teams are trying to do their own thing. You’re gonna have teams that don’t wanna share.

Right? I used to work at a really big company. The Oracle team wouldn’t talk to us. Which was really weird.

So we had to trade political favors to try to get their data because it made more sense with our Microsoft SQL data. But if you start that far, you’re gonna encounter cultural issues, political issues, you’re gonna count all that type of stuff. Sure. You’ll stop most problems if you start there, but no one’s gonna realize the value of that for a long time.

So you need to start right and then shift left. Right? That’s gonna be your snowball effect. So but that’s a great question.

We got two questions more than the last guy. Heck yeah.

I think it’s very interesting that you mentioned data quality as fit for purpose, or fit for usage. What about the definition and the cataloging piece of things where? What does the does a business define it as?

How does how does, your data quality perspective, really, to the data cataloging? Cause we all spends three days today. Talking about data lineage quality Yep. Is all tied together. So It is. How does your product or your vision line up that way to the definition of catalogs?

Yeah. That is a great question. It kinda ties a little bit back, to what was being mentioned earlier. It’s who’s gonna be delivering who’s gonna be delivered to the most value. Right? So I like to say catalog lineage and and quality are kind of like Google apps or ways, lineage is gonna be all of the roads that tell me how to get where I need to go.

Cadillac is all of the ratings and all of the information about stores and the places that I wanna get.

Right? Like, if I don’t have that, I just have a bunch of random roads. If I don’t have the roads, I have no I know a lot about what’s going on around me, but I don’t know how to get there. K?

And quality is actually gonna activate that for you. Right? It’s gonna tell you if something’s closed or something’s open. If they have higher traffic than usual or not, on the roads as their traffic on the road.

So their police officer with the roads is there is there a speed zone, you know, like all that type of stuff.

All of that’s gonna be incorporated between those three. Right? Now, ideally, sure, you know, write a blank check, get all three in there, all the exact same time, hire a bunch of SMEs. Yeah. That I mean, be awesome.

But who is gonna be delivered the most value from the main value points of those solutions?

Right? So if our biggest problem is the fact that we don’t have good data, right, and we’re not making good business decisions because we’re not getting good data, but we know where the data comes from, then quality would be more beneficial. But if you’re getting a lot of complaints about the facts that people are like, I just don’t know where to get my data. Well, lineage and catalog makes way more sense.

Right? And then they’re gonna be like, okay. Well, now that I know where to get it and how it gets there, how do I know it’s good? Now quality comes in.

Right. So inevitably, you want catalog lineage. You want quality. There was some conversations about data dictionaries in the last session and stuff like that that was there, and I had, like, winced because I used to write my own data dictionaries, and they were horrible.

Mainly because of me. But, you know, like, you you need something that’s gonna be a lot more it’s gonna be easier to spread across your complexity.

Right? And so I would just say who’s who is feeling the most pain, who’s gonna be your highest level of value being delivered to them, and whatever they’re feeling the most pain on, they’re going to sponsor you for Right? And this is gonna bleed into the last answer. Let’s say that they feel the most pain is a bad data.

You implement data quality. It works really well. All their problems are solved. Then you go back to them and you’re like, by the way, we think that there’s more data we could deliver to you.

That’s better. Right? We can give you more information that’s required we need something that’s gonna be able to we know we need to know where it’s all coming from, where it’s going. They’re gonna sponsor you for that next thing.

Right? So you need you have to kind of have this, like, really broad understanding, you know, three, four years where would you like to go and how do you implement that in a way where actually, like, you get that sponsorship. To move on.

Yes. Yes. The question was, does light up connect to data catalog tools? Yes, we do. I purposely didn’t want to demo light up specifically today. My demo was actually gonna be an Excel box because, you know, I’m a glutton for punishment.

But I wasn’t specific in a demo light up today. You can go to our website if you wanna see it. Email me afterwards if you wanna see it, which is fine.

But, yes, we do integrate with those, and we’re actually expanding a lot of roadmap to make sure that that’s actually really bidirectional and a good handshake between the two. So Good question. Yeah.

So we have about three minutes left. We can take other questions or wrap up early.

Awesome. Awesome.

Thank you all for calling.

Yeah.

Really appreciate it.

Find hidden bad data, across the modern data stack.

Get full visibility into enterprise data with Lightup’s modern Data Quality Monitoring solution.

better data boxes
Scroll to Top