CIOReview Recognized Lightup as
Enterprise Data Quality Monitoring Solution Company of the Year
2024

Lightup Fireside Chat with Malcolm Hawker, CDO of Profisee

Transcript

Malcolm, welcome to the show. I, read your recent LinkedIn post, and I’m like, I have to come talk to you about this. It’s a spicy hot take, and, there’s so much to unpack in what you’re saying there.

I think most people probably that are listening to this already know you, so you probably don’t need an introduction. But for those of you that don’t, Malcolm, maybe you want to introduce yourself quickly, and I’ll just add that you have been in the industry longer than you probably care to admit.

But there’s just so much to, so much to draw from your experience. But why don’t we introduce ourselves real quick?

Well, first and foremost, thank you, Manu, for having me, and giving me the opportunity to share my perspectives with your audience.

I’m Malcolm. I’m the CDO of Profisee. We make master data management MDM software. Been in the data and analytics space for about thirty years, been around the block. I’ve worn a lot of different hats, but I’m very active out on on LinkedIn. Part of my mission is to share what I know with with the kind of a broader data and analytics community and, to help other CDOs and other, you know, senior data and analytics leaders succeed, in their mission to drive transformative value from data. So, thanks for having me today, and I look forward to the chat.

And poke the bear every once in a while. Right?

Oh, I poke the bear almost every day. Yeah. I love I like I like poking the bear. And the and the reason is because we have a lot of data to show that the status quo is just not working that well.

Mhmm.

Right? There’s lots of data to show the status quo is is is holding us back.

So I think we need more people looking at different and more provocative ways of solving old problems. So that’s that’s something that role I I I hold near and dear of of chief bear poker.

Yep. Awesome. And, I’m Manu Bansal. I’m the founder and CEO of Lightup Data. We are building a data quality platform that we’re very proud of, working with large enterprises, and, really enjoying the journey.

Malcolm, I you you made a very interesting statement recently, which is challenging the core foundations of data.

And I was just intrigued when you said that because people don’t normally do that. And, I think the way you framed it, it makes perfect sense to me. You’re kind of talking about how data has, especially in the world of LLMs, has really shifted, from what it used to look like in the world of analytics to being much more text oriented and unstructured. And, I think, it’s kind of natural to assume that your data foundations will just carry over, but you seem to be implying they don’t.

And you’re taking a hard stance on that. What exactly are you getting at?

Well, you know, I guess we need to be careful with generalities, right, and with and with sweeping statements.

Sweeping statements get a lot of clicks and they get a lot of attention, but they are sweeping statements. So I I I should probably preface by saying that this general shift that we have towards unstructured data that is being driven by LLMs is a challenge for most companies, not all. It’s a challenge for most companies. It’s a challenge for most CDOs.

It’s a challenge for most data and analytics foundations because the foundations we’ve been focused on for the last twenty years are the foundations needed to support analytics, rows and columns, very structured data. Yes. We can have conversations about, you know, ETL versus ELT, lakes versus warehouses. We can we can have those conversations.

But at a very, very high level, the core foundations that most CDOs and most data leaders have been building, most, not all, but most have been building, is around structured data needed to optimize analytical processes.

The reality of a Gen AI driven world is that LLMs are built on and optimized by text text. Right? So so they were built off the Internet. They’re trained off text data.

They are fine tuned using text based data, and they’re optimized by text based data, meaning the prompts that we type into them. So if you’re a CDO out there and you’re saying, yeah. I want my company to use LLMs and to get value from generative AI, so I’m gonna double down on my foundations. You’re doubling down on something that is probably enabling your analytics platform, your click, your Tableau, your Burst, your your Snowflake, you know, a data warehouse, but it’s not enabling a broader use of Gen AI in your organization.

Can you give me an example of what that, foundation or that element might look like that doesn’t actually carry over?

That that that doesn’t carry over. Well, so let’s take a a data quality rule, for example. A basic data quality rule that says this data may must be present or it must conform to this standard or even the idea of here’s how I determine whether an address is correct or not. An address is actually one that might carry over because an address might be useful to an LLM.

But basic data quality rules that are all built around looking at an individual field or an individual record. Right? They’re not looking at it. Your data quality rules are typically not looking at full paragraphs of data.

Right? There may be some, for example, maybe in health care or some other uses of of data quality where they are looking at more of a narrative.

Mhmm. Right? But typically, what we’re looking at is individual fields of data, individual attributes, individual records, and applying data quality standards to them, or or making data conform to certain standards. So this is just one example of how okay.

I’ve got a foundation built on these data quality rules that is looking for to make sure that it’s a a, you know, an integer instead of a varchar. Okay. Well, that’s that’s that’s not enough in a world of LLMs. If you’re trying to apply data quality in the world of l l LLMs, you would be need necessarily need to look at narratives, stories.

You’d need to look broader. The the Air Canada use case is a is a classic example here where Air Canada got sued because they were putting incorrect information into a chatbot around their bereavement policy. Their their policy they used to reimburse people for airline tickets bought to go to go to the funerals of family members.

And that data was was was incorrect, but it was data that was based on long long form text based data that stated a a given policy that a data quality rule that is being used to make sure that the data going into a data warehouse that is used for a downstream analytical process would would would be ill suited to support.

I see. I see. And so it’s like, that’s interesting because I’m trying to put this into the context of what I was hearing at Gartner data analytics the last it happened earlier this year. And, you know, it’s like you couldn’t go to a talk that didn’t talk about GenAI, and you couldn’t go to one that didn’t talk about data quality. And it was just alarming to see how those two topics would be part of pretty much every single conversation.

And one thing I felt coming out of that that event and then follow-up conversations was from a qualitative point of view, first principles point of view, it makes perfect sense that Gen AI is being held back by data quality. But then the moment you try to link the two, it starts to become very vague and murky. Right?

But at the same time, like, the fact remains that that’s what we hear CDOs complaining about the most in holding back those initiatives.

I don’t think you’re suggesting we don’t have data quality as a primary need for LLMs. Right? You’re not suggesting that. Right? The same problem.

No. No. No. No. Not not at all. Data quality is more important than it’s ever been before for the very reason stated.

But the problem you you you cited Gartner. I was there as well.

And and the problem is what I’ve heard is two years of platitudes.

I’ll give you an example.

AI governance.

We need to have ethical, practices and we need to have practices, around data, governance that limit, the presence of bias.

What does that what do they actually mean? Mhmm. What what are the what are the rules that, that me as a data practitioner would use to ensure that the data that I was using was ethical?

Mhmm.

Nobody nobody is talking about what those actual u rules or policies would be because if you need to implement them at scale, if you need to automate them, if you need to write some sort of script against the data that is saying, okay. Is this data ethical? You you you need to mathematically apply some sort of rules or some sort of conditions to the data. And how do you do that when all that you’re hearing at these conferences is just platitude after platitude? Right? Like and and it’s impossible to argue, okay. Data quality is important for companies to get the value out of generative AI.

But then when you start but what does that actually mean? Well, what it actually means is that is that the data that it would be consuming either during training, ninety nine percent of companies aren’t gonna be training their own models. So we’re really kind of talking about either fine tuning or or we’re talking about prompts. Right? Is the data that is going into a prompt accurate?

Is the data that is being used for fine tuning accurate? And how do we ensure that? If we’re talking about a paragraph of data that describes, who knows, right, your bereavement policy or maybe your HR policy, what are the processes that you would use to make sure that that data is correct? How are you going to do that?

How are you going to deploy data stewardship resources to make sure that that data is correct? Are you gonna be able to automate it? Nobody’s talking about these things. Like, nobody’s talking about it.

Not to mention the fact that most of the stuff that we govern is actually structured, is highly structured. Stuck rows, columns. It’s not the stuff again, there there are outliers. There are companies in the health care space that have been doing OCR optical character recognition.

There are companies in a few spaces where they have been applying governance to less structured data. But for the most part, most companies are just completely ignoring unstructured data. SharePoint servers, PDF files, sitting out on on on on marketing Google drives.

Right? Like, that stuff is is getting sucked into LLMs left, right, and center with absolutely zero governance. So you go to Gartner. You hear, oh, well, we need to double down on foundations. We need to focus on data quality to enable, you know, LLMs when the fact of the matter is nobody is out there talking about how.

How.

Yeah.

And and the and it’s very interesting take, or or angle that you’re getting into, which is how part of it can we get get into the details here a little bit. Right? And it’s like, at some level, you’re saying if you can measure it, then how do you manage it at all? Right?

And then immediately, you start to talk about what should you measure. So before you even get into how should you measure, what should you be measuring? What what does data quality even mean in this context? And, as much as I have thought about it, I feel like, the answer drastically varies depending on which end of the pipeline you’re looking at, and the two ends are obviously the extreme points.

You talked a little bit about what it looks like going into a prompt, and you want the data to be accurate. And the primary kind of measure, at least if I put this in the context of a human review, what I would tell my expert to review for is accuracy of data that’s going into the prompt. Right? And I could imagine accuracy here could also encompass completeness, which then therefore starts to talk about freshness and all those typical attributes of data quality that we talked about.

Right?

If policies are not up to date, then they probably are not going to be accurate. Right?

Do you have any ideas on what it would look at the head end of the pipeline? We kind of alluded to the store of PDF sitting on a SharePoint, and no one is governing that, and then you suddenly expect elements to be well on them. It’s not going to happen. But if if I even forget about the step of feeding this into elements, and I’m just saying I want my PDF store to be ready for one day being fed into LLMs. What should I be tracking today if I’m going to start using it in six months?

That’s a that’s a great question. I mean, step number one is just discovery.

Right? Like, just discovery. You gotta figure out what’s out there. I mean, to me, that would be step one.

Like, what is the universe of arguably valuable data that you could use to help inform or optimize or potentially fine tune a language model for a given business problem. Mhmm. Right? So what is the universe of data out there?

What’s out there? Right? I suspect this is a huge challenge. Like, do do CDOs even know how much data is sitting in the marketing realm, for example, that is that could be arguably extremely valuable to to build models related to customer preferences or buyer behaviors?

Those those are with the line more to more traditional ML models. But are there other data sources out there that could be used to create some sort of Copilot for customer service use case, for example? Right? How much data is sitting out there in customer service FAQs and that type of data that is sitting in those those repositories that is largely and can totally ungoverned from the perspective of the CDO.

Maybe it’s governed from low some some local process where the customer service function is is managing their own data quality, managing their own rules. Okay. That’s great. But what are those rules?

So so step number one would be discovery. Step number two would be to understand what governance is being applied today and who’s responsible for it. What are those processes look like? Right?

How do you ensure things like change management? How do you ensure all the dimensions that many that you just mentioned, whether there’s four, six, or twelve dimensions of data quality depending on who you ask. Right? Yeah.

So I mean, to me, that would be kind of step number one. And number two, what’s out there and what are you doing to ensure some idea of of governance? And then number three would be to to kind of overlay some requirements for AI to say, okay. What’s unique about AI?

What do we need what do we need to solve for?

You know, I mean, that in and of itself would be, for most companies, I suspect would be a massive lift. And what I’m seeing out there is most data organizations are just kind of not doing it. Mhmm. And and the usage of LLMs is is happening organically from the bottom up within specific business functions where people are just using off the shelf LLMs to help or or copilots to do things like, you know, like a like a Git copilot to optimize engineering processes where people in marketing are just going using OpenAI to write, you know, customer communications or to write FAQ statements.

So, you know, if is that right? Is that wrong? It’s happening. Right? And the c CDOs could be, hey.

You know, we gotta stop this ungoverned use of this these these these processes. Well, it’s not gonna stop. I mean, only gonna only gonna get worse.

And you have, like, just trying to, again, deconstruct the problem statement here a little bit.

I’m hearing kind of two different takes from the community right now. One take is that data quality for the construction data is just so poor that you cannot expect elements to do anything useful, at least not to the level that you can actually put in production yet. And all the focus should be on improving the quality of this input data source.

Second kind of, more developer, sentiment seems to be that majority of the gain right now is in playing with heuristics on the on the LLM pipeline side itself, whether you’re doing drag or some variant of it, and how you start to chunk up data, and should it chunk more or less, or should you take neighbors or multiple PDFs or just one? Like, there’s so many kind of, it feels like a black art right now, which, which is not very scientific. But then just because the bar is so low, simple heuristics actually tend to give you a lot of gain. And that’s what’s really the limiting step right now before you start to care about the quality of input data.

Where do you stand on, like, which is a more important problem today?

Those those are astute observations and completely accurate.

The first observation, which is this idea that the data quality is just so bad I can’t do anything, would I think closely align to CDOs that most likely will be looking for new jobs in the next year.

The fact that you are making this this highly deterministic, very binary, it’s either good or it’s bad statement when it comes to anything related to AI is a testament to a lack of knowledge around how AI works.

Right? Because AI is not deterministic. It’s probabilistic. And AI is highly, highly, subject to context.

Right? So what is true in one context may not be necessarily be true in another. It’s all about the context. As data people, we should know this by now.

We should know this by now. But many of us cling to these deterministic mindsets that make us think that data is either all garbage or all good. It’s neither. It may be both at the same time.

Right? It may depend it depends entirely on the use case. So if you are a data leader out there that is taking that approach of our data quality is just so bad. I don’t know what to do.

I can’t use this. Well, that’s the wrong approach because it’s inaccurate. I guarantee you there is a business problem out there right now today that could be solved using an LLM even with the poor and a business state of your, let’s say, your CRM data, which is one that people always pick on. I guarantee you there is business value that can be delivered even using the data as it exists today.

Let’s remember that LLMs are built on the Internet.

This is not necessarily a bastion of data quality. It’s not something that is well known for the accuracy of data. Yet somehow, these amazing systems that we’re building are still able to provide meaningful value to us. People are still using them.

Kids are still using them to do their homework day in and day out with with varying degrees of success, I would imagine. But the other use case that you talked about is a very practical one, which is which I’m seeing is happening in most CTO CTO organizations because people are getting fed up of a lack of traction related to AI, and they’re going and trying to solve these problems. Mhmm. More on the CTO side of the house where they are deploying complex rag problems or prop complex rag processes, where they are using vector databases, where they are trying to find ways to add context and insight off of structured datasets using, you know, a graph, right, and others.

So, yes, that is the more practical way to do it. It is the more outcome driven way to do it. Today, what you described is the only way that I know that I can see at the very least in vectors and and and graphs. The only way that I know that you can you can add the context necessary to make structured data more consumable by LLMs.

So that’s a more practical approach. The data shows us, this was published in a new Vantage partner survey earlier this summer, that only about five percent of companies are taking that approach are taking that approach that you just described. I hope to see it increase in the coming year, and I expect it will increase in the coming year because there’s a ton of value out there when we find a way to operationalize that highly structured data to have it be consumed by LLMs and during some sort of a fine tuning process or even within more of a complex prompt. So, yeah, that’s the practical way to do it, but not enough companies are focused on it.

Yeah. I mean, there might actually be an irony hiding there and trying to really clean up the data feeding into LMS. Right? I mean, the whole promise is you don’t have to do that. I mean, are you just basically then starting to structure your unstructured data that was supposed to be what you avoid in the first place?

This this here’s the paradox. Right? Which is the only way that I think that we’re gonna be able to get our data in a state where we can operationalize it and get value at scale Through for AI is to use AI to do it. Mhmm.

Is to use AI to do some of the data prep. Right? Is to use AI to get the data in a more unstructured format, right, to create narratives or to create stories where GenAI based solutions can, can more readily consume it. So there’s a little bit of a paradox there.

There’s a little bit of an irony there. But this does start to get into some of the foundations that you were talking about, which is, can you build processes to start injecting stewardship or or more traditional governance processes into something like what we just described? Mhmm. Right?

Where you are doing some data profiling at scale, of of structured data and you were running graphs on it and then you were making some assertions based on the graphs and or or or some of the triples that are based based in that graph, can then you apply more more traditional data stewardship to make sure this makes case and makes makes sense. Yeah. You know, can can we can we improve some of these processes and scale some of these processes? Yes.

I think we can. And we can bring some of those foundations into this new world, but far too many data leaders are taking that first approach, Minnie, that you described, which is the, well, it’s junk, and I I I can’t use it. It’s useless.

Yeah. Yeah. Maybe maybe instead of saying data is junk and I can’t make GenAI work, we should be saying data is junk, and the only way I can actually get value out of it is with Gen AI.

And so the question At scale at scale, the answer is yes.

Because the Pareto principle tells us that eighty percent of our data is is probably sitting out there unused, right, unmonetized, un unanalyzed, ungoverned. It’s just it’s out there in the SharePoint servers and PDFs and image files all over the place.

On sitting on word deck word docs on hard drives. I mean, there’s a ton of data out there, and we know intuitively we should know that there’s a ton of value in that data. There’s a ton of risk in that data as well, but we should know there’s a ton of value there. And how do we extract that insight? The only way we’re gonna do it at scale is using AI, ironically.

Yeah. So it’s like, how do you light up your dark data? And maybe maybe that’s what the LLM is designed to do in the first place. And since sort of fighting, that dark, ugly data, maybe we should be we should be asking, how do I actually get value out of it? Because that’s what I that’s what the objective is. That’s what the challenge is.

I I love it. And I love you using the phrase dark data because it it has multiple definitions in this case. Right? I think what you were referring to is data that is just sitting there, you know, not generating value.

But there’s another definition of dark data that says that it’s data that is just sitting in a data center on a disk somewhere, consuming scarce energy resources in that data center, but we’ll never see the light of day on a report, that will never be on a dashboard, that is never used to, you know, to do anything of meaning.

And depending on who you ask, there’s the data’s all over the place on this, but it’s anywhere from fifty to ninety percent of data is sitting dark. Right? Yeah. Get that data is all sitting on disk somewhere that is consuming energy that needs to be cool, that needs to be powered. Right, to the point where the data center industry, by some estimates, is consuming or or producing more greenhouse gas than the airline industry and the shipping industry combined.

So maybe that is part of the use case here. Right? So maybe if it’s not getting getting the value from from Gen AI, maybe at the very least, maybe it’s doing doing well by the planet because a lot of this data is just sitting out there consuming scarce resources and just collecting dust.

Yep. Yep.

So if we could light up that data Yeah. The there’s it’s a win win. Right? It’s it’s you can you can get some value from it. You can you can mitigate any business risk to it. And, also, by the way, you could probably do some good by the planet.

Yep. Yeah. I mean, and I think we are still just kind of, it it’s almost like we’re unlearning the habits, that have been developed before, and we have we have to get there before we can start to develop new habits. Right? So if you try to really pin down what those new foundations, are going to look like and what they should look like, for the world of AI, probably don’t know enough yet.

I think you’re on point, right, which is, first, let’s agree that we need to we need to unlearn the way we have been doing the infrastructure data. That’s not working. Then we can start to ask what is. Right?

To totally agree. And I I when it comes to frameworks and foundations, I mean, that’s one thing that we do pretty well. Mhmm. Right? Like, when you go to these conferences, you mentioned Gartner before. Every single one of them is talking about AI governance, and most of them will show you an AI governance framework.

It’s how to actually operationalize it. It’s the problem. Nobody’s talking about how to actually do that. So when you talk about this this this shift of of foundations yeah. I mean, I think I think figuring out, like, the kind of the bits and bobs of each of the, you know, enabling capabilities of each of these frameworks are are are going to be necessary. I’m not worried about our ability to figure that stuff out. What I’m more worried about is what could brought more broadly be to call a mindset.

The way we think about data, the way we think about our customers, and I use that word intentionally customers. Right? The way we think about our roles.

I I think we’ve got a real challenge related to mindset. I gave you one example earlier when I was talking about this very deterministic way that people in the data and analytics space tend to see the world. Right? Data is either all good or all bad. It’s garbage in, which means it’s garbage out. That is a very deterministic, rules driven approach to the world. When in reality, in an AI driven world, it is inherently probabilistic.

Mhmm.

It is inherently probabilistic.

Right?

So what is good for one may not be good for another or vice versa.

So take that’s what just one small example of a different way of thinking about these old problems.

Right? I use the phrase garbage. We use the phrase garbage to describe our data all the time. What sort of impact is that having long term on how we view our role and how we view the products that we’re building for our customers?

I suspect that it would have a corrosive impact over time. Right? This phrase of garbage in garbage out. I could keep going on here, but I think we need to be think the the the start here is to think differently about data, think differently about these problems, think differently about how we approach things, challenge the status quo Yeah.

Because it’s not working.

Mhmm.

That will lead to all I I suspect in time that could lead to better things down the road.

Yep. And it kind of reminds me of how Columbia University built their walkways on the campus. They just you know, they’re like, yeah. We can get a planner to plan it all out, but why do that?

Let’s just have people go to classes. And over time, they very quickly picked out the fastest paths going from point a to point b depending on where they needed to route. And you just started to see those marks on the graph, and they said, let’s pave these pathways. Right?

And something Another way of saying that another way of saying that so I’ll be giving presentation in the first, few months, of this year related to data governance and saying we need to start over.

Yeah. We need to rethink data governance.

And one of the things that I’m recommending is that we try to make a pivot from rules based policies to exception based policies. And that’s basically what you just said, which is you can you can solve a problem by throwing rules at it and saying, here are the rules.

Right? Or you can let the people walk the class, and the rules will naturally evolve. Right? The paths will naturally evolve. They always do. They always do.

So to some degree, not always, there are compliance and audit and and regulatory concerns related to data governance that we need to adhere to. That’s the minimum bar. But I would argue that to go farther than that, we need to be more exceptions driven than more rules driven. And I know that’s really kind of pithy and and and high level, but I think that’s part of the mindset shift that we need to make here.

We need to make here. There there are some, my my friend Bob Snyder, would call that perhaps more of a noninvasive approach to data governance. I think that that has some relevancy here. But I would argue that a lot of the data governance is it is happening, you know, kind of naturally within organizations today, and it is.

Right? The those the PDFs and the SharePoint servers that I talked about, they’re not completely ungoverned. There are controls about who can access them. There are controls about who can update a a given marketing document.

So there are controls out there. It’s happening. The pathways are being built.

We just need to figure out who’s doing it. What are the rules used to do it today? Are those meeting the needs? And how do we need to change our approach to that? And I suspect more of an exception based approach would be the right one. And, honestly, that’s probably largely where we are outside the CDO organization today anyway.

Well, this is which is basically, suggesting that let people build those applications first, then you wait and watch what practices or what recommendations emerge. Right? We don’t have to force a certain, form of data governance or data quality yet. And if we try to, we are probably going to do more damage than good. It’s just much better to let people play with this stuff, build some applications, show some value.

And like you said, right, maybe we’ll make a few mistakes, along the way, but those exceptions will teach us more than trying to sit on a drawing board right now, first creating foundations and then hoping for the applications to lower value. Right? Yeah. Maybe maybe that’s a big takeaway here, which is let people build, and then we will see what they need as support around it.

Well, because we know we again, we’ve got twenty years of data here. We know that when we take a rules driven, control driven approach where we are unable and unwilling to show the value and to quantify the value and prove the value of that approach. Right? If we’re saying you must do this, yet we don’t say, here’s the benefit you’re gonna get from it, and it’s been quantified, it’s been modeled out, it’s been shown, you’re gonna get the benefit of it. If we take that approach of all stick, no carrot, right, using an old using an old metaphor, people are just gonna go do what they’re gonna do anyway.

They’re they’re gonna do what they’re gonna do anyway. We see this we see this every day. And and this is this is this is why most data governance programs are struggling is because it is all rules. It is all is all stick, and it is no carrot.

Right? And we talk a big game about the value of governance or the importance of governance or how important this is to get the value out of data. But then when our customers ask us, oh, okay. Can you prove it?

Crickets.

So then our our our customers are right to say, well, I mean, I hear you, but at the same time, I’ve got an SLA to meet. I’ve got a product development deadline. I’ve got customers who I need to support, and I’m gonna go do it. Right? So they are building those pathways. They’re doing what they need to get done.

So I would say, hey. You know, what do we got to lose in in changing the way we think about these problems? I don’t think we have a ton to lose. I think we’ve got a lot to gain.

I think that was a really good, deep dive into you had a lot of ideas that you touched upon in that post, and I think we’re able to get a good comprehensive view of what this is actually getting at and what we should be doing as a data community to to really make this a productive exercise as opposed to trying to discipline people, when they don’t actually want it. Right? So great conversation, Malcolm. Great to have you on the show. Thanks for talking to me, and, see you around.

Thank you so much. I appreciate it.

Bye now.

Find hidden bad data, across the modern data stack.

Get full visibility into enterprise data with Lightup’s modern Data Quality Monitoring solution.

better data boxes
Scroll to Top