Understanding Data Quality and Why Teams Struggle with It
Data quality: the catch-all term for business logic, reliability, validity, and consistency
Conversations about data quality can be difficult, especially when the elephant in the room is an underperforming product.
Situations where these discussions play out typically include disappointed stakeholders, frustrated product managers, and misunderstood engineers.
Familiar phrases might bounce off the walls, including:
- “fix the data”
- “discrepancy”
- “data validation”
- “trust”
- “data quality”
But there is a force at work preventing individuals from arriving at a common understanding. Words are being spoken, yet for some reason they aren’t landing. Reading between the lines of what each person says, it’s clear multiple definitions of “data quality” are at play.
The meaning and implication behind the words is different for each person. The validity of the team’s collective experience and perspective is undermined as they continue to talk past each other. The clock keeps ticking and they eventually exit the conversation without a clear resolution.
That’s quite the pickle, and it’s a common theme in data products.
The phrase “data quality” is widely used and can mean different things to different people. Let’s kick things off with a subjective definition of this term from the perspective of the three archetypes introduced above: stakeholders, product managers, and engineers.
Stakeholders: let’s assume these are less technical people who interact with data products (like dashboards) in their day-to-day operations. To them good data quality means the information accurately reflects the real-world processes they interact with. When they see a dashboard their first thought is to download/export the data to a spreadsheet so they can reconcile the numbers against other known quantities they trust.
Product managers: the primary concern here is that the numbers in product x
match the numbers in product y
and tell a cohesive story. If the numbers match across the whole product line, they’re doing a decent job of playing the hand they’ve been dealt. Product managers don’t necessarily know how to reconcile against a technical source of truth, but they do expect information to be consistent across all products.
Engineers: they are responsible for architecting data pipelines that support the data products at scale. They also interpret business logic and convert it to technical implementation. If they can verify that output x
is correct given input x
then they are maintaining the integrity of the data passing through their pipelines.
These archetype definitions aren’t perfect, but they’ll help paint a picture of how it’s possible for “data quality” to have a completely different meaning depending on who’s talking about it.
Engineers tend to evaluate data quality based on inputs and outputs within self-contained systems while stakeholders tend to judge data quality based on their expectation that data should directly reflect what they see in the actual business operations.
When the engineering implementations are flawed, that’s the classic source of “bad data” that almost everyone has encountered at some point. Maybe a query is failing or data is being duplicated, and as a result the numbers are off. This is the easiest problem to understand, and this is often where we look first when something about the data doesn’t smell right.
But there will also be situations where an engineer executes their job flawlessly and still there is a discrepancy between what a stakeholder sees and what they expect to see. This can appear to the stakeholder as a “data quality” issue that engineers need to fix, even though the pipelines are performing exactly as they should. Situations like this can often occur when processes, designs, or ingredients upstream from engineering teams (operations, accounting, etc) are flawed or otherwise incomplete. Issues like this may require a coalition of teams and disciplines to solve.
Engineers and stakeholders see things from different perspectives, they have different priorities, and they also tend to have strong opinions on how to invest time and resources to move things in a positive direction. This sets the foundation for some serious miscommunication.
In this article we’re taking a stab at cutting through the Gordian Knot that is “data quality”. By leaning on the perspectives of our three archetypes (stakeholder, product manager, engineer) we’ll build an intuitive understanding of what data quality means in various contexts.
Interested in tumbling deeper down the data product rabbit hole?
- Read my article on understanding the various layers of data management and how they enable products.
- Follow my Substack for insights on data engineering and product development.
Data quality: nailing down the ingredients and keeping them fresh
We’re talking about data quality, so it should come as no surprise that maintaining the integrity of raw and derived data is critical. But this concept is a bit ambiguous, so let’s flesh that out a bit.
Maintaining the integrity of raw and derived data means the data is:
- Reliable. The pipelines are robust and downstream processes and users have high trust that they can depend on your product data to be delivered consistently on time and in a predictable state.
- Verifiable. The data is demonstrably correct and it can be verified and validated against a source of truth at all stages of development.
- Accurate. The data correctly reflects the real-world processes it describes.
Reliability of the data pipelines is a core responsibility of the engineering teams. Capturing information as business-relevant events occur (sales, user interactions, etc) and reliably logging the information for future processing and analytics requires flawless execution in order to supply downstream teams and pipelines with high quality data.
Accepting any margin of error at this stage of data and pipeline management directly translates to an inability to reliably tell the story of what is going on in the business when it comes time to build dashboards and generate insights from the resulting data.
Verifiability adds another layer of confidence to how data and information is being managed as it travels through various systems and processes to its final destination.
Imagine we have a reliable car we’re traveling in, meaning our car does everything right: its wheels are spinning, windshield wipers are wiping, and the cruise control is cruising. But we stop for gas and leave the windows rolled down while we go inside to buy some snacks. Someone snatches our bags from the back seat, and we don’t check on our bags when returning to the car. We drive off, and at the final destination we realize our bags were lost somewhere along the way. The vehicle (pipeline) reliably got us where we wanted to be, but we failed to verify important details along the way and only realized at the end of the trip that we didn’t like the result.
Every time we move data around or transform it, it’s good to verify that the information we have after the transformations is valid given the information we passed into the transformations. Often times high-level sanity checks like “do total sales before and after the operation still add up to the same number?” will go a long way in creating useful checkpoints to verify the data against a relative source of truth. When outputs are out of sync with the inputs, that’s often a red flag that we made a mistake in our pipeline.
Having good verification processes in place also directly benefits the business by introducing higher levels of explainability to the data driving important decisions. When decision makers trust the data, the data is far more likely to be used and to deliver value to decision-making processes.
Accuracy is where responsibility shifts to the product side, especially in terms of interpreting what matters to the business and being proactive in ensuring data and product pipelines align with those goals.
In the context of building data products, accuracy can be related to the literal values within the data, but more importantly it’s about accurately mapping product needs and requests to what matters. What numbers and definitions should we be investing time in?
Engineers can guarantee 100% verified data from input to output, but the end result may fail to add any value to the business if product managers and product-minded engineers have not properly defined metrics and enabled the appropriate ingredients to flow into the pipelines.
Data is simply a commodity until good definitions and business logic are applied to it. When this is done well, data becomes less of a commodity and more of an actual product that delivers value to the business.
Avoiding miscommunications about data quality
Understanding an individual’s perspective when they initiate a conversation about “data quality” helps us dial into what their pain point is. This allows us to effectively transpose the conversation away from the ambiguous and subjective anchor of “data quality”. This frees us to have a more productive conversation getting to the heart of the issue:
- is crucial information missing or mismanaged at the raw data layer?
- are there flaws in the business logic driving ETL in the derived data layer?
- are we failing to correctly model a crucial aspect of our business in our pipelines?
- are there inconsistencies in product KPI definitions, causing data discrepancies across products managed by various teams?
Because “data quality” is so subjective, it’s in everyone’s best interest to move away from that nebulous term and progress towards a common understanding of where the problem lives.
Correctly labeling what the issue is also helps avoid common frustrations:
- Engineers will be less likely to feel that the quality of their work is being questioned. If they have implemented robust processes that verify the data at every stage of development, being roped into a meeting about “bad data quality” will likely put them on the defensive.
- It’s more productive to discuss “why are these two metric definitions not aligned?” than it is to debate a wide open question like “why do we have bad data quality?”. Identifying the issue means we can target exactly who should be in the follow-up discussions, resulting in shorter discussions that do not include people who don’t need to be involved.
- Stakeholders are more likely to have their pain points addressed when the scope of the problem is narrowed from “the data is wrong” to “this KPI in product X is not consistent with the same KPI in product Y”.
Navigating data quality conversations: closing thoughts
In a world where everything seems to generate data and every product wants to be data-driven, asking ten people what “data quality” means will probably get you more than ten different answers. There’s hardly a tool or process in the modern business world that isn’t using, storing, or analyzing data.
Teams building data products will inevitably encounter challenges described by stakeholders as “data quality issues”.
When we encounter these challenges, it’s easy to get swept away into a debate where one side (often the engineering side) is defending their implementation and another side (stakeholders) catalogs numerous inconsistencies and anecdotal experiences where the data products fall short of expectations.
A useful tactic for avoiding fruitless back-and-forth sparring about data quality is to consider where the complaints are coming from. Doing this helps to limit frustrations of talking past one another, which allows us to ditch the surface-level descriptions and identify exactly where the product is falling short of expectations.
Perceived “data quality” issues can be related to data engineering, product definitions, user experience, platform reliability, and all sorts of root causes. Getting to the solution that benefits everyone can often be achieved more quickly by realizing that data quality is a culmination of every aspect of the business, and that clearly labeling issues with more precise language sets teams up for success to identify the root cause and develop solutions.