How To Test Data Pipelines — Even When You Don’t Have Any Data

How to create, evaluate and apply test data to data pipeline builds.

Like a yacht, test data is a luxury that not all of us will have access to. Image courtesy of Alexander Mills on Unsplash.

I need your help. Take a minute to answer a 3-question survey to tell me how I can help you outside this blog. All responses receive a free gift.

It wasn’t until about a year into my position as a data engineer that I realized that test data was not a requirement. It was a luxury.

Initially, I assumed nearly all projects would be accompanied with some kind of data that I could use to “sanity test” my builds. If I were working with a third-party API, most vendors offered dashboards that helped visualize the output, even before I had data flowing into my pipeline. I could often download these reports as CSVs to perform quality assurance checks during and after a pipeline build.

However, just like you won’t always receive the most ideal and the cleanest datasets, there are times when components of data engineering will be inconvenient. There may even be times when you are missing the most important component of your data pipeline: The data.

When I was a newer data engineer faced with testing a pipeline with limited, inaccurate or nonexistent test data, I was tempted to just blindly program and hope for the best. If that failed I could always blame an “upstream” issue, right?

But I quickly learned that one of the undervalued but critical data engineering skills is, like a software engineer, being able to create a test environment — complete with test data.

For me, 3 specific scenarios come to mind:

The Unstable API
The Lacking CSV
The Not-yet Configured Alerting System

We’ll explore these scenarios and examine potential approaches to each.

The Unstable API

I want to clarify that when I refer to an “unstable API” I don’t necessarily mean that I’m referring to a broken API.

I’m simply referring to an API that has constraints that make it difficult to reliably extract test data (assuming all infrastructure is sound, of course).

Constraints could include organization-wide rate limits, contractual or cost constraints like the number of API calls allowed in a given timeframe, and/or an API that provides some data but then breaks.

In both my personal and professional projects, I’ve encountered several variations of the unstable API and I’ve been frustrated at its role in preventing me from efficiently and reliably testing my pipelines.

The best approach, if you can manage it, is to get some slice of data (assuming its available) in the final form you hope to ingest.

For instance, if you’re using a JSON-based pipeline, you’ll want to write the .json output to a local file (make sure to appropriately encode/decode your files). You can even do this with chunks of streaming data.

Having even one record in a JSON dictionary can help you design your ingestion process, since you’ll have access to the relevant keys.

This will also help you design an accurate and robust schema.

The Lacking CSV

It finally happened.

You received an assignment for work or school. Instructions or requirements were clear. The data made sense. Even the API documentation was legible (a rarity in this industry sometimes).

Your assignee was even nice enough to include a CSV containing test data.

Unfortunately, they were only able to provide less than 5 rows.

And each row was missing a different field that should be present in the schema.

On top of that, your organization/school hasn’t finalized the contract with the vendor who owns the API in question, so this CSV was supposed to be a way to test before getting access to the production data.

Big problem.

The good news is that, at this stage, you’re just testing infrastructure, so the content of the data doesn’t really matter.

Plus, since CSVs can’t contain nested columns, you’re likely dealing with flattened data, which is slightly easier to ingest.

In my mind, you have 4 options to make this mystery data appear:

Ask the stakeholder or your technical account manager for better sample data (it can’t hurt to ask, but this might produce mixed results)
Find existing organizational data with a similar structure. Is your data mostly STRINGs? Pull from a report that contains similar data.
Create synthetic data with a tool like Mockaroo (unaffiliated)
Focus on the most complete record you have and design your pipeline and accompanying schema assuming you’ll have complete data.

While management will likely understand constraints like a lack of resources, it’s in these kinds of scenarios that you really earn your paycheck as a data engineer, since creating new data sources is the core of the job.

Pardon the interruption: For more Python, SQL and cloud computing walkthroughs, follow Pipeline: Your Data Engineering Resource.

To receive my latest writing, you can follow me as well.

The Not-yet Configured Alerting System

Unlike a sputtering API or a paper-thin CSV, when you’re configuring alerting systems, you need to rely on data generated at specific time intervals or as a result of certain triggers, like log entries.

In this scenario, creating test data becomes more difficult because you need to also engineer a situation in which that data becomes available.

Let me clarify that with an example from an alerting system I built which uses GCP’s logs.

The goal of the build was to create an alerting product that sensed error-ing Compute Engine VMs based on log entries (and to fix it after it later malfunctioned).

However, (thankfully) we could go days or weeks without infrastructure failing.

Did that mean that I extended my timeline and waited for something to error to test the alerts?

Nope.

I did the opposite.

I mean I did the literal opposite of good engineering.

I wrote code that ran inside a VM that purposefully triggered an error I would then attempt to catch with the Python script I was developing.

This was ultimately a helpful approach because:

It proactively tested the script rather than reacting to a condition
It could be run multiple times, on-demand
It allowed me to better understand how errors are generated and how GCP interprets them

Knowing the context for how to generate test data is as important as the data you ultimately create.

Takeaway

Testing is an overlooked and undervalued component of the data engineer role.

Frankly, testing can be tedious and time-consuming.

Lack of quality and coherent testing data can make an already-painful process downright excruciating.

However, these constraints are where you, the data engineer, are able to be creative with evaluating the requested data products you’re responsible for creating and delivering.

Using the strategies mentioned above, I’ve moved from a reactive mindset of asking stakeholders for test data to a proactive mindset of nearly always creating my own (within reason).

Understanding how to properly test your work, especially under such constraints, is essential to your technical and professional growth.