Stop The Bleeding: 4 Strategies To Troubleshoot, Triage Data Anomalies
Quickly identify, isolate and fix malfunctioning data pipelines for quality data, happier stakeholders and a stress-free workday.
I need your help. Take a minute to answer a 3-question survey to tell me how I can help you outside this blog. All responses receive a free gift.
The Usual Suspects Cause Anomalies In Your Data
Only in data science does a mistake make the company look better.
Spikes in user activity, millions more rows of mineable data and inflated revenue values.
On the surface, these all sound like good things. In more volatile fields like finance it’s rare but still plausible for an investment banker to approach a manager, say an investment’s returns increased by 5x overnight, and for the manager to not think anything of it.
If you say that to a data-minded executive, in their mind, your voice will fade and the only sound they’ll hear is alarm bells.
This kind of strange, out-of-nowhere variance in data has a name: Anomaly.

Companies with solid data infrastructure incorporate upstream and downstream checks for anomalies to ensure that the data that is delivered is clean, timely and, above all, accurate.
But such detection systems aren’t intelligent enough to just “know” how to spot an anomaly. These systems become more reliable as their underlying models are trained over time and on increasingly vast sources of data.
So if your model is unqualified for the job, who identifies, investigates and troubleshoots anomalies in newer data sources?
You.
For a newer data engineer, getting a message like “I don’t know what’s going on with this data” can be a bit intimidating, even if it’s something you partially or entirely built.
Luckily, as you gain experience investigating data anomalies, you start repeatedly encountering the same “usual suspects.”

Usual Suspect #1: A Broken Pipeline
The good news is a broken pipeline is simultaneously the easiest and hardest suspect to catch.
The tricky part is that when I say “broken” pipeline, I’m not referring to a script that triggers errors.
That’s a different beast.
Remember:
Execution != functionality.
In my mind if a pipeline isn’t reliably fulfilling its intended function, it’s broken.
If it’s a critical bit of infrastructure connected to other downstream processes then this becomes a big problem that infects every component of your data ecosystem.
A few red flags of a broken pipeline include:
- Multiple days of triggering a “no data” exception (you should always implement error handling for this reason)
- Unusual row counts (either too few or too many)
- Bad API requests that are logged but not addressed or trigger alerts
The following is code from a function I use to access my personal Stripe earnings.
Note the exception message. Additionally, notice how the message was repeated on multiple runs. Since this doesn’t run every day, that was an issue.
if len(current_dt_check) < 0:
data = make_request(cfg.base_url, token)
logging.info("Creating data frame...")
df = format_df(data)
if len(df) > 0:
logging.info(f"Uploading to {cfg.dataset}.{cfg.table}")
upload_to_bq(df, cfg.dataset, cfg.table, cfg.schema)
logging.info(f"{cfg.dataset}.{cfg.table} updated with earnings from month: {cfg.current_month}")
else:
logging.info(f"No data available for month: {cfg.current_month}")
If your pipeline is exhibiting any of the above behaviors, you’ll want to pause any trigger/schedule it’s on, take the code out of production and, above all, communicate to stakeholders that data will be unavailable until you’re confident that the pipeline is functional.
Fixes will vary depending on the error source, but the best place to begin is to look at your API request and step through each part of the ETL, EL or ELT process until you find the culprit.
Pardon the shameless plug— To get access to my weekly writing, please feel free to become a Medium member using my referral link (I receive a small commission from each individual who joins).
Usual Suspect #2: Downstream Letdowns
Data doesn’t live in isolation or even just in your data warehouse.
Data, like any resource, is made to be consumed, or, more accurately, devoured.
This is why, if you don’t have a robust alerting infrastructure, often stakeholders will be the first individuals to alert you to anomalous data.
Because they can literally see the problem.

Depending on how your organization or future organization’s data teams are structured, dashboard design might be the domain of analytics engineers or data analysts.
But if that dashboard relies on a source table or view, guess who they’ll be approaching first?
In this scenario we’ll assume that your upstream processes are functioning correctly, or at least as expected.
This is when you need to examine specific components of the loader scripts and views that power the dashboards.
You might check for:
- A malfunctioning join
- A dimension changing types
- Data “freshness” or availability
Having data connected to a dashboard can sometimes make it easier to “see” the problem.
Pardon the interruption: For more Python, SQL and cloud computing walkthroughs, follow Pipeline: Your Data Engineering Resource.
To receive my latest writing, you can follow me as well.
Usual Suspect #3: Crossing a line–Timezones
If you’ve worked in programming, especially data-oriented programming, you’ll share my frustration with timezones.
To understand how truly complex the concept of separate time zones is, I suggest you watch this talk I attended from PyCon 23. The variance in time zones is staggering.







