The Simple Reason Your Data Pipelines Will Never Be 100% Automated

Even AI solutions can’t fix gaps in vendor infrastructure and the ultimate data engineering constraint — time.

Want to create your own portfolio-worthy data automation project? Learn how with my free project guide.

Data Engineering: A Waiting Game

One of the worst waits I’ve endured was on a boat filled to capacity, idling in a monsoon; if that doesn’t sound bad enough: I was the captain. When at the helm of a boat, it’s these moments that make you question why anyone romanticizes boats or bodies of water. Of course, most boats offer an efficient, comfortable way to travel–in ideal conditions. Unlike cars and even most airplanes that can climb and dip below storms, when watercraft and their passengers encounter rough weather all they can do is wait it out.

Unfortunately, no matter how technically perfect and resilient your data pipelines are, there are aspects of ingestion even the best data teams can’t predict or out-engineer. Two of the most significant “out of our control” challenges I’ve encountered are:

Vendor-related or “upstream” failures; all we can do is open a ticket and wait for a response
Unpredictable load times

And while I could write a nice anthology series on the first challenge, I try to be a friend of all data vendors and API maintainers, so we won’t get into that. However, I do want to discuss the second point with an emphasis on receiving data sources at unpredictable intervals or through unconventional channels.

Sometimes data ingestion moves at a sloth’s pace. Photo by Sebastian Molinares on Unsplash.

The Biggest Challenge: No API

Now, for context, data pipeline orchestration can typically be broken down into two broad categories:

Event driven (triggered by an upstream stimulus like a webhook or Airflow task)
Scheduled

The frequency of pipelines depends on your stakeholders’ needs and your technical constraints. Some resource-intensive processes might run once a day, others once a week and some may even run hourly or every 5 minutes.

We can use cron-based tools like Google Cloud Scheduler to establish and maintain these predictable schedules. But then there are the outlier processes. They’re still necessary, but they might not fit neatly into the above definitions. And, unfortunately, you might find yourself at the mercy of your vendor.

For instance, there might be a report that is simply available when it’s available. Google offers several data transfers, like those associated with billing, that fit this description, a challenge I emphasized in the last part of my ETL pipeline walkthrough trilogy.

Write Your First ETL Pipeline (Part III)

Take a SQL script from a SQL environment to Google Cloud Platform by introducing a dynamic data check and upload step.

medium.com

You might even run into a rather unconventional scenario I’ve encountered in which a vendor doesn’t yet have an API but still needs to provide you and your team data.

In this case they might manually send you an email every day.

Aside from causing engineers spoiled by well-documented, intuitive APIs, to wrack their brains creating workaround solutions to ingest data from unconventional channels, these methods also add uncertainty to reporting. Though code-less pipelines and bespoke reports (manually prepared and sent) sound really nice, it’s frustrating when these methods fail or simply don’t arrive as scheduled because there is no solution.

Having to tell any stakeholder their data isn’t ready is a difficult conversation because, while missing ingestion might be a technical error to an engineer, it’s possible that being forced to “fly blind” without data could result in the loss of revenue. And make no mistake, if there’s a revenue issue you will be alerted and you’ll feel that — sometimes in your paycheck if the company’s bottom line tanks.

Instead of making your pipeline more resilient, the best thing you can do in these scenarios is to create a specific, actionable contingency plan. Ensure that when you’re dealing with a vendor that provides a schedule-less delivery you have a personal point of contact and, if at all possible, some way to obtain data as a backup. For instance, I work with a vendor that leverages a third-party querying tool to let us write SQL-like queries to develop reports our webhook-based pipeline (my baby) ingests.

But if there’s an issue with the timing of the data I can still get this data because, each morning, I get emailed a CSV of the results. I can also generate this CSV at any time using that query tool’s UI. So while this doesn’t represent a typical API-based solution to data ingestion, it’s actually quite resilient and protected by redundancy.

Unconventional Problems, Unconventional Solutions

Of course, when configured correctly, emailed reports could still be automated solutions. There are a variety of flows you can create that push a file from an email provider like Gmail or Outlook to a cloud drive.

So, to close, I’ll share a problem I recently solved that, had I not found a bit of a workaround, would be a fully manual task. As I’ve shared, I’ve been working to build my own reporting infrastructure using a combination of data sources and solutions on GCP. Like any good business (though I’m far from it) I needed to get revenue data. Though I don’t do this for the money, as I’m fortunate to have a full-time role, it does serve as a helpful metric for what pieces are “working.”

However, despite trying a few different approaches, I haven’t found a way to manually extract payment data per story. The closest method I’ve found is extracting historic payment data from a JSON-ified version of the Partner Program dashboard, using this story as inspiration:

How to Get Medium Story Stats with 3 Lines of Python Code

A step by step guide in how to extract and read a JSON file containing your stats into a Pandas DatFrame

python.plainenglish.io

At first I would download the file and manually parse the data using Python. Recently, though, I changed my approach to storing the file in Google Cloud Storage. Since I configured my pipeline to grab the most recent version of that file at the beginning of every month, the only work I have to manually do is to ensure that I’ve kept that file updated.

For context, here’s my code:

import pandas as pd 
import requests
import re
import json
from datetime import datetime
from google.cloud import bigquery
from google.cloud import storage
import config as cfg
import logging
import google.cloud.logging

logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%I:%M:%S')

client = google.cloud.logging.Client()
client.setup_logging()

def get_file(file: str):
    
    bucket = storage.Client().get_bucket("medium_info")
    bucket = bucket.blob(file)
    pay = bucket.download_as_string()
    pay = pay.decode("utf-8")

    return pay

def get_paid():
    
    medium_text = get_file(cfg.file)

    logging.info(f"Downloading {cfg.file} from GCS...")
  
    med_df=pd.DataFrame(json.loads(re.sub(r'^.*?{', '{', medium_text)))

    pay_info = pd.json_normalize(med_df.loc['postAmounts', 'payload'], sep= '_')

    logging.info("Creating data frame...")
  
    pay_info['post_firstPublishedAt'] = pd.to_datetime(pay_info['post_firstPublishedAt'], unit='ms')
    pay_info['post_firstPublishedAt'] = pay_info['post_firstPublishedAt'].dt.strftime("%Y-%m-%d")

    final_df = pay_info[["post_title", "post_firstPublishedAt", "post_id", "totalAmountPaidToDate"]]

    final_df.rename(columns={"post_title": "postTitle", "post_id": "postId", "post_firstPublishedAt": "postFirstPublishedAt"}, inplace=True)

    final_df['totalAmountPaidToDate'] = final_df['totalAmountPaidToDate'] / 100
    
    final_df['dtUpdated'] = pd.Timestamp.now("US/Eastern")
    
    return final_df

The catch is that even though the month starts on the 1st, data doesn’t always post until the 2nd or 3rd, once stories have had an opportunity to earn. To account for this, like in my Stripe pipeline, I simply set my schedule a few days into the month (typically the 11th).

Stripe pipeline cron schedule. Screenshot by the author.

Then I use simpler tech like a phone reminder to ensure I’ve completed the download process. Thankfully, as long as the file is present in the bucket, contains data and hasn’t changed format, the pipeline will automatically ingest the file and upload the data to BigQuery.

When I read think pieces about automation in data (I’ve written some) or hear concerns about AI, I feel reasonably confident that unconventional use cases like the ones shared above will be what keeps the human in the loop.

Since not every company scales their data infrastructure at the same rate, there will be vendors who aren’t equipped to offer data in a synchronous, on-demand way, like via an API.

As much as I want to think that I’ve done a lot of work in automating and, by extension, optimizing my workflow, there will always be an element of manual labor to data engineering.

There will always be 1–2% of the job that requires your judgment and brainpower.

Even if you’re just using your brain to compose an email to tell your vendor you haven’t yet received your data.

I need your help. Take a minute to answer a 3-question survey to tell me how I can help you outside this blog. All responses receive a free gift.