avatarZach Quinn

Summary

The article discusses three practical strategies for overcoming optimization limitations in data engineering processes: rescheduling, refactoring, and rethinking.

Abstract

In the realm of data engineering, optimization is crucial for efficient data processing. However, there comes a point where further performance improvements are constrained by a so-called "optimization ceiling." The article, "3 Data Engineering Quick Fixes When You Hit An Optimization Ceiling," outlines three high-level approaches to break through this barrier. The first strategy, rescheduling, involves adjusting when processes run to avoid conflicts and optimize resource usage. The second, refactoring, recommends breaking down complex processes into simpler, more manageable parts to improve reliability and speed. Lastly, rethinking challenges the necessity of existing processes, advocating for the removal of any that do not add significant value. These strategies emphasize the importance of considering the broader context of data processing tasks and making informed decisions beyond mere code optimization.

Opinions

  • Optimization in data engineering should not be pursued indefinitely, as it can lead to diminishing returns and increased technical debt.
  • Rescheduling jobs can be a simple yet effective solution to improve performance without altering the underlying code.
  • Refactoring is not just about improving code but can also involve reevaluating and restructuring entire processes for better efficiency.
  • There is a recognition that some legacy processes may continue to operate due to a lack of scrutiny rather than actual utility.
  • Domain knowledge is crucial when reevaluating processes, as it enables engineers to ask critical questions and understand the bigger picture.
  • Communication with stakeholders is important to ensure that processes align with business needs and to justify the elimination of unnecessary tasks.
  • Engineers should be proactive in identifying and proposing alternatives to inefficient or obsolete processes, which can lead to organizational benefits.

3 Data Engineering Quick Fixes When You Hit An Optimization Ceiling

Use the 3 Rs of optimization to smash through technical barriers.

Helsinki Library’s dome is prettier than the optimization ceiling. Photo by Jaakko Kemppainen on Unsplash.

I need your help. Take a minute to answer a 3-question survey to tell me how I can help you outside this blog. All responses receive a free gift.

Raising The Optimization Ceiling

Optimization is a data engineering buzz word that is misunderstood and, at times, misused.

From SQL queries to Python scripts to AirFlow DAGs, it’s not enough for a process to simply be functional.

As a a data engineer, nearly everything you build must process as much data as possible in as little time as possible using as few resources as possible.

Despite employer demand for optimized, performant processes, few learning materials focus on the feasibility and limitations of optimizing a process.

Even fewer tell you the harsh truth:

If you continue to chase performance for performance’s sake, you will eventually hit what I’m calling the optimization ceiling.

Like the debt ceiling (for those in the U.S.), the problem of the optimization ceiling is made worse by ignorance.

Ignoring impending limitations to your optimization efforts increases technical and resource debt.

In other words: There will be a point with optimization where you’re just kicking a very expensive and resource-intensive can down the road.

While I can’t anticipate your organizational challenges, I can say that there will come a day when you just won’t be able to improve anymore.

Good management and stakeholders will understand that engineers are bound by resource, time and technological constraints.

The rest will think you aren’t trying hard enough.

If you find the optimization ceiling closing in, I suggest considering the following three quick fixes.

Note: Each fix is not tied to a particular language or stack; rather, they are intended to help you consider overall pain points in your processes.

1. Re-scheduling

Wouldn’t it be great if, instead of laboring over Python code to fix a slow running process, you could just change 1 digit in a cron job?

Now, this isn’t always possible or always sensible, but if I encounter a slow-running process, I ask myself 3 questions:

  • When does this run?
  • What else is running during its execution?
  • Why does it have to run at this time?

I can tell you, as a less experienced engineer, the last thing I thought about, once I merged code into prod, was the timing of a process.

Sure, I’d approximate, but I would stop short of asking myself the last 2 questions.

Then, when a process began running slower than usual, my first reaction was to dive into the code and try to optimize the product without thinking in a larger context.

I’ve been able to improve processes that I’ve already “optimized” (over-engineered) simply by looking at the bigger picture of our load jobs and changing my cron statement.

Next to re-running a process in the UI, rescheduling is one of the easiest fixes you can apply.

Pardon the interruption: For more Python, SQL and cloud computing walkthroughs, follow Pipeline: Your Data Engineering Resource.

To receive my latest writing, you can follow me as well.

2. Refactoring

I recently created four Python scripts to serve as an ETL pipeline for the same data source?

The crazy thing?

They started as one failure-prone function.

Since the process involved using an open-source package to hit an API, I was getting a ton of bad request errors since my function calls were exceeding rate limits.

When I complained to my senior engineers that the process was functional but failed in prod they suggested I dismantle the script, create separate functions and stagger the runs.

It has since run flawlessly (and faster).

While refactoring typically happens in the code itself, you can also refactor a process.

Beginning your first job, you’ll realize that you’re inheriting a lot of legacy processes that continue to run not because they’re fool-proof, but because they haven’t been properly reviewed or scrutinized.

I recently had a Python script that ran in a VM because, at the time, it mirrored another process.

The whole run, on average, would take 2 hours.

After some team discussion, I broke apart the components of the Python script, re-wrote an AirFlow DAG and refactored some SQL queries.

The new run time?

10 seconds.

Granted, we threw a little more compute power at the problem (a last resort, especially if you’re cost-conscious), but by being skeptical of the process and willing to kill what seemed to be perfectly good code, we broke through the previous optimization ceiling.

3. Rethinking

Like rescheduling, reevaluating or rethinking a process involves little coding, but does require solid domain knowledge and a willingness to ask uncomfortable questions of anyone who owns a particular process.

If you’ve attempted rescheduling and refactoring and still find your processes are inefficient, you need to answer a more existential question:

Why does this process exist in its present state?

Like refactoring, investigating the origins of a pipeline might reveal that it, in fact, serves no tangible need.

In an industry plagued by a divide between technical and non-technical teams, something might exist simply because someone many managers above you willed it into existence.

If a slow process is essential and stakeholders are open to communication surrounding a process, you can ask the following questions:

  • Why do we need all of these fields?
  • What might be a more constrained date range we could pull?
  • How flexible are you on the timing and cadence of this data’s delivery?

If your management supports cutting bloated processes, then it might be as simple as deleting the pipe (provided it does, in fact, provide not tangible business value).

Being proactive in identifying pointless pipes and being brave enough to brainstorm alternatives can result in a win-win-win for you, your management and the organization.

Create a job-worthy data portfolio. Learn how with my free project guide.

Data Engineering
Data Science
Data Analysis
Learning To Code
Programming
Recommended from ReadMedium