avatarAnna Geller

Summary

The website content distinguishes "workflow orchestration" from "data orchestration," advocating for the former term as more representative of the processes involved in managing data pipelines within the Modern Data Stack.

Abstract

The article clarifies the distinction between workflow orchestration and data orchestration, emphasizing that workflow orchestration is the more accurate term for governing dataflows with respect to business logic, scheduling, and execution. It describes a workflow orchestration tool as essential for managing complex data operations, providing features such as ad-hoc runs, custom schedules, failure alerts, and retry mechanisms. The term "data orchestration" is

Workflow Orchestration vs. Data Orchestration — Are Those Different?

Let’s disambiguate the terms to understand workflow orchestration better — with a real-life analogy!

Photo by Artem Podrez from Pexels

With the rise of the Modern Data Stack, many tools in the industry started positioning themselves as “data orchestrators” rather than “workflow orchestrators.” This article attempts to disambiguate the terms. I’d argue that the data orchestration moniker is a confusing shorthand term and that workflow orchestration and dataflow automation better represent what orchestration for the Modern Data Stack is about.

Table of contents:
· What is workflow orchestration?
· What do people mean when they say “data orchestration”?
· Workflow orchestration — Explain Like I’m 5
  ∘ Sequential, concurrent, and distributed workflow execution
  ∘ Scale and graceful failure handling
  ∘ Hybrid execution model
· Why “workflow orchestration” is a less confusing term than “data orchestration” 
· Conclusion

What is workflow orchestration?

Workflow orchestration means governing your dataflow in a way that respects the orchestration rules and your business logic. A workflow orchestration tool allows you to schedule, run, and observe your workflows.

A good workflow orchestration tool will provide you with building blocks to connect to your existing data stack and will allow you to:

  • trigger an ad-hoc parametrized run,
  • assign custom (complex) schedules,
  • alert you when something fails,
  • retry and recover from failures,
  • and will save you time writing defensive code (think: countless try-except blocks anticipating anything that can go wrong) only to ensure that your workflow steps run in the right order and that you find out when a failure took place (visibility).

What do people mean when they say “data orchestration”?

They usually refer to the orchestration of workflow nodes that touch data. Any workflow nodes that interact with data, either producing or consuming data, fall into this category.

Following this definition, “data orchestration” is a shorthand term for orchestrating data (or data warehousing) workflows, but it still describes workflow orchestration or dataflow automation.

Workflow orchestration — Explain Like I’m 5

Imagine that the workflow orchestration tool is your personal delivery service:

  • Each order (or shopping cart) reflects your workflow,
  • Each delivery is a workflow run,
  • It’s extremely easy and convenient to put things into a shopping cart — you just add a couple of decorators, and you’re off to the races,
  • Within each order (or shopping cart), you may have many products that get packaged into boxes — your tasks,
  • Products within the boxes may have various flavors, forms, and shapes, and they reflect what you put into your shopping cart — what you wished to be orchestrated and how,
  • Flavors may reflect your data replication jobs (e.g. Airbyte), data transformations (e.g. dbt), data cleaning (e.g. pandas), your ML use cases (e.g. scikit-learn), and so much more.
  • Your boxes may be as small or as big as you wish — it’s your order in the end (your workflow design),
  • Products inside of your boxes may come from various vendors, i.e. your data tools, e.g. dbt, Fivetran, your favorite ML frameworks, your custom data cleaning libraries,
  • The delivery address may either be your home address (your data warehouse), your holiday address (your data lakehouse), or an address of a friend (some external database, data processing service, microservice, or application).
Photo by Norma Mortenson from Pexels

Sequential, concurrent, and distributed workflow execution

Following the delivery service analogy, in the orchestration, you may:

  • decide whether you want to get your order delivered all at once or sequentially — the order of execution of your tasks,
  • choose your delivery type — you may choose a standard (sequential execution) or an express delivery using parallel execution (imagine multiple delivery trucks), or even speed up the execution within a single thread using concurrency with async (a single but faster and more efficient deliverer who can better context-switch),
  • determine how your order should be (gift) wrapped — you may choose to package it into a subprocess, a Docker container, etc.

The workflow orchestration will take care of the delivery, i.e., the execution. It will ensure that your products will get packaged as desired, get shipped at the right schedule, and with the right delivery type for all boxes — some packages need to be delivered quickly with express delivery, while others can wait and get executed sequentially.

Scale and graceful failure handling

A good delivery (orchestration) service scales extremely well. You can have multiple deliveries scheduled for the same time with potentially thousands of trucks (or even cargo ship fleets) and millions of packages. It will still give you fine-granular visibility into the delivery state of every package.

This scheduling service is highly available to guarantee that your order will get shipped even when some suppliers get sick or some trucks break down. And many things can go wrong during the delivery:

  • some boxes may get damaged and may need to be returned, i.e., retried or restarted,
  • the entire delivery may need to be rescheduled because you weren’t at home at that time.

Privacy

Good delivery service also respects your privacy and operates purely on metadata, such as your shipping address, the delivery type, packaging form, etc. It is then responsible for the transport and execution (the dataflow), but it cannot and should not open the box to check what’s inside (your data).

Why “workflow orchestration” is a less confusing term than “data orchestration”

A good workflow orchestration tool (your delivery service) allows you to pick and choose your products, put them into boxes, and customize your order as you wish, but in the end, it’s responsible for the data movement (the delivery, transport, dataflow, the execution), not about the actual products within those boxes (your data). Therefore, the term data orchestration is a confusing shorthand missing the word describing the actual movement of data as it flows through your system. Dataflow orchestration, workflow orchestration, data movement orchestration, data processing orchestration, and data transport orchestration — all those are much clearer than the shorthand term data orchestration.

Workflow orchestration is about the dataflow and ensuring that you can rely on its execution through various failure-handling mechanisms. It can give you visibility into how long the delivery took. It can provide you with all shipment updates (your workflow execution logs). It can tell you whether a box was successfully shipped to the end recipient, but it cannot directly open it to check the brand, quality, and origin of the products inside.

Borrowing the analogy from this blog post, orchestration is about the arrows indicating transitioning between various boxes (tasks) as they get executed, not about the boxes themselves (your data). It should guarantee that your flows and tasks run as intended in the right order at the right time and with the right parallelism. It should guard against errors and failures and help you recover from them and correctly interpret the execution states.

Workflow Automation
Data Engineering
Python
Data Science
Data
Recommended from ReadMedium