avatarRamesh Ganesan

Summary

The article discusses using Airflow tags to manage and visualize dependencies between DAGs, enhancing the troubleshooting process for data pipeline support engineers.

Abstract

The article, aimed at Data Engineers using Apache Airflow, explains the use of tags for organizing and filtering DAGs within the Airflow UI. It emphasizes the importance of understanding dependencies, particularly in large-scale environments with numerous jobs. The author introduces the concept of DAG lineage, which is not natively supported by Airflow, and proposes a method to represent these dependencies using tags. By tagging DAGs with 'upstream' (us:) and 'downstream' (ds:) prefixes, engineers can easily trace and manage the impact of failures and fixes without needing to refer to the code base or documentation. This approach simplifies the process of identifying and addressing issues within interdependent data workflows.

Opinions

  • The author believes that remembering dependencies between data pipeline programs is challenging, especially in large organizations.
  • Airflow's tags feature is seen as a powerful tool for filtering and managing DAGs, akin to hashtags for organizing content by topic.
  • The author suggests that Airflow's lack of visual representation for DAG dependencies is a gap that can be filled by using tags strategically.
  • The author posits that the proposed tagging system for DAG lineage is more efficient than traditional methods like checking documents or code bases.
  • There is an open invitation for readers to share alternative methods or improvements to the tagging approach for managing DAG dependencies.

How to use Airflow tags for DAG lineage?

As a Data Engineer, I always need help remembering the dependencies between programs of our data pipeline environment. It is not easy to remember things when you work for multiple teams across an organization with hundreds of jobs running.

I am writing this purely from the “Airflow” perspective. We use Apache Airflow for setting up and maintaining our data workflows. This article will not be useful if you are not using “Airflow” and are not interested in exploring “Airflow” for your data infrastructure needs. Please decide whether to read further or not.

Image by Apache Airflow

Airflow:

If you are working in the latest data engineering environment, you must be aware of Airflow. If not, It is an orchestration platform for programmatically authoring, scheduling, and monitoring workflows.

You can read about Airflow and its capabilities over here: https://airflow.apache.org/docs/apache-airflow/stable/index.html.

This article discusses how to use “tags” for DAG lineage.

TAGs:

The Airflow UI shows a list of DAGs, which can be for the whole organization or from multiple different teams.

To filter a specific DAG, you can filter using the DAG name.

To filter the DAGs by specific functionality or by the team, you can use “tags”

The “tags” are only searchable keyword strings that are indexed. In simple words, it is like hashtags in social media to identify a relevant topic.

Airflow introduced “tags” in version 1.10.8.

Lineage:

In Airflow, Lineage is used at the task level. Airflow can visually represent the dependencies between tasks that are called lineage graphs. By tracking lineage graphs, we can understand the data pipeline flow and troubleshoot for any failures inside a DAG.

In summary, Lineage is the ability to trace and understand the flow of anything.

What about DAG Lineage?

Airflow is not providing any visual representation for DAG dependencies. The Airflow DAGs are independent objects. Obviously, we can set dependencies if required. If you have an Airflow instance running a reasonable amount of DAGs, tracking what depends on what can be challenging. Especially if you are in support and need to fix something middle of your night can be time-consuming and tedious to refer to the code base.

How to achieve DAG lineage using tags?

Let’s start with an example. The below Diagram shows four DAGs, and they depend on each other.

Diagram created by Author Ramesh Ganesan

The “example_2” DAG depending on “example_1” DAG

The “example_3”, “example_4” DAGs are depending on “example_2” DAG

It is easy for us to understand from the diagrammatic representation.

Below screen-shot shows the list of DAGs in the Airflow UI.

Screen-shot generated by Author Ramesh Ganesan

Let’s get to the real problem.

If “example_2” DAG fails, there might be two potential causes.

  • A validation failure — it means “example_1” DAG has missing or partial data, so it is an upstream issue
  • An error due to run time issues or bad data in the “example_2” DAG and will fail the other dependent DAGs

We may get a support alert or an email based on the importance. Usually, most failures are captured by triggering an email on failure.

Based on the above causes:

If it is a validation failure, we must fix the “example_1” DAG and rerun the dependent DAGs.

If it is a run-time error, we must fix the “example_2” DAG and rerun the dependent DAGs.

Being a support engineer, How do we know the dependent DAGs? By referring to the documents or code base? There is a better way. We can solve this by adding “tags” pointing to any downstream or upstream DAGs.

Screen-shot generated by Author Ramesh Ganesan

From the above Diagram, DAGs are tagged as per the dependencies. Now it is going to be easy to search through the tags.

As soon as we know the DAG “example_2” has some validation errors, we need to check the upstream DAGs for any failures. In this case, it is “example_1” DAG. To find this, we need to get the DAGs with “example_2” as downstream.

So search for tags with “ds:example_2” in the tag search bar as below.

Screen-shot generated by Author Ramesh Ganesan

If you are fixing a run-time error in the DAG “example_2”, after the fix, we need to make sure related downstream DAGs are running. In this case, it is “example_3” and “example_4” DAGs. To find these, we need to get the DAGs “example_2” as upstream.

So search for tags with “us:example_2” in the tag search bar as below.

Screen-shot generated by Author Ramesh Ganesan

So using the above tagging approach, we can easily traverse through the DAG lineage without referring to the document or code base. It looks simple to remember, but it is challenging when you have hundreds of DAG running.

I hope I am making some sense here, but there is always another way to handle this. Please feel free to chip in your ideas if you have something.

Apache Airflow
Data Pipeline
Big Data
Airflow 2
Data
Recommended from ReadMedium