avatarJulien Kervizic

Summary

Airflow is a robust platform for managing and scheduling data workflows, with features that support easy setup, scalable execution, and detailed monitoring of data operations through directed acyclic graphs (DAGs).

Abstract

Airflow is an open-source platform designed to programmatically author, schedule, and monitor data pipelines. It is particularly adept at handling complex data workflows, replacing traditional cron jobs with more sophisticated management capabilities. Airflow's core concept revolves around DAGs, which are collections of tasks organized in a directional manner to manage dependencies and execution order. The platform provides a user-friendly interface to track the status of jobs, offering insights into successes, failures, and ongoing tasks, with alerting mechanisms for failed processes. Setting up Airflow can be simplified using Docker images and Docker Compose, facilitating quick environment configuration for development or production. Airflow supports different executors like Local and Celery, allowing for task distribution across multiple workers for enhanced scalability. Additionally, Airflow can be integrated with cloud services such as AWS and Azure, leveraging their managed services for components like Redis and Postgres. Developing for Airflow involves creating DAGs using operators, sensors, and dependencies, which together define the tasks, their execution logic, and their interrelationships. Operators encapsulate task-specific code, sensors monitor for condition fulfillment, and dependencies ensure task execution adheres to the workflow's logic. Airflow's design philosophy emphasizes code-based pipeline definitions, catering to an engineering approach to data management and enabling the programmatic generation of complex data pipelines.

Opinions

  • The author suggests that Airflow's management interface is a significant advantage, providing detailed job status visibility.
  • Using Docker images from puckel for setting up Airflow is recommended for ease of use.
  • The Celery executor is preferred for scaling tasks across multiple workers, whereas the Local executor is suitable for simpler, integrated solutions.
  • The author advocates for using managed cloud services like Amazon Redis and RDS Postgres over default Docker images for better reliability and scalability.
  • Developing DAGs in Airflow is described as a task that requires experience and knowledge but is relatively straightforward to learn.
  • The author emphasizes Airflow's engineering mindset, which allows for the handling of data-flows and data processing steps in an integrated and programmatic manner.

Airflow, the easy way

What is Airflow

Airflow is a data orchestration and scheduling platform, in layman’s its a tool to manage your data-flows and data operations. It enables better management of what would have otherwise have been created through a cron job. Airflow revolves around the concept of directed acyclic graph (DAGs), a collection of tasks that are organized in directional manner handling their dependencies.

Airflow offers a management interfaces showcasing the status of every dag job run, whether it succeeded, failed, running or stuck on a retry mechanism.

It is possible to deep dive into the status of different tasks of the DAG, above for instance is tasks to pull data on sponsored products from Amazon’s Ads API for a few European marketplaces. Each marketplace has its own set of tasks periodically run. Airflow also provides the possibility to get alerted on failure or missed SLA.

Setting up Airflow

Docker Images

The easiest way to setup airflow is through one of the docker images from puckel, you can use one of the docker compose yml files in order to setup an environment.

With docker-compose, setting an environment for development purpose is as easy as “docker-compose up”.

Executor Choice

Airflow allows for a choice of executor, Local or Celery. There are some limitation in terms of what a local executor can do.

The celery executor allows for the dispatch of tasks across multiple “workers”, instances meant to process the different DAGs and tasks.

Local executors on the other hand provide for an integrated solution that let you run the different airflow components; front end, scheduler, worker, … on a single instance.

Making it Live

It is possible to setup Airflow on AWS based on the docker-compose files mentioned before and Amazon Elastic Container Service. In essence what is needed to set up Airflow is to set up container instance of puckel/Airflow with different environment variables and different command execution.

Rather than using the default Redis and Postgres images provided within the puckel’s docker-compose file, it is better to use Amazon’s internal services for Redis and RDS Postgres as managed services.

On Azure it is possible to host your own, version of the Airflow container, and launch it as part of a container instance, app service or within Azure Kubernetes service.

Developing for Airflow (DAGs)

What do DAGs look like

The setup of dags in airflow is built across 3 main concepts within it, operators, sensors and dependencies. Together they allow for building programmatically sets of tasks and their relationships and interdependencies.

Operators

Operators are wrapper around the specific code of the tasks that you wish to execute. They can be used to wrap around plan code in different languages like python and php or execution steps to fetch data from an FTP or move files to a data storage such as HDFS.

Sensors

Sensors are a specific type of operator which role is to check if certain conditions have been met, this can be the case for checking that a file has been placed in a FTP folder, that a partition in a database has been created, …

Dependencies

Airflow allows for defining dependencies between tasks and only execute tasks provided the upstream dependencies have been met. This is done by the set upstream and set downstream functions or through bitshift operators (<< and >>).

Airflow allows for the management of what should be done if a dependency is not fully met. This can be done by setting the trigger rules of the different tasks (all success, all failed, all done, one success…) within an operator.

Wrapping up

Airflow provides tools that make it easier to manage data-flows and data processing steps in an integrated manner. It is built with a heavy engineering mindset and the pipeline definitions in it is built as code. Airflow makes it possible to handle programmatically the generation of pipelines.

A set of docker container exists to make it easy to set up both as a development environment and on the cloud. The creation of dags for data flows and processing purposes does require some experience and knowledge but is fairly easy to pick up and start developing upon it.

More from me on Hacking Analytics:

Data Engineering
Engineering
Software Development
Python
Data Science
Recommended from ReadMedium