avatarSid Anand

Summary

Apache Airflow has been promoted to the Apache Software Foundation's 200th top-level project after a successful incubation period, marking a significant milestone for the workflow scheduling platform.

Abstract

Apache Airflow, a platform for authoring and managing complex data workflows as Directed-Acyclic Graphs (DAGs) in Python, has reached a significant milestone by becoming the 200th active Top-level Project (TLP) of the Apache Software Foundation (ASF). This recognition follows a 2.5+ year journey through the Apache Incubator, where it grew a dedicated community of users, contributors, and maintainers. Initially created by Maxime Beauchemin at Airbnb, Airflow addresses the need for a cloud and developer-friendly workflow solution with a user-friendly interface, which sets it apart from other frameworks like Luigi, Azkaban, and Apache Oozie. Its evolution into an ASF TLP ensures its sustainability and community support, with a growing number of committers, PMC members, companies adopting it, and contributions. Apache Airflow is now a robust ecosystem with numerous hooks and operators, underpinning services like Google's Cloud Composer and serving critical data needs at companies such as PayPal.

Opinions

  • The author views Apache Airflow's "DAG-as-code" paradigm as a significant improvement in workflow management, bringing software development best practices to data engineering.
  • The attractive UI of Apache Airflow is considered a key factor in its popularity and rapid adoption.
  • The author emphasizes the importance of transitioning Airflow to an Apache project to avoid dependency on a single company's resources and to ensure its longevity and widespread impact.
  • The ASF incubation process is seen as beneficial for Airflow, providing a tech brand boost for Airbnb and safeguarding other companies using the tool from personnel changes at Airbnb.
  • The author expresses pride in the growth of the Airflow community, citing the increase in committers, PMC members, contributing companies, and active participation in email lists and Slack channels.
  • The author encourages further usage, feedback, and contributions to Apache Airflow, inviting others to join the movement and contribute to its development.

Apache Airflow Grows Up!

Preamble

Today, the Apache Software Foundation (ASF) welcomed Apache Airflow, a popular open-source workflow scheduling platform, to its ranks as its 200th active TLP (Top-level Project). This caps a 2.5+ year journey through Apache Incubation. This milestone could only have been achieved through the tireless efforts of a community of users, contributors, maintainers, and PMC members dedicated to improving the lives of fellow data scientists, data engineers, & ML/AI engineers who need to manage complex workflows.

Apache Airflow, if you are unfamiliar with it, is a workflow or DAG (Directed-Acyclic Graphs) orchestration system that allows users to author workflows in Python. This “DAG-as-code” paradigm was first created by Spotify, with the advent of Luigi. Luigi brings the power and goodness of software development best practices to the world of workflow management (e.g. Version Control Systems, peer-reviewed code, CI/CD).

Apache Airflow is the brain-child of Maxime Beauchemin, an engineer from Airbnb who now calls Lyft his home. In the Summer of 2015, I found myself seated in the audience of Max’s talk at Hadoop Summit. As it turned out, as Agari’s Data Architect, I was in dire need of a cloud-friendly and developer-friendly workflow solution to manage our predictive batch data pipelines. As an ex-LinkedIn engineer, I was familiar with both Azkaban and Apache Oozie. Both of those frameworks, while mature, relied on config files (e.g. XML) to bundle dependent code together. However, for workflows of reasonable complexity, these frameworks made managing DAGs very cumbersome.

Luigi, while both mature and supporting “DAGs-as-code”, didn’t offer the attractive UI that Apache Airflow did. Airflow’s beautiful and intuitive UI, an engineer’s first introduction to Apache Airflow, is a key reason for its popularity and rapid adoption.

Airflow’s path to Apache

In the Fall of 2015, as more companies adopted Airflow, Maxime found himself burning the candle from both ends to meet new bug reports and a growing request for new features. It was clear that Max was near burnout. With 30 companies depending on Airflow for critical business needs, it was essential that we scale the project out beyond the resourcing of one company, namely Airbnb.

Airbnb, at the time, was new to the Apache way and had not yet signed over any of its software to the ASF. After a few emails with Max and others at Airbnb, Airbnb was “bought in”. Joining the ASF would be a tech brand boost for Airbnb as it would attract engineers interested in making lasting, widely-impactful software contributions. Additionally, it would safeguard other companies using Airflow from any personnel changes at Airbnb.

Fast forward to March 2016, our incubation proposal was voted in and the initial committers, with the help of mentors Jakob Homan and Hitesh Shah, were ready to start learning the vast code base, adding more integrity controls, all while supporting a growing user base.

Airflow by the Numbers

Over the past 2.5 years, we have added 9committers and PMC members to round out our cadre of 17 PMC/committers. We grew from 30 companies at incubation start to 234 companies officially using Airflow today. We added 600+ contributors and merged ~3k Pull Requests (a.k.a. PRs). We have active weekly participation on various email lists and slack channels to the tune of 800+ people.

Airflow Today!

Today, Apache Airflow has grown in multiple dimensions. It supports 20+ hooks and 30+ operators that bind it to multiple 3rd party systems. It is the scheduler that underlies Google’s Cloud Composer service. It’s used for critical data movement and ETL needs at various companies, such as PayPal, my current employer. If you haven’t used it yet, we welcome your usage, feedback, and contributions. Come join the movement!

Recommended from ReadMedium
avatarKonstantin Mogilevskii
Amazon Redshift Data Sharing

Introduction

6 min read