avatarNajma Bader

Summary

The provided content explains how to schedule DAGs in Apache Airflow, focusing on the schedule_interval parameter and its associated mandatory and optional parameters, as well as various scheduling options like Airflow macros, cron-based intervals, and frequency-based intervals.

Abstract

The content is a detailed guide on scheduling Directed Acyclic Graphs (DAGs) within Apache Airflow, based on the book "Data Pipelines with Apache Airflow" by Bas Harenslak and Julian de Ruiter. It emphasizes the necessity of defining the schedule_interval and start_date for scheduling a DAG, and optionally an end_date. The guide explains that the first DAG execution occurs at the first scheduled interval after the start_date. It outlines common Airflow macros for scheduling, such as @once, @hourly, @daily, @weekly, @monthly, and @yearly. It also delves into more complex cron-based intervals, providing examples and discussing the stateless nature of cron expressions, which leads to limitations in representing dependent job runs. To address this, the author introduces frequency-based intervals using timedelta, which allows for scheduling based on relative time intervals, such as every three days. The article concludes with a recommendation to test cron expressions and document the rationale behind chosen scheduling intervals.

Opinions

  • The author suggests that Airflow's approach to scheduling the first DAG run after the start_date plus the schedule_interval is an important concept to understand.
  • The use of Airflow macros is recommended for commonly used scheduling intervals due to their convenience.
  • Cron-based intervals are presented as powerful for complex scheduling needs, but the author acknowledges they can be confusing and recommends using tools like crontab.guru for assistance.
  • The author points out the limitation of cron expressions in representing dependent job runs and proposes frequency-based intervals as a solution for such cases.
  • The importance of documenting the reasoning behind the chosen cron expression is highlighted to aid code comprehension for others.
  • The author encourages the practice of testing cron expressions to ensure they behave as expected before deploying them in Airflow.

How to define the DAG “schedule_interval” parameter

My personal notes from the book “Data Pipelines with Apache Airflow” by Bas Harenslak and Julian de Ruiter — Chapter 3, Part 2

Data Pipelines with Apache Airflow — Manning Publications

Introduction

This series of posts recaps my learnings from the book by Bas Harenslak and Julian de Ruiter. If you like the content, you can purchase the book on Manning.

📚 Related Posts:

Chapter 2:

  1. Introduction to Airflow
  2. Running Airflow Locally (in a Python Environment)
  3. Running Airflow with Docker
  4. Understanding Airflow User Interface

Chapter 3:

  1. Scheduling DAGs in Airflow
  2. How to define the DAG “schedule_interval” parameter

Mandatory and Optional parameters to schedule a DAG

In order to schedule a DAG you need to define the following mandatory parameters:

  • schedule_interval
  • start_date

Optionally, you can also define an:

  • end_date

If this is not specified, Airflow will keep executing the dag forever.

For example:

dag = DAG(
    dag_id="01_unscheduled",
    start_date=dt.datetime(2019, 1, 1),  # start date of the DAG
    schedule_interval=@daily,  # run every day at midnight
    end_date=dt.datetime(2022, 1, 1)
)

The schedule_interval and the start_date are essential parameters and, based on them, Airflow will schedule the first run.

⚠️ Airflow will schedule the first execution of the DAG to run at the first scheduled interval after the start date (start + interval)

For example, if you set start_date = dt.datetime(2019,1,1) and schedule_interval="@daily" (where @daily is an Airflow macro which runs the DAG every day at midnight), the first execution will happen on start_date + schedule_interval , which means “2019–01–02 00:00:00”.

The second execution is on “2019–01–03 00:00:00”, the third is on “2019–01–04 00:00:00” and so on. In other words, “2019–01–01 00:00:00” is skipped as the first execution is defined as start_date + schedule_interval .

Options for scheduled intervals

1. Airflow Macros In the example above, we’ve used the macro @daily for our scheduled interval.

These macros are shorthand for commonly used scheduling intervals that Airflow provides as default. Here below are the most common ones:

  • @once : schedule once and only once
  • @hourly: run once an hour at the beginning of the hour
  • @daily : run once a day at midnight
  • @weekly : run once a week at midnight on Sunday morning
  • @monthly : run once a month at midnight on the first day of the month
  • @yearly : run once a year at midnight on January 1

2. Cron-based intervals Cron-based intervals allow us to schedule more complex intervals such as “run the DAG at 23:45 every Saturday”.

Cron-based intervals use the same syntax used by cron which is a time-based job scheduler for macOS or Linux.

A cron expression is defined by 5 components:

# ┌─────── minute (0 - 59)
# │ ┌────── hour (0 - 23)
# │ │ ┌───── dayofthemonth(1-31)
# │ │ │ ┌───── month(1-12)
# │ │ │ │ ┌──── dayoftheweek(0-6)(SundaytoSaturday; 
# │ │ │ │ │ 7isalsoSundayonsomesystems) 
# * * * * *

The idea is to run a job (if on macOS/Linux) or a DAG (if used for the schedule_interval parameter, when the time/date of the cron expression matches the system time/date.

When we don’t care about a value of a given field, we just leave the * .

For examples:

  • 0 * * * * : run at minute 0 of every hour/day/month/day of the week (i.e. run hourly)
  • 0 0 * * * : run at minute 0 of hour 0 of every day/month/day of the week (i.e. run every day at midnight)
  • 0 0 * * 0 : run at minute 0 of hour 0 of the day of the week 0 (i.e. Sunday)
  • 0 0 1 * * : run at hour 0 of day 0 of the 1 of the month (i.e. run at midnight on the first day of the month)
  • 45 23 * * SAT : run at minute 45 of hour 23 of Saturday (i.e. run at 23:45 every Saturday)

You can also define a collection of values using a comma , or a dash - . For example:

  • 0 0 * * MON, WED, FRI : run at minute 0 of hour 0 (i.e. at midnight) every Monday, Wednesday, and Friday
  • 0 0 * * MON-FRI : run at midnight every weekday
  • 0 0,12 * * * : run at minute 0 of hour 0 and hour 12 of every day/month/weekday (i.e. run every day at 00:00 and 12:00).

Cron syntax may be confusing at the start but luckily there are a lot of websites that can do this “translation” for us. I personally use this one: https://crontab.guru/. It’s always a good idea to test your expression before using it in Airflow.

Another good practice is to document the reason why you picked a given cron expression. This could help others understand the expression when reading your code.

Finally, cron expressions are said to be “stateless” meaning that a job run doesn’t depend on the previous one. This is because the cron time/date is continuously matched against the current time of the system to determine whether a job should be executed.

The stateless nature of cron expressions brings an important limitation: they are unable to represent job runs that depend on one another (e.g. run DAG every 3 days).

We could write an expression that runs on every first, fourth, seventh, and so on day of the month, but this approach would run into problems at the end of the month as the DAG would run consecutively on both the 31st and the first of the next month. Which is not want we want. A solution to this is provided by “frequency-based intervals” which are explained in the next paragraph.

3. Frequency-based Intervals Frequency-based intervals are based on relative time intervals. In other words, the schedule would check how many “time units” (e.g. days) have passed from the previous run and determine whether to trigger the DAG or not.

In order to create a frequency-based interval, you can use a timedelta instance. For example:

dag = DAG(
    dag_id="dag_running_every_3_days",
    schedule_interval=dt.timedelta(days=3),
    start_date=dt.datetime(year=2019, month=1, day=1),
    end_date=dt.datetime(year=2019, month=1, day=5),
)

This DAG would start running on “2019–01–04 00:00:00”, then on “2019–01–07 00:00:00”, then “2019–01–11 00:00:00” and so on.

Timedelta can be customised as you want to make the DAG run, for instance, every 10 minutes, every 2 hours and so on.

I hope this helps ❤️ See you in the next post!

References

Data Pipelines with Apache Airflow by Bas P. Harenslak and Julian Rutger de Ruiter

Airflow
Data Engineering
Data Science
Data Pipeline
Programming
Recommended from ReadMedium