How to define the DAG “schedule_interval” parameter
My personal notes from the book “Data Pipelines with Apache Airflow” by Bas Harenslak and Julian de Ruiter — Chapter 3, Part 2

Introduction
This series of posts recaps my learnings from the book by Bas Harenslak and Julian de Ruiter. If you like the content, you can purchase the book on Manning.
📚 Related Posts:
Chapter 2:
- Introduction to Airflow
- Running Airflow Locally (in a Python Environment)
- Running Airflow with Docker
- Understanding Airflow User Interface
Chapter 3:
Mandatory and Optional parameters to schedule a DAG
In order to schedule a DAG you need to define the following mandatory parameters:
schedule_intervalstart_date
Optionally, you can also define an:
end_date
If this is not specified, Airflow will keep executing the dag forever.
For example:
dag = DAG(
dag_id="01_unscheduled",
start_date=dt.datetime(2019, 1, 1), # start date of the DAG
schedule_interval=@daily, # run every day at midnight
end_date=dt.datetime(2022, 1, 1)
)The schedule_interval and the start_date are essential parameters and, based on them, Airflow will schedule the first run.
⚠️ Airflow will schedule the first execution of the DAG to run at the first scheduled interval after the start date (start + interval)
For example, if you set start_date = dt.datetime(2019,1,1) and schedule_interval="@daily" (where @daily is an Airflow macro which runs the DAG every day at midnight), the first execution will happen on start_date + schedule_interval , which means “2019–01–02 00:00:00”.
The second execution is on “2019–01–03 00:00:00”, the third is on “2019–01–04 00:00:00” and so on. In other words, “2019–01–01 00:00:00” is skipped as the first execution is defined as start_date + schedule_interval .

Options for scheduled intervals
1. Airflow Macros
In the example above, we’ve used the macro @daily for our scheduled interval.
These macros are shorthand for commonly used scheduling intervals that Airflow provides as default. Here below are the most common ones:
@once: schedule once and only once@hourly: run once an hour at the beginning of the hour@daily: run once a day at midnight@weekly: run once a week at midnight on Sunday morning@monthly: run once a month at midnight on the first day of the month@yearly: run once a year at midnight on January 1
2. Cron-based intervals Cron-based intervals allow us to schedule more complex intervals such as “run the DAG at 23:45 every Saturday”.
Cron-based intervals use the same syntax used by cron which is a time-based job scheduler for macOS or Linux.
A cron expression is defined by 5 components:
# ┌─────── minute (0 - 59)
# │ ┌────── hour (0 - 23)
# │ │ ┌───── dayofthemonth(1-31)
# │ │ │ ┌───── month(1-12)
# │ │ │ │ ┌──── dayoftheweek(0-6)(SundaytoSaturday;
# │ │ │ │ │ 7isalsoSundayonsomesystems)
# * * * * *The idea is to run a job (if on macOS/Linux) or a DAG (if used for the schedule_interval parameter, when the time/date of the cron expression matches the system time/date.
When we don’t care about a value of a given field, we just leave the * .
For examples:
0 * * * *: run at minute 0 of every hour/day/month/day of the week (i.e. run hourly)0 0 * * *: run at minute 0 of hour 0 of every day/month/day of the week (i.e. run every day at midnight)0 0 * * 0: run at minute 0 of hour 0 of the day of the week 0 (i.e. Sunday)0 0 1 * *: run at hour 0 of day 0 of the 1 of the month (i.e. run at midnight on the first day of the month)45 23 * * SAT: run at minute 45 of hour 23 of Saturday (i.e. run at 23:45 every Saturday)
You can also define a collection of values using a comma , or a dash - . For example:
0 0 * * MON, WED, FRI: run at minute 0 of hour 0 (i.e. at midnight) every Monday, Wednesday, and Friday0 0 * * MON-FRI: run at midnight every weekday0 0,12 * * *: run at minute 0 of hour 0 and hour 12 of every day/month/weekday (i.e. run every day at 00:00 and 12:00).
Cron syntax may be confusing at the start but luckily there are a lot of websites that can do this “translation” for us. I personally use this one: https://crontab.guru/. It’s always a good idea to test your expression before using it in Airflow.
Another good practice is to document the reason why you picked a given cron expression. This could help others understand the expression when reading your code.
Finally, cron expressions are said to be “stateless” meaning that a job run doesn’t depend on the previous one. This is because the cron time/date is continuously matched against the current time of the system to determine whether a job should be executed.
The stateless nature of cron expressions brings an important limitation: they are unable to represent job runs that depend on one another (e.g. run DAG every 3 days).
We could write an expression that runs on every first, fourth, seventh, and so on day of the month, but this approach would run into problems at the end of the month as the DAG would run consecutively on both the 31st and the first of the next month. Which is not want we want. A solution to this is provided by “frequency-based intervals” which are explained in the next paragraph.
3. Frequency-based Intervals Frequency-based intervals are based on relative time intervals. In other words, the schedule would check how many “time units” (e.g. days) have passed from the previous run and determine whether to trigger the DAG or not.
In order to create a frequency-based interval, you can use a timedelta instance. For example:
dag = DAG(
dag_id="dag_running_every_3_days",
schedule_interval=dt.timedelta(days=3),
start_date=dt.datetime(year=2019, month=1, day=1),
end_date=dt.datetime(year=2019, month=1, day=5),
)This DAG would start running on “2019–01–04 00:00:00”, then on “2019–01–07 00:00:00”, then “2019–01–11 00:00:00” and so on.
Timedelta can be customised as you want to make the DAG run, for instance, every 10 minutes, every 2 hours and so on.
I hope this helps ❤️ See you in the next post!
References
Data Pipelines with Apache Airflow by Bas P. Harenslak and Julian Rutger de Ruiter






