Scheduling DAGs in Airflow
My personal notes from the book “Data Pipelines with Apache Airflow” by Bas Harenslak and Julian de Ruiter — Chapter 3, Part 1

Introduction
This series of posts recaps my learnings from the book by Bas Harenslak and Julian de Ruiter. If you like the content, you can purchase the book on Manning.
📚 Related Posts:
Chapter 2: - 1. Introduction to Airflow - 2. Running Airflow Locally (in a Python Environment) - 3. Running Airflow with Docker - 4. Understanding Airflow User Interface
Chapter 3:
-1. Scheduling DAGs in Airflow
👩💻 Practice!
I highly encourage you to follow along with the book examples and to get some practice. To replicate the examples:
First-time setup
- Go to the GitHub repository: data-pipelines-with-airflow
- Clone it (either with SSH or HTTPS). I personally use SSH:
$ cd ~
$ mkdir Projects
$ cd Projects
$ git clone [email protected]:BasPH/data-pipelines-with-apache-airflow.git3. Open Docker Desktop. If you need to download you can find it on the official page here.
4. Check out the README.md file of the corresponding chapter to see running instructions. However, in most cases, you will do the following:
$ cd data-pipelines-with-apache-airflow
$ cd <chapter_number> # put the number of the current chapter
$ docker-compose up --buildThis will take care of spinning up the required resources and start an Airflow instance for you.
5. Once everything is running, you should be able to run the examples in Airflow http://localhost:8080/.
6. To stop running the examples, run the following command:
$ docker-compose down -vDe🐛ing
🙋♀️ If you open Airflow on your browser and you don’t see any dags, most likely you ran docker-compose up --build in the wrong directory. If you check the docker-compose.yaml file, you see that it’s mounting the dags from the folder ./dags into /opt.airflow/dags . This means that you should run the docker-compose up --build command in the parent directory of the dags folder.

🙋♀️ Which are the login username and password? The standard practice (for development environments), is to use admin for both the Airflow username and login password. In any case, you can check/change them here:

🙋♀️ if something doesn’t work, a good idea is to remove all existing containers and start “fresh”: docker rm -f $(docker ps -aq)
⚠️ For running Chapter 03, I had to modify the requirements.txt to a newer version of Flask and consequently a newer version of Click. This is because support for the escape module was dropped in a newer version of Jinja which was a dependency of Flask V1.X.X. My new file looks something like this:
click==8.0
faker==4.14.0
Flask==2.1.0
pandas==1.1.3⚠️ When building the containers, I also got this error message:
Error response from daemon: Ports are not available: exposing port TCP 0.0.0.0:5000 -> 0.0.0.0:0: listen tcp 0.0.0.0:5000: bind: address already in useI hence checked which process was running on port 5000:

It seems that Control Center on Monterey is listening on ports 5000 and 7000. To free up the ports, you can turn it off by going to System Preferences > Sharing .
Running the examples after the first-time setup
- Oper Docker Desktop
- Open the project
data-pipelines-with-apache-airflowfrom an IDE (I personally use Pycharm) - Go to the chapter of interest:
cd <chapter_number> - Follow the instructions on the README.md file. For example:
docker-compose up -build. Wait for a few seconds and you should be able to access the examples at http://localhost:8080/ with password=admin and user=admin. - After you are done, stop the container with:
docker-compose down -v.
Why Scheduling a DAG
Scheduling is important for a couple of reasons:
- Automating your job (you don’t want to do a repetitive task such as triggering a DAG since a computer could do it for you).
- Process data incrementally at regular intervals.
- Loading and reprocessing past data (backfilling).
- Having reliable tasks.
Simple Example: Processing user events
The book starts with an example of a service that tracks which pages users have accessed on a given website. Users are identified by their IP address. We want to know:
- How many different pages do users access?
- How much time do users spend on each visit?
We want to compute these two statistics every day and be able to compare different periods.
In this toy example, data will be stored locally. If you were to replicate this flow in production, you would use cloud storage (e.g. Amazon S3 or Google’s Cloud Storage) as the raw data might quickly become large. Cloud storage provides high durability at a relatively low cost.
To simulate this scenario, the authors have created a local API to fake the retrieval of user events. The API will be automatically set up for you if you used the docker-compose file of Chapter 3.
The API returns a JSON-encoded list of user events. You can try it running:
curl http://localhost:5000/events
Let’s start modelling our pipeline
It’s good practice in Airflow to always break up your work into “minimal units”. In this case, we could create a task that fetches the user events and another one that computes the statistics.
Fetching the data is a relatively simple command, hence we will use a BashOperator . Computing the statistics is a bit more complex (we are going to create a Pandas Dataframe and use groupby and aggregation), hence we will use a PythonOperator .
The DAG definition could be something like this:







