avatarAudhi Aprilliant

Summary: The article provides a tutorial for beginners on using Apache Airflow as a job orchestration tool to scrape and monitor COVID-19 data from Kompas News in Indonesia.

Abstract: The tutorial covers the prerequisites, such as installing Apache Airflow, module dependencies, and setting up files and directories. The article then provides step-by-step instructions for scraping data from Kompas News, including getting the URL, current date, aggregated data, and province's data. The tutorial also explains how to save the data and define and schedule the DAG. The article ends with a summary and disclaimer, along with a recommendation for an AI service.

Opinions:

  • Apache Airflow is a useful tool for elevating cron jobs to the next level and automating scripts.
  • Monitoring daily COVID-19 patient data in Indonesia is crucial due to the ongoing pandemic.
  • The tutorial is aimed at beginners and assumes no prior knowledge of Apache Airflow.
  • The article provides detailed instructions and code snippets for scraping data from Kompas News.
  • The tutorial recommends using ZAI.chat, an AI service that provides similar performance to ChatGPT Plus but is more cost-effective.
Photo by Joshua Aragon on Unsplash

Tutorial

Introduction to Apache Airflow as the Job Orchestration: a Quick Tutorial for Beginner

Web scraping of Covid19 data in Indonesia (send notification via email and Telegram)

Overview

Airflow is a platform that elevates cron jobs to the next level, enabling the creation and monitoring of task scheduling. Airflow utilizes directed acyclic graphs (DAGs) as workflows to be executed, automating scripts.

Meanwhile, the COVID-19 pandemic remains a grave concern. The outbreak was first identified in Wuhan, China in December 2019 and quickly spread worldwide within a month. As a result, it is crucial to monitor daily COVID-19 patient data in Indonesia. Kompas News is one of the platforms that provides daily updates on COVID-19 data through a dedicated dashboard. This data will be scraped using Python and scheduled using Apache Airflow as a workflow scheduler.

Prerequisites

Prior to delving into a more in-depth discussion, kindly ensure that you have thoroughly reviewed and properly set up the following tools:

1. Install Apache Airflow read here

2. Install the module dependencies

  • requests —It simplifies the process of sending HTTP requests and handling responses, making it easier to interact with web services and APIs
  • bs4 — popular Python library used for web scraping and parsing HTML and XML documents. It provides functions and methods for navigating, searching, and manipulating the content of HTML and XML files in a convenient and Pythonic way
  • pandas — popular Python library used for data manipulation, analysis, and processing
  • re — built-in Python library that provides support for regular expressions, which are powerful and flexible patterns used for text manipulation and pattern matching
  • os — built-in library that provides a wide range of functions for interacting with the operating system and performing various operating system-related tasks
  • datetime — It allows you to perform various operations related to date and time manipulation, formatting, and arithmetic
  • json — a lightweight data-interchange format that is commonly used for representing structured data, such as configurations, settings, and data exchanged between different systems or platforms

3. Telegram

4. Email

  • Apps password that contains 16 digits of characters read here
  • set up airflow.cfg file to synchronize with our email

5. Set up files and directories in Airflow

  • Save the DAG Python file in the directory dags
  • Save Telegram chat ID in the directory config
  • Create directory data/covid19 in Airflow to store summary_covid19.txt and daily_update_covid.csv.
The recommended directory (Image by Author)

How to Scrape Data from Kompas News

1 Install and import the module dependencies As mentioned in the prerequisites above, the first step is to install all the module dependencies. These dependencies will ensure that our Python program is written properly. We can easily list all the modules in the requirements.txt file.

bs4
pandas

Furthermore, requests, re, os, datetime, and json are built-in modules in Python, so there is no need to install them. After installing the necessary modules, you can import them as shown below.

2 Get the URL of Kompas news page for daily monitoring of Covid-19 This function is designed to read HTML pages in Python, and it utilizes BeautifulSoup for parsing and manipulating the HTML elements.

3 Get the current date The updated time can serve as metadata to track the most recent data for daily monitoring, and it can be stored in our local directory. The time is scraped and saved in the original Indonesia Western Time (WIB). We can perform some transformations to convert it into the standard YY/MM/DD format.

HTML elements for date-time data at Kompas News (Image by Author)

4 Get the aggregated data The daily aggregated data comprises four pieces of information: confirmed cases, active cases, deaths, and recoveries.

HTML elements for the aggregated data of Covid19 cases at Kompas News (Image by Author)

5 Save the aggregated data This aggregated data will be saved with a txt extension. Typically, the Government holds a press conference at 16:00 WIB, although it may be earlier or later. The Kompas news team must update the dashboard with the current data as soon as possible, which is not always predictable. Therefore, to ensure that we do not obtain duplicate data, it is necessary to check the data against the previous period.

6 Get the province's data This data contains accumulated data for all provinces in Indonesia, allowing us to perform various analyses for the selected province of interest and make comparisons with other provinces.

HTML elements for the province's data of Covid19 cases at Kompas news (Image by Author)

7 Save the province's data After scraping the data for the desired province using Python, it is essential to properly handle and store the data for further analysis. Saving the province’s data allows for easy retrieval and analysis, and enables further processing or visualization of the data as needed. Properly saving the scraped data ensures data integrity, allows for data manipulation and analysis, and facilitates data-driven decision making.

8 Define DAG and set a schedule Before defining the DAG, it is necessary to set up default arguments for tasks. The owner field refers to the creator or owner, where we can specify our name or division. The depends_on_past field is assigned to True if the next task should run only when the previous one is successful, and vice versa. The start_date field defines the start date of our Airflow job. The email field should be assigned to our email address, so that Airflow can notify us in case of task failures. The retries field specifies the number of retries for a failed task before sending an email alert. The retry_delay field refers to the interval between two retries.

Finally, it is necessary to determine the schedule of our job in UTC, specifying when it will run. If you are confused with the configuration of cron, you can refer to the link provided here. It is important to note that the dag_id must be unique in the Airflow database. For more details on our DAG, you can refer to the following code.

9 Create tasks In this web scraping program, we have created six main tasks, with each task calling the functions defined above. This is how we create tasks in Apache Airflow, which formally show the start and end of a job using the “echo task start” and “echo task end” tasks. We will be using three operators in this process.

  1. BashOperator: This operator is used to execute tasks using a shell command
  2. PythonOperator: This operator is used to execute tasks using a Python command
  3. EmailOperator: This operator is used to connect with our configuration file in airflow.cfg, allowing us to send data or information to a private email

10 Set up task dependencies Setting up task dependencies in Apache Airflow is a crucial step in creating a reliable and efficient workflow. Task dependencies ensure that tasks are executed in the desired sequence and are dependent on the successful completion of previous tasks. This allows for better control and management of complex workflows.

Task dependencies can be defined using operators such as BashOperator, PythonOperator, and EmailOperator, which represent different types of tasks. By properly defining task dependencies, we can ensure that tasks are executed in the correct order, with appropriate error handling and retries.

The task in Apache Airflow webserver (Image by Author)

Summary

To execute and run our web scraping job in Apache Airflow, we need to create a Python file and save it in the “dags” directory, which is the designated location for DAG files in Airflow. Once the file is saved, we can start the Airflow webserver and scheduler to activate the job. This will allow Airflow to automatically schedule and execute the tasks defined in our DAG file based on the specified configuration, such as start date, dependencies, and interval.

Please visit my GitHub repository to view the complete code for this project. The repository contains all the codes that have been completed, including the web scraping, data processing, and DAG setup scripts. You can access the repository to review the code and download the files for further reference or use. Thank you for your interest!

Disclaimer and data source

Data Science
Python
Apache Airflow
Covid-19
Web Scraping
Recommended from ReadMedium