avatarAashish Nair

Summary

The article provides a beginner's guide to building ETL (Extract, Transform, Load) pipelines using Python, including a case study on collecting and processing Covid-19 news articles.

Abstract

The article "Building ETL Pipelines — For Beginners" offers an overview of the ETL process, which is crucial for data scientists to transform raw data into a usable format for analysis. It breaks down the three main components of ETL: extracting data from various sources, transforming it to ensure quality and consistency, and loading it into a database for storage and future use. A practical case study demonstrates how to implement an ETL pipeline in Python to collect Covid-19 news articles using the New York Times Article Search API, clean and filter the data, and store it in a PostgreSQL database. The article emphasizes the importance of ETL in data science and encourages beginners to gain hands-on experience with these tools and techniques.

Opinions

  • The author suggests that the most appealing aspect of the data scientist role is the ability to perform tasks that yield actionable insights, which is contingent upon having usable and accessible data.
  • ETL pipelines are presented as essential for obtaining data that is suitable for analysis or product development.
  • The author expresses that an ETL pipeline's appeal lies in its ability to streamline data collection, processing, and storage efficiently and with minimal friction.
  • The article implies that real-world ETL pipelines may require more sophisticated tools and technologies, such as cloud platforms, big data frameworks, and job schedulers, to handle larger datasets and automate workflows.
  • The author encourages beginners to start building their own ETL pipelines with the tools they are familiar with, suggesting that even those with basic programming knowledge can create effective data pipelines.
  • There is an underlying opinion that hands-on experience is key to becoming comfortable with ETL processes and that beginners should progressively incorporate more advanced tools into their pipelines.

Building ETL Pipelines — For Beginners

Overview and implementation with Python

Photo by Danny Meneses: https://www.pexels.com/photo/photo-of-turned-on-laptop-computer-943096/

The most appealing aspect of the data scientist role is the chance to build predictive models or conduct studies that yield actionable insights.

However, such tasks are impossible to perform without data that is both usable and accessible.

To attain data that can adequately fuel analysis or product development, data scientists often opt to build ETL pipelines.

ETL, short for extract-transform-load, is a series of processes that entails ingesting data, processing it to ensure usability, and storing it in a secure and accessible location.

The appeal of an ETL pipeline is that it facilitates data collection, processing, and storage with maximum efficiency and minimal friction.

Here, we explore the individual constituents of ETL and then demonstrate how one can build a simple ETL pipeline using Python.

Extract

Before conducting any analysis, the relevant data needs to be procured.

The first phase of ETL entails extracting raw data from one or more sources. Such sources can include flat files, databases, and CRMs.

Transform

At this point, all the raw data is collected, but it is unlikely to be fit for use.

Thus, the second phase of ETL entails transforming the data to ensure its usability.

There are many types of transformations one may want to apply to their data.

1. Data cleansing

Any unwanted records or variables should be removed. Data cleansing can come in the form of removing features, missing values, duplicates, or outliers.

2. Re-formatting

Oftentimes, when data is pulled from multiple sources, re-formatting becomes a necessary step. Even if different sources report the same information, they may do so in their own unique format.

For instance, two data sources could have a date feature, but one source may show dates in the ‘day-month-year’ format, while the other may show dates in the ‘month-day-year’ format. For data to be usable, information from all sources has to adhere to a single format.

3. Feature extraction

New features can be created using information from existing features. Examples of this include extracting information from a string variable or extracting the year/month/day components from a date variable.

4. Aggregation

Data can be aggregated to derive the desired metrics (e.g., customer count, revenue, etc.).

5. Joins

Data from multiples sources can be merged to create one comprehensive dataset.

6. Filtering

Unwanted categories can be omitted from the dataset.

Load

After all the transformations are applied, the data is fit for analysis, but it needs to be stored for any subsequent use.

The third and final phase of ETL entails loading the data in a secure and accessible location.

Here are some options users can opt for when storing their data.

1. Relational databases

A very popular approach is to store data in a relational database. With this method, users can periodically append or overwrite the data stored in the database with newly procured data.

2. Flat files

Users also have the option to store their data in flat files (e.g., Excel spreadsheets, text files).

Case Study

We can see the ETL processes in effect by building a simple pipeline using Python.

Suppose we need to obtain data on news articles related to Covid-19 for some type of analysis.

To achieve this goal, we will write a program that can:

  • Collect data on news articles on Covid-19 published on the current date
  • Transform the data so that it is fit for use
  • Store the data in the database.

With this pipeline, we can procure information on all relevant news articles for the current date. By running this program every day, we would get a continuous supply of data on Covid-19 news articles.

The modules required for the exercises are shown below:

Note: This exercise involves extracting data using the New York Times Article Search API. If you are unfamiliar with this API or with using APIs for data collection in general, check out this article:

Step 1: Extract

First, we need to obtain the raw data of news articles on Covid-19 using the New York Times API.

We can first create a function that creates the URI needed to make the necessary requests with the API for any given query and date.

We can use this function to obtain all news articles pertaining to Covid-19 that are published in the selected date.

Due to the API only providing 10 articles per request, we need to make multiple requests until we have collected all of the data, which is then stored in a data frame.

Here is a overview of the dataset’s features.

Code Output (Created By Author)

Step 2: Transform

Now, we need to make sure that the data is processed so that it is usable.

Out of the available features, the only ones required are the news article’s headline, URL, publication date, and author. Moreover, the articles collected have to be unbiased and objective, meaning that op-eds are not desired.

For such a scenario, the ideal transformations to apply are data cleansing and filtering. All records with missing headlines as well as any duplicates need to be removed. Next, as we are looking for objective articles, all op-eds should be filtered out of the dataset. Finally, any irrelevant features should be omitted.

We achieve all of this with the following code.

Here is a preview of the processed dataset.

Code Output (Created By Author)

Step 3: Load

Since the data is now in an acceptable format, it should be stored in a relational database (PostgreSQL in this case) for future use. To achieve this, we will need to implement object relational mapping (ORM), which can be done with the SQLAlchemy module.

Here, we create an engine in which we pass the location of an already created database. Using this engine, we can store the data in a table named “news_articles”. If the table already exists, it will append the new data instead of being overwritten. With this approach, we can preserve any previously collected data.

Now that the data is sent to the provided location, it can be directly accessed with SQL queries.

Code Output (Created By Author)

Using just a few modules, we were able to build a simple ETL pipeline that collects data with an API, processes it to remove unwanted information, and stores the transformed data in a database for future use all in one fell swoop.

Altogether, the code will procure all news articles related to Covid-19 published at the latest date. Theoretically, if a user ran this program daily for a year, they would acquire all relevant data on Covid-19 news articles for the whole year.

Additional tools to consider

While the demonstration showed how ETL can be carried out with a simple program, data pipelines for real-life business cases are more sophisticated and often require the inclusion of other tools and technologies.

  1. Cloud platforms

In the case study, the transformed data is stored in a local machine.

However, when large quantities of data are involved, storing data on-premise is not a feasible approach. Thus, it is quite common to rely on a cloud platform (e.g., AWS, GCP) for data storage.

2. Big data frameworks

If working with big data, ETL pipelines may need to incorporate large-scale data processing frameworks (e.g., Apache Spark) that can expedite operations with parallel processing.

3. Job schedulers

ETL is rarely a one-and-done kind of job. Data may need to be collected periodically so that it remains up-to-date.

Assuming you are not a robot that never gets sick or never skips weekends, you may need to utilize a job scheduler (e.g., Apache Airflow) to automate the ETL workflow.

Conclusion

Photo by Prateek Katyal on Unsplash

When I was a novice, I understood the concept of ETL but was always intimidated by the complex code that would be needed to execute it.

Hopefully, I have shown have ETL pipelines can be created with basic programming knowledge. If you are just starting out as a beginner, I invite you to get some hands-on experience by building your own pipeline using the tools you are familiar with.

Once you get comfortable with the basics, feel free to add more skills to your arsenal. In no time, you will be able to build sophisticated pipelines that resemble those used in real-life business cases.

I wish you the best of luck in your data science endeavors!

Data Science
Python
Recommended from ReadMedium