Building ETL Pipelines — For Beginners
Overview and implementation with Python
The most appealing aspect of the data scientist role is the chance to build predictive models or conduct studies that yield actionable insights.
However, such tasks are impossible to perform without data that is both usable and accessible.
To attain data that can adequately fuel analysis or product development, data scientists often opt to build ETL pipelines.
ETL, short for extract-transform-load, is a series of processes that entails ingesting data, processing it to ensure usability, and storing it in a secure and accessible location.
The appeal of an ETL pipeline is that it facilitates data collection, processing, and storage with maximum efficiency and minimal friction.
Here, we explore the individual constituents of ETL and then demonstrate how one can build a simple ETL pipeline using Python.
Extract
Before conducting any analysis, the relevant data needs to be procured.
The first phase of ETL entails extracting raw data from one or more sources. Such sources can include flat files, databases, and CRMs.
Transform
At this point, all the raw data is collected, but it is unlikely to be fit for use.
Thus, the second phase of ETL entails transforming the data to ensure its usability.
There are many types of transformations one may want to apply to their data.
1. Data cleansing
Any unwanted records or variables should be removed. Data cleansing can come in the form of removing features, missing values, duplicates, or outliers.
2. Re-formatting
Oftentimes, when data is pulled from multiple sources, re-formatting becomes a necessary step. Even if different sources report the same information, they may do so in their own unique format.
For instance, two data sources could have a date feature, but one source may show dates in the ‘day-month-year’ format, while the other may show dates in the ‘month-day-year’ format. For data to be usable, information from all sources has to adhere to a single format.
3. Feature extraction
New features can be created using information from existing features. Examples of this include extracting information from a string variable or extracting the year/month/day components from a date variable.
4. Aggregation
Data can be aggregated to derive the desired metrics (e.g., customer count, revenue, etc.).
5. Joins
Data from multiples sources can be merged to create one comprehensive dataset.
6. Filtering
Unwanted categories can be omitted from the dataset.
Load
After all the transformations are applied, the data is fit for analysis, but it needs to be stored for any subsequent use.
The third and final phase of ETL entails loading the data in a secure and accessible location.
Here are some options users can opt for when storing their data.
1. Relational databases
A very popular approach is to store data in a relational database. With this method, users can periodically append or overwrite the data stored in the database with newly procured data.
2. Flat files
Users also have the option to store their data in flat files (e.g., Excel spreadsheets, text files).
Case Study
We can see the ETL processes in effect by building a simple pipeline using Python.
Suppose we need to obtain data on news articles related to Covid-19 for some type of analysis.
To achieve this goal, we will write a program that can:
- Collect data on news articles on Covid-19 published on the current date
- Transform the data so that it is fit for use
- Store the data in the database.
With this pipeline, we can procure information on all relevant news articles for the current date. By running this program every day, we would get a continuous supply of data on Covid-19 news articles.
The modules required for the exercises are shown below: