avatarZach Quinn

Summary

The website content discusses the optimization of multi-column data transformation operations in Python using config files to streamline code and improve maintainability.

Abstract

The article emphasizes the importance of using Python's config library to manage multi-column transformations efficiently. It illustrates how traditional ETL scripts can become cluttered with transformation code, leading to difficulties in reading, reviewing, and revising. By utilizing a config file, developers can define variables and mappings that reference data frames, reducing redundancy and improving the legibility of the code. The author provides examples of how to use config files for column renaming, variable type conversion, and string replacement, demonstrating the benefits of this approach for data engineering tasks. The article also warns about the need to restart the kernel in notebook environments to ensure config changes take effect and advises on security considerations when storing sensitive information in config files.

Opinions

  • The author prefers using Pandas for data cleaning and transformation but acknowledges the potential for messy and overwhelming code in ETL scripts.
  • Config files are seen as a solution to cluttered scripts, providing a cleaner and more maintainable approach to variable assignment and reference in Python scripts.
  • The author values the security benefits of storing sensitive information like credentials in config files, while also cautioning about the potential security risks if not managed properly.
  • The article suggests that using config files can significantly reduce the amount of code needed for common data transformation tasks, such as renaming multiple columns or converting variable types.
  • The author promotes their data engineering resource, "Pipeline," and invites readers to follow for more insights on Python, SQL, and cloud computing

Multi-Column Transformation Using Config Files in Python

How to optimize multi-column transformation operations using Python’s config library.

Photo by Darryl Low on Unsplash

In building data pipelines, the T of ETL, transformation, can quickly consume lines in a script, making for messy code.

Since I prefer that my data, whenever possible, be in a data frame, I use Pandas to clean and transform the output data. However, if we examine one of my personal ETL scripts, you’ll notice that this can get overwhelming for a developer, reviewer and any other poor unfortunate soul who may have to update or revise the code in the future.

To illustrate my point, here is a small snippet from a project I completed using data from The Onion’s coverage of President Joe Biden.

Aside from being difficult to read, review and revise, this code also contains redundant steps, like wasting lines casting categorical variables (excuse the SQL terminology) individually instead of completing the step all at once.

In addition to examining use cases for a config file to accompany an ETL script, I’ll also walk through how we can eliminate redundant transformations by explicitly referencing a config file that maps variables as a dictionary or list that we can use to optimize Python code.

Why You Should Write A Python Config File

A config file is a Python file that can include basic mappings for any variable you want to assign and later reference in a Python script.

In data engineering use cases, you might store credentials, table names and other information related to pipeline development that has the potential to clutter a script. When supplying an API with credentials and other tokens, there is also a need to keep this sensitive information secure; config files can be handy ways to store these strings.

For your reference, here is a sample config file for a hypothetical simple ETL pipeline. There’s nothing particularly complex about this file. It really only defines variables.

By defining variables within the scope of a config file, we can then reference this information in a Python script.

Note: Config files can be saved as yml, ini and even JSON files. For the purposes of this demonstration, I’m using Python’s config library, which I’ve aliased as ‘cfg’ in my ETL script.

This yields the following output.

Essentially, we can treat each of these variables as attributes of the config/cfg function. Once you understand how these two files communicate, then it becomes much easier to substitute existing variables with a config file, potentially eliminating several lines of redundant and illegible code.

Pardon the interruption: For more Python, SQL and cloud computing walkthroughs, follow Pipeline: Your Data Engineering Resource.

To receive my latest writing, you can follow me as well.

Column Transformations Using a Config File

With the above context in mind, here are a few transformations that you can conduct with a config file instead of hard coding every line in a script.

Column Renaming

If you’ve ever encountered a school, work or personal project situation in which you need to rename multiple columns, you’ll know that it is very easy to wind up with unwieldy dictionaries that can clutter a script and distract a reviewer.

Here is a sample from a small web scraping function I wrote to conduct data analysis for my data-driven story on the disparity in winter vs. summer Olympics wins of the U.S. and other world powers.

I’ve encountered situations in which I’ve had 75–100 columns to rename. This can become tedious and unsightly for anyone who encounters your code. Here is the same operation, but this time we’re using a config file to map the dictionary.

Cfg.medal_cols refers to the medal_cols variable I created within the associated config file:

Variable Type Conversion

Python and Pandas, especially, are particular about how users define and implement variables. Variable conversion is always an essential part of my work, but again, this can consume space within a script.

Imagine we have the same columns from the above excerpt and we’d like to convert them all to strings. Doing so manually would look something like this:

Yes, I’m aware that I can store these columns in a list and reference that. However, I can save space in my main script by moving this step to a config file and referencing only the variable that contains the string columns.

Note that because I’m passing a list, I don’t need the double brackets.

String Replacement

Using a config file for string replacement might be one of my favorite use cases. The steps are: Define a variable that references a list stored in a config file, loop through each item in the list, assign the results to a variable and apply a function like str.replace(). Here is how that looks once implemented.

In this final use case, I’m returning to my Biden Onion example from the introduction. I’d like to eliminate row values that contain brackets. Since this data was originally scraped from Wikipedia, there are several footnotes encased in brackets. This is not clean data.

Note that I only explicitly reference the config file in the first step and then I reference the associated variable. To me, this is much more efficient than embedding a list of columns and potentially writing cluttered code.

A Final Warning

If you are attempting to implement any of these methods in a notebook environment like Jupyter Notebook, be aware that you need to restart the kernel and save your config file before config changes will be visible.

The most common error you’ll see is something along the lines of ‘Attribute Error: config does not have attribute x.’

Nine times out of ten, it’s because the file hasn’t saved and because the kernel hasn’t been restarted.

Also, for those of you concerned with security, be aware that a config file that anyone can access may not be the most secure way to store information like API tokens and sensitive passwords.

Create a job-worthy data portfolio. Learn how with my free project guide.

Data Engineering
Python
Etl
Data Science
Data Analysis
Recommended from ReadMedium