The provided context outlines methods for cleaning and organizing categorical data using the Pandas library in Python, emphasizing the importance of data cleaning in data science projects.
Abstract
The article titled "DATA CLEANING WITH PANDAS" delves into the intricacies of cleaning categorical data, a fundamental yet often disliked task in data science and machine learning. It highlights the significance of data cleaning, stating that data scientists spend a substantial amount of time on this task, which is critical for the success of any data-driven project. The author discusses common issues encountered with categorical data, such as membership constraints, value inconsistencies, and the need to collapse data into meaningful categories. Practical examples and code snippets are provided to illustrate how to handle these issues using Pandas, including setting up the environment, understanding categorical data, and implementing various cleaning techniques such as dropping, remapping, and inferring categories. The article also touches on the use of Pandas functions like cut and replace to create categorical variables and reduce the number of categories, respectively. The author encourages readers to share the article and provides links to related content for further reading.
Opinions
The author acknowledges that data cleaning is a tedious but essential part of data science, often taking up the majority of a data scientist's time.
There is a clear emphasis on the necessity of establishing membership constraints to define the number of categories and their representation in the data.
The article suggests that real-world datasets often contain inconsistencies and errors, such as typos and variations in category labels, which must be addressed to ensure data quality.
The author provides a professional tip to create separate data frames for category mappings, indicating a best practice in data cleaning workflows.
The author expresses a preference for lower-case standardization in categorical data and demonstrates how to handle inconsistencies like whitespace and capitalization errors.
The use of visual aids, such as count plots, is recommended to gain insights from categorical data after cleaning.
The article promotes the idea of collapsing smaller categories into larger, more general ones when it serves the analytical goals of the project.
The author encourages reader engagement by asking for feedback and sharing, indicating the value of community interaction and knowledge dissemination in the field of data science.
“Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis. 57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets”.
Data cleaning is full of frustration, disgusting surprises that take hours to deal with, always new problems with new datasets, you name it. The process is never enjoyable and always considered as a dirty side of data science.
Even though it is often hated, it might be the most important step before any data project. Without properly addressing issues with your data, you might compromise all the other stages in the data science workflow.
Without much intro in this second part of my Data Cleaning Series, let’s get right to the point. This post is about categorical data cleaning. I will be discussing common and uncommon methods to deal with some intermediate level categorical data problems. Here is the general overview:
You can get the notebook and the data used in this post on this GitHub repo.
Setup
Categorical Data — Understanding, And Examples
The formal definition of categorical data would be:
A predefined set of possible categories or groups an observation can fall into.
You can see examples of categorical data in almost all the datasets you have worked on. Nearly any type of data can be turned into categorical. For example:
Survey responses:
Yes or No
male or female
employed or unemployed
Numeric data:
Annual income in groups: 0-40k, 40-100k, ...
Ages: child, teenager, adult …
As we are learning data cleaning using panads library, it is important to understand that pandas will never import categorical data as categorical. It is mostly imported as strings or integers:
You can see that cut, color and clarity are imported as strings rather than as categorical. We could have used the dtype parameter of read_csv like this:
But with real-world datasets, you often do not have this luxury because the data you will be working on probably will have many categorical variables and often, you will be completely unfamiliar with the data at first.
Right after you identify the categorical variables, there are some checks and cleaning to be done before you convert the columns to categorical.
Dealing With Categorical Data Problems
When you work with real-world data, it will be filled with cleaning problems.
As I wrote in the first part of the series, people collecting data won’t take into account the cleanliness of the data and do what it takes to record the necessary information in an easy manner as possible.
Also, problems will arise because of free text during the collection process which leads to typos, many representations of the same value, etc. It is also possible that errors are introduced because of data parsing errors or bad database design.
For example, consider this worst-case scenario: you are working on a survey data conducted across the USA and there is a state column for the state of each observation in the dataset. There are 50 states in the USA and imagine all the damn variations of state names people can come up with. You are in even bigger problem if data collectors decide to use abbreviations:
CA, ca, Ca, Caliphornia, Californa, Calfornia, calipornia, CAL, CALI, …
Such columns will always be filled with typos, errors, inconsistencies. Never imagine that you will have a smooth one-to-one mapping of categories.
Before moving on to analysis, you have to establish what is called membership constraints which clearly defines the number of categories and how they are represented in a single format.
Membership Constraints
There are 3 ways you can treat categorical data problems:
dropping
remapping categories
inferring categories
First, we will focus on isolating inconsistent observations and dropping them. I have created fake data to illustrate how it is done in code:
You can see the short script I wrote to generate this dataset from the GitHub repo I shared at the beginning of the article
It is often the case that you will have some background information about your data. For example, let’s say you want to check for inconsistencies in the blood_type column of the above data frame. You find out beforehand that blood_type can only have these categories: [A+, A-, B+, B-, O+, O-, AB+, AB-]. So, you have to make sure that the column in the data source only includes these values.
In our case, there are 10k rows and visual search for inconsistencies is not an option, which is also the case for many other real-world data. Here is how can implement the best solution for such problems:
First, you should create a new data frame which holds all the possible values for a categorical column:
PRO TIP: It is a good practice to create such data frames which hold category mappings for each categorical column in the main data.
As we now have the correct categories in a separate data frame, we can use a basic set operation which gives us the difference of unique values in the two columns:
To get the difference between the two sets, we use .difference function. It basically returns all the values from the left set that are not in the right set. Here is a very simple example:
Attentive readers may have noticed that inside the set function, I also called .unique() on the blood_type. From what I have read from one StackOverflow thread, it seems that the time it takes to get the unique values will be much shorter if you use both set and unique for larger datasets.
Now, we will filter our main data for the blood types ‘C+’ and ‘D-’:
Using isin on blood_type will return a boolean series which we can use to index the data frame:
90 rows x 6 columns. Somehow x turned into Russian в🤷♂️
So, there are 90 individuals with incorrect blood types. Since we don’t have any clue of how these errors occurred (I did it😁😁😁), we have to drop them. It can be done in two ways:
As our column is clean now, it is safe to set it as a categorical variable:
Definitely check out the first part of this series. There, I covered basic and common data problems. You will also familiarize yourself with some of the functions I will be using here.
Just like we talked about in the second section, there may be many representations of the same category in the data set. These errors may occur just because of simple typos, random capitalization, you name it. Continuing cleaning our data, the turn has come to the marriage_status column:
Using value_counts on a data frame column returns the count of unique values in that column. If you look at the result, you can immediately see the issues. The values should be either lower-case or upper-case. I prefer lower-case:
Using .str on a data frame column enables us to use all the Python string functions on each value of the column. Here, we are using .lower() which converts strings to lowercase.
value_counts is still returning 6 unique values, why? After paying close attention, you can see that one of the categories has extra leading whitespace. That's why it is being treated as an individual category. The same can be true for one of the unmarried, it might have trailing whitespace. We can use the string strip function to get rid of trailing whitespace from a string:
Now the column is clean. All is left is to convert this column into a category data type too:
Collapsing Data Into Categories
Sometimes, you may want to take already existing data, often numerical, and generate categories from it. This can be useful in a number of cases.
In our demographics dataset, we have an annual income column. It can be useful to cut this column into different income groups because doing so might give some extra insight into the data.
pandas has a perfect function for this: cut. It enables us to cut numerical ranges like data frame columns into bins and give them custom labels. Let's see it in action:
Now, we can use functions like value_counts to get more insight:
You can also plot a count plot:
It is randomly generated data, that’s why the bar heights are almost the same.
Reducing the Number of Categories
Sometimes, there may be unnecessary categories. In such cases, you can collapse smaller categories into general and bigger categories which might suit your needs better. Consider the device column of our data:
It is not much use to compare phone’s operating system to computer’s. What would be better is to reduce the categories into mobileOS and desktopOS. To do this, first, we need to create a dictionary which maps each category to the new one:
Then, we use replace function of pandas which maps out the new categories dynamically:
If you liked the article, please share it and leave feedback. As a writer, your support means the world to me!