Datascience with Python — Data Preprocessing
A crucial step to ensure the quality of your data
This article is part of the “Datascience with Python” series. You can find the other stories of this series below:
Data preprocessing is a crucial step in the data science workflow that ensures the quality and cleanliness of the data used for analysis and modeling. It is often said that 80% of a data scientist’s time is spent on cleaning and preparing data, making data preprocessing a critical aspect of the data science process.
We will explore the common steps involved in data preprocessing in Python and how to perform these steps using popular libraries such as pandas, NumPy, and scikit-learn.
Importing Data
The first step in data preprocessing is to import the data into your Python environment. There are several ways to do this, including importing data from CSV files, Excel spreadsheets, databases, or APIs.
You can import data using pandas or NumPy:
import pandas as pd
data = pd.read_csv('file.csv')
# or
import numpy as np
data = np.loadtxt('file.txt', delimiter=',')
Obviously, your data has to be formatted correctly for you to be able to import it without problems.
Handling Missing Values
One of the most common issues in real-world data is missing values. Missing values can arise due to various reasons, such as data entry errors, measurement errors, or simply because a value is not available. When building data models, it is important to handle missing values appropriately to ensure the accuracy and reliability of the results.
There are several methods for dealing with missing values, including filling missing values with a placeholder value or imputing them with statistical methods.
Filling missing values with a placeholder value is a simple and straightforward approach. The most common placeholder value used is np.nan
(Not a Number) in NumPy, and None
or NaN
(Not a Number) in pandas. For example, you can fill missing values with the mean value of the column:
import pandas as pd
data = pd.read_csv('file.csv')
data.fillna(data.mean(), inplace=True)
Imputing missing values with statistical methods is a more sophisticated approach. For example, you can use the mean, median, or mode of the column to fill missing values. The scikit-learn library provides the Imputer
class, which can be used to perform mean imputation:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
data = imp.fit_transform(data)
Depending on your data, you may choose one method or the other. In general, imputing missing values is a preferred approach over simply filling missing values with a placeholder value, as it provides a more accurate representation of the data.
Encoding Categorical Variables
Categorical variables are variables that can take on a limited number of values, such as gender, color, or country. In order to use these variables in data analysis and modeling, they must be encoded into numerical values.
There are several methods for encoding categorical variables, including one-hot encoding, label encoding, and ordinal encoding.
One-hot encoding is a technique that creates new columns for each unique category in a categorical variable. Each column represents one category, and each row receives a value of 1 or 0 indicating the presence or absence of the category in that row. The pandas library provides the get_dummies()
method for one-hot encoding:
import pandas as pd
data = pd.read_csv('file.csv')
data = pd.get_dummies(data, columns=['color'])
Label encoding is a technique that assigns a unique integer to each category in a categorical variable. The scikit-learn library provides the LabelEncoder
class for label encoding:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['color'] = le.fit_transform(data['color'])
Ordinal encoding is a technique that assigns an ordered integer to each category in a categorical variable. This method is used when the categorical variable has a natural ordering, such as high, medium, and low.
Scaling Numerical Variables
Numerical variables, also known as continuous variables, can have values with a large range. When building data models, it is often necessary to scale these variables so that they are on the same scale. This is because some machine learning algorithms are sensitive to the scale of the input variables and can produce inaccurate results if the variables are not scaled.
There are several methods for scaling numerical variables, including min-max scaling, standardization, and normalization.
Min-max scaling, also known as normalization, scales the values of a variable to a specific range, typically between 0 and 1. The scikit-learn library provides the MinMaxScaler
class for min-max scaling:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = scaler.fit_transform(data)
Standardization scales the values of a variable to have a mean of 0 and a standard deviation of 1. The scikit-learn library provides the StandardScaler
class for standardization:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = scaler.fit_transform(data)
Normalization scales the values of a variable so that they have a magnitude of 1. This method is used when the direction of the values is more important than the actual values themselves.
Splitting the Data
Once the data has been preprocessed, it is common to split the data into two or three parts: a training set, a validation set, and a test set. The training set is used to train the data model, the validation set is used to tune the model’s hyperparameters, and the test set is used to evaluate the model’s performance.
The training set is used to train the data model by providing it with the input and output data. The model uses the training set to learn how to make predictions based on the input data.
The validation set is used to tune the model’s hyperparameters. Hyperparameters are the parameters of the model that are not learned from the data. The validation set is used to determine the optimal values of the hyperparameters.
The test set is used to evaluate the model’s performance. The test set provides the model with new data that it has not seen before, and the model’s predictions are compared to the actual outputs. This provides an estimate of the model’s generalization ability and its ability to make accurate predictions on new, unseen data.
The scikit-learn library provides the train_test_split
function for splitting the data into training and test sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Dealing with Outliers
Outliers are values that are significantly different from the majority of the values in the data. Outliers can have a significant impact on the results of a data analysis, so it is important to identify and handle them appropriately.
There are several methods for dealing with outliers, including removing them, imputing them, or transforming them.
Removing outliers involves identifying the values that are significantly different from the majority of the values and removing them from the data. This method is appropriate when the outliers represent errors in the data, such as measurement errors or typos.
import numpy as np
import pandas as pd
df = pd.read_csv("data.csv")
# Remove outliers based on Z-Score
z = np.abs(stats.zscore(df))
df = df[(z < 3).all(axis=1)]
Imputing outliers involves replacing the values that are significantly different from the majority of the values with a more appropriate value. This method is appropriate when the outliers represent missing values or outliers that are not errors in the data.
import numpy as np
import pandas as pd
df = pd.read_csv("data.csv")
# Impute outliers with the median value
median = df.median()
df = df.mask(df.sub(df.mean()).div(df.std()).abs().gt(3), median, axis=1)
Transforming outliers involves transforming the values that are significantly different from the majority of the values so that they are more in line with the majority of the values. This method is appropriate when the outliers represent important information in the data, such as extreme values that are meaningful in the context of the data.
import numpy as np
import pandas as pd
df = pd.read_csv("data.csv")
# Transform outliers using the logarithmic function
df[df > np.percentile(df, 99)] = np.log(df[df > np.percentile(df, 99)])
Final Note
Data preprocessing is an essential step in the data science process.
By effectively preprocessing the data, you can ensure that the data is accurate, meaningful, and ready for analysis. This is critical for building accurate models and making informed decisions based on the data.
To explore the other stories of this story, click below!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link: