Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4902

Abstract

tegorical variables, including one-hot encoding, label encoding, and ordinal encoding.</p><p id="9147">One-hot encoding is a technique that creates new columns for each unique category in a categorical variable. Each column represents one category, and each row receives a value of 1 or 0 indicating the presence or absence of the category in that row. The pandas library provides the <code>get_dummies()</code> method for one-hot encoding:</p><div id="35e6"><pre><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = pd.read_csv(<span class="hljs-string">'file.csv'</span>) data = pd.get_dummies(data, columns=[<span class="hljs-string">'color'</span>])</pre></div><p id="4f5d">Label encoding is a technique that assigns a unique integer to each category in a categorical variable. The scikit-learn library provides the <code>LabelEncoder</code> class for label encoding:</p><div id="21f6"><pre><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> LabelEncoder

le = LabelEncoder() data[<span class="hljs-string">'color'</span>] = le.fit_transform(data[<span class="hljs-string">'color'</span>])</pre></div><p id="6285">Ordinal encoding is a technique that assigns an ordered integer to each category in a categorical variable. This method is used when the categorical variable has a natural ordering, such as high, medium, and low.</p><h2 id="fb27">Scaling Numerical Variables</h2><p id="b64f">Numerical variables, also known as continuous variables, can have values with a large range. When building data models, it is often necessary to scale these variables so that they are on the same scale. This is because some machine learning algorithms are sensitive to the scale of the input variables and can produce inaccurate results if the variables are not scaled.</p><p id="107b">There are several methods for scaling numerical variables, including min-max scaling, standardization, and normalization.</p><p id="f459">Min-max scaling, also known as normalization, scales the values of a variable to a specific range, typically between 0 and 1. The scikit-learn library provides the <code>MinMaxScaler</code> class for min-max scaling:</p><div id="4d88"><pre><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> MinMaxScaler

scaler = MinMaxScaler() data = scaler.fit_transform(data)</pre></div><p id="7b34">Standardization scales the values of a variable to have a mean of 0 and a standard deviation of 1. The scikit-learn library provides the <code>StandardScaler</code> class for standardization:</p><div id="426f"><pre><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler

scaler = StandardScaler() data = scaler.fit_transform(data)</pre></div><p id="f325">Normalization scales the values of a variable so that they have a magnitude of 1. This method is used when the direction of the values is more important than the actual values themselves.</p><h2 id="ee9d">Splitting the Data</h2><p id="a322">Once the data has been preprocessed, it is common to split the data into two or three parts: a training set, a validation set, and a test set. The training set is used to train the data model, the validation set is used to tune the model’s hyperparameters, and the test set is used to evaluate the model’s performance.</p><p id="9027">The training set is used to train the data model by providing it with the input and output data. The model uses the training set to learn how to make predictions based on the input data.</p><p id="5560">The validation set is used to tune the model’s hyperparameters. Hyperparameters are the parameters of the model that are not learned from the data. The validation set is used to determine the optimal values of the hyperparameters.</p><p id="17e8">The test set is used to evaluate the model’s performance. The test set provides the model with new data that it has not seen before, and the model’s predictions are compared to the actual outputs. This provides an estimate of the model’s generalization ability and its ability to make accurate predictions on new, unseen data.</p><p id="0900">The scikit-learn library provides the <code>train_test_split</code> function for splitting the data into training and test sets:</p><div id="9a56"><pre><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)</pre></div><h2 id="041e">Dealing with Outliers</h2><p id="30ce">Outliers are values that are significantly different from the majority of the values in the data. Outliers can have a significant impact on the results of a data analysis, so

Options

it is important to identify and handle them appropriately.</p><p id="1e14">There are several methods for dealing with outliers, including removing them, imputing them, or transforming them.</p><p id="62bb">Removing outliers involves identifying the values that are significantly different from the majority of the values and removing them from the data. This method is appropriate when the outliers represent errors in the data, such as measurement errors or typos.</p><div id="afb0"><pre><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

df = pd.read_csv(<span class="hljs-string">"data.csv"</span>)

<span class="hljs-comment"># Remove outliers based on Z-Score</span> z = np.<span class="hljs-built_in">abs</span>(stats.zscore(df)) df = df[(z < <span class="hljs-number">3</span>).<span class="hljs-built_in">all</span>(axis=<span class="hljs-number">1</span>)]</pre></div><p id="40f2">Imputing outliers involves replacing the values that are significantly different from the majority of the values with a more appropriate value. This method is appropriate when the outliers represent missing values or outliers that are not errors in the data.</p><div id="6b5f"><pre><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

df = pd.read_csv(<span class="hljs-string">"data.csv"</span>)

<span class="hljs-comment"># Impute outliers with the median value</span> median = df.median() df = df.mask(df.sub(df.mean()).div(df.std()).<span class="hljs-built_in">abs</span>().gt(<span class="hljs-number">3</span>), median, axis=<span class="hljs-number">1</span>)</pre></div><p id="ee7f">Transforming outliers involves transforming the values that are significantly different from the majority of the values so that they are more in line with the majority of the values. This method is appropriate when the outliers represent important information in the data, such as extreme values that are meaningful in the context of the data.</p><div id="d77f"><pre><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

df = pd.read_csv(<span class="hljs-string">"data.csv"</span>)

<span class="hljs-comment"># Transform outliers using the logarithmic function</span> df[df > np.percentile(df, <span class="hljs-number">99</span>)] = np.log(df[df > np.percentile(df, <span class="hljs-number">99</span>)])</pre></div><h2 id="759e">Final Note</h2><p id="b4cd">Data preprocessing is an essential step in the data science process.</p><p id="9492">By effectively preprocessing the data, you can ensure that the data is accurate, meaningful, and ready for analysis. This is critical for building accurate models and making informed decisions based on the data.</p><p id="aba4"><i>To explore the other stories of this story, click below!</i></p><div id="ad69" class="link-block"> <a href="https://readmedium.com/data-science-with-python-32da1e5c3d2f"> <div> <div> <h2>Data Science with Python</h2> <div><h3>Aka the best programming language for data scientists</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*d7J13Ipreaf-8k5j)"></div> </div> </div> </a> </div><p id="dc9e"><i>To explore more of my Python stories, click <a href="https://readmedium.com/tech-aa824bad0d67">here</a>! You can also access all my content by checking <a href="https://readmedium.com/about-me-d63607c8c341">this page</a>.</i></p><p id="8692"><i>If you want to be notified every time I publish a new story, subscribe to me via email by clicking <a href="https://medium.com/subscribe/@estebanthi">here</a>!</i></p><p id="7770"><i>If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:</i></p><div id="a2d6" class="link-block"> <a href="https://medium.com/@estebanthi/membership"> <div> <div> <h2>Join Medium with my referral link — Esteban Thilliez</h2> <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div> </div> </div> </a> </div></article></body>

Datascience with Python — Data Preprocessing

A crucial step to ensure the quality of your data

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

Data preprocessing is a crucial step in the data science workflow that ensures the quality and cleanliness of the data used for analysis and modeling. It is often said that 80% of a data scientist’s time is spent on cleaning and preparing data, making data preprocessing a critical aspect of the data science process.

We will explore the common steps involved in data preprocessing in Python and how to perform these steps using popular libraries such as pandas, NumPy, and scikit-learn.

Importing Data

The first step in data preprocessing is to import the data into your Python environment. There are several ways to do this, including importing data from CSV files, Excel spreadsheets, databases, or APIs.

You can import data using pandas or NumPy:

import pandas as pd

data = pd.read_csv('file.csv')

# or

import numpy as np

data = np.loadtxt('file.txt', delimiter=',')

Obviously, your data has to be formatted correctly for you to be able to import it without problems.

Handling Missing Values

One of the most common issues in real-world data is missing values. Missing values can arise due to various reasons, such as data entry errors, measurement errors, or simply because a value is not available. When building data models, it is important to handle missing values appropriately to ensure the accuracy and reliability of the results.

There are several methods for dealing with missing values, including filling missing values with a placeholder value or imputing them with statistical methods.

Filling missing values with a placeholder value is a simple and straightforward approach. The most common placeholder value used is np.nan (Not a Number) in NumPy, and None or NaN (Not a Number) in pandas. For example, you can fill missing values with the mean value of the column:

import pandas as pd

data = pd.read_csv('file.csv')
data.fillna(data.mean(), inplace=True)

Imputing missing values with statistical methods is a more sophisticated approach. For example, you can use the mean, median, or mode of the column to fill missing values. The scikit-learn library provides the Imputer class, which can be used to perform mean imputation:

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
data = imp.fit_transform(data)

Depending on your data, you may choose one method or the other. In general, imputing missing values is a preferred approach over simply filling missing values with a placeholder value, as it provides a more accurate representation of the data.

Encoding Categorical Variables

Categorical variables are variables that can take on a limited number of values, such as gender, color, or country. In order to use these variables in data analysis and modeling, they must be encoded into numerical values.

There are several methods for encoding categorical variables, including one-hot encoding, label encoding, and ordinal encoding.

One-hot encoding is a technique that creates new columns for each unique category in a categorical variable. Each column represents one category, and each row receives a value of 1 or 0 indicating the presence or absence of the category in that row. The pandas library provides the get_dummies() method for one-hot encoding:

import pandas as pd

data = pd.read_csv('file.csv')
data = pd.get_dummies(data, columns=['color'])

Label encoding is a technique that assigns a unique integer to each category in a categorical variable. The scikit-learn library provides the LabelEncoder class for label encoding:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['color'] = le.fit_transform(data['color'])

Ordinal encoding is a technique that assigns an ordered integer to each category in a categorical variable. This method is used when the categorical variable has a natural ordering, such as high, medium, and low.

Scaling Numerical Variables

Numerical variables, also known as continuous variables, can have values with a large range. When building data models, it is often necessary to scale these variables so that they are on the same scale. This is because some machine learning algorithms are sensitive to the scale of the input variables and can produce inaccurate results if the variables are not scaled.

There are several methods for scaling numerical variables, including min-max scaling, standardization, and normalization.

Min-max scaling, also known as normalization, scales the values of a variable to a specific range, typically between 0 and 1. The scikit-learn library provides the MinMaxScaler class for min-max scaling:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data = scaler.fit_transform(data)

Standardization scales the values of a variable to have a mean of 0 and a standard deviation of 1. The scikit-learn library provides the StandardScaler class for standardization:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data = scaler.fit_transform(data)

Normalization scales the values of a variable so that they have a magnitude of 1. This method is used when the direction of the values is more important than the actual values themselves.

Splitting the Data

Once the data has been preprocessed, it is common to split the data into two or three parts: a training set, a validation set, and a test set. The training set is used to train the data model, the validation set is used to tune the model’s hyperparameters, and the test set is used to evaluate the model’s performance.

The training set is used to train the data model by providing it with the input and output data. The model uses the training set to learn how to make predictions based on the input data.

The validation set is used to tune the model’s hyperparameters. Hyperparameters are the parameters of the model that are not learned from the data. The validation set is used to determine the optimal values of the hyperparameters.

The test set is used to evaluate the model’s performance. The test set provides the model with new data that it has not seen before, and the model’s predictions are compared to the actual outputs. This provides an estimate of the model’s generalization ability and its ability to make accurate predictions on new, unseen data.

The scikit-learn library provides the train_test_split function for splitting the data into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Dealing with Outliers

Outliers are values that are significantly different from the majority of the values in the data. Outliers can have a significant impact on the results of a data analysis, so it is important to identify and handle them appropriately.

There are several methods for dealing with outliers, including removing them, imputing them, or transforming them.

Removing outliers involves identifying the values that are significantly different from the majority of the values and removing them from the data. This method is appropriate when the outliers represent errors in the data, such as measurement errors or typos.

import numpy as np
import pandas as pd

df = pd.read_csv("data.csv")

# Remove outliers based on Z-Score
z = np.abs(stats.zscore(df))
df = df[(z < 3).all(axis=1)]

Imputing outliers involves replacing the values that are significantly different from the majority of the values with a more appropriate value. This method is appropriate when the outliers represent missing values or outliers that are not errors in the data.

import numpy as np
import pandas as pd

df = pd.read_csv("data.csv")

# Impute outliers with the median value
median = df.median()
df = df.mask(df.sub(df.mean()).div(df.std()).abs().gt(3), median, axis=1)

Transforming outliers involves transforming the values that are significantly different from the majority of the values so that they are more in line with the majority of the values. This method is appropriate when the outliers represent important information in the data, such as extreme values that are meaningful in the context of the data.

import numpy as np
import pandas as pd

df = pd.read_csv("data.csv")

# Transform outliers using the logarithmic function
df[df > np.percentile(df, 99)] = np.log(df[df > np.percentile(df, 99)])

Final Note

Data preprocessing is an essential step in the data science process.

By effectively preprocessing the data, you can ensure that the data is accurate, meaningful, and ready for analysis. This is critical for building accurate models and making informed decisions based on the data.

To explore the other stories of this story, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com