Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

provides many classes for binarization:</p><ul><li><code>Binarizer()</code> — set a feature value to 0 or 1, depending on a threshold. A value greater than the threshold is set to 1, a value less or equal to the threshold is set to 0.</li><li><code>LabelBinarizer()</code> — for each output label, build a vector, where the number of elements is equal to the number of unique labels, then assign 1 or 0 to each element of the vector depending depending on which label it is in. This class is very similar to the <code>OneHotEncoder()</code>, with the difference that <code>LabelBinarizer()</code> is used for output classes, while OneHotEncoder is used for input features. For more details, you can read <a href="https://stackoverflow.com/questions/50473381/scikit-learns-labelbinarizer-vs-onehotencoder">this thread</a> on Stackoverflow.</li><li><code>MultilabelBinarizer()</code> —an extension of the <code>LabelBinarizer()</code> to support multilabels.</li></ul><p id="d839">Similarly to scalers, also binarizers should be firstly instantiated, then fitted and, finally applied to data:</p><div id="eae6"><pre><span class="hljs-keyword">binarizer </span>= <span class="hljs-keyword">Binarizer() </span><span class="hljs-keyword">binarizer.fit(data) </span><span class="hljs-keyword">binarized_data </span>= <span class="hljs-keyword">binarizer.transform(new_data)</span></pre></div><p id="0a1d">or alternatively:</p><div id="8500"><pre><span class="hljs-attribute">binarizer</span> <span class="hljs-operator">=</span> Binarizer() <span class="hljs-attribute">data</span> <span class="hljs-operator">=</span> binarizer.fit_transform(data)</pre></div><p id="883f">Scikit-learn also provides useful functions for binarization, which can be used when the number of elements is fixed:</p><ul><li><code>binarize()</code></li><li><code>label_binarize()</code></li></ul><p id="e8f9">As already said for Feature Scaling, please remind to save the fitted binarizer, because it will be used during model deployment.</p><h1 id="4dfe">3 Feature Encoding</h1><p id="c385">Categorical features must be transformed to numbers, before a model can be fitted and evaluated. Feature encoding is just this kind of transformation.</p><p id="badc">Scikit-learn provides different classes for feature encoding:</p><ul><li><code>LabelEncoder()</code> — encode output labels with a value between 0 and the total number of classes minus one.</li><li><code>OneHotEncoder()</code> — for each input categorical feature, build a vector, where the number of elements is equal to the number of unique labels, then assign 1 or 0 to each element of the vector depending depending on which label it is in.</li><li><code>OrdinalEncoder()</code> — each unique category value is assigned an integer value. Then, each input categorical feature value is transformed to a number, corresponding to the relative category. For example, “<i>apple</i>” is 1, “<i>orange</i>” is 2, and “<i>melon</i>” is 3.</li></ul><p id="e9cf">The use of each encode is quite similar to the previous operations. Thus it is sufficient to instantiate the selected encoder and then fit and transform it with categorical data.</p><h1 id="ee62">4 Non Linear Transformations</h1><p id="b2a4">Scikit-learn also provides some interesting classes for non linear transformations:</p><ul><li><code>PowerTransformer()</code> — apply a power transform to make features more Gaussian-like. This is useful for modelling situations where data normality is desired.</li><li><code>QuantileTransformer()</code> — transforms the features to follow a uniform or a normal distribution. This is done by exploiting quantile information.</li><li><code>SplineTransformer()</code> — generate a new feature matrix, based on univariate B-spline functions.</li><li><code>FunctionTransformer()</code> — apply a custom transformation.</li></ul><p id="0f53">Each transformer must be firstly instantiated, than fitted with data and finally used through the <code>transform()</code> function.</p><p id="44b4">Transformations can be also done directly through the following functions, without any fitting, if the number of elements is fixed:</p><ul><li><code>quantile_transform()</code></li><li><code>power_transform()</code></li></ul><h1 id="2afd">5 Other</h1><p id="ce74">The <code>preprocessing</code> package also includes the following classes:</p><ul><li><code>KBinsDiscretizer()</code> — bin continuous data into intervals.</li><li><code>KernelCenterer()</code> — center a kernel matrix.</li><li><code>PolynomialFeatures()</code> — generate a new feature matrix with all polynomial combinations of the features with degree less than or equal to the specified degree.</li></ul><p id="e5ca">and an interesting function:</p><ul><li><code>add_dummy_feature()</code> — augment the dataset with a dummy inpu

Options

t feature.</li></ul><h1 id="05ed">Summary</h1><p id="8df5">In this article, I have described an overview of the Scikit-learn <code>preprocessing</code> package. Many operations and transformations can be applied to a dataset, both to input features and output classes, including Feature Scaling, Feature Binarization, Feature Encoding, Non Linear Transformations and other operations. For more information, you can read the official <a href="https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing">Scikit-learn Documentation</a> on the preprocessing package.</p><p id="c583">If you want to discover other classes and functions provided by Scikit-learn, you can <a href="https://alod83.medium.com/">follow me</a>, <a href="https://alod83.medium.com/subscribe">subscribe to my mailing list</a> and stay tuned.</p><p id="6f6a">If you have come this far to read, for me it is already a lot for today. Thanks! You can read more about me in <a href="https://alod83.medium.com/which-topics-would-you-like-to-read-c68314dc6813">this article</a>.</p><h1 id="af7d">Would you like to support my research?</h1><p id="8afc">You could subscribe for a few dollars per month and unlock unlimited articles — <a href="https://alod83.medium.com/membership">click here</a>.</p><h1 id="4f31">Bonus</h1><p id="459a"><b>Remind to save the fitted preprocessing operation into an external file</b>, because when you deploy the model, you need to apply to new data the same scaler used during processing:</p><div id="1bb8"><pre><span class="hljs-keyword">from</span> sklearn.externals <span class="hljs-keyword">import</span> joblib</pre></div><div id="f234"><pre>joblib.<span class="hljs-keyword">dump</span>(prep, <span class="hljs-string">'prep.pkl'</span>)</pre></div><p id="6832">Then, to open the scaler from a file, you can execute the following piece of code:</p><div id="c675"><pre><span class="hljs-attr">prep</span> = joblib.load(<span class="hljs-string">'prep.pkl'</span>)</pre></div><h1 id="c9a0">Related Articles</h1><div id="f6a1" class="link-block"> <a href="https://towardsdatascience.com/machine-learning-getting-started-with-the-k-neighbours-classifier-d7e6b25f2b09"> <div> <div> <h2>Machine Learning: Getting Started with the K-Neighbours Classifier</h2> <div><h3>A Python ready-to-run code which implements the K-Neighbours Classifier in scikit-learn, from data preprocessing to…</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*msNTiBKD0LyhAyg2mcMV-g.png)"></div> </div> </div> </a> </div><div id="ff9c" class="link-block"> <a href="https://towardsdatascience.com/how-to-check-if-a-classification-model-is-overfitted-using-scikit-learn-148b6b19af8b"> <div> <div> <h2>How to Check if a Classification Model is Overfitted using scikit-learn</h2> <div><h3>undefined</h3></div> <div><p>undefined</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*mSELPVE8Jr2lIQ6i55BSMA.jpeg)"></div> </div> </div> </a> </div><div id="de28" class="link-block"> <a href="https://towardsdatascience.com/a-complete-data-analysis-workflow-in-python-and-scikit-learn-9a77f7c283d3"> <div> <div> <h2>A complete Data Analysis workflow in Python and scikit-learn</h2> <div><h3>undefined</h3></div> <div><p>undefined</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*KLjJ53tl5Rs2HnQX3PFGOw.jpeg)"></div> </div> </div> </a> </div><div id="f0a2" class="link-block"> <a href="https://towardsdatascience.com/understanding-the-n-jobs-parameter-to-speedup-scikit-learn-classification-26e3d1220c28"> <div> <div> <h2>Understanding the n_jobs Parameter to Speedup scikit-learn Classification</h2> <div><h3>A ready-to-run code which demonstrates how the use of the n_jobs parameter can reduce the training time</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*PprW8-w9ieOhDc-ZBCaiaQ.jpeg)"></div> </div> </div> </a> </div></article></body>

Machine Learning

An overview of the Scikit-learn Library — Episode 1 Preprocessing

A description in episodes of the well-known Python Library for Machine Learning. The first episode deals with the preprocessing sub-package.

Scikit-learn is a very popular Python library for Machine Learning. Initially developed by David Cournapeau in 2007, it began to grow up in 2010, when INRIA, the French Institute for Research in Computer Science and Automation got involved into the project. In September 2021, the latest release of Scikit-learn was released, i.e. 1.0.

Scikit-learn provides all the steps involved in the Machine Learning process, including Data Preprocessing, Feature Extraction, Model Selection, Model Training, Model Evaluation and Model Deployment.

With this article, I start a series of episodes, each describing a single sub-package provided by Scikit-learn. Scikit-learn is organized into a main module, named sklearn, which is split in many submodules. In this article, I focus Data Preprocessing.

All the classes and functions devoted to Data Preprocessing are contained in the submodule sklearn.preprocessing, which provides the following operations:

Feature Scaling
Feature Binarization
Feature Encoding
Non Linear Transformations
Other

1 Feature Scaling

Feature Scaling involves Data Normalization and Standardization. In this interesting article by Baijayanta Roy, entitled All about Feature Scaling, you can understand the difference between normalization and standardization.

Scikit-learn provides many classes for feature scaling:

MaxAbsScaler() — given a list of feature values, convert every value of the list into a number between 0 and 1. The new value is calculated as the current value divided by the max value of the column.
MinMaxScaler() — the new value is calculated as the difference between the current value and the min value, divided by the range of the list of feature values.
RobustScaler() — remove the median and scale the data according to the Interquartile Range. This scaler is robust to outliers.
StandardScaler() — remove the mean and scale to the variance. This scaler corresponds to the classical standardization process.
Normalizer() — normalize each value to the unit norm.

In order to use a whatever scaler, you should firstly instantiate an object, such as:

scaler = StandardScaler()

then you must fit it with all the available data:

scaler.fit(data)

finally, you can apply the scaler only to data of interest (which may include also all. the dataset):

scaled_data = scaler.transform(subset_of_data)

Alternatively, you could apply fit and transform together, with a single function:

scaled_data = scaler.fit_transform(data)

You can read a practical example on MinMaxScaler() and MaxAbsScaler in my previous article, entitled Data Normalization with Python scikit-learn.

In addition to the described classes, Scikit-learn provides some functions for feature scaling, which can be used directly on a fixed array, without the fitting procedure:

maxabs_scale()
minmax_scale()
normalize()
robust_scale()
scale()

2 Feature Binarization

Feature Binarization thresholds numerical features to get boolean values, stored as 0 or 1.

Scikit-learn provides many classes for binarization:

Binarizer() — set a feature value to 0 or 1, depending on a threshold. A value greater than the threshold is set to 1, a value less or equal to the threshold is set to 0.
LabelBinarizer() — for each output label, build a vector, where the number of elements is equal to the number of unique labels, then assign 1 or 0 to each element of the vector depending depending on which label it is in. This class is very similar to the OneHotEncoder(), with the difference that LabelBinarizer() is used for output classes, while OneHotEncoder is used for input features. For more details, you can read this thread on Stackoverflow.
MultilabelBinarizer() —an extension of the LabelBinarizer() to support multilabels.

Similarly to scalers, also binarizers should be firstly instantiated, then fitted and, finally applied to data:

binarizer = Binarizer()
binarizer.fit(data)
binarized_data = binarizer.transform(new_data)

or alternatively:

binarizer = Binarizer()
data = binarizer.fit_transform(data)

Scikit-learn also provides useful functions for binarization, which can be used when the number of elements is fixed:

binarize()
label_binarize()

As already said for Feature Scaling, please remind to save the fitted binarizer, because it will be used during model deployment.

3 Feature Encoding

Categorical features must be transformed to numbers, before a model can be fitted and evaluated. Feature encoding is just this kind of transformation.

Scikit-learn provides different classes for feature encoding:

LabelEncoder() — encode output labels with a value between 0 and the total number of classes minus one.
OneHotEncoder() — for each input categorical feature, build a vector, where the number of elements is equal to the number of unique labels, then assign 1 or 0 to each element of the vector depending depending on which label it is in.
OrdinalEncoder() — each unique category value is assigned an integer value. Then, each input categorical feature value is transformed to a number, corresponding to the relative category. For example, “apple” is 1, “orange” is 2, and “melon” is 3.

The use of each encode is quite similar to the previous operations. Thus it is sufficient to instantiate the selected encoder and then fit and transform it with categorical data.

4 Non Linear Transformations

Scikit-learn also provides some interesting classes for non linear transformations:

PowerTransformer() — apply a power transform to make features more Gaussian-like. This is useful for modelling situations where data normality is desired.
QuantileTransformer() — transforms the features to follow a uniform or a normal distribution. This is done by exploiting quantile information.
SplineTransformer() — generate a new feature matrix, based on univariate B-spline functions.
FunctionTransformer() — apply a custom transformation.

Each transformer must be firstly instantiated, than fitted with data and finally used through the transform() function.

Transformations can be also done directly through the following functions, without any fitting, if the number of elements is fixed:

quantile_transform()
power_transform()

5 Other

The preprocessing package also includes the following classes:

KBinsDiscretizer() — bin continuous data into intervals.
KernelCenterer() — center a kernel matrix.
PolynomialFeatures() — generate a new feature matrix with all polynomial combinations of the features with degree less than or equal to the specified degree.

and an interesting function:

add_dummy_feature() — augment the dataset with a dummy input feature.

Summary

In this article, I have described an overview of the Scikit-learn preprocessing package. Many operations and transformations can be applied to a dataset, both to input features and output classes, including Feature Scaling, Feature Binarization, Feature Encoding, Non Linear Transformations and other operations. For more information, you can read the official Scikit-learn Documentation on the preprocessing package.

If you want to discover other classes and functions provided by Scikit-learn, you can follow me, subscribe to my mailing list and stay tuned.

If you have come this far to read, for me it is already a lot for today. Thanks! You can read more about me in this article.

Would you like to support my research?

You could subscribe for a few dollars per month and unlock unlimited articles — click here.

Bonus

Remind to save the fitted preprocessing operation into an external file, because when you deploy the model, you need to apply to new data the same scaler used during processing:

from sklearn.externals import joblib

joblib.dump(prep, 'prep.pkl')

Then, to open the scaler from a file, you can execute the following piece of code:

prep = joblib.load('prep.pkl')

Machine Learning: Getting Started with the K-Neighbours Classifier

A Python ready-to-run code which implements the K-Neighbours Classifier in scikit-learn, from data preprocessing to…

towardsdatascience.com

How to Check if a Classification Model is Overfitted using scikit-learn

undefined

A complete Data Analysis workflow in Python and scikit-learn

undefined

Understanding the n_jobs Parameter to Speedup scikit-learn Classification

A ready-to-run code which demonstrates how the use of the n_jobs parameter can reduce the training time

towardsdatascience.com