
Getting Started
How to Master Scikit-learn for Data Science
Here’s the Essential Scikit-learn you Need for Data Science
Scikit-learn is one of many scikits (i.e. short form for SciPy Toolkits) that specializes on machine learning. A scikit represents a package that is too specialized to be included in SciPy and are thus packaged as one of many scikits. Another popular scikit is the scikit-image (i.e. collection of algorithms for image processing).
Scikit-learn is by far one of the pillars for machine learning in Python as it allows you to build machine learning models as well as providing utility functions for data preparation, post-model analysis and evaluation.
In this article, we will be exploring the essential bare minimal knowledge that you need in order to master scikit-learn for getting started in data science. I try my best to distill the essence of the scikit-learn library through the use of hand-drawn illustrations of key concepts as well as code snippets that you can use for your own projects.
Let’s dive in!
1. Data representation in scikit-learn
Let’s start with the basics and consider the data representation used in scikit-learn, which is essentially a tabular dataset.
At a high-level, for a supervised learning problem the tabular dataset will be comprised of both X and y variables while an unsupervised learning problem will constitute of only X variables.
At a high-level, X variables are also known as independent variables and they can be either quantitative or qualitative descriptions of samples of interests while the y variable is also known as the dependent variable and they are essentially the target or response variable that predictive models are built to predict.
A cartoon illustration of a typical tabular data that is used in scikit-learn is shown below.

For example, if we’re building a predictive model to predict whether individuals have a disease or not the disease/non-disease status is the y variable whereas health indicators obtained by clinical test results are used as X variables.
2. Loading data from CSV files via Pandas
Practically, the contents of a dataset can be stored in a CSV file and it can be read in using the Pandas library via the pd.read_csv() function. Thus, the data structure of the loaded data is known as the Pandas DataFrame.
Let’s see this in action.

Afterwards, data processing can be performed on the DataFrame using the wide range of Pandas functions for handling missing data (i.e. dropping missing data or filling them in with imputed values), selecting specific column or range of columns, performing feature transformations, conditional filtering of data, etc.
In the following example, we will separate the DataFrame as X and y variables, which will be used shortly for model building.

This gives rise to the following X data matrix:

And the following y variable:

For a high-level overview of how to master Pandas for data science also check out a prior blog post that I’ve written.
3. Utility functions from scikit-learn
One of the great things about scikit-learn aside from its machine learning capability is its utility functions.
3.1. Creating artificial datasets
For instance, you can create artificial datasets using scikit-learn (as shown below) that can be used to try out different machine learning workflow that you may have devised.

3.2. Feature scaling
As features may be of heterogeneous scales with several magnitude difference, it is therefore essential to perform feature scaling.
Common approaches include normalization (scaling features to a uniform range of 0 and 1) and standardization (scaling features such that they have centered mean and unit variance that is all X features will have a mean of 0 and standard deviation of 1).
In scikit-learn, normalization can be performing using the normalize() function while standardization can be performed via the StandardScaler() function.
3.3. Feature selection
A common feature selection approach that I like to use is to simply discard features that have low variance as they provide minimal signal (if we think of it in terms of signals and noises).

3.4. Feature engineering
It is often the case that provided features may not readily be suitable for model building. For instance, categorical features require us to encode such features to a form that is compatible with machine learning algorithms in scikit-learn (i.e. from strings to integers or binary numerical form).
Two common types of categorical features includes:
- Nominal features — Categorical values of the feature has no logical order and are independent from one another. For instance, categorical values pertaining to cities such as Los Angeles, Irvine and Bangkok are nominal.
- Ordinal features — Categorical valeus of the feature has a logical order and are related to one another. For instance, categorical values that follow a scale such as low, medium and high has a logical order and relationship such that low < medium < high.
Such feature encoding can be performed using native Python (numerical mapping), Pandas (get_dummies() function and map() method) as well as from within scikit-learn (OneHotEncoder(), OrdinalEncoder(), LabelBinarizer(), LabelEncoder(), etc.).
3.5. Imputing missing data
Scikit-learn also supports the imputation of missing values, which is an important part of data pre-processing prior to the construction of machine learning models. Users can use either the univariate or multivariate imputation method via the SimpleImputer() and IterativeImputer() functions from the sklearn.impute sub-module.
3.6. Data splitting
A commonly used function would have to be data splitting for which we can separate the given input X and y variables as training and test subsets (X_train, y_train, X_test and y_test).
The code snippet below makes use of the train_test_split() to perform the data splitting where its input arguments are the input X and y variables, the size of the test set set to 0.2 (or 20%) and a random seed number set to 42 (such that the code block will yield the same data split if it is ran multiple times).

3.7. Creating a workflow using Pipeline
As the name implies, we can make use of the Pipeline() function to create a chain or sequence of tasks that are involved in the construction of machine learning models. For example, this could be a sequence that consists of feature imputation, feature encoding and model training.
We can think of pipelines as the use of a collection of modular Lego-like building blocks for building machine learning workflows.
For more information on building your own machine learning pipeline using scikit-learn, Jason Brownlee from Machine Learning Mastery provides a detailed account in the following tutorial:
4. High-level overview of using scikit-learn
4.1. Core steps for building and evaluating models
In a nutshell, if I can summarize the core essence of using learning algorithms in scikit-learn it would consist of the following 5 steps:
from sklearn.modulename import EstimatorName # 0. Import
model = EstimatorName() # 1. Instantiate
model.fit(X_train, y_train) # 2. Fit
model.predict(X_test) # 3. Predict
model.score(X_test, y_test) # 4. ScoreTranslating the above pseudo-code to the construction of an actual model (e.g. classification model) by using the random forest algorithm as an example would yield the following code block:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_features=5, n_estimators=100)
rf.fit(X_train, y_train)
rf.predict(X_test)
rf.score(X_test, y_test)A cartoon illustration summarizing these core basic steps for using estimators (i.e. the learning algorithm function) in scikit-learn is shown below.

Step 0. Importing the estimator function from a module of scikit-learn. An estimator is used to refer to the learning algorithm such as RandomForestClassifier that is used to estimate the output y values given the input X values.
Simply put, this can be best summarized by the equation y = f(X) where y can be estimated given known values of X.
Step 1. Instantiating the estimator or model. This is done by calling the estimator function and simply assigning it to a variable. Particularly, we can name this variable as model, clf or rf (i.e. abbreviation of the learning algorithm used, random forest).
The instantiated model can be thought of as an empty box with no trained knowledge from the data as no training has yet occured.
Step 2. The instantiated model will now be allowed to learn from a training dataset in a process known as model building or model training.
The training is initiated via the use of the fit() function where the training data is specified as the input argument of the fit() function as in rf.fit(X_train), which literally translates to allowing the instantiated rf estimator to learn from the X_train data. Upon completion of the calculation, the model is now trained on the training set.
Step 3. The trained model will now be applied to make predictions on a new and unseen data (e.g. X_test) via the use of the predict() function.
As a result, predicted y values (y_test) are generated (and can be stored into a variable such as y_test_pred that can later be used for computing the model performance).
Step 4. The model performance can now be calculated. The simplest and quickest method is to use the score() function as in model.score(X_test, y_test).
If this function is used for a classification model the score() function produces the accuracy value whereas if it is a regression model the score() function calculates the R2 value.
For completeness, we can then extend this core workflow to also include other additional steps that could further boost the robustness and usability of constructed models.
I’ll be talking about these additional steps separately in the following sections.
4.2. Model interpretation
A model is only useful if insights can be extracted from it so as to drive the decision-making process.
In continuation of the random forest model built above, important features stored in the instantiated rf model can be extracted as follows:
# Model interpretationrf.feature_importances_The above code would produces the following an array of importance values for features used in model building:

We can then tidy up the representation by combining it with the feature names to produce a clean DataFrame as follows:

Finally, one can take these values to create a feature importance plot as shown below:

In a nutshell, as the name implies a feature importance plot provides the relative importance of features as judged by importance value such as those obtained from Gini indices produced by the random forest model.
4.3. Hyperparameter tuning
Typically, I would use default hyperparameters when building the first few models. At the first few attempts the goal is to make sure that entire workflow works synchronously and does not spit out errors.
My go-to machine learning algorithm is random forest and I use it as the baseline model. In many cases it is also selected as the final learning algorithm as it provides a good hybrid between robust performance and excellent model interpretability.
Once the workflow is in place, the next goal is to perform hyperparameter tuning in order to achieve the best possible performance.
Although random forest may work quite good straight out of the box but with some hyperparameter tuning it could achieve slightly higher performance. As for learning algorithms such as support vector machine, it is essential to perform hyperparameter tuning in order to obtain robust performance.
Let’s now perform hyperparameter tuning which we can perform via the use of the GridSearchCV() function.
- Firstly, we will create an artificial dataset and perform data splitting, which will then serve as the data for which to build subsequent models.

2. Secondly, we will now perform the actual hyperparameter tuning

3. Finally, we can display the results from hyperparameter tuning in a visual representation.

You can download the full Jupyter notebook from which the above code snippets was taken from. If video is your thing, I’ve also created a YouTube video showing how to perform hyperparameter tuning using scikit-learn.







