Sklearn Tutorial: Module 3
I took the official sklearn MOOC tutorial. Here are my takeaways.
This is the third post in my scikit-learn tutorial series. If you didn’t catch it, I strongly recommend my first two posts — it’ll be way easier to follow along:
In this third module, we’ll see what hyperparameters are, and why and how we should optimize them.
What’s a hyperparameter
When setting up our model so far, we only changed either the preprocessing, the kind of model, or both — but we haven’t really played with the model’s hyperparameters.
A model’s hyperparameters are parameters that are set by us, data scientists, when creating our model/pipeline. They are parameters that define the model before it sees any data. You could say that they allow us to define different “variants” of the same pipeline.
Hyperparameters typically influence the model’s complexity, and as a consequence, the learning process and the overall model performance. Given a dataset and the problem you want to solve, your job as a data scientist is to find the best “hyper-parametrized model” among the infinite space of “hyperparametrized models.”
The hyperparameters are not to be confused with the internal parameters that are learned by the model during the learning process — those internal parameters that are learned are also called “coefficients.” For example, in polynomial regression, the hyperparameter (set before learning) is the degree of the regression, while the internal parameters learned using the train set are the polynomial coefficients (the a/b/c in aX² + bX + c). Put another way, you first set the degree (hyperparameter), and then the regression fit is done using the data (internal coefficients are learned) — not the other way around.
As a consequence, model hyperparameters can be set when model/preprocessors are created. For example, in scikit-learn:
PolynomialFeatures(degree=degree)
: the polynomial degree created from each featureRidge(alpha=5)
: regularization term of the linear ridge regressionSVC(C=1.0, kernel="rbf")
: regularization parameter and kernel for a support vector classifier. Depending on the kernel chosen, additionnal hyperparameters are availableKNeighborsClassifier(n_neighbhors=5)
: the number of neighbors considered in a K-nearest neighbors classifierStandarScaler(with_mean=True, with_std=True)
: the standard scaler preprocessor can also be tuned with its hyperparameters, wheter to remove the mean and/or divide by the standard deviation
Those examples show that the available hyperparameters depend on the whole pipeline you use for your model. For example the following pipeline has the hyperparameters of both the scaler and the regressor:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessor import StandardScaler
pipeline = Pipeline(
steps=[
"standard_scaler", StandardScaler(with_mean=True), # has with_mean/with_std hyperparameters
"linear_regression", LinearRegression(fit_intercept=True), # has fit_intercept
]
)
# This pipeline's hyperparameters set is the union of the hyperparameters of each step of the pipeline
As we will see below, hyperparameters can also be read and set after a pipeline has been created. We will even see that individual steps can be considered as hyperparameters (for example, the “kind” of the scaler preprocessor, with possible values “StandardScaler”, “MinMaxScaler”, etc).
Note that for a given dataset, just like a certain kind of model could outperform another one — a hyperparameter could outperform another one. Put another way, for every dataset, there is an optimum hyperparameter set.
So remember:
- Hyperparameters correspond to parameters you set when creating the model, before the model is fed with a dataset.
- They correspond to every parameter you can set when creating a pipeline, given each step in the pipeline.
- The optimal hyperparameter set depends on the goal of the ML exercise and the input dataset.
- Our job is to find the best hyperparameter.
The rest of this post explains how to access and modify hyperparameters of models, and the different ways to search and optimize such hyperparameters.
How to get/set hyperparameters of a pipeline/model:
In sklearn, once a model or pipeline has been created, an API is available to:
- list the hyperparameters available and their respective values
- change their value
For a given model, you can get the list of all the hyperparameters and their values with the.get_params()
method:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
pipeline = Pipeline(
steps=[
('preprocessor', StandardScaler()),
('lin_reg', LinearRegression())
]
)
pipeline.get_params()
{
'memory': None,
'steps': [
('preprocessor', StandardScaler()),
('lin_reg', LinearRegression())
],
'verbose': False,
'preprocessor': StandardScaler(),
'lin_reg': LinearRegression(),
'preprocessor__copy': True,
'preprocessor__with_mean': True,
'preprocessor__with_std': True,
'lin_reg__copy_X': True,
'lin_reg__fit_intercept': True,
'lin_reg__n_jobs': None,
'lin_reg__positive': False
}
Several important things are to be noticed:
.get_params()
return a dict, including asteps
entry that contains the list of the steps of the pipeline- the names used when creating the pipeline,
preprocecssor
andlin_reg
in our case, are used and stored in this parameter dict - as such, each step’s hyperparameters are named using the conventions
<step_name>__<parameter_name>
, with a double underscore between the step’s name and the step’s parameter’s name
For the API interface to be consistent, note that all parameters are returned in this dict, including some hyperparameters that do not have an impact of the performance (like lin_reg__n_jobs
and preprocessor__copy
).
Similarly, we can change the value of any of those parameters using the following consistent API with set_params(name=value)
:
# to change 2 parameters at once
pipeline.set_params(lin_reg__fit_intercept=False, preprocessor__with_std=False)
# to change the scaler step completly
pipeline.set_params(preprocessor=MinMaxScaler())
pipeline.get_params()
{
'memory': None,
'steps': [
('preprocessor', MinMaxScaler()),
('lin_reg', LinearRegression())
],
'verbose': False,
'preprocessor': MinMaxScaler(),
'lin_reg': LinearRegression(),
'preprocessor__clip': False,
'preprocessor__copy': True,
'preprocessor__feature_range': (0, 1),
'lin_reg__copy_X': True,
'lin_reg__fit_intercept': True,
'lin_reg__n_jobs': None,
'lin_reg__positive': False
}
As mentioned before, we can even change a step completely using the same API: here we changed from a StandardScaler to a MinMaxScaler preprocessor. Notice the differences in parameters available after the change of preprocessor type (still called ‘preprocessor,’ but the corresponding hyperparameters are those of a MinMaxScaler).
Manual hyperparameter tuning
Now that we know what hyperparameters are, how to get/set them, and why we should optimize them, let’s take a first approach to do such optimization.
As for any optimization problem, we identify:
- The “space” we want to explore: this is all the hyperparameter values we want to try.
- The value we want to optimize: here it corresponds to the model’s performance through its score.
The simplest and most inefficient, non-robust way to do such optimization is using a loop on one hyperparameter and using the score of a single train/test split:
pipeline = Pipeline(
[('preprocessor', StandardScaler()),
('lin_reg', LinearRegression())]
)
X_train, X_test, y_train, y_test = train_test_split(X, y)
for with_mean in [True, False]:
pipeline.set_params(preprocessor__with_mean=with_mean)
pipeline.fit(X_train, y_train)
print(f"with_mean={with_mean}: score={pipeline.score(X_test, y_test)}")
# we can then identify the best value for with_mean
So in this first approach, we manually write a loop, in which the pipe is fitted and tested. A first improvement we can do is to use cross-validation in order to compute a more meaningfull score:
for with_mean in [True, False]:
pipeline.set_params(preprocessor__with_mean=with_mean)
cv_results = cross_validation(pipeline, X, y)
print(f"with_mean={with_mean}: score={cv_results['test_score']}")
# we can then identify the best value for with_mean, with more certainty about our choice
Using cross-validation, we have a more robust estimation of the model’s performance for each value of the hyperparameter.
Now let’s improve further with optimisation of 2 hyperparameters: we have to nest 2 loops, one for each hyperparameters:
for with_mean in [True, False]:
for with_std in [True, False]:
pipeline.set_params(preprocessor__with_mean=with_mean, preprocessor__with_std=with_std)
cv_results = cross_validation(pipeline, X, y)
print(f"with_mean={with_mean}/with_std={with_std}: score={cv_results['test_score']}")
# we can then identify the best value for (with_mean, with_std)
Now, what if we want to optimize on 3, 4, 10, or more hyperparameters? And what if we want to try 10 different values for each of those hyperparameters? We’d have to write many, many nested loops and inspect many, many scores.
This is why scikit-learn provides helper functions to automate this hyperparameter searching process, like GridSearchCV and RandomSearchCV.
Automatic tuning using GridSearch
The first automatic approach provided by sklearn to optimize hyperparameters is called GridSearchCV
. The idea is to use a dict to specify all values for each single hyperparameters, and all combinations will be tested. For example, to reproduce the example above where with_mean
could be [True, False]
and with_std
could be [True, False]
we’d use:
param_grid = {
"preprocessor__with_mean":[True, False],
"preprocessor__with_std":[True, False],
}
model_grid_search = GridSearchCV(pipeline, param_grid=param_grid)
This first snippet only creates a model
: yes, a new model, that wraps the true low-level pipeline. This new grid-seach model can be fitted, again using model_grid_search.fit
. During this fitting step, all combinations of hyperparameters are tested and the model performance computed using cross-validation. Once the grid-search is fitted, it can be used as any other predictor (by calling predict or score for example), using the model with the best parameters found during fit:
# fit the gridsearch model
model_grid_search.fit(X_train, y_train)
# use the best model found
model_grid_search.score(X_test, y_test)
model_grid_search.predict(X_new)
# or inspect the results of the grid search
model_grid_search.cv_results_
So in other words, fitting a GridSearch model means trying all combinations and keeping the best one.
An important feature to remember is that we can use a list of dictionaries instead of just a dictionary to specify the combinations we want to try, in order to refine the hyperparameter sets that should be tested. For example:
param_grid = [
{
"preprocessor":StandardScaler(),
"preprocessor__with_mean":[True, False],
"preprocessor__with_std":[True, False],
},
{
"preprocessor":MinMaxScaler(),
"preprocessor__feature_range":[(0, 1), (0, 0.5), (0.25, 0.75)],
},
]
# This grid search will try the StandardScaler with all combinations of with_mean/with_std AND the MinMaxScaler with 3 different ranges
model_grid_search = GridSearchCV(pipeline, param_grid=param_grid)
Stochastic tuning with RandomizedSearchCV
When the hyperparameters are continuous-valued and live on a great range of values, and/or the number of hyperparameters to tune is important, and/or the model is computationally complex, the full combination approach of GridSearchCV shows its limitations: the fitting time starts to increase. There is obviously a tradeoff between the number of hyperparameter sets to test versus the total time.
To circumvent these limitations and improve our chances to find a good — if not the best — hyperparameter set, we can use a random approach to sample the hyperparameters space.
The idea is to specify all the possible values for all the hyperparameters and try sets at random.
Using a random approach to optimize numerical problems is a common trick, like for numerical integration or optimization problems.
To do so in sklearn, we use RandomizedSearchCV
— the usage is exactly the same as GridSearchCV
. For example, say we want to optimize a support vector classifier by tuning its C parameter which can have any value from 0 to infinity, as well as some other hyperparameters like kernel and gamma:
from sklearn.model_selection import RandomizedSearchCV
param_grid = {
'C': uniform(0,1000).rvs(100),
'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
'gamma': ['scale', 'auto'] + list(uniform(0,1).rvs(10))
}
random_search_model = RandomizedSearchCV(pipeline, param_grid=param_grid, n_iter=1000)
# fit the gridsearch model
random_search_model.fit(X_train, y_train)
# use the best model found
random_search_model.score(X_test, y_test)
random_search_model.predict(X_new)
# or inspect the results of the grid search
random_search_model.cv_results_
Here, we allow the search to try 1000 hyperparameter sets, using n_iter
to control the number of tries.
So remember: the randomized approach allows trying hyperparameters randomly and controlling the number of tries using n_iter. This approach is useful when some hyperparameters are continuously-valued and/or may take a wide range of values.
Nested cross-validation pattern
In order to train and find the best the best-hyperparameter model found using GridSearchCV
/RandomizedSearchCV
, we use the original train set from the first split. This first-split train set was used internally using another train/test split. In other words:
- First split: the original dataset is split into X_train and X_test.
- Then X_train is used to optimize the hyperparameters by training/testing each hyperparameter set using N internal splits (the number of folds): so X_train is split n-times into another X_train/X_test. The models are fitted/tested for each hyperparameter set, and the model performance is evaluated using cross-validation.
- Finally, the best model found is tested and evaluated on the original X_test set.
This means that this approach only provides us with a single evaluation of the generalization performance, since only the original X_test set is retained and never used in the learning steps (both fitting and optimizing). In order to improve our estimation of the generalization performance, we can use an outer cross-validation loop.
So remember: the outer loop is used to estimate the generalization performance of the overall fitting/optimizing process. In other words, the estimated best model’s performance is evaluated using cross-validation.
# nested-cross validation pattern:
cv_results = cross_validate(
model_grid_search, X, y,
)
This way, we get the best of both worlds, at the price of additional computation.
Wrapup
This third module was dedicated to hyperparameters:
- Hyperparameters are parameters that define the way the model works and learns; they define its complexity. They should not be confused with the internal coefficients learned by the model when it is presented with the train set.
- Since these hyperparameters greatly influence the model, they must be optimized to improve the model’s performance for the given task at stake. The best hyperparameters depend on the input data.
- Optimizing hyperparameters can be done using cross-validated search methods like grid-search and random-search.
- A good practice when optimizing hyperparameters is to use the nested-cross-validation pattern to estimate the best-fitted model performance.
You might like some of my other posts, make sure to check them out: