Hyperparameter Tuning For XGBoost
Grid Search Vs Random Search Vs Bayesian Optimization (Hyperopt)
Grid search, random search, and Bayesian optimization are techniques for machine learning model hyperparameter tuning. This tutorial covers how to tune XGBoost hyperparameters using Python. You will learn
- What are the differences between grid search, random search, and Bayesian optimization?
- How to use grid search cross-validation to tune the hyperparameters for the XGBoost model?
- How to use random search cross-validation to tune the hyperparameters for the XGBoost model?
- How to use Bayesian optimization Hyperopt to tune the hyperparameters for the XGBoost model?
- How to compare the results from grid search, random search, and Bayesian optimization Hyperopt?
Resources for this post:
- Video tutorial for this post on YouTube
- Python code is at the end of the post. Click here for the notebook.
- More video tutorials on hyperparameter tuning
- More blog posts on hyperparameter tuning
Let’s get started!
Step 0: Grid Search Vs. Random Search Vs. Bayesian Optimization
Grid search, random search, and Bayesian optimization have the same goal of choosing the best hyperparameters for a machine learning model. But they have differences in algorithm and implementation. Understanding these differences is essential for deciding which algorithm to use.
- Grid search is an exhaustive way to search hyperparameters. It evaluates every combination of hyperparameters for the model. Therefore, it can take a long time to run when there are a lot of hyperparameter combinations to compare.
- Random search pick a fixed number of hyperparameter combinations randomly, so not every single combination is evaluated. Therefore, a more comprehensive range of values and a longer list of hyperparameters can be assessed within a given time. The downside is that sometimes the random selection may not include top performance hyperparameter combinations.
- Bayesian optimization utilizes the results from the previous step to decide which hyperparameter combination to evaluate next. The major difference between Bayesian optimization and grid/random search is that grid search and random search consider each hyperparameter combination independently, while Bayesian optimization is dependent on the previous evaluation results.
Step 1: Install And Import Libraries
In the first step, let’s import the Python libraries needed for this tutorial.
For this tutorial, we will need to import datasets
to get the breast cancer dataset. pandas
and numpy
are for data processing. `StandardScaler'is for standardizing the dataset.
train_test_split
, XGBClassifier
and precision_recall_fscore_support
are for model training and performance evaluation.
GridSearchCV
, RandomizedSearchCV
, and hyperopt
are the hyperparameter tuning algorithms. StratifiedKFold
and cross_val_score
are for the cross-validation.
# Dataset
from sklearn import datasets
# Data processing
import pandas as pd
import numpy as np
# Standardize the data
from sklearn.preprocessing import StandardScaler
# Model and performance evaluation
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import precision_recall_fscore_support as score
# Hyperparameter tuning
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from hyperopt import tpe, STATUS_OK, Trials, hp, fmin, STATUS_OK, space_eval
Step 2: Read In Data
In the second step, the breast cancer data from sklearn
library is loaded and transformed into a pandas dataframe.
The information summary shows that the dataset has 569 records and 31 columns.
# Load the breast cancer dataset
data = datasets.load_breast_cancer()
# Put the data in pandas dataframe format
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['target']=data.target
# Check the data information
df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mean radius 569 non-null float64
1 mean texture 569 non-null float64
2 mean perimeter 569 non-null float64
3 mean area 569 non-null float64
4 mean smoothness 569 non-null float64
5 mean compactness 569 non-null float64
6 mean concavity 569 non-null float64
7 mean concave points 569 non-null float64
8 mean symmetry 569 non-null float64
9 mean fractal dimension 569 non-null float64
10 radius error 569 non-null float64
11 texture error 569 non-null float64
12 perimeter error 569 non-null float64
13 area error 569 non-null float64
14 smoothness error 569 non-null float64
15 compactness error 569 non-null float64
16 concavity error 569 non-null float64
17 concave points error 569 non-null float64
18 symmetry error 569 non-null float64
19 fractal dimension error 569 non-null float64
20 worst radius 569 non-null float64
21 worst texture 569 non-null float64
22 worst perimeter 569 non-null float64
23 worst area 569 non-null float64
24 worst smoothness 569 non-null float64
25 worst compactness 569 non-null float64
26 worst concavity 569 non-null float64
27 worst concave points 569 non-null float64
28 worst symmetry 569 non-null float64
29 worst fractal dimension 569 non-null float64
30 target 569 non-null int64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB
The target variable distribution shows 63% of ones and 37% of zeros in the dataset. One means the patient has breast cancer, and 0 represents the patient does not have breast cancer.
# Check the target value distribution
df['target'].value_counts(normalize=True)
Output
1 0.627417
0 0.372583
Name: target, dtype: float64
Step 3: Train Test Split
In step 3, we split the dataset into 80% training and 20% testing dataset. random_state makes the random split results reproducible.
# Train test split
X_train, X_test, y_train, y_test = train_test_split(df[df.columns.difference(['target'])],
df['target'],
test_size=0.2,
random_state=42)
# Check the number of records in training and testing dataset.
print(f'The training dataset has {len(X_train)} records.')
print(f'The testing dataset has {len(X_test)} records.')
The training dataset has 455 records, and the testing dataset has 114 records.
The training dataset has 455 records.
The testing dataset has 114 records.
Step 4: Standardization
Standardization is to rescale the features to the same scale. It is calculated by extracting the mean and divided by the standard deviation. After standardization, each feature has zero mean and unit standard deviation.
Standardization should be fit on the training dataset only to prevent test dataset information from leaking into the training process. Then, the test dataset is standardized using the fitting results from the training dataset.
There are different types of scalers. StandardScaler and MinMaxScaler are most commonly used. For a dataset with outliers, we can use RobustScaler.
In this tutorial, we will use StandardScaler
.
# Initiate scaler
sc = StandardScaler()
# Standardize the training dataset
X_train_transformed = pd.DataFrame(sc.fit_transform(X_train),index=X_train.index, columns=X_train.columns)
# Standardized the testing dataset
X_test_transformed = pd.DataFrame(sc.transform(X_test),index=X_test.index, columns=X_test.columns)
# Summary statistics after standardization
X_train_transformed.describe().T
We can see that after using StandardScaler, all the features have zero mean and unit standard deviation.
Let’s get the summary statistics for the training data before standardization as well, and we can see that the mean and standard deviation can be very different in scale. For example, the area error has a mean value of 40 and a standard deviation of 47. On the other hand, the compactness error has a mean of about 0.023 and a standard deviation of 0.019.
# Summary statistics before standardization
X_train.describe().T
Step 5: XGBoost Classifier With No Hyperparameter Tuning
In step 5, we will create an XGBoost classification model with default hyperparameters. This serves as a baseline model to compare against.
This is a list of the hyperparameters we can tune. Usually, a subset of essential hyperparameters will be tuned.
base_score
is the starting prediction score for all the instances at the model initiation. This number does not have much impact on the final results when there is a sufficient number of iterations. Therefore,base_score
is not a good choice for hyperparameter tuning.booster
specifies which booster to use for the model. Boostergbtree
anddart
use tree-based models, and boostergblinear
uses linear functions.colsample_bylevel
is the subsample ratio of columns for each depth level from the set of columns for the current tree.colsample_bynode
is the subsample ratio of columns for each node(split) from the set of columns for the current level.colsample_bytree
is the subsample ratio of columns for each tree from the set of all columns in the training dataset.gamma
is a value greater than or equal to zero. It is the minimum loss reduction required for a split.learning_rate
is also calledeta
. It is a value between 0 and 1. It is the step size shrinkage for the feature weights to make the boosting process more conservative.max_delta_step
puts an absolute regularization weight capping before applyingeta
correction. The default value of 0 means that there is no restriction on the maximum value of the weight. A positive number might help for the dataset with highly imbalanced classes. A value between 1 to 10 is usually used but it can take any value greater than or equal to 0.max_depth
is the maximum depth of a tree and it can take the value of any integer greater than or equal to 0. 0 means no limit to the tree depth. A larger value formax_depth
builds more complex models and tends to overfit.min_child_weight
is the minimum sum of instance weight needed in a child for partitioning. It takes the value greater than or equal to 0.missing
is the value in the input data that needs to be considered as a missing value. The default value isNone
, meaning that onlynp.nan
is considered to be missing values.n_estimators
is the number of gradient boosted trees.n_jobs
takes in the number of parallel threads for the model.n_jobs=-1
means using all the available cores for parallel processing.nthread
is the number of parallel threads for running XGBoost.'objective': 'binary:logistic'
means that the logistic regression for binary classification is used as the learning objective and the model output probability.random_state
sets a seed for model reproducibility.reg_alpha
provides L1 regularization to the weight. Higher values result in more conservative models. The default value of 0 means no L1 regularization.reg_lambda
provides L2 regularization to the weight. Higher values result in more conservative models. XGBoost applies L2 regularization by default.scale_pos_weight
controls the balance of positive and negative weights. It's useful for unbalanced classes.seed
sets a random number seed.silent
decides whether to print out information during model training.subsample
is the percentage of randomly sampled training data before growing trees. It happens in every boosting iteration. It is greater than 0 and less than or equal to 1. The default value of 1 means all the data in the training dataset will be used to build trees. A value of less than 1 helps to prevent overfitting.verbosity
controls how many messages are printed. The valid values are 0 (silent), 1 (warning), 2 (info), and 3 (debug).
# Initiate XGBoost Classifier
xgboost = XGBClassifier()
# Print default setting
xgboost.get_params()
Output
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 1,
'colsample_bynode': 1,
'colsample_bytree': 1,
'gamma': 0,
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 3,
'min_child_weight': 1,
'missing': None,
'n_estimators': 100,
'n_jobs': 1,
'nthread': None,
'objective': 'binary:logistic',
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 1,
'verbosity': 1}
When training the model, seed=0
makes sure that we get reproducible results. After running the baseline XGBoost model, we predicted the testing dataset using .predict
and calculated the predicted probabilities using .predict_proba
.
# Train the model
xgboost = XGBClassifier(seed=0).fit(X_train_transformed,y_train)
# Make prediction
xgboost_predict = xgboost.predict(X_test_transformed)
# Get predicted probability
xgboost_predict_prob = xgboost.predict_proba(X_test)[:,1]
We want to capture as many actual cancer patients as possible for this particular dataset, so we will use recall as the performance metric to optimize.
# Get performance metrics
precision, recall, fscore, support = score(y_test, xgboost_predict)
# Print result
print(f'The recall value for the baseline xgboost model is {recall[1]:.4f}')
The baseline XGBoost model gave us a recall of 97.18%.
The recall value for the baseline xgboost model is 0.9718
Step 6: Grid Search for XGBoost
In step 6, we will use grid search to find the best hyperparameter combinations for the XGBoost model. Grid search is an exhaustive hyperparameter search method. It trains models for every combination of specified hyperparameter values. Therefore, it can take a long time to run if we test out more hyperparameters and values.
For this reason, we would like to have the grid search space relatively small so the process can finish in a reasonable timeframe. The search space includes the hyperparameters, and their values grid search builds models for. We had three hyperparameters for grid search in this example.
colsample_bytree
is the percentage of columns to be randomly sampled for each tree.reg_alpha
provides l1 regularization to the weight. Higher values result in more conservative models.reg_lambda
provides l2 regularization to the weight. Higher values result in more conservative models.
Scoring is the metric to evaluate the cross-validation results for each model. Since recall is the evaluation metric for the model, we set scoring = ['recall']
. The scoring option can take more than one metric in the list.
StratifiedKFold
is used for the cross-validation. It helps us keep the class ratio in the folds the same as the training dataset. n_splits=3
means we are doing 3-fold cross-validation. shuffle=True
means the data are shuffled before splitting. random_state=0
makes the shuffle reproducible.
# Define the search space
param_grid = {
# Percentage of columns to be randomly samples for each tree.
"colsample_bytree": [ 0.3, 0.5 , 0.8 ],
# reg_alpha provides l1 regularization to the weight, higher values result in more conservative models
"reg_alpha": [0, 0.5, 1, 5],
# reg_lambda provides l2 regularization to the weight, higher values result in more conservative models
"reg_lambda": [0, 0.5, 1, 5]
}
# Set up score
scoring = ['recall']
# Set up the k-fold cross-validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
We specified a few options for GridSearchCV
.
estimator=xgboost
means we are using XGBoost as the model.param_grid=param_grid
takes our pre-defined search space for the grid search.scoring=scoring
set the performance evaluation metric. Because we set the scoring to 'recall', the model will use recall as the evaluation metric.refit='recall'
enables refitting the model with the best parameters on the whole training dataset.n_jobs=-1
means parallel processing using all the processors.cv=kfold
takes theStratifiedKFold
we defined.verbose
controls the number of messages returned by the grid search. The higher the number, the more information is returned.verbose=0
means silent.
After fitting GridSearchCV
on the training dataset, we will have 48 hyperparameter combinations. Since 3-fold cross-validation is used, there are 144 models trained in total.
# Define grid search
grid_search = GridSearchCV(estimator=xgboost,
param_grid=param_grid,
scoring=scoring,
refit='recall',
n_jobs=-1,
cv=kfold,
verbose=0)
# Fit grid search
grid_result = grid_search.fit(X_train_transformed, y_train)
# Print grid search summary
grid_result
# Print the best score and the corresponding hyperparameters
print(f'The best score is {grid_result.best_score_:.4f}')
print('The best score standard deviation is', round(grid_result.cv_results_['std_test_recall'][grid_result.best_index_], 4))
print(f'The best hyperparameters are {grid_result.best_params_}')
Output
The best score is 0.9895
The best score standard deviation is 0.0086
The best hyperparameters are {'colsample_bytree': 0.8, 'reg_alpha': 0.5, 'reg_lambda': 0}
The grid search cross-validation results show that 80% of features, using l1 regularization with 0.5 penalty coefficient and no l2 regularization gave us the best results. The best recall is 98.95%, and the standard deviation of the score is 0.86%.
# Make prediction using the best model
grid_predict = grid_search.predict(X_test_transformed)
# Get predicted probabilities
grid_predict_prob = grid_search.predict_proba(X_test_transformed)[:,1]
# Get performance metrics
precision, recall, fscore, support = score(y_test_transformed, grid_predict)
# Print result
print(f'The recall value for the xgboost grid search is {recall[1]:.4f}')
We can see that the grid search recall value is the same as the baseline XGBoost model at 97.18%.
The recall value for the xgboost grid search is 0.9718
Step 7: Random Search for XGBoost
In step 7, we are using a random search for XGBoost hyperparameter tuning. Since random search randomly picks a fixed number of hyperparameter combinations, we can afford to try more hyperparameters and more values. Therefore, we added three more parameters to the search space.
learning_rate
shrinks the weights to make the boosting process more conservative.max_depth
is the maximum depth of the tree. Increasing it increases the model complexity.gamma
specifies the minimum loss reduction required to do a split.
If at least one of the parameters is a distribution, sampling with replacement is used for a random search. If all parameters are provided as a list, sampling without replacement is used. Each list is treated as a uniform distribution.
# Define the search space
param_grid = {
# Learning rate shrinks the weights to make the boosting process more conservative
"learning_rate": [0.0001,0.001, 0.01, 0.1, 1] ,
# Maximum depth of the tree, increasing it increases the model complexity.
"max_depth": range(3,21,3),
# Gamma specifies the minimum loss reduction required to make a split.
"gamma": [i/10.0 for i in range(0,5)],
# Percentage of columns to be randomly samples for each tree.
"colsample_bytree": [i/10.0 for i in range(3,10)],
# reg_alpha provides l1 regularization to the weight, higher values result in more conservative models
"reg_alpha": [1e-5, 1e-2, 0.1, 1, 10, 100],
# reg_lambda provides l2 regularization to the weight, higher values result in more conservative models
"reg_lambda": [1e-5, 1e-2, 0.1, 1, 10, 100]}
# Set up score
scoring = ['recall']
# Set up the k-fold cross-validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
The same scoring metric and cross-validation values used in grid search are used for the random search. But for a random search, we need to specify a value for n_iter
, the number of parameter combinations sampled. So we are randomly testing 48 combinations for this example.
# Define random search
random_search = RandomizedSearchCV(estimator=xgboost,
param_distributions=param_grid,
n_iter=48,
scoring=scoring,
refit='recall',
n_jobs=-1,
cv=kfold,
verbose=0)
# Fit grid search
random_result = random_search.fit(X_train_transformed, y_train)
# Print grid search summary
random_result
# Print the best score and the corresponding hyperparameters
print(f'The best score is {random_result.best_score_:.4f}')
print('The best score standard deviation is', round(random_result.cv_results_['std_test_recall'][random_result.best_index_], 4))
print(f'The best hyperparameters are {random_result.best_params_}')
Output
The best score is 0.9895
The best score standard deviation is 0.0086
The best hyperparameters are {'reg_lambda': 0.1, 'reg_alpha': 0.01, 'max_depth': 15, 'learning_rate': 0.1, 'gamma': 0.1, 'colsample_bytree': 0.5}
After finishing the random search cross-validation, we printed out the best score, standard deviation, and the best parameters. Although the best parameters are different from the grid search, the best score and standard deviation for the cross-validation are very close.
# Make prediction using the best model
random_predict = random_search.predict(X_test_transformed)
# Get predicted probabilities
random_predict_prob = random_search.predict_proba(X_test_transformed)[:,1]
# Get performance metrics
precision, recall, fscore, support = score(y_test, random_predict)
# Print result
print(f'The recall value for the xgboost random search is {recall[1]:.4f}')
The random search recall value on the test dataset is creased from 97.18% to 98.59%.
The recall value for the xgboost random search is 0.9859
Step 8: Bayesian Optimization For XGBoost
In step 8, we will apply Hyperopt Bayesian optimization on XGBoost hyperparameter tuning. According to the documentation on Hyperopt github page, there are four key elements for Hyperopt:
- the space over which to search
- the objective function to minimize
- the database in which to store all the point evaluations of the search
- the search algorithm to use
For the search space, the same space as the random search is used for the Hyperopt Bayesian optimization.
# Space
space = {
'learning_rate': hp.choice('learning_rate', [0.0001,0.001, 0.01, 0.1, 1]),
'max_depth' : hp.choice('max_depth', range(3,21,3)),
'gamma' : hp.choice('gamma', [i/10.0 for i in range(0,5)]),
'colsample_bytree' : hp.choice('colsample_bytree', [i/10.0 for i in range(3,10)]),
'reg_alpha' : hp.choice('reg_alpha', [1e-5, 1e-2, 0.1, 1, 10, 100]),
'reg_lambda' : hp.choice('reg_lambda', [1e-5, 1e-2, 0.1, 1, 10, 100])
}
StratifiedKFold
is used to split the training dataset into k folds and keep the ratio between the classes in each fold the same as the training dataset. It is used for the cross-validation.
n_splits=3
means that the training dataset is split into 3 folds. This is because our dataset is small. For a larger dataset, usually 5 or 10 folds are used.shuffle=True
means that the dataset will be shuffled before splitting into folds. Note that the samples within each split will not be shuffled.random_state=0
make the split reproducible.
# Set up the k-fold cross-validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
Then an objective function is defined.
XGBClassifier
is used as the model algorithm.seed=0
makes the model results reproducible.**params
takes in the hyperparameter values.cross_val_score
produces k scores, one for each of the k folds. We get the mean of the k scores and output the average value.estimator
takes the estimator to fit the data.X
takes the training dataset feature matrix andy
takes the target variable for the training dataset.cv
determines the cross-validation splitting strategy. We setcv=kfold
, which is the output from theStratifiedKFold
.scoring='recall'
means thatrecall
is the key metric for the model.n_jobs=-1
enables parallel model training.- Next,
loss
is defined. Because the model's goal is to maximize recall, it is the same as minimizing negative recall, so we setloss = - score
. - The function returns a dictionary with
loss
,params
, andstatus
.
# Objective function
def objective(params):
xgboost = XGBClassifier(seed=0, **params)
score = cross_val_score(estimator=xgboost,
X=X_train_transformed,
y=y_train,
cv=kfold,
scoring='recall',
n_jobs=-1).mean()
# Loss is negative score
loss = - score
# Dictionary with information for evaluation
return {'loss': loss, 'params': params, 'status': STATUS_OK}
fmin
is the function to search the best hyperparameters with the smallest loss value.
fn
takes in the objective function.space
is for the search space of the hyperparameters.algo
is for the type of search algorithms. Hyperopt currently has three algorithms, random search, Tree of Parzen Estimators (TPE), and adaptive TPE. We are using TPE as the search algorithm.max_evals
specifies the maximum number of evaluations.trials
stores the information for the evaluations.
# Optimize
best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 48, trials = Trials())
Output:
100%|██████████| 48/48 [00:11<00:00, 4.23it/s, best loss: -0.9859649122807017]
After the Bayesian optimization search, we get the best loss of -0.99, meaning that the recall value is about 99%.
We can print out the index for the parameters using print(best)
. To get the values of the best parameters, we can use the space_eval
and pass in the search space and best
.
# Print the index of the best parameters
print(best)
# Print the values of the best parameters
print(space_eval(space, best))
Output
{'colsample_bytree': 1, 'gamma': 4, 'learning_rate': 0, 'max_depth': 5, 'reg_alpha': 0, 'reg_lambda': 1}
{'colsample_bytree': 0.4, 'gamma': 0.4, 'learning_rate': 0.0001, 'max_depth': 18, 'reg_alpha': 1e-05, 'reg_lambda': 0.01}
Next, we apply the best hyperparameters to the XGBClassifier
and make predictions.
# Train model using the best parameters
xgboost_bo = XGBClassifier(seed=0,
colsample_bytree=space_eval(space, best)['colsample_bytree'],
gamma=space_eval(space, best)['gamma'],
learning_rate=space_eval(space, best)['learning_rate'],
max_depth=space_eval(space, best)['max_depth'],
reg_alpha=space_eval(space, best)['reg_alpha'],
reg_lambda=space_eval(space, best)['reg_lambda']
).fit(X_train_transformed,y_train)
# Make prediction using the best model
bayesian_opt_predict = xgboost_bo.predict(X_test_transformed)
# Get predicted probabilities
bayesian_opt_predict_prob = xgboost_bo.predict_proba(X_test_transformed)[:,1]
# Get performance metrics
precision, recall, fscore, support = score(y_test, bayesian_opt_predict)
# Print result
print(f'The recall value for the xgboost Bayesian optimization is {recall[1]:.4f}')
Output:
The recall value for the xgboost Bayesian optimization is 0.9859
The recall value on the test dataset is 98.59%, the same as the random search result.
Summary
In this tutorial, we covered how to tune XGBoost hyperparameters using Python. You learned
- What are the differences between grid search, random search, and Bayesian optimization?
- How to use grid search cross-validation to tune the hyperparameters for the XGBoost model?
- How to use random search cross-validation to tune the hyperparameters for the XGBoost model?
- How to use Bayesian optimization to tune the hyperparameters for the XGBoost model?
- How to compare the results from grid search, random search, and Bayesian optimization?
In practice, random search and Bayesian optimization usually have better performance than the grid search because they can tune more parameters on wider ranges of values.
More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.