avatarAmy @GrabNGoInfo

Summary

The provided content offers a comprehensive guide on hyperparameter tuning for XGBoost models using grid search, random search, and Bayesian optimization techniques in Python.

Abstract

The article delves into the intricacies of hyperparameter tuning for XGBoost models, comparing three prevalent methods: grid search, random search, and Bayesian optimization. It begins by introducing the XGBoost algorithm and the importance of hyperparameter tuning for optimizing model performance. The author then systematically explains each method, starting with grid search, which systematically evaluates every combination of hyperparameters. Next, the article covers random search, which selects hyperparameter combinations randomly, allowing for a broader search space. Finally, it explores Bayesian optimization, a more sophisticated approach that uses results from previous evaluations to select the next hyperparameter combination. The tutorial includes practical Python code snippets, data standardization techniques, and a step-by-step application of these methods to a breast cancer dataset. The performance of each tuning method is assessed using recall as the key metric, with the results indicating that random search and Bayesian optimization can yield superior performance compared to grid search.

Opinions

  • The author suggests that grid search, while thorough, may be impractical for large hyperparameter spaces due to its exhaustive nature.
  • Random search is presented as a more efficient alternative to grid search, particularly for datasets with a large number of hyperparameter combinations.
  • Bayesian optimization is highlighted as a powerful method for hyperparameter tuning, with the potential to outperform both grid search and random search by intelligently selecting hyperparameter combinations based on prior results.
  • The article emphasizes the importance of reproducibility in model training by setting random seeds and using stratified cross-validation to maintain class distribution.
  • The author's preference for recall as the evaluation metric indicates a focus on correctly identifying positive cases, which is crucial in imbalanced datasets such as medical diagnostics.
  • The use of Python libraries like sklearn, pandas, numpy, xgboost, and hyperopt is endorsed for their utility in implementing the discussed hyperparameter tuning techniques.
  • The tutorial encourages the use of additional resources, such as the author's YouTube channel and website, for further learning on machine learning topics.

Hyperparameter Tuning For XGBoost

Grid Search Vs Random Search Vs Bayesian Optimization (Hyperopt)

Photo by Ed van duijn on Unsplash

Grid search, random search, and Bayesian optimization are techniques for machine learning model hyperparameter tuning. This tutorial covers how to tune XGBoost hyperparameters using Python. You will learn

  • What are the differences between grid search, random search, and Bayesian optimization?
  • How to use grid search cross-validation to tune the hyperparameters for the XGBoost model?
  • How to use random search cross-validation to tune the hyperparameters for the XGBoost model?
  • How to use Bayesian optimization Hyperopt to tune the hyperparameters for the XGBoost model?
  • How to compare the results from grid search, random search, and Bayesian optimization Hyperopt?

Resources for this post:

Let’s get started!

Step 0: Grid Search Vs. Random Search Vs. Bayesian Optimization

Grid search, random search, and Bayesian optimization have the same goal of choosing the best hyperparameters for a machine learning model. But they have differences in algorithm and implementation. Understanding these differences is essential for deciding which algorithm to use.

  • Grid search is an exhaustive way to search hyperparameters. It evaluates every combination of hyperparameters for the model. Therefore, it can take a long time to run when there are a lot of hyperparameter combinations to compare.
  • Random search pick a fixed number of hyperparameter combinations randomly, so not every single combination is evaluated. Therefore, a more comprehensive range of values and a longer list of hyperparameters can be assessed within a given time. The downside is that sometimes the random selection may not include top performance hyperparameter combinations.
  • Bayesian optimization utilizes the results from the previous step to decide which hyperparameter combination to evaluate next. The major difference between Bayesian optimization and grid/random search is that grid search and random search consider each hyperparameter combination independently, while Bayesian optimization is dependent on the previous evaluation results.

Step 1: Install And Import Libraries

In the first step, let’s import the Python libraries needed for this tutorial.

For this tutorial, we will need to import datasets to get the breast cancer dataset. pandas and numpy are for data processing. `StandardScaler'is for standardizing the dataset.

train_test_split, XGBClassifier and precision_recall_fscore_support are for model training and performance evaluation.

GridSearchCV, RandomizedSearchCV, and hyperopt are the hyperparameter tuning algorithms. StratifiedKFold and cross_val_score are for the cross-validation.

# Dataset
from sklearn import datasets
# Data processing
import pandas as pd
import numpy as np
# Standardize the data
from sklearn.preprocessing import StandardScaler
# Model and performance evaluation
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import precision_recall_fscore_support as score
# Hyperparameter tuning
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from hyperopt import tpe, STATUS_OK, Trials, hp, fmin, STATUS_OK, space_eval

Step 2: Read In Data

In the second step, the breast cancer data from sklearn library is loaded and transformed into a pandas dataframe.

The information summary shows that the dataset has 569 records and 31 columns.

# Load the breast cancer dataset
data = datasets.load_breast_cancer()
# Put the data in pandas dataframe format
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['target']=data.target
# Check the data information
df.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         569 non-null    float64
 15  compactness error        569 non-null    float64
 16  concavity error          569 non-null    float64
 17  concave points error     569 non-null    float64
 18  symmetry error           569 non-null    float64
 19  fractal dimension error  569 non-null    float64
 20  worst radius             569 non-null    float64
 21  worst texture            569 non-null    float64
 22  worst perimeter          569 non-null    float64
 23  worst area               569 non-null    float64
 24  worst smoothness         569 non-null    float64
 25  worst compactness        569 non-null    float64
 26  worst concavity          569 non-null    float64
 27  worst concave points     569 non-null    float64
 28  worst symmetry           569 non-null    float64
 29  worst fractal dimension  569 non-null    float64
 30  target                   569 non-null    int64  
dtypes: float64(30), int64(1)
memory usage: 137.9 KB

The target variable distribution shows 63% of ones and 37% of zeros in the dataset. One means the patient has breast cancer, and 0 represents the patient does not have breast cancer.

# Check the target value distribution
df['target'].value_counts(normalize=True)

Output

1    0.627417
0    0.372583
Name: target, dtype: float64

Step 3: Train Test Split

In step 3, we split the dataset into 80% training and 20% testing dataset. random_state makes the random split results reproducible.

# Train test split
X_train, X_test, y_train, y_test = train_test_split(df[df.columns.difference(['target'])], 
                                                    df['target'], 
                                                    test_size=0.2, 
                                                    random_state=42)
# Check the number of records in training and testing dataset.
print(f'The training dataset has {len(X_train)} records.')
print(f'The testing dataset has {len(X_test)} records.')

The training dataset has 455 records, and the testing dataset has 114 records.

The training dataset has 455 records.
The testing dataset has 114 records.

Step 4: Standardization

Standardization is to rescale the features to the same scale. It is calculated by extracting the mean and divided by the standard deviation. After standardization, each feature has zero mean and unit standard deviation.

Standardization should be fit on the training dataset only to prevent test dataset information from leaking into the training process. Then, the test dataset is standardized using the fitting results from the training dataset.

There are different types of scalers. StandardScaler and MinMaxScaler are most commonly used. For a dataset with outliers, we can use RobustScaler.

In this tutorial, we will use StandardScaler.

# Initiate scaler
sc = StandardScaler()
# Standardize the training dataset
X_train_transformed = pd.DataFrame(sc.fit_transform(X_train),index=X_train.index, columns=X_train.columns)
# Standardized the testing dataset
X_test_transformed = pd.DataFrame(sc.transform(X_test),index=X_test.index, columns=X_test.columns)
# Summary statistics after standardization
X_train_transformed.describe().T

We can see that after using StandardScaler, all the features have zero mean and unit standard deviation.

XGBoost Model Data Standardization — Image from GrabNGoInfo.com

Let’s get the summary statistics for the training data before standardization as well, and we can see that the mean and standard deviation can be very different in scale. For example, the area error has a mean value of 40 and a standard deviation of 47. On the other hand, the compactness error has a mean of about 0.023 and a standard deviation of 0.019.

# Summary statistics before standardization
X_train.describe().T
XGBoost Model Data before Standardization — Image from GrabNGoInfo.com

Step 5: XGBoost Classifier With No Hyperparameter Tuning

In step 5, we will create an XGBoost classification model with default hyperparameters. This serves as a baseline model to compare against.

This is a list of the hyperparameters we can tune. Usually, a subset of essential hyperparameters will be tuned.

  • base_score is the starting prediction score for all the instances at the model initiation. This number does not have much impact on the final results when there is a sufficient number of iterations. Therefore, base_score is not a good choice for hyperparameter tuning.
  • booster specifies which booster to use for the model. Booster gbtree and dart use tree-based models, and booster gblinear uses linear functions.
  • colsample_bylevel is the subsample ratio of columns for each depth level from the set of columns for the current tree.
  • colsample_bynode is the subsample ratio of columns for each node(split) from the set of columns for the current level.
  • colsample_bytree is the subsample ratio of columns for each tree from the set of all columns in the training dataset.
  • gamma is a value greater than or equal to zero. It is the minimum loss reduction required for a split.
  • learning_rate is also called eta. It is a value between 0 and 1. It is the step size shrinkage for the feature weights to make the boosting process more conservative.
  • max_delta_step puts an absolute regularization weight capping before applying eta correction. The default value of 0 means that there is no restriction on the maximum value of the weight. A positive number might help for the dataset with highly imbalanced classes. A value between 1 to 10 is usually used but it can take any value greater than or equal to 0.
  • max_depth is the maximum depth of a tree and it can take the value of any integer greater than or equal to 0. 0 means no limit to the tree depth. A larger value for max_depth builds more complex models and tends to overfit.
  • min_child_weight is the minimum sum of instance weight needed in a child for partitioning. It takes the value greater than or equal to 0.
  • missing is the value in the input data that needs to be considered as a missing value. The default value is None, meaning that only np.nan is considered to be missing values.
  • n_estimators is the number of gradient boosted trees.
  • n_jobs takes in the number of parallel threads for the model. n_jobs=-1 means using all the available cores for parallel processing.
  • nthread is the number of parallel threads for running XGBoost.
  • 'objective': 'binary:logistic' means that the logistic regression for binary classification is used as the learning objective and the model output probability.
  • random_state sets a seed for model reproducibility.
  • reg_alpha provides L1 regularization to the weight. Higher values result in more conservative models. The default value of 0 means no L1 regularization.
  • reg_lambda provides L2 regularization to the weight. Higher values result in more conservative models. XGBoost applies L2 regularization by default.
  • scale_pos_weight controls the balance of positive and negative weights. It's useful for unbalanced classes.
  • seed sets a random number seed.
  • silent decides whether to print out information during model training.
  • subsample is the percentage of randomly sampled training data before growing trees. It happens in every boosting iteration. It is greater than 0 and less than or equal to 1. The default value of 1 means all the data in the training dataset will be used to build trees. A value of less than 1 helps to prevent overfitting.
  • verbosity controls how many messages are printed. The valid values are 0 (silent), 1 (warning), 2 (info), and 3 (debug).
# Initiate XGBoost Classifier
xgboost = XGBClassifier()
# Print default setting
xgboost.get_params()

Output

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1}

When training the model, seed=0 makes sure that we get reproducible results. After running the baseline XGBoost model, we predicted the testing dataset using .predict and calculated the predicted probabilities using .predict_proba.

# Train the model
xgboost = XGBClassifier(seed=0).fit(X_train_transformed,y_train)
# Make prediction
xgboost_predict = xgboost.predict(X_test_transformed)
# Get predicted probability
xgboost_predict_prob = xgboost.predict_proba(X_test)[:,1]

We want to capture as many actual cancer patients as possible for this particular dataset, so we will use recall as the performance metric to optimize.

# Get performance metrics
precision, recall, fscore, support = score(y_test, xgboost_predict)
# Print result
print(f'The recall value for the baseline xgboost model is {recall[1]:.4f}')

The baseline XGBoost model gave us a recall of 97.18%.

The recall value for the baseline xgboost model is 0.9718

Step 6: Grid Search for XGBoost

In step 6, we will use grid search to find the best hyperparameter combinations for the XGBoost model. Grid search is an exhaustive hyperparameter search method. It trains models for every combination of specified hyperparameter values. Therefore, it can take a long time to run if we test out more hyperparameters and values.

For this reason, we would like to have the grid search space relatively small so the process can finish in a reasonable timeframe. The search space includes the hyperparameters, and their values grid search builds models for. We had three hyperparameters for grid search in this example.

  • colsample_bytree is the percentage of columns to be randomly sampled for each tree.
  • reg_alpha provides l1 regularization to the weight. Higher values result in more conservative models.
  • reg_lambda provides l2 regularization to the weight. Higher values result in more conservative models.

Scoring is the metric to evaluate the cross-validation results for each model. Since recall is the evaluation metric for the model, we set scoring = ['recall']. The scoring option can take more than one metric in the list.

StratifiedKFold is used for the cross-validation. It helps us keep the class ratio in the folds the same as the training dataset. n_splits=3 means we are doing 3-fold cross-validation. shuffle=True means the data are shuffled before splitting. random_state=0 makes the shuffle reproducible.

# Define the search space
param_grid = { 
    # Percentage of columns to be randomly samples for each tree.
    "colsample_bytree": [ 0.3, 0.5 , 0.8 ],
    # reg_alpha provides l1 regularization to the weight, higher values result in more conservative models
    "reg_alpha": [0, 0.5, 1, 5],
    # reg_lambda provides l2 regularization to the weight, higher values result in more conservative models
    "reg_lambda": [0, 0.5, 1, 5]
    }
# Set up score
scoring = ['recall']
# Set up the k-fold cross-validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

We specified a few options for GridSearchCV.

  • estimator=xgboost means we are using XGBoost as the model.
  • param_grid=param_grid takes our pre-defined search space for the grid search.
  • scoring=scoring set the performance evaluation metric. Because we set the scoring to 'recall', the model will use recall as the evaluation metric.
  • refit='recall' enables refitting the model with the best parameters on the whole training dataset.
  • n_jobs=-1 means parallel processing using all the processors.
  • cv=kfold takes the StratifiedKFold we defined.
  • verbose controls the number of messages returned by the grid search. The higher the number, the more information is returned. verbose=0 means silent.

After fitting GridSearchCV on the training dataset, we will have 48 hyperparameter combinations. Since 3-fold cross-validation is used, there are 144 models trained in total.

# Define grid search
grid_search = GridSearchCV(estimator=xgboost, 
                           param_grid=param_grid, 
                           scoring=scoring, 
                           refit='recall', 
                           n_jobs=-1, 
                           cv=kfold, 
                           verbose=0)
# Fit grid search
grid_result = grid_search.fit(X_train_transformed, y_train)
# Print grid search summary
grid_result
# Print the best score and the corresponding hyperparameters
print(f'The best score is {grid_result.best_score_:.4f}')
print('The best score standard deviation is', round(grid_result.cv_results_['std_test_recall'][grid_result.best_index_], 4))
print(f'The best hyperparameters are {grid_result.best_params_}')

Output

The best score is 0.9895
The best score standard deviation is 0.0086
The best hyperparameters are {'colsample_bytree': 0.8, 'reg_alpha': 0.5, 'reg_lambda': 0}

The grid search cross-validation results show that 80% of features, using l1 regularization with 0.5 penalty coefficient and no l2 regularization gave us the best results. The best recall is 98.95%, and the standard deviation of the score is 0.86%.

# Make prediction using the best model
grid_predict = grid_search.predict(X_test_transformed)
# Get predicted probabilities
grid_predict_prob = grid_search.predict_proba(X_test_transformed)[:,1]
# Get performance metrics
precision, recall, fscore, support = score(y_test_transformed, grid_predict)
# Print result
print(f'The recall value for the xgboost grid search is {recall[1]:.4f}')

We can see that the grid search recall value is the same as the baseline XGBoost model at 97.18%.

The recall value for the xgboost grid search is 0.9718

Step 7: Random Search for XGBoost

In step 7, we are using a random search for XGBoost hyperparameter tuning. Since random search randomly picks a fixed number of hyperparameter combinations, we can afford to try more hyperparameters and more values. Therefore, we added three more parameters to the search space.

  • learning_rate shrinks the weights to make the boosting process more conservative.
  • max_depth is the maximum depth of the tree. Increasing it increases the model complexity.
  • gamma specifies the minimum loss reduction required to do a split.

If at least one of the parameters is a distribution, sampling with replacement is used for a random search. If all parameters are provided as a list, sampling without replacement is used. Each list is treated as a uniform distribution.

# Define the search space
param_grid = { 
    # Learning rate shrinks the weights to make the boosting process more conservative
    "learning_rate": [0.0001,0.001, 0.01, 0.1, 1] ,
    # Maximum depth of the tree, increasing it increases the model complexity.
    "max_depth": range(3,21,3),
    # Gamma specifies the minimum loss reduction required to make a split.
    "gamma": [i/10.0 for i in range(0,5)],
    # Percentage of columns to be randomly samples for each tree.
    "colsample_bytree": [i/10.0 for i in range(3,10)],
    # reg_alpha provides l1 regularization to the weight, higher values result in more conservative models
    "reg_alpha": [1e-5, 1e-2, 0.1, 1, 10, 100],
    # reg_lambda provides l2 regularization to the weight, higher values result in more conservative models
    "reg_lambda": [1e-5, 1e-2, 0.1, 1, 10, 100]}
# Set up score
scoring = ['recall']
# Set up the k-fold cross-validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

The same scoring metric and cross-validation values used in grid search are used for the random search. But for a random search, we need to specify a value for n_iter, the number of parameter combinations sampled. So we are randomly testing 48 combinations for this example.

# Define random search
random_search = RandomizedSearchCV(estimator=xgboost, 
                           param_distributions=param_grid, 
                           n_iter=48,
                           scoring=scoring, 
                           refit='recall', 
                           n_jobs=-1, 
                           cv=kfold, 
                           verbose=0)
# Fit grid search
random_result = random_search.fit(X_train_transformed, y_train)
# Print grid search summary
random_result
# Print the best score and the corresponding hyperparameters
print(f'The best score is {random_result.best_score_:.4f}')
print('The best score standard deviation is', round(random_result.cv_results_['std_test_recall'][random_result.best_index_], 4))
print(f'The best hyperparameters are {random_result.best_params_}')

Output

The best score is 0.9895
The best score standard deviation is 0.0086
The best hyperparameters are {'reg_lambda': 0.1, 'reg_alpha': 0.01, 'max_depth': 15, 'learning_rate': 0.1, 'gamma': 0.1, 'colsample_bytree': 0.5}

After finishing the random search cross-validation, we printed out the best score, standard deviation, and the best parameters. Although the best parameters are different from the grid search, the best score and standard deviation for the cross-validation are very close.

# Make prediction using the best model
random_predict = random_search.predict(X_test_transformed)
# Get predicted probabilities
random_predict_prob = random_search.predict_proba(X_test_transformed)[:,1]
# Get performance metrics
precision, recall, fscore, support = score(y_test, random_predict)
# Print result
print(f'The recall value for the xgboost random search is {recall[1]:.4f}')

The random search recall value on the test dataset is creased from 97.18% to 98.59%.

The recall value for the xgboost random search is 0.9859

Step 8: Bayesian Optimization For XGBoost

In step 8, we will apply Hyperopt Bayesian optimization on XGBoost hyperparameter tuning. According to the documentation on Hyperopt github page, there are four key elements for Hyperopt:

  • the space over which to search
  • the objective function to minimize
  • the database in which to store all the point evaluations of the search
  • the search algorithm to use

For the search space, the same space as the random search is used for the Hyperopt Bayesian optimization.

# Space
space = {
    'learning_rate': hp.choice('learning_rate', [0.0001,0.001, 0.01, 0.1, 1]),
    'max_depth' : hp.choice('max_depth', range(3,21,3)),
    'gamma' : hp.choice('gamma', [i/10.0 for i in range(0,5)]),
    'colsample_bytree' : hp.choice('colsample_bytree', [i/10.0 for i in range(3,10)]),     
    'reg_alpha' : hp.choice('reg_alpha', [1e-5, 1e-2, 0.1, 1, 10, 100]), 
    'reg_lambda' : hp.choice('reg_lambda', [1e-5, 1e-2, 0.1, 1, 10, 100])
}

StratifiedKFold is used to split the training dataset into k folds and keep the ratio between the classes in each fold the same as the training dataset. It is used for the cross-validation.

  • n_splits=3 means that the training dataset is split into 3 folds. This is because our dataset is small. For a larger dataset, usually 5 or 10 folds are used.
  • shuffle=True means that the dataset will be shuffled before splitting into folds. Note that the samples within each split will not be shuffled.
  • random_state=0 make the split reproducible.
# Set up the k-fold cross-validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

Then an objective function is defined.

  • XGBClassifier is used as the model algorithm. seed=0 makes the model results reproducible. **params takes in the hyperparameter values.
  • cross_val_score produces k scores, one for each of the k folds. We get the mean of the k scores and output the average value.
  • estimator takes the estimator to fit the data.
  • X takes the training dataset feature matrix and y takes the target variable for the training dataset.
  • cv determines the cross-validation splitting strategy. We set cv=kfold, which is the output from the StratifiedKFold.
  • scoring='recall' means that recall is the key metric for the model.
  • n_jobs=-1 enables parallel model training.
  • Next, loss is defined. Because the model's goal is to maximize recall, it is the same as minimizing negative recall, so we set loss = - score.
  • The function returns a dictionary with loss, params, and status.
# Objective function
def objective(params):
    
    xgboost = XGBClassifier(seed=0, **params)
    score = cross_val_score(estimator=xgboost, 
                            X=X_train_transformed, 
                            y=y_train, 
                            cv=kfold, 
                            scoring='recall', 
                            n_jobs=-1).mean()
    # Loss is negative score
    loss = - score
    # Dictionary with information for evaluation
    return {'loss': loss, 'params': params, 'status': STATUS_OK}

fmin is the function to search the best hyperparameters with the smallest loss value.

  • fn takes in the objective function.
  • space is for the search space of the hyperparameters.
  • algo is for the type of search algorithms. Hyperopt currently has three algorithms, random search, Tree of Parzen Estimators (TPE), and adaptive TPE. We are using TPE as the search algorithm.
  • max_evals specifies the maximum number of evaluations.
  • trials stores the information for the evaluations.
# Optimize
best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 48, trials = Trials())

Output:

100%|██████████| 48/48 [00:11<00:00,  4.23it/s, best loss: -0.9859649122807017]

After the Bayesian optimization search, we get the best loss of -0.99, meaning that the recall value is about 99%.

We can print out the index for the parameters using print(best). To get the values of the best parameters, we can use the space_eval and pass in the search space and best.

# Print the index of the best parameters
print(best)
# Print the values of the best parameters
print(space_eval(space, best))

Output

{'colsample_bytree': 1, 'gamma': 4, 'learning_rate': 0, 'max_depth': 5, 'reg_alpha': 0, 'reg_lambda': 1}
{'colsample_bytree': 0.4, 'gamma': 0.4, 'learning_rate': 0.0001, 'max_depth': 18, 'reg_alpha': 1e-05, 'reg_lambda': 0.01}

Next, we apply the best hyperparameters to the XGBClassifier and make predictions.

# Train model using the best parameters
xgboost_bo = XGBClassifier(seed=0, 
                           colsample_bytree=space_eval(space, best)['colsample_bytree'], 
                           gamma=space_eval(space, best)['gamma'], 
                           learning_rate=space_eval(space, best)['learning_rate'], 
                           max_depth=space_eval(space, best)['max_depth'], 
                           reg_alpha=space_eval(space, best)['reg_alpha'],
                           reg_lambda=space_eval(space, best)['reg_lambda']
                           ).fit(X_train_transformed,y_train)
# Make prediction using the best model
bayesian_opt_predict = xgboost_bo.predict(X_test_transformed)
# Get predicted probabilities
bayesian_opt_predict_prob = xgboost_bo.predict_proba(X_test_transformed)[:,1]
# Get performance metrics
precision, recall, fscore, support = score(y_test, bayesian_opt_predict)
# Print result
print(f'The recall value for the xgboost Bayesian optimization is {recall[1]:.4f}')

Output:

The recall value for the xgboost Bayesian optimization is 0.9859

The recall value on the test dataset is 98.59%, the same as the random search result.

Summary

In this tutorial, we covered how to tune XGBoost hyperparameters using Python. You learned

  • What are the differences between grid search, random search, and Bayesian optimization?
  • How to use grid search cross-validation to tune the hyperparameters for the XGBoost model?
  • How to use random search cross-validation to tune the hyperparameters for the XGBoost model?
  • How to use Bayesian optimization to tune the hyperparameters for the XGBoost model?
  • How to compare the results from grid search, random search, and Bayesian optimization?

In practice, random search and Bayesian optimization usually have better performance than the grid search because they can tune more parameters on wider ranges of values.

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.

Recommended Tutorials

References

Xgboost
Hyperparameter Tuning
Gridsearchcv
Randomsearchcv
Bayesian Optimization
Recommended from ReadMedium