avatarOkan Yenigün

Summary

The provided content offers a comprehensive guide to Bayesian Optimization, detailing its advantages over traditional optimization methods, its implementation in Python, and its practical applications, particularly in the context of machine learning hyperparameter tuning.

Abstract

Bayesian Optimization is presented as an efficient strategy for the global optimization of complex, black-box functions that are often encountered in machine learning and other domains where analytical gradients are unavailable or the objective function is noisy. The technique leverages a surrogate probabilistic model, typically a Gaussian Process, to predict function values and quantify uncertainty, thereby balancing exploration and exploitation to find an optimum. The guide walks through the foundational concepts of Bayesian Optimization, including the treatment of objective functions as black boxes, the role of acquisition functions in guiding the optimization process, and the practical considerations when implementing this approach in Python. It also contrasts Bayesian Optimization with classical methods like grid search and random search, emphasizing its sample efficiency and ability to handle high-dimensional search spaces. The article culminates in a hands-on demonstration using Python code snippets to illustrate the optimization process, from initial sampling to the iterative refinement of the surrogate model and the selection of subsequent sample points.

Opinions

  • Bayesian Optimization is superior to traditional optimization methods such as grid search and random search, especially in scenarios involving high-dimensional or noisy functions.
  • The incorporation of prior knowledge into the optimization process, facilitated by the probabilistic surrogate model, is a key strength of Bayesian Optimization.
  • Gaussian Processes are particularly suitable for Bayesian Optimization as they provide a flexible framework for modeling the uncertainty of black-box functions.
  • The iterative nature of Bayesian Optimization, which combines the surrogate model with acquisition functions like Expected Improvement, Upper Confidence Bound, and Probability of Improvement, is highlighted as a mechanism for efficient exploration and exploitation.
  • The use of Bayesian Optimization for hyperparameter tuning in machine learning is emphasized as a practical and impactful application of this methodology.
  • The article suggests that Bayesian Optimization is not just theoretically sound but also practically viable, as demonstrated by the Python implementation using libraries such as scikit-learn and matplotlib.

Step-by-Step Guide to Bayesian Optimization: A Python-based Approach

Building the Foundation: Implementing Bayesian Optimization in Python

Photo by Federico Beccari on Unsplash

Bayesian optimization is a technique used for the global (optimum) optimization of black-box functions.

A black box is a system whose internal workings are unknown to the observer. She can only have access to the inputs and outputs of the system but does not have knowledge of how the system arrives at outputs based on the inputs. In the context of optimization, a black box function refers to an objective function.

Black box. Source

To illustrate, consider a function f that is inaccessible to us. We cannot directly access f or compute its gradients. Our only available information is providing an input x and receiving a noisy estimation (or without any noise) of the true output.

Our objective is to optimize the underlying value within the black box function f according to our purpose, maximizing it for example.

In traditional optimization methods, such as grid search or random search, the objective function is evaluated at a predefined set of points. However, these methods can be computationally expensive and inefficient when dealing with high-dimensional or noisy functions.

Comparison between (a) grid search; and (b) random search for hyper-parameter tuning. Source

Bayesian optimization offers several advantages over traditional methods. Firstly, it efficiently handles expensive and noisy function evaluations by building a probabilistic surrogate model, which captures the uncertainty and guides the search process intelligently.

It incorporates prior knowledge or beliefs about the objective function, enabling more informed decision-making.

It also balances exploration and exploitation. Exploration allows for a broader exploration of the search space, potentially discovering better solutions, while exploitation focuses on exploiting the known promising areas to optimize the current best solution. Balancing these two aspects is crucial to finding better solutions and refining the best solution.

Exploitation vs Exploration. Source

Classical bandit algorithms make use of frequentist bounds to model the unknown function. Bandit problems are a class of sequential decision-making problems that involve selecting actions from a set of options (arms) over a series of steps or time periods. Each arm has an unknown reward distribution associated with it, and the goal is to maximize the cumulative reward obtained by selecting the arms strategically.

In the depicted cube shape, there are three metrics represented along three axes, providing an understanding of the variations in mindsets at a broader level.

Bandits vs Bayesian. Source

The performance metric is about regret setting. Regret setting is a framework used to analyze and quantify the performance of decision-making algorithms or policies in sequential decision problems. It measures the cumulative difference between the performance of a chosen policy and the optimal policy that could have been chosen in hindsight.

Regret. Source
  • Offline Performance Metric: The offline performance metric evaluates the algorithm or policy using hindsight information. It measures how well the algorithm would have performed if it had access to the complete dataset or information from the start. In other words, it measures the performance based on the best possible hindsight knowledge.
  • Online Performance Metric: The online performance metric, on the other hand, evaluates the algorithm or policy in a more realistic scenario where decisions are made sequentially with limited information. It measures performance based on the actual decisions made and their outcomes without access to future information. The online performance metric reflects the real-world performance of the algorithm in a dynamic and uncertain environment.

The X-axis of the cube displays the modeling technique, which can be either Bayesian or frequentist. On the Z-axis, the cube indicates the relationship between variables, specifically whether one variable provides information or insight about another variable or not.

Algorithm

Given: function f(x) and we want to maximize it. And we have some data for f.

Steps:

  1. Initial sample: Begin by selecting a limited set of sample points randomly.
  2. Initialize the model: Utilize these points to calculate a surrogate function.
  3. Iterate:

3.1. Acquisition function: Use the acquisition function and get the next point.

3.2. Update the model: Reevaluate the surrogate function.

3.3. Validation Point: Verify whether the surrogate function remains stable or if the variance falls below a predetermined threshold, or if f is exhausted, depending on your specific design objective.

Given Function

For example, let’s create a black box function that combines the sine and cosine functions to generate an output.

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)

def black_box_function(x):
    y = np.sin(x) + np.cos(2*x)
    return y

# range of x values
x_range = np.linspace(-2*np.pi, 2*np.pi, 100)

# output for each x value
black_box_output = black_box_function(x_range)

# plot
plt.plot(x_range, black_box_output)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Black Box Function Output')
plt.show()

The function values for each x value (x_range) are computed using black_box_function, and then the output is plotted.

Black box function output plot. Image by the author.

Initial Sample

Sample outputs from function f.

# random x values for sampling
num_samples = 10
sample_x = np.random.choice(x_range, size=num_samples)

# output for each sampled x value
sample_y = black_box_function(sample_x)

# plot
plt.plot(x_range, black_box_function(x_range), label='Black Box Function')
plt.scatter(sample_x, sample_y, color='red', label='Samples')
plt.xlabel('x')
plt.ylabel('Black Box Output')
plt.title('Sampled Points')
plt.legend()
plt.show()
Sampling. Image by the author.

Modeling

Surrogate models in Bayesian optimization serve as a proxy for the true objective function. They capture the uncertainty in the function’s behavior and provide predictions or estimates of the function’s values at unobserved points within the search space. By iteratively updating the surrogate model based on observed function evaluations, Bayesian optimization guides the search towards regions of the search space likely to contain the optimal solution.

We aim to develop a model, denoted as M, that possesses the capability to not only provide predictions but also retain a measure of uncertainty associated with those predictions. For this purpose, we will adopt the following predefined interface:

  • observe(M, x, y): record observations -> All
  • predict(M, x): make posterior predictions -> All
  • sample(M, x): sample latent function values -> Thompson
  • getTail(M, v, x): the probability of exceeding x -> PI
  • getImprovement(M, v, x): the integral above v -> EI
  • getQuantile(M, q, x): the qth quantile -> UCB
  • getEntropy(M, x): the predictive entropy -> P(ES)

The ability to execute various procedures indicates the types of approaches and acquisition functions that can be employed within the model class. The first two are applicable in all Bayesian optimization models. Additionally, for example, if you have access to the tail probabilities, you can utilize the probability of improvement (PI) method.

Gaussian Process

Gaussian processes (GPs) are commonly used as surrogate models in Bayesian optimization. A Gaussian process defines a distribution over functions, where any finite set of function values follows a multivariate Gaussian distribution. In the context of Bayesian optimization, a GP is used to model the unknown objective function, and it provides a posterior distribution over the function values given the observed data.

Gaussian processes are specified by the estimation function and the uncertainty function. Source

GP is fully specified by its mean function and covariance function (also called the kernel function). The mean function represents the expected value of the function at each input point, while the covariance function characterizes the relationship between different input points and controls the smoothness and behavior of the generated function.

Given a set of observed data, a GP can be used to make predictions about the function values at unobserved points. These predictions are based on the observed data and the underlying assumptions encoded in mean and covariance functions. Importantly, GPs provide a measure of uncertainty associated with these predictions, allowing for principled handling of uncertainty in predictions.

Illustration of Gaussian process regression in one dimension. Source

In domains with limited prior knowledge, Gaussian processes are particularly useful because they rely on the assumption that similar inputs lead to similar outputs. This makes Gaussian processes well-suited for detecting patterns and trends in the data when not much prior information is available.

When constructing our model, we assume that it exhibits a random process, is stationary, and adheres to a Gaussian distribution. A random process refers to a mathematical model that describes the evolution of a system or phenomenon over time, where the outcomes are subject to uncertainty or randomness. Stationarity, in the context of a random process, implies that the statistical properties of the process remain constant over time. Specifically, it means that the mean, variance, and autocovariance of the process do not change with respect to time or shifts in the time origin.

The principles of stationarity. Source

We possess an array, f, representing the values of a black box function for all its elements. We assume that this array follows a normal distribution.

The probability density function will take the form of a Gaussian density function.

Since we assume stationarity, the joint probability distribution remains constant over time, implying it is time-invariant. The mean value of the function is m. The covariance is expected to exhibit a Gaussian shape and can be approximated using a radial basis function (RBF) kernel.

Next, we require a prediction function that aims to estimate the probability of the function f (x), given the available data we have at hand.

Normal distribution with 0 means and ~C covariance over normal distribution with 0 means and C covariance.

Now, we can simplify it.

k is an array of RBF kernels. C is the covariance matrix.

Using this prediction function, we can infer that, on average, the unknown function will take the value μ at x, accompanied by a variance of σ.

So, the algorithm for the surrogate function is more or less:

  1. Iterate through each sample value of input x for which we desire to perform an evaluation.
  • Build k and f vectors using RBF kernel.
  • Build matrices C and C~.
  • Calculate μ and σ.
  • Append μ to the predictedMu array and σ to predictedSigma arrays.

2. Calculate Omega as the mean of the black box function for sampled points.

3. Calculate Kappa = predictedMu + Omega

4. Return:

  • Kappa, the mean estimate of the surrogate function
  • predictedSigma, the estimate of variance from the surrogate function.
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

# Gaussian process regressor with an RBF kernel
kernel = RBF(length_scale=1.0)
gp_model = GaussianProcessRegressor(kernel=kernel)

# Fit the Gaussian process model to the sampled points
gp_model.fit(sample_x.reshape(-1, 1), sample_y)

# Generate predictions using the Gaussian process model
y_pred, y_std = gp_model.predict(x_range.reshape(-1, 1), return_std=True)

# Plot 
plt.figure(figsize=(10, 6))
plt.plot(x_range, black_box_function(x_range), label='Black Box Function')
plt.scatter(sample_x, sample_y, color='red', label='Samples')
plt.plot(x_range, y_pred, color='blue', label='Gaussian Process')
plt.fill_between(x_range, y_pred - 2*y_std, y_pred + 2*y_std, color='blue', alpha=0.2)
plt.xlabel('x')
plt.ylabel('Black Box Output')
plt.title('Black Box Function with Gaussian Process Surrogate Model')
plt.legend()
plt.show()
Gaussian process plot. Image by the author.

Acquisition Function

How do we choose the next point for the surrogate function?

Acquisition functions determine the next point or set of points to evaluate in the search space. It quantifies the potential utility or desirability of sampling a particular point based on the current state of the optimization process. The purpose of the acquisition function is to balance exploration and exploitation.

It takes into account both the predicted values of the surrogate model and the uncertainty associated with those predictions. It combines these two aspects to identify points that are expected to have high objective function values and (or) high uncertainty, indicating the potential for improvement.

Acquisition function for choosing the next point of evaluation. Source

There are several acquisition functions used in Bayesian optimization:

  • Expected Improvement (EI) selects points that have the potential to improve upon the best-observed value. It quantifies the expected improvement over the current best value and considers both the mean prediction of the surrogate model and its uncertainty.
from scipy.stats import norm

def expected_improvement(x, gp_model, best_y):
    y_pred, y_std = gp_model.predict(x.reshape(-1, 1), return_std=True)
    z = (y_pred - best_y) / y_std
    ei = (y_pred - best_y) * norm.cdf(z) + y_std * norm.pdf(z)
    return ei

# Determine the point with the highest observed function value
best_idx = np.argmax(sample_y)
best_x = sample_x[best_idx]
best_y = sample_y[best_idx]

ei = expected_improvement(x_range, gp_model, best_y)

# Plot the expected improvement
plt.figure(figsize=(10, 6))
plt.plot(x_range, ei, color='green', label='Expected Improvement')
plt.xlabel('x')
plt.ylabel('Expected Improvement')
plt.title('Expected Improvement')
plt.legend()
plt.show()
Expected improvement. Image by the author.
  • Upper Confidence Bound (UCB) trades off exploration and exploitation by balancing the mean prediction of the surrogate model and an exploration term proportional to the uncertainty. It selects points that offer a good balance between predicted high values and exploration of uncertain regions.
def upper_confidence_bound(x, gp_model, beta):
    y_pred, y_std = gp_model.predict(x.reshape(-1, 1), return_std=True)
    ucb = y_pred + beta * y_std
    return ucb

beta = 2.0

# UCB
ucb = upper_confidence_bound(x_range, gp_model, beta)

plt.figure(figsize=(10, 6))
plt.plot(x_range, ucb, color='green', label='UCB')
plt.xlabel('x')
plt.ylabel('UCB')
plt.title('UCB')
plt.legend()
plt.show()
UCB. Image by the author.
  • Probability of Improvement (PI) estimates the probability that a point will improve upon the current best value. It considers the difference between the mean prediction and the current best value, taking into account the uncertainty in the surrogate model.
def probability_of_improvement(x, gp_model, best_y):
    y_pred, y_std = gp_model.predict(x.reshape(-1, 1), return_std=True)
    z = (y_pred - best_y) / y_std
    pi = norm.cdf(z)
    return pi

# Probability of Improvement
pi = probability_of_improvement(x_range, gp_model, best_y)


plt.figure(figsize=(10, 6))
plt.plot(x_range, pi, color='green', label='PI')
plt.xlabel('x')
plt.ylabel('PI')
plt.title('PI')
plt.legend()
plt.show()
PI. Image by the author.

Irrespective of the approach used, we choose the point that has the highest value on the y-axis.

Posterior and acquisition function in Bayesian Optimization. Source
num_iterations = 5

plt.figure(figsize=(10, 6))

for i in range(num_iterations):
    # Fit the Gaussian process model to the sampled points
    gp_model.fit(sample_x.reshape(-1, 1), sample_y)

    # Determine the point with the highest observed function value
    best_idx = np.argmax(sample_y)
    best_x = sample_x[best_idx]
    best_y = sample_y[best_idx]

    # Set the value of beta for the UCB acquisition function
    beta = 2.0

    # Generate the Upper Confidence Bound (UCB) using the Gaussian process model
    ucb = upper_confidence_bound(x_range, gp_model, beta)

    # Plot the black box function, surrogate function, previous points, and new points
    plt.plot(x_range, black_box_function(x_range), color='black', label='Black Box Function')
    plt.plot(x_range, ucb, color='red', linestyle='dashed', label='Surrogate Function')
    plt.scatter(sample_x, sample_y, color='blue', label='Previous Points')
    if i < num_iterations - 1:
        new_x = x_range[np.argmax(ucb)]  # Select the next point based on UCB
        new_y = black_box_function(new_x)
        sample_x = np.append(sample_x, new_x)
        sample_y = np.append(sample_y, new_y)
        plt.scatter(new_x, new_y, color='green', label='New Points')

    plt.xlabel('x')
    plt.ylabel('y')
    plt.title(f"Iteration #{i+1}")
    plt.legend()
    plt.show()
Iteration #1. Image by the author.
Iteration #2. Image by the author.
Iteration #3. Image by the author.
Iteration #4. Image by the author.
Iteration #5. Image by the author.

Cool, right? It keeps learning and improving in every iteration.

Bayesian optimization offers several positive aspects. One of its key advantages is the ability to optimize black-box functions that lack analytical gradients or have noisy evaluations. Employing a surrogate model, such as a Gaussian process, balances exploration and exploitation to efficiently search for the optimal solution. It provides a principled approach for iteratively selecting points that offer a good trade-off between exploring unexplored regions and exploiting promising areas.

It finds applications in various domains. It is particularly well-suited for hyperparameter tuning in machine learning. It can be used for optimizing complex simulations, experimental designs, and other scenarios where the evaluation of the objective function is costly or time-consuming.

Read More

Sources

https://www.researchgate.net/publication/322035981_Per_Instance_Algorithm_Configuration_for_Continuous_Black_Box_Optimization

https://www.researchgate.net/publication/341691661_A_Kernel_Design_Approach_to_Improve_Kernel_Subspace_Identification

https://nesslabs.com/exploration-exploitation-ambidextrous-mindset-success-trap

https://www.youtube.com/watch?v=C5nqEHpdyoE

https://www.researchgate.net/publication/334714068_Low-Cost_Multi-Objective_Optimization_of_Multiparameter_Antenna_Structures_Based_on_the_l1_Optimization_BPNN_Surrogate_Model

https://www.researchgate.net/publication/327613136_Bayesian_optimization_for_likelihood-free_cosmological_inference

Optimization
Machine Learning
Hyperparameter Tuning
Python
Statistics
Recommended from ReadMedium