Multi-armed Bandits Part III: Thompson Sampling based on Bayesian Posterior Probability

Summary

The web content discusses Thompson Sampling for multi-armed bandit problems, a Bayesian approach to balancing exploration and exploitation, which is an improvement over the Upper Confidence Bound (UCB) algorithm.

Abstract

The article "Multi-armed Bandits Part III: Thompson Sampling based on Bayesian Posterior Probability" builds upon the previous discussion of the Upper Confidence Bound (UCB) algorithm by introducing Thompson Sampling as a more refined method for solving multi-armed bandit problems. This approach assumes a Gaussian distribution for the mean rewards of slot machines and sets the upper bound at a 95% confidence interval. Thompson Sampling leverages Bayesian probability to select actions, drawing from the posterior distribution of rewards, which is particularly effective when the reward distribution is known, such as a Bernoulli bandit that follows a Beta distribution. The method is praised for implementing probability matching, where the estimated reward probabilities align with the probability that an action is optimal given the observed history. The article concludes with an invitation to follow the author on Medium for more insights.

Opinions

The author suggests that making assumptions about the reward distribution, such as expecting it to be Gaussian, can lead to better bound estimations in multi-armed bandit problems.
Thompson Sampling is presented as a simple yet powerful Bayesian strategy that can effectively address the exploration-exploitation dilemma inherent in multi-armed bandit scenarios.
The article implies that the Beta distribution is a natural choice for modeling the success probability in a Bernoulli distribution due to its properties that align well with updating beliefs based on observed successes and failures.
The author endorses Thompson Sampling for its implementation of probability matching, which is seen as a desirable characteristic for making decisions that balance exploration and exploitation.
By providing references to foundational texts and lectures, the author positions the content as credible and grounded in established reinforcement learning research.

Thompson sampling

Thompson sampling is using a very simple Bayesian idea but it turns out to be very useful to solve the multi-armed bandit problem. Basically at each step, the strategy is going to select an action with optimal probability

π(a|ht) is the probability of taking action a given the history ht. Here we can use Bayes law to compute posterior distribution P(R | ht).

For the Bernoulli bandit described, it is natural to assume that Q(a) follows a Beta distribution, as Q(a) is essentially the success probability θ in Bernoulli distribution. The value of Beta(α,β) is within the interval [0, 1], and α and β are the counts when we succeeded or failed to get a reward respectively. Because of the beautiful characteristics from Beta distribution, we can update the posterior with the known prior and the likelihood from the sampled data:

From an engineering perspective, Thompson sampling implements probability matching. Because its reward estimations Q̃ are sampled from posterior distributions, each of these probabilities is equivalent to the probability that the corresponding action is optimal, conditioned on observed history.

Multi-armed Bandits Part III: Thompson Sampling based on Bayesian Posterior Probability

Multi-armed Bandits Part II: What is Upper Confidence Bound (UCB) Algorithm?

In Part 1: Multi-armed Bandits: An Example to Introduce Reinforcement Learning we can get some high level understanding…

Thompson sampling

Thanks for Reading!

References