avatarT Z J Y

Summary

The web content discusses Thompson Sampling for multi-armed bandit problems, a Bayesian approach to balancing exploration and exploitation, which is an improvement over the Upper Confidence Bound (UCB) algorithm.

Abstract

The article "Multi-armed Bandits Part III: Thompson Sampling based on Bayesian Posterior Probability" builds upon the previous discussion of the Upper Confidence Bound (UCB) algorithm by introducing Thompson Sampling as a more refined method for solving multi-armed bandit problems. This approach assumes a Gaussian distribution for the mean rewards of slot machines and sets the upper bound at a 95% confidence interval. Thompson Sampling leverages Bayesian probability to select actions, drawing from the posterior distribution of rewards, which is particularly effective when the reward distribution is known, such as a Bernoulli bandit that follows a Beta distribution. The method is praised for implementing probability matching, where the estimated reward probabilities align with the probability that an action is optimal given the observed history. The article concludes with an invitation to follow the author on Medium for more insights.

Opinions

  • The author suggests that making assumptions about the reward distribution, such as expecting it to be Gaussian, can lead to better bound estimations in multi-armed bandit problems.
  • Thompson Sampling is presented as a simple yet powerful Bayesian strategy that can effectively address the exploration-exploitation dilemma inherent in multi-armed bandit scenarios.
  • The article implies that the Beta distribution is a natural choice for modeling the success probability in a Bernoulli distribution due to its properties that align well with updating beliefs based on observed successes and failures.
  • The author endorses Thompson Sampling for its implementation of probability matching, which is seen as a desirable characteristic for making decisions that balance exploration and exploitation.
  • By providing references to foundational texts and lectures, the author positions the content as credible and grounded in established reinforcement learning research.

Multi-armed Bandits Part III: Thompson Sampling based on Bayesian Posterior Probability

In previous section: Multi-armed Bandits: What is Upper Confidence Bound (UCB) Algorithm, we achieve a very general estimation based on Hoeffding’s Inequality. This is primarily because we don’t assume any prior on the reward distribution. If we could make some assumption upfront, we would definitely be able to make better bound estimation.

If we expect the mean reward of every slot machine to be Gaussian and independent, we could then set the upper bound as 95% confidence interval by make Ut(a) as 1.96x standard deviation.

Thompson sampling

Thompson sampling is using a very simple Bayesian idea but it turns out to be very useful to solve the multi-armed bandit problem. Basically at each step, the strategy is going to select an action with optimal probability

π(a|ht) is the probability of taking action a given the history ht. Here we can use Bayes law to compute posterior distribution P(R | ht).

For the Bernoulli bandit described, it is natural to assume that Q(a) follows a Beta distribution, as Q(a) is essentially the success probability θ in Bernoulli distribution. The value of Beta(α,β) is within the interval [0, 1], and α and β are the counts when we succeeded or failed to get a reward respectively. Because of the beautiful characteristics from Beta distribution, we can update the posterior with the known prior and the likelihood from the sampled data:

From an engineering perspective, Thompson sampling implements probability matching. Because its reward estimations Q̃ are sampled from posterior distributions, each of these probabilities is equivalent to the probability that the corresponding action is optimal, conditioned on observed history.

Thanks for Reading!

If you enjoyed it, please follow me on Medium for more. It’s great cardio for your 👏 AND will help other people see the story.

References

Data Science
Machine Learning
Reinforcement Learning
Artificial Intelligence
Statistics
Recommended from ReadMedium