A Baby Robot’s Guide To Reinforcement Learning
A Comparison of Bandit Algorithms
Multi-Armed Bandits: Part 6

Overview
Over the course of this series we’ve looked at the framework and terminology that are used to define Multi-Armed Bandits. We then turned our attention to some of the algorithms that can be used to solve this problem, from the simple Greedy algorithms, up to the complex Bayesian approach of Thompson Sampling. Now there’s only one question left to answer. Which of these approaches is the best at solving the bandit problem?
For those not yet familiar with the Multi-Armed Bandit problem, or wishing to refresh their knowledge of any of the particular areas, the other parts in this series are as follows:
- Part 1: Mathematical Framework and Terminology - all the basic information needed to get started
- Part 2: The Bandit Framework - a description of the code and test framework
- Part 3: Bandit Algorithms - The Greedy Algorithm - The Optimistic-Greedy Algorithm - The Epsilon-Greedy Algorithm (ε-Greedy) - Regret
- Part 4: The Upper Confidence Bound (UCB) Bandit Algorithm
- Part 5: Thompson Sampling - Bernoulli Thompson Sampling - Gaussian Thompson Sampling
- Part 5b: Thompson Sampling using Conjugate Priors
All code for the bandit algorithms and testing framework can be found on github: Multi_Armed_Bandits
Recap
Baby Robot is lost in the mall. Using Reinforcement Learning we want to help him find his way back to his mum. However, before he can even begin looking for her, he needs to recharge, from a set of power sockets that each give a slightly different amount of charge.
Using the strategies from the multi-armed bandit problem we need to find the best socket, in the shortest amount of time, to allow Baby Robot to get charged up and on his way.

Baby Robot has entered a charging room containing 5 different power sockets. Each of these sockets returns a slightly different amount of charge. We want to get Baby Robot charged up in the minimum amount of time, so we need to locate the best socket and then use it until charging is complete.
This is identical to the Multi-Armed Bandit problem except that, instead of looking for a slot machine that gives the best payout, we’re looking for a power socket that gives the most charge.

A Comparison of Bandit Algorithms
To answer the question of which is the best bandit algorithm (in terms of the ones we’ve looked at), we can re-frame the question in terms of our own problem: which algorithm will let Baby Robot get fully charged in the shortest amount of time?
To run this experiment we first need to define exactly what we mean by being fully charged. For this we’ll just arbitrarily define that Baby Robot is fully charged when he has enough charge to run for an hour (3600 seconds).
With this definition we can now run each of our bandit algorithms for a maximum of 500 time steps, which yields the following results:

(Note: The full test system used to generate these results is available in the github notebook.)

From the graph above we can see that:
- Optimistic Greedy, UCB and Thompson Sampling all reach the maximum required charge in approximately the same number of time steps (which is why the lines for Optimistic Greedy and Thompson Sampling are obscured by that for UCB).
- In each case, the regret for these algorithms is nearly zero. Each has taken only 300 time steps to reach the maximum required charge of 3600 seconds of charge, when the maximum available charge from any socket is 12 seconds of charge. So the optimal socket has been located quickly and then exploited to the full.
- Epsilon Greedy on the other hand takes slightly longer to reach maximum charge. Since it continues to explore the set of sockets throughout the run it fails to fully exploit the best socket.
- And, worst of all, is the Greedy algorithm. It actually fails to reach the maximum charge in the time available.
So, from the point of view of charging Baby Robot, any of the Optimistic Greedy, UCB or Thompson Sampling algorithms would do the job. However, it should be noted that both Optimistic Greedy and UCB require a parameter to be set (these parameters are the initial values for Optimistic Greedy and the confidence value for UCB). Making a bad choice for these parameters could lead to a degraded performance of the algorithm.
Since Thompson Sampling doesn’t require a parameter to be set, this isn’t an issue, and so this may be the deciding factor when choosing which algorithm to use.
Increasing the Problem Difficulty
The socket problem we’ve used up to now was deliberately made very simple, to allow the exploration and exploitation mechanisms of each algorithm to be illustrated. However, in terms of finding which algorithm is best, it has a couple of major drawbacks.
- Firstly, it only has five sockets. Therefore, even with the random search of the Greedy algorithm, there’s a 20% chance of finding the best socket. For the more advanced search mechanisms they can quickly locate and lock onto the best socket since there are so few to test.
- Secondly, each socket has a very distinct reward. Since we setup the experiment with the sockets having a spacing of 2 seconds of charge and a unit variance in their reward, there is very little overlap in the amount of charge returned by a socket and the next best socket. Once the best socket has been tried it is very unlikely that any of the other sockets will return a charge that is greater than this value. This makes it very easy to identify which socket is optimal.
To overcome these deficiencies in our original experiment, let’s double the number of sockets and decrease the difference in the amount of charge the sockets can return, reducing this from 2 seconds of charge down to 0.2. With these values the output of the sockets now looks as follows:


Now we have 10 sockets, with 0.2 seconds of charge difference between the mean reward of a socket and the next best socket. This means that there’s a much greater overlap in the rewards that are returned.
The socket order [7, 5, 1, 10, 2, 3, 6, 8, 9, 4] defines how good a socket is, in increasing order, so socket number 1 is the 7th best socket and will have a mean reward of (7*0.2 +2) = 3.4 seconds of charge (the +2 is just an offset value to stop sockets from having negative values). So the best socket is socket 4, having a mean reward of (10*0.2 +2) = 4 and the worst is socket 3, with a mean reward of (1*0.2 +2) = 2.2.
A couple of other points to note are:
- Because we’ve reduced the spread in the mean values of the sockets, the best socket now has a lower output than when we only tested with 5 sockets. As a result, the number of time steps required to reach our defined maximum total reward, of 3600 seconds of charge, will be increased.
- We’ve kept the same parameters as in the 5 socket experiment for the UCB and Optimistic Greedy algorithms, so it’s possible that tuning these values could give better results.
Running with this new setup generates the following results:


Now we can see some separation in the performance of the algorithms:
- As before, the Greedy algorithm performs much worse than all the others. Epsilon Greedy, while being much better than the simple Greedy algorithm, is still worse than the other 3 algorithms.
- Although there’s not a lot in it, Thompson Sampling outperforms the others.
This difference in the algorithms becomes even more distinct as the number of sockets is increased and the spread of mean reward values is decreased, as shown below, for 100 sockets with a spread of 0.1 (note that now the maximum socket output is once again 12 (=0.1*100 +2):


With 100 sockets it’s interesting to note how both the UCB and Optimistic Greedy algorithms perform worse than the Epsilon Greedy algorithm. In fact, by the time that Thompson Sampling has reached maximum charge, neither of these other algorithms has yet surpassed the total mean reward of Epsilon Greedy.
This can be seen to be caused by the priming rounds of the UCB and Optimistic Greedy algorithms, during which each socket is tested exactly once (note the lower gradient up to the 100th time step). So, for the first 100 time steps, UCB and Optimistic Greedy are working their way through each of the 100 sockets, whereas Epsilon Greedy and Thompson Sampling have already started to exploit the best sockets that they’ve found.
If we left the experiment to run for more time-steps, then UCB and Optimistic Greedy would probably exceed the mean total reward of Epsilon Greedy and reach full charge before it. However, if Baby Robot has chosen his sockets using the Thompson Sampling algorithm he’ll already be fully charged and on his way.
Contextual Bandits
One major drawback of the basic bandit algorithms we’ve seen is that they take no account of any available context information.
For example, imagine that the power sockets were colour coded, with the colour giving an indication of the amount of charge that would be returned. If, after completing our trials, we found that blue sockets gave a lot of charge and yellow sockets gave very little, then it would make sense the next time we entered a charging station to favour blue sockets over yellow ones.
The basic bandit algorithms simply take an action, collect a reward, and pay no attention to their current state. Therefore potentially helpful information from the current state, that could assist in choosing the best action, is simply ignored.
Contextual bandits³ (also known as “associative bandits”) address this limitation by using information from the current state to help guide their choice of action. As a result they can either be thought of as sophisticated bandit algorithms or as a simplified version of reinforcement learning.
In the full reinforcement learning problem the action that is taken can lead to a change in state and therefore to new contextual information. As a result, the chosen action can have an impact on the future rewards that can be obtained. For example, in a game of chess, a move that looks good and that gives a large immediate reward (such as taking the opposition’s queen) may in fact lead to you losing the game. It would therefore actually be considered to be a bad move. The true reward is delayed and isn’t received until the game ends.
In contextual bandits neither of these conditions exist; actions don’t change the state and rewards are instant, not delayed.
Uses of Bandit Algorithms
So, apart from charging Baby Robots or winning your fortune on the slot machines the next time you’re in Vegas¹, what exactly can Multi-Armed Bandit algorithms be used for?
¹Disclaimer: Vegas casino owners are notoriously touchy about people using algorithms in their casinos. I therefore accept no responsibility for any damages, physical or financial, incurred during the use of these algorithms!
There are probably two main areas of use for Multi-Armed Bandits:
The first is how we’ve used them, as a stepping stone to full Reinforcement Learning. Many of the concepts, such as actions and rewards, from Multi-Armed Bandits are directly applicable to the full Reinforcement Learning (RL) problem. Effectively a Multi-Armed Bandit can be thought of as representing a single state in full Reinforcement Learning. Additionally, the greedy selection of actions, although maybe not the best approach to solving the bandit problem, is often used to choose between different actions in RL.
The second main area of use for bandit algorithms is during real world testing. This can be in any field but is particularly prevalent in online commerce, healthcare and finance.
For example, when assessing how changes to a web page affect its performance, where this can be measured in a variety of ways such as the sales generated from the page or the click through rate, a standard approach is to use A/B Testing. This takes two or more different variations of the page and then typically presents each of these equally to the site users to see how they perform. After a predefined period of time the statistics from each of the pages are compared and the one that has performed the best is then adopted as the winning page.
Quite clearly there is one major drawback to this form of testing: for the duration of the trial an equal number of customers are being sent to an under performing web page. If this page is performing particularly badly it could have a potentially large negative impact on your site’s overall performance. By having a period solely dedicated to exploration A/B Testing continues to investigate badly performing options.
In contrast, in a Multi-Armed Bandit approach, pages are presented to users according to their measured relative performances. Pages that are performing well will increasingly be shown and those that are under-performing will be shown less often. In this way changes to a website may be tested. Good features, that improve performance, are instantly promoted to give a positive impact and bad features, that in A/B testing would continue to give a negative impact for the duration of the test period, are quickly dropped.
Multi-Armed Bandits are used in a similar way in clinical trials. Obviously, during a drug trial, it would be a bad idea to continue to administer a drug that is causing patients to exhibit negative side effects. Therefore Multi-Armed Bandit algorithms have been employed to decide on the drugs and dosages to give to patients to maximise positive outcomes.
For a more comprehensive look at the practical uses of Bandit Algorithms I recommend checking out the following paper:
“A Survey on Practical Applications of Multi-Armed and Contextual Bandits”, Djallel Bouneffouf, Irina Rish (2019)
Conclusion
We’ve seen that, when faced with the problem of having to choose between various different options, where the reward for selecting any of those options is initially unknown, that algorithms such as Thompson Sampling or Upper Confidence Bounds (UCB) perform much better than simply choosing from the options at random. With these algorithms we can minimise the number of times we try bad actions and maximise the number of times we take the best action.
By employing these algorithms in the socket selection problem we were able to quickly locate and exploit the best socket and get Baby Robot charged up and on his way.
In subsequent articles we’ll look at more advanced Reinforcement Learning (RL) techniques, some of which make use of these bandit algorithms. Using these we’ll help Baby Robot to find his way back to his mum!
Note: Although we’ve examined quite a few methods to solve the Multi-Armed Bandit problem, we’ve really just scratched the surface in terms of all available algorithms. Take a look at the Bandit Book if you’d like to see a whole lot more.
References:
[1] “Reinforcement Learning: An Introduction”, Sutton & Barto (2018)
[2] “A Tutorial on Thompson Sampling”, Russo et al., (2017)
[3] “A Contextual Bandit Bake-off”, Bietti et al., (2020)
[4] “A Survey on Practical Applications of Multi-Armed and Contextual Bandits”, Djallel Bouneffouf, Irina Rish (2019)
All code for the bandit algorithms and testing framework can be found on github:
Next in the series…
An Introduction to Reinforcement Learning: Part 1 State Values and Policy Evaluation
