The web content discusses the advantages of multi-armed bandit experiments over traditional A/B testing in online advertising and product feature testing, highlighting the use of Bayesian methods and Monte Carlo simulations for more efficient and adaptive testing.
Abstract
The article provides an in-depth comparison between traditional A/B testing and multi-armed bandit experiments, emphasizing the latter's ability to dynamically allocate traffic to better-performing variations. Multi-armed bandit experiments utilize Bayesian updating and Thompson sampling to continuously learn and adjust during the testing process, which can lead to quicker identification of the most effective options. This approach is particularly beneficial when the cost of exposing users to suboptimal experiences is high, as it minimizes the number of users subjected to less successful variations. The article also covers the concept of "value remaining in experiment" as used by Google Analytics, which helps in determining when to terminate an experiment, and the use of Monte Carlo simulations to assess confidence in the observed results. The author concludes by discussing situations where multi-armed bandit tests are preferable, such as for early-stage startups or when testing multiple variants, while also acknowledging the limitations and appropriate use cases for A/B testing.
Opinions
The author views multi-armed bandit experiments as superior to A/B testing in certain contexts, particularly due to their adaptability and efficiency in traffic allocation.
It is the author's opinion that the Bayesian approach, specifically Thompson sampling, is an elegant solution that allows for real-time learning and decision-making in experiments.
The author suggests that the ability to terminate experiments earlier with multi-armed bandit tests is a significant advantage, as it reduces the time and resources spent on testing.
The author believes that Google Analytics' concept of "value remaining in experiment" is a useful criterion for ending experiments, ensuring that decisions are made when there is sufficient confidence in the results.
The author acknowledges that while multi-armed bandit experiments have a higher false positive rate compared to A/B testing, the trade-off is often acceptable given the method's other advantages.
According to the author, multi-armed bandit experiments are particularly well-suited for early-stage startups and scenarios where multiple variants need to be tested simultaneously.
The author notes that A/B testing remains a better option in situations where controlling for false positives is paramount, or when the company has a large user base to support more traditional testing methods.
A study of Google Analytics’ stochastic k-armed bandit test with Thompson sampling and Monte Carlo simulation
A/B Testing Recap
A/B testing relies on classic statistical test for statistical significance. When we come up with a new product feature, we probably want to test whether it is useful before launching it to the entire user base. The test involves two groups: a treatment group (who have access to the new feature) and a control group. Then we measure a key metric for the two groups: average time on site (social network), average checkout time (e-commerce), or click through rate (online ads). The difference between the groups is tested for statistical significance.
Classical statistical test (z-test, t-test) guarantees that false positive rate is no more than α, which is often set to 5%. This means that when there is no difference between the treatment group and control group, the test will find a statistical difference by chance 5% of the time.
A balanced AB test would allocate equal amount of traffic to each group, until reaching sufficient sample size. However, we cannot adjust traffic allocation during the test based on what is observed. So the disadvantage of A/B testing is obvious: if the treatment group is clearly superior, we still have to spend lots of traffic on the control group, in order to obtain statistical significance.
Multi-armed Bandit
Whereas A/B testing is a frequentist approach, we can also conduct the test from the Bayesian way. It is understandable that once we see one treatment is clearly better, we want to add more users to that treatment right away. Multi-armed bandit experiment makes this possible in a controlled way.
The foundation of the multi-armed bandit experiment is Bayesian updating. Each treatment (called “arm”, see class definition below) has a probability of success, which is modeled as a Bernoulli process. The probability of success is unknown, and is modeled by a Beta distribution. As the experiment continues, each arm receives user traffic, and the Beta distribution is updated accordingly. To visualize the updating process, see my earlier post here.
In this post, I’m using the Google Analytics example of online advertisements matching. Suppose there are K arms. Each arm is an ad with click-through rate (ctr) that follows a Beta distribution. The goal of the experiment is to find the ad with the highest click through rate.
Thompson Sampling
In a nutshell, Thompson sampling is a greedy method that always chooses the arm that maximizes expected reward. In each iteration of the bandit experiment, Thompson sampling simply draws a sample ctr from each arm’s Beta distribution, and assign the user to the arm with the highest ctr.
The elegant part of bandit experiment is that Thompson sampling and Bayesian update work hand-in-hand. If one of the arms is performing well, its Beta distribution parameters are updated to remember this, and Thompson sampling will more likely draw a high ctr from this arm. Throughout the experiment, high-performing arms are rewarded with more traffic, whereas under-performing arms are punished with less traffic.
Monte Carlo Simulation
Whereas the Beta distribution estimates the ctr, we need to know how confident we are about each estimate of ctr. If we are confident enough about the arm that currently has the highest ctr, we can end the experiment.
The way Monte Carlo simulation works is to randomly draw samples from each of the K arms multiple times, and empirically compute how often each of the arms wins (with highest ctr). If the winning arm is beating the second arm by a large enough margin, the experiment terminates.
Termination
Google Analytics introduces the concept of “value remaining in experiment” (more details here). In each Monte Carlo simulation, the value remaining is computed. If we choose α = 5%, then the experiment terminates when 95% of the samples in a Monte Carlo simulation have remaining value less than 1% of the winning arm’s value.
Simulation
Having defined the utility functions above, it is straightforward to put them together. For each iteration, a new user arrives. We apply Thompson sampling to select an arm and see whether user clicks. Then we update the Beta parameters of the arm, check whether we are confident enough about the winning arm to end the experiment.
Note that I introduced a burn-in parameters. This is the minimum number of iterations that must be run before declaring a winner. The beginning of the experiment is the nosiest period, and any loser arm could get ahead by chance. The burn-in period helps prevent prematurely ending the experiment before the noise settles down.
In reality, this also helps controls novelty effect, cold-start and other user-psychology related confounding variables. Google Analytics forces all bandit experiments to run for a minimum of 2 weeks.
Advantage
The main advantage of bandit experiment is that it terminates earlier than A/B test because it requires much smaller sample. In a two-armed experiment with click-through rate 4% and 5%, traditional A/B testing requires 11,165 in each treatment group at 95% significance level. With 100 users a day, the experiment will take 223 days. In the bandit experiment, however, simulation ended after 31 days, at the above termination criterion.
A second advantage of bandit experiment is that the experiment is making fewer mistakes than A/B testing. A balanced A/B test would always send 50% of traffic to each group. The plot above shows that as the experiment progresses, fewer and fewer traffic was sent to the losing arm.
Here is how a 5-armed bandit experiment ran in simulation. We see that the red arm (with ctr 4.4%) was mistaken as the winning arm during the first 150 iterations, and we were diverting as much as 80% traffic to the losing arm. But the true blue arm (ctr 4.8%) caught up and emerged as the true winner.
Tradeoff
There is no free lunch, and the convenience of smaller sample size comes at a cost of a larger false positive rate. Although I had used α empirically as the false positive rate to terminate experiment, the false positive rate is higher than α, after repeating the simulation many times.
Empirically, an α of 5% finds the winning arm about 91% of the time, instead of 95%. The smaller the α we set, the larger the sample size we need (shown in red), which is consistent with how A/B testing behaves.
Conclusion
As it turns out, there is no definite winner, and it is important for product manager, data scientist and practitioners to understand the strength and weakness of the two methods before making a choice. Multi-armed bandit test is preferred during the following situations:
When the cost of sending users to a losing arm is high. In this example, matching users with a bad advertisement simply results in less revenue. The loss is affordable. In other cases, such as when testing two different account recovery methods, a failure in each arm means permanent loss of a user, and multi-armed bandit experiment is clearly a better option.
For early startups with insufficient user traffic, multi-armed bandit experiment works better because it requires a smaller sample size, terminates earlier, and is more agile than A/B testing.
When there are more than two variants to test, bandit experiment can help us find the winner quickly. In bandit experiment, it is common to test 4~8 variants at a time, whereas A/B testing is limited to two groups per test.
A clear limitation of multi-armed bandit test is that each arm must be modeled by a Beta distribution, meaning that every time you try an arm, it results in success or failure. This is good for modeling click through rate and conversion rate, but if you are testing which checkout process is faster, you have to do t-test on difference of means.
On the other hand, A/B testing is a better option when the company has large enough user base, when it’s important to control for type I error (false positives), and when there are few enough variants that we can test each one of them against the control group one at a time.
Extended Reading
The following blogs covers topics relevant to AB testing, and more in-depth review of key concepts mentioned in this article.
Visualizing Beta Distribution and Bayesian Updating [link]