Understanding Power Analysis in AB Testing
A digestible explanation of the stats fundamentals you need to know in AB testing
You have a great idea for your product and you know it’ll increase conversion and bring business value. But can you convince management? And even if you are convinced yourself, can you quantify your confidence in this conviction?
Luckily, with AB testing, you can take out a lot of the guesswork. Or at least you can make a quantifiable guess and determine how ‘guess-able’ your guessing is. In stats’ terminology: you can test the hypothesis that your idea is really great and set a confidence level to the results you are seeing.
If you are a product owner, it can be helpful to have some general idea of how AB testing works and understand some common applications and issues. If you are running tests yourself, understanding the underlying statistics and intuition behind AB testing is critical when performing experimentation design, even if you have a fully set up experimentation engine. Experimentation design answers the questions:
- How much data should I be collecting for my test?
- How long should I run my test for?
- My page doesn’t have a lot of visitors — does that matter in running an experiment?
In this post, I am going to cover statistical intuitions behind AB testing, experimentation design and some brief practical applications. Let’s jump right in.
What is AB Testing, and how is it different from Hypothesis Testing?
Statistical hypothesis testing is a procedure to accept or reject the null hypothesis, or H0 for short. The null hypothesis represents an assumption about the population parameter, and is considered the default assumption. An example of this: we assume a coin is fair. Can we determine if this assumption is reasonable if we flip the coin 100 times?
AB testing is taking two randomized samples from a population, a Control and a Variant sample, and determining if the difference between those two samples are significant. To note, there are many forms of experimentation (ABC and multivariate testing) but we are covering just AB testing today.
For the rest of the article, we will be constantly using the terms Control and Variant to get into the mindset of AB testing.
Knowing the Terminologies: Null Hypothesis, Types of Errors
Now let’s put it all in the context of testing a new feature that you want to launch on your site. You will have a Control and Variant version of the test whereby the Variant sample has the new feature. Below are some terminologies:
- Null Hypothesis, H0 is when your design change will not have an effect on your test variation. If you fail to reject the null hypothesis, you will act as if the null hypothesis is true and you should not launch your new feature.
- Alternative Hypothesis, H1 is alternate to the null hypothesis whereby the design change will have an effect on your test variation. If you reject the null hypothesis, you accept the alternative hypothesis and you should launch your new feature.
- Type I error: you reject the null, when you should not. In the online world, this means that you are launching a feature change when it actually makes no positive difference to conversion. Your cost is the cost of development.
- Type II error: when you do not reject the null, but there is actually a positive difference between test and control. It is when you decide not to launch a new feature when there is actually a difference. Assuming that the change is positive, your net cost will be the potential lift from launching the feature minus the development cost.
How is Binomial and Normal Distribution Involved?
Why do we use Binomial Distribution? In AB testing, you are trying to determine if the number of successes in the Variant is significantly different from Control. The number of successes (usually a conversion or no conversion outcome) over a sequence of trials can be adequately measured using Binomial Distribution where the X-axis is number of conversion (or conversion rate) and the Y-axis is probability.
To add to that, the Central Limit Theorem says that if the sample size is big enough, then the distribution will follow the Normal Distribution.
What is big enough? A rule of thumb is if the sample size multiplied with probability (or conversion rate) is more than 5, it should follow the Normal Distribution.
The distribution looks like the graph below where it can be represented by a mean and a standard deviation.
Now, let’s bring it back to Control and Variant sample tests.
The graph below represents the distribution of the Control and Variant distributions. The Control will represent the null hypothesis while the Variant will represent the alternative hypothesis. The critical value region line represents the point at which you reject the null hypothesis. For illustration purposes, we can say that if H1-H0 >= 0.5, we will reject the null hypothesis (H1 being alternative hypothesis, and H0 being null hypothesis). The actual mean of the Variant is 0.7.
If the null hypothesis is true and you reject the null hypothesis, there is a chance that you are making a Type 1 error. That chance is represented by the shaded area in grey. On the other hand, if the null hypothesis is not true and you fail to reject the null, there is a chance you are making a Type II error. That chance is represented by the shaded area in red.
I hope the illustration above gives a better intuition of how confidently you can make the statement on whether the Variant is actually different from the Control. Let me introduce another illustration to give an intuition about sample size.
The higher the sample size, the more likely the sample is representing the actual population. To put it in statistical terms, your standard error will be smaller and your distribution will be narrower. The graph below has the same distribution means, but with a smaller standard error, or higher sample size. The shaded area representing Type I and Type II error becomes smaller. If you do reject the null hypothesis, you are now more confident about not making a Type I error because your sample sizes are bigger.
Now that we have some intuition on sample size and test distributions, we can have a stronger intuition when talking about statistical factors to consider when designing an experiment.
Experimentation Design and Power Analysis
When you are designing a test, you want to prepare your experiment in a way that you can confidently make statements about the difference (or absence of a difference) in the Variant sample, even if that difference is small. If your site has millions of unique users per day, even a 0.5% difference in conversion can be significant in your top-line. If so, you want to capture that.
There are four statistical factors to consider when designing an experiment:
1. Minimum Detectable Effect Size and Conversion Rate: What is the effect you want to capture to make the experiment worth your while? This is a question easily answered by a measured dollar amount. When you multiply the minimum detectable effect with your N (total number of users), what is the measurable positive lift you will get in dollar value? As an example, how much top-line dollar lift will you achieve if 1,000 additional customers checkout with an average basket size of $50?
If you are a big e-commerce company with 15 million unique visitors to your Item Page each day, 1% will have a huge impact to your top-line. In fact, 1% might be a tall ask given your site has likely been optimized over many product developments to achieve 15 million users each day. On the other hand, if your site has just 1,000 conversions per day, a 1% lift would be an additional 10 counts of conversion which is likely not meaningful to your top-line. At the same time, achieving a 10% lift or more might be very feasible if the feature change is fixing an obvious pain point for users.
Conversion rate matters in your calculation of sample sizes. If your site’s conversion rate is 30%, you do not require as large a sample size as if your site’s conversion rate is 1%
2. Sample Size: What is the size of the sample you need to collect? Your experiment design could require you to have 1 million visitors, but if you only have 5,000 visitors a day to your Account page, it is time to change your other parameters.
3. Significance: Also known as alpha, this is the probability threshold that the difference you detect in your experiment is by chance and not from an actual difference. It is often set to 5%.
Another way of putting it is: if there is no difference in the tests, you’re willing to make a Type I error 5% of the time.
4. Power: also known as (1 -Beta), it can be explained as the strength of your test to detect an actual difference in your Variant. Conversely, Beta is the probability that your test does not reject the null hypothesis when it should actually be rejecting the null hypothesis. The higher the power, the lower the probability of a Type II error. Experiments are usually set at a power level of 80%, or 20% Beta.
Another way of putting it is: if there is a difference in the test, you’re willing to make Type II error 20% of the time.
The Power Analysis is finding one of these parts given the input of the other 3 parts.
Let’s revisit our familiar null and alternative hypothesis distribution to tie it back to the intuition we’ve built so far. Given a fixed effect size and sample size, alpha and beta has an inverse relationship. The more Power (smaller beta) you set for your experiment, the bigger the alpha area — there is no free lunch.
As a practitioner, the Statistical Significance and Power is typically fixed from one experiment to another, based on the nature of the business. Your concern for each new experiment is whether your sample size is sufficient to give your experiment Power to detect the effect size you care about. In other words, what is the minimum required sample size for your experiment?
The formula for calculating sample size is:
where:
- p = average conversion rate
- pA = Control probability or conversion rate
- pB = Variant probability or conversion rate you plan to detect
- |z2| = absolute z score for power
- 1.96 = z-score for when significance is 5% (for 2-tailed test)
In practice, there are plenty of online calculators that will help you estimate the minimum required sample size. This is a good one to check out.
Try playing with the calculator and see how changing the value of a statistical factor affects the minimum required sample size. The relationship will follow the summary below:
Once you know what your minimum required sample size is, you can evaluate if your site has sufficient traffic to reach that sample size and how long you should be running your experiment for.
Minimum Duration
Say you are able to capture 1 million visitors to each variation on the page you are running, and the required sample size is just 3 million. Do you then just need to run your experiment for 3 days and conclude? You can argue for agility and a faster launch, but this is not a good idea. You want your sample to be representative of your population and it’s fair to assume that weekend behavior is very different from weekday behavior. Heck, Thursday behavior is very different from Friday behavior. Hence, it is recommended to run the experiment for at least 2 weeks, and the very bare minimum for 1 week.
Closing
Thanks for reading. My goal for this post is to help a product manager or interviewee (or anyone, really) to talk intelligibly about the intuitions in AB testing. Let me know if there are parts that are unclear.
Sources:
- Machine Learning Mastery: A Gentle Introduction to Statistical Power and Power Analysis in Python
- Udacity’s Course on AB Testing
- Statsoft: Designing an Experiment, Power Analysis
- www.biostathandbook
The post was originally posted on my personal blog.