Experimentation and Causal Inference

Multiple Comparison: A Common Pitfall for A/B Testing

Data Scientists are cool!

Introduction

A recent trend in the tech sector is companies increasingly resort to scientific methods to guide decision-making. In particular, they adopt a rigorous testing strategy for product release and innovation. The underlying idea is:

Whenever in doubt, A/B tests it.

To meet large-scale testing needs, tech companies build in-house experimentation platforms that require engineering excellence. For example, Google builds an overlapping testing infrastructure that supports hundreds, even thousands of concurrent experiments. Microsoft develops an automatic triggering system that alerts experiment owners if anything goes astray. Airbnb creates a metric repository that hosts thousands of metrics that can be easily pulled and monitored in real-time.

These examples show that we need a strong engineering team to build and maintain all of these infrastructures, which leaves the false impression that experimentation can be done solely by Platform Engineers and Data Engineers without help from Data Scientists. To my surprise, this is a rather popular misconception in the industry.

In this blog post, we go through a typical experimentation setup and see how things would turn ugly if no qualified Data Scientist or Statistician is on board the team.

Why Hypothesis Testing?

Before we formally get started, let’s try to answer this question:

Why do we need hypothesis testing?

The core idea of Online Experimentation is to infer the population parameters from a finite sample. We conduct statistical tests on the sample data and decide if the difference between the treatment and control groups is an actual improvement or simply due to randomness.

Under any testing scenario, it is impossible to totally eliminate the element of uncertainty. There is no such thing as 100% sure as the element of surprise due to randomness always exists. Instead, we can quantify the level of uncertainty by assigning a value, which is why Statisticians coin the alpha level (aka., False Positive Rate).

For any experimentation platform, it’s critical to accurately assess the level of FPR. A too high FPR renders the platform useless as it shoots wrong signals all the time, and we do not know when is a true signal and when is a false signal.

Why Multiple Testing Is A Problem

Here is a typical use case of A/B testing: a Product Manager wants to understand how a new feature would affect the Overall Evaluation Criterion (e.g., retention) and multiple aspects of User Engagement, including Add-to-Cart Rate, Average Session Duration, Daily Active Users, Number of Returning Customers, and Customer Satisfaction Score. In addition, she is attentive to website performance and chooses bounce rate, session crash rate, website load time, power indicator, and revenue as guardrail metrics. In total, there are 10 metrics in the experiment.

The experimentation team sets up an experiment. After 3 weeks, she collects enough data and feels confident about the new product since the result returns several small P-values. So, let’s roll out the new feature.

Any problems?

Yes, there is a big problem with an inflated FPR. To be specific, the probability of observing at least one false positive, Family-Wise Error Rate (FWER), in multiple hypothesis testing increases significantly.

Here is a back-of-the-envelope calculation: setting alpha level at 5%.

False Positive Rate for a single test: 0.05
The probability of not falling for Type I error for a single test: 1 — 0.05
The probability of not falling for Type I error for 10 tests: (1 — 0.05)¹⁰
The probability of committing Type 1 error at least once for 10 tests: 1 — (1 — 0.05)¹⁰ =0.40, or 40%.

The above procedures show that we will observe at least one False Positive 40% of the time, even when there is no actual difference.

As a first side note, the correct interpretation of the alpha level at 5% is if we re-run the experiment repeatedly, we would falsely reject the Null Hypotheses 5% of the time even when there are no differences.

As a second side note, an A/A test is an experiment that we administer the same treatment condition to both groups. As estimated, the FPR for any single A/A test should be close to the nominal alpha level (e.g., 5%), and the final distribution of p-values for multiple A/A tests should follow a uniform distribution. Check out this post, if you want to learn more about A/A tests and their underlying importance.

By now, you probably agree that multiple hypothesis testing without the proper statistical adjustment would increase FWER, which sends false signals all the time.

In the next section, we delve into three popular statistical adjustments.

Solution 1: Bonferroni Correction

Let’s start with the most straightforward method called Bonferroni Correction (BC). The BC method divides the alpha level by the number of comparisons, i.e., α/n. If we multiply α/n by the number of comparisons n, the BC method keeps the FWER at the α.

In our example, we conduct multiple comparisons of 10 metrics. By default, we choose 0.05 as the individual alpha level. Using the BC method, we divide 0.05 by the number of comparisons: 0.05/10 = 0.005. If we want to claim any statistically significant results, the respective p-value has to be smaller than 0.005.

As seen, we have lowered the rejection bar from 0.05 to 0.005, making it much harder to reject the Null Hypothesis. The method is often criticized for being too conservative and lacking statistical power. The BC method lowers False Positives at the expense of False Negatives.

Of course, the method has several merits. For example, it is generic and applicable to any test statistic used, (in)dependence of the p-values, or the nature of the H0 (James et al. 2021, An Introduction to Statistical Learning).

Solution 2: Holm-Bonferroni

Due to its conservative nature, the BC method is rarely used in practice, and Statisticians have come up with an improved version called the Holm-Bonferroni method. It also controls the FWER but by ranking the p-values and adjusting the rejection criteria for each Null Hypothesis. The biggest difference between these two methods is that the H-B procedure’s rejection threshold depends on all p-values, unlike the BC method.

In terms of power, the H-B is uniformly more powerful than the Bonferroni method: it rejects at least as many Null Hypotheses as the BC. If you use the Bonferroni, consider switching to Holm’s procedure as it always outperforms.

Solution 3: Benjamini-Hochberg

Unlike the Bonferroni family that controls the FWER, Yoav Benjamini and Yosef Hochberg proposed a new method called the Benjamini-Hochberg procedure that controls the False Discovery Rate (FDR), which is defined as the proportion of false positives among all the discoveries (significant results). For example, you have made 100 significant results, and 5 of them are false discoveries (or false rejections); the FDR is 5%.

Here is a handy comparison of FDR and FWER:

FWER: the probability that at least one or more false positives in the test.

FDR: the percentage of false discoveries in the total number of discoveries.

In production, FDR offers a more intuitive interpretation than FWER. The proportion of false findings out of the entire discoveries, which makes intuitive sense to business folks.

In comparison, the Benjamini-Hochberg provides a weaker guarantee around the False Positive Rate but significantly reduces the False Negative Rate than the BC. The Benjamini-Hochberg provides a better tradeoff between Type I and Type II Errors in production.

If you find my post useful and want to learn more about my other content, plz check out the entire content repository here: https://linktr.ee/leihua_ye.

Takeaway

Due to randomness in hypothesis testing, we want to take sampling variance into consideration and assess the probability of having such an extreme data distribution.
A common approach is to use False Positive Rate.
However, multiple comparisons without proper adjustments inflate the cumulate alpha level, rendering A/B testing useless.
This post argues that the Benjamini-Hochberg procedure is the “best” approach that balances False Positives and False Negatives.
It does not mean the Benjamini-Hochberg procedure is a one-size-fits-all solution. You may want to choose another correction method that fits your specific need.
Finally, Data Scientists are super cool.

Enjoy reading this one?

Please find me on LinkedIn and YouTube.

Also, check my other posts on Artificial Intelligence and Machine Learning.