An overview of the Multiple Comparison problem
This article presents an overview of the Multiple Comparison problem by introducing the pertinent problem, describing possible corrections and displaying a visual example using python code.
In 2012, the IgNobel prize was awarded to an fMRI study of a dead salmon [1] since, after multiple testing over voxels, they found significant activity in the dead brain of a salmon.
This study is an example of what is known as Multiple Correction problem, defined in Wikipedia as “the problem that occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values ”. In other words, it is a problem that arises when implementing a large number of statistical tests in the same experiment since, the more tests we do, the higher probability of obtaining, at least, one test with statistical significance.
In the study of the dead salmon, the authors studied the activity of the brain across 130, 000 voxels in a typical fMRI volume. Due to the large number of tests, the probability of obtaining, at least, one false positive was almost certain (as it happened).
Therefore, when running multiple tests, it is important to be aware of this problem so, to warn data scientists, this article aims to:
- Teach how to calculate the probability of obtaining statistical significance between two groups in terms of α and the number of tests.
- Present multiple comparison corrections.
- Run an experiment and display the results using python code.
1. Formula to calculate the probability of obtaining statistical significance between two groups in terms of α and the number of tests
This probability is called Family-wise error rate (FWER), and its formula is:
That error rate indicates the probability of making one or more false discoveries when performing multiple hypotheses tests.
If we run a test (α = 0.05) to assess whether there is a statistically significant difference between two groups, the FWER is:
However, if we run the same test six times, the FWER would not be 5% anymore, but it would increase to ~26%.
Figure 1displays a graph of the FWER in terms of α and the number of tests.
2. Multiple comparison correction
There are different methods to prevent this problem to happen. This article presents three multiple comparison corrections: Bonferroni, Bonferroni-Hold and Šidák.
The Bonferroni correction is the simplest and most conservative approach, which sets the α value for the entire set of comparisons equal to the division of the alpha value of an individual test by the number of tests performed.
The α’ is the new threshold that needs to be reached for a single test to be classified as significant.
The cost of the Bonferroni method is that by protecting against false-positive errors, the risk of failing to reject one or more false null hypotheses increases. Therefore, this other method improves the method above by sorting the obtained p-values from lowest to highest and comparing them to nominal alpha levels of α/m to α
lowest p_value <α/m, 2nd_lowest p_value <α/(m-1), …, highest p_value <α
This method also defines a new α’ to reach. This threshold is defined using the FWER and the number of tests.
3. Multiple comparison problem and correction through an example
This example aims to illustrate all the terms presented above by comparing two normal distributions (rvs1 and rvs2) with different means (5 for rvs1 and 6.5 for rvs2) and standard deviations (10 for rvs1 and 8 for rvs2), to 100 normal distributions with mean and standard deviations similar to rvs1.
Common sense says that there should be no statistical significance for rvs1, but there should be statistical significance for rvs2.