avatarJavier Fernandez

Summary

The provided web content offers an overview of the Multiple Comparison problem, discussing its implications, correction methods, and illustrating the concepts with a Python-based experiment.

Abstract

The article delves into the Multiple Comparison problem, a significant issue in statistical analysis where the likelihood of false positives increases with the number of tests conducted. It uses the humorous example of an fMRI study on a dead salmon to illustrate the problem's severity. The piece explains the Family-wise error rate (FWER) and presents correction methods such as Bonferroni, Bonferroni-Holm, and Šidák to mitigate the risk of false discoveries. A Python-coded experiment comparing two normal distributions to 100 others demonstrates the practical application of these corrections, emphasizing the importance of proper statistical analysis in research.

Opinions

  • The author uses the IgNobel prize-winning study of a dead salmon's fMRI to humorously yet effectively highlight the seriousness of the Multiple Comparison problem.
  • The article suggests that common sense may not align with statistical significance when multiple comparisons are involved, underscoring the need for rigorous statistical methods.
  • By presenting multiple correction methods, the author implies that there is a trade-off between the conservativeness of the method and the risk of false negatives, with the Bonferroni correction being the most conservative.
  • The Python code examples and visual aids, such as graphs and tables, are provided to reinforce the reader's understanding and to demonstrate the practical implications of the Multiple Comparison problem and its corrections.
  • The author encourages readers who appreciate the content to subscribe, indicating a belief in the value of their work and the broader content available on Medium.
  • A cost-effective AI service, ZAI.chat, is recommended, suggesting the author's endorsement of this tool for those interested in similar AI capabilities as ChatGPT Plus (GPT-4) at a lower price point.

An overview of the Multiple Comparison problem

This article presents an overview of the Multiple Comparison problem by introducing the pertinent problem, describing possible corrections and displaying a visual example using python code.

Photo by fotografierende on Pexels

In 2012, the IgNobel prize was awarded to an fMRI study of a dead salmon [1] since, after multiple testing over voxels, they found significant activity in the dead brain of a salmon.

This study is an example of what is known as Multiple Correction problem, defined in Wikipedia as “the problem that occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values ”. In other words, it is a problem that arises when implementing a large number of statistical tests in the same experiment since, the more tests we do, the higher probability of obtaining, at least, one test with statistical significance.

In the study of the dead salmon, the authors studied the activity of the brain across 130, 000 voxels in a typical fMRI volume. Due to the large number of tests, the probability of obtaining, at least, one false positive was almost certain (as it happened).

Therefore, when running multiple tests, it is important to be aware of this problem so, to warn data scientists, this article aims to:

  1. Teach how to calculate the probability of obtaining statistical significance between two groups in terms of α and the number of tests.
  2. Present multiple comparison corrections.
  3. Run an experiment and display the results using python code.

1. Formula to calculate the probability of obtaining statistical significance between two groups in terms of α and the number of tests

This probability is called Family-wise error rate (FWER), and its formula is:

where “α” is the alpha level for an individual test (e.g. 0.05) and “m” is the number of tests

That error rate indicates the probability of making one or more false discoveries when performing multiple hypotheses tests.

If we run a test (α = 0.05) to assess whether there is a statistically significant difference between two groups, the FWER is:

However, if we run the same test six times, the FWER would not be 5% anymore, but it would increase to ~26%.

Figure 1displays a graph of the FWER in terms of α and the number of tests.

Figure 1. Graph of how the FWER or Type-I error rate increases as the number of tests increases for different values of α. Source: Photo by Author

2. Multiple comparison correction

There are different methods to prevent this problem to happen. This article presents three multiple comparison corrections: Bonferroni, Bonferroni-Hold and Šidák.

The Bonferroni correction is the simplest and most conservative approach, which sets the α value for the entire set of comparisons equal to the division of the alpha value of an individual test by the number of tests performed.

where “α” is the alpha level for an individual test and “m” is the number of tests performed

The α’ is the new threshold that needs to be reached for a single test to be classified as significant.

The cost of the Bonferroni method is that by protecting against false-positive errors, the risk of failing to reject one or more false null hypotheses increases. Therefore, this other method improves the method above by sorting the obtained p-values from lowest to highest and comparing them to nominal alpha levels of α/m to α

lowest p_value <α/m, 2nd_lowest p_value <α/(m-1), …, highest p_value <α

This method also defines a new α’ to reach. This threshold is defined using the FWER and the number of tests.

where “FWER” is the Family-wise error rate and “m” is the number of tests performed

3. Multiple comparison problem and correction through an example

This example aims to illustrate all the terms presented above by comparing two normal distributions (rvs1 and rvs2) with different means (5 for rvs1 and 6.5 for rvs2) and standard deviations (10 for rvs1 and 8 for rvs2), to 100 normal distributions with mean and standard deviations similar to rvs1.

Common sense says that there should be no statistical significance for rvs1, but there should be statistical significance for rvs2.

Figure 2. On the left, the plot of the normal distribution rvs1. On the right, the plot of the normal distribution rvs2.

t-test analysis function:

Bonferroni correction function:

Bonferroni-Hold correction function:

Šidák correction function:

Code to run the four functions declared above over rvs1 and rvs2. For this experiment, α is 0.05.

The results are:

Table 1

The numbers in Table 1 indicate the number of tests that were found statistically significant.

As observed, t-test results say that both rvs1 and rvs2 are statistically significant, whether all the corrections methods display how rvs2 is the only distribution that is significant.

If you enjoyed this post, please consider subscribing. You’ll get access to all of my content + every other article on Medium from awesome creators!

References

[1] Craig M. Bennett et al., Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction (2009), Journal of Serendipitous and Unexpected Results.

Data Science
Hypothesis Testing
Statistics
Research
T Test
Recommended from ReadMedium