model to our data then use it as an argument for<code>anova_lm</code> .</p>
<figure id="bcfd">
<div>
<div>
<iframe class="gist-iframe" src="/gist/khuyentran1401/81736653d3339e18b5f3987e661e700a.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><figure id="bc8f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*YZunErhRVfFHoINGPXDe2w.png"><figcaption></figcaption></figure><p id="1997">The F-statistic of the model is 14.962217. The p-value of the model is 8e-06.</p><p id="54fe">Since the p-value is less than the significance level of 0.05, there is enough evidence to claim that some of the means of different levels of cotton content are statistically different.</p><p id="0993">Even though we know some of the means are statistically different overall, <b>which specific</b> <b>two</b> levels of cotton content are different? That is when Tukey’s HSD (Honest Significant Difference) comes in handy.</p><h1 id="99e7">Compare Each Pair of Means Using Tukey’s HSD</h1><p id="93e8">Tukey’s HSD finds out which specific groups’ means are different. The test compares all possible pairs of means.</p><p id="c8c0">Let’s use <code>MultiComparision</code> and its<code>turkeyhsd()</code> method to test for multiple comparisons.</p>
<figure id="21ec">
<div>
<div>
<iframe class="gist-iframe" src="/gist/khuyentran1401/e3ae05644f396b82fedf93e7eba2a8eb.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure>
<figure id="2554">
<div>
<div>
<img class="ratio" src="http://placehold.it/16x9">
<iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fdatapane.com%2Fu%2Fkhuyentran1401%2Freports%2Fanova%2Fembed%2F%3Fblocksquery%3D%252F%252F%252A%255B%2540id%253D%2527summary%2527%255D&display_name=Datapane&url=https%3A%2F%2Fdatapane.com%2Fu%2Fkhuyentran1401%2Freports%2Fanova%2F%3Fblocksquery%3D%252F%252F%2A%255B%2540id%253D%2527summary%2527%255D&image=https%3A%2F%2Fstorage.googleapis.com%2Fdatapane-files-prod%2Fdp%2Fthumbnails%2Fcab4c180-5b22-4025-9e6f-3a16ccab35c7.png&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=datapane" allowfullscreen="" frameborder="0" height="625" width="800">
</div>
</div>
</figure></iframe></div></div></figure><p id="f42b">Explanation of the table above:</p><ul><li><code>group1</code> is compared to <code>group2</code></li><li><code>meandiff</code> : the mean difference between <code>group1</code> and <code>group2</code></li><li><code>p-adj</code> : how likely <code>group1</code> and <code>group2</code> to have the same means</li><li><code>lower</code> and <code>upper</code> : lower and upper bound of the confidence interval.</li><li><code>reject</code> : If it is <code>True</code> , the null hypothesis is rejected. There is enough evidence to claim that the means of the two levels of cotton being compared are significantly different.</li></ul><p id="0b27">Pairs of levels of cotton content that are statistically different:</p><ul><li>(15, 35)</li><li>(20, 25)</li><li>(20, 35)</li><li>(25, 30)</li></ul><p id="4ea1">To understand the results better, let’s look at the plots that visualize significant differences with one confidence interval per group.</p>
<figure id="a657">
<div>
<div>
<iframe class="gist-iframe" src="/gist/khuyentran1401/1b4577d9cd41ba84276f2b4ab19dc34b.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure>
<figure id="8b05">
<div>
<div>
<img class="ratio" src="http://placehold.it/16x9">
<iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fdatapane.com%2Fu%2Fkhuyentran1401%2Freports%2Fanova%2Fembed%2F%3Fblocksquery%3D%252F%252F%252A%255B%2540id%253D%2527plots%2527%255D&display_name=Datapane&url=https%3A%2F%2Fdatapane.com%2Fu%2Fkhuyentran1401%2Freports%2Fanova%2F%3Fblocksquery%3D%252F%252F%2A%255B%2540id%253D%2527plots%2527%255D&image=https%3A%2F%2Fstorage.googleapis.com%2Fdatapane-files-prod%2Fdp%2Fthumbnails%2Fcab4c180-5b22-4025-9e6f-3a16ccab35c7.png&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=datapane" allowfullscreen="" frameborder="0" height="625" width="800">
</div>
</div>
</figure></iframe></div></div></figure><p id="8d1d">The plot above compares the mean of group 15 (fiber with 15% cotton) colored in blue and the means of other groups.</p><ul><li>Since group 35’s mean is not statistically different from group 15’s mean, group 35 is colored gray.</li><li>Since the mean of group 20, 25, and 30 are significantly different from the mean of group 15, they are colored red.</li></ul><p id="45a7">Select other options in the dropdown bar for other comparisons.</p><h1 id="a0fe">Check Model Assumptions</h1><p id="3cfa">ANOVA assumes that each sample was drawn from a normally distributed population. To use ANOVA at all, we need to make sure that this assumption is met.</p><p id="a5fb">To test for normality, we will create a Q-Q plot of residuals. The Q-Q plot plots quantiles of the data versus quantiles of a normal distribution.</p>
<figure id="49b1">
<div>
Options
<div>
<iframe class="gist-iframe" src="/gist/khuyentran1401/78d2835e5cf8c68f09d791f8ac5f35c0.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure>
<figure id="f899">
<div>
<div>
<img class="ratio" src="http://placehold.it/16x9">
<iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fdatapane.com%2Fu%2Fkhuyentran1401%2Freports%2Fanova%2Fembed%2F%3Fblocksquery%3D%252F%252F%252A%255B%2540id%253D%2527residual%2527%255D&display_name=Datapane&url=https%3A%2F%2Fdatapane.com%2Fu%2Fkhuyentran1401%2Freports%2Fanova%2F%3Fblocksquery%3D%252F%252F%2A%255B%2540id%253D%2527residual%2527%255D&image=https%3A%2F%2Fstorage.googleapis.com%2Fdatapane-files-prod%2Fdp%2Fthumbnails%2Fcab4c180-5b22-4025-9e6f-3a16ccab35c7.png&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=datapane" allowfullscreen="" frameborder="0" height="625" width="800">
</div>
</div>
</figure></iframe></div></div></figure><p id="018d">Since the data points fall along a straight diagonal line in a Q-Q plot, the dataset is likely to follow a normal distribution. Thus, the data satisfies ANOVA’s assumption of normality.</p><h1 id="9de3">Conclusion</h1><p id="59e7">Congratulations! You have just learned how to use one-way ANOVA to compare the means of three or more independent groups. No matter how good your data is, if you don’t have a good testing technique, you won’t be able to extract meaningful insights from your data.</p><p id="144a">With ANOVA, you will be able to determine if differences in mean values between three or more groups are by chance or if they are indeed significantly different. Eventually, it will help you decide if it is beneficial to choose one group over others.</p><p id="b348">The source code of this article could be found here:</p><div id="63cd" class="link-block">
<a href="https://github.com/khuyentran1401/Data-science/blob/master/statistics/ANOVA_examples.ipynb">
<div>
<div>
<h2>khuyentran1401/Data-science</h2>
<div><h3>Collection of useful data science topics along with code and articles - khuyentran1401/Data-science</h3></div>
<div><p>github.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*BAwFIGcy_ZxwsCi-)"></div>
</div>
</div>
</a>
</div><p id="be9c">I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on <a href="https://www.linkedin.com/in/khuyen-tran-1401/">LinkedIn</a> and <a href="https://twitter.com/KhuyenTran16">Twitter</a>.</p><p id="39ee">Star <a href="https://github.com/khuyentran1401/Data-science">this repo</a> if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these:</p><div id="0bf8" class="link-block">
<a href="https://towardsdatascience.com/how-to-match-two-people-with-python-7583b51ff3f9">
<div>
<div>
<h2>How to Find a Best Match with Python</h2>
<div><h3>Provided Individual Preferences, how to Match so that the Total Preference is Maximized?</h3></div>
<div><p>towardsdatascience.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*3mFrFJgbe12_1i6W)"></div>
</div>
</div>
</a>
</div><div id="43e3" class="link-block">
<a href="https://towardsdatascience.com/how-to-turn-a-dinosaur-dataset-into-a-circle-dataset-with-the-same-statistics-64136c2e2ca0">
<div>
<div>
<h2>Can Datasets of a Dinosaur and a Circle have Identical Statistics?</h2>
<div><h3>They have the Same Median, Standard Deviation, but they are Two Clearly Distinct Datasets!</h3></div>
<div><p>towardsdatascience.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*V_bcNpwT5mQ5BapLaGGqGw.gif)"></div>
</div>
</div>
</a>
</div><p id="690d"><a href="https://towardsdatascience.com/top-6-python-libraries-for-visualization-which-one-to-use-fe43381cd658">https://towardsdatascience.com/top-6-python-libraries-for-visualization-which-one-to-use-fe43381cd658</a></p><div id="ade1" class="link-block">
<a href="https://towardsdatascience.com/pytest-for-data-scientists-2990319e55e6">
<div>
<div>
<h2>Pytest for Data Scientists</h2>
<div><h3>A Comprehensive Guide to Pytest for your Data Science Projects</h3></div>
<div><p>towardsdatascience.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*NdxIFtI2AeW3WkTaFePRjA.jpeg)"></div>
</div>
</div>
</a>
</div><h1 id="b177">Reference</h1><p id="705e">“One-Way ANOVA.” <i>One-Way ANOVA — An Introduction to When You Should Run This Test and the Test Hypothesis | Laerd Statistics</i>, statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php.</p></article></body>
Introduction to One-Way ANOVA: A Test to Compare the Means between More than Two Groups
The Outcomes of Different Groups Look Different, but are they Statistically Different?
A t-test is useful to find out whether there is a significant difference between two groups. However, a t-test cannot be used to compare between three or more independent groups.
For example, if you are a product developer, you might want to know whether the change in the percentage of cotton used in the synthetic fiber will result in an increase in the tensile strength of the fiber.
To find out, you can conduct a completely randomized experiment with five levels of cotton content and replicate the experiment five times representing five trials. The data might look like the following table:
By looking at the table alone, it is difficult to know whether there are statistically significant differences between the means of these 5 levels of cotton content. Since there are more than 2 groups being compared, a t-test cannot be used in this case.
Is there a way that you can determine whether any of these means are statistically different from each other and produces meaningful outputs like below?
Pairs of levels of cotton content that are not statistically different:
(15, 20)
(15, 25)
(15, 30)
(20, 30)
(25, 35)
(30, 35)
Pairs of levels of cotton content that are statistically different:
(15, 35)
(20, 25)
(20, 35)
(25, 30)
That is when one-way ANOVA comes in handy.
What is One-Way ANOVA?
The one-way ANOVA compares the means of the groups you are interested in and determines whether any of those means are statistically different from each other. A one-way ANOVA has one independent variable while a two-way ANOVA has two independent variables.
Since there is only one independent variable in our problem, which is tensile strength, we will use a one-way ANOVA.
To perform a one-way ANOVA in Python, we will install and use statsmodels package:
pip install statsmodels
Create Data
We will create data that is shown in the introduction.
Nice! The data is set up. Now we are ready to use the one-way ANOVA test.
Compare the Means Between Different Groups
We start with testing whether the means of some levels of cotton content are statistically different.
Null hypothesis: There is no difference in means
Alternate hypothesis: The means are not all equal
Since anova-lm requires one or more fitted linear models, we start with fitting the Ordinary Least Squares (OLS) model to our data then use it as an argument foranova_lm .
The F-statistic of the model is 14.962217. The p-value of the model is 8e-06.
Since the p-value is less than the significance level of 0.05, there is enough evidence to claim that some of the means of different levels of cotton content are statistically different.
Even though we know some of the means are statistically different overall, which specifictwo levels of cotton content are different? That is when Tukey’s HSD (Honest Significant Difference) comes in handy.
Compare Each Pair of Means Using Tukey’s HSD
Tukey’s HSD finds out which specific groups’ means are different. The test compares all possible pairs of means.
Let’s use MultiComparision and itsturkeyhsd() method to test for multiple comparisons.
Explanation of the table above:
group1 is compared to group2
meandiff : the mean difference between group1 and group2
p-adj : how likely group1 and group2 to have the same means
lower and upper : lower and upper bound of the confidence interval.
reject : If it is True , the null hypothesis is rejected. There is enough evidence to claim that the means of the two levels of cotton being compared are significantly different.
Pairs of levels of cotton content that are statistically different:
(15, 35)
(20, 25)
(20, 35)
(25, 30)
To understand the results better, let’s look at the plots that visualize significant differences with one confidence interval per group.
The plot above compares the mean of group 15 (fiber with 15% cotton) colored in blue and the means of other groups.
Since group 35’s mean is not statistically different from group 15’s mean, group 35 is colored gray.
Since the mean of group 20, 25, and 30 are significantly different from the mean of group 15, they are colored red.
Select other options in the dropdown bar for other comparisons.
Check Model Assumptions
ANOVA assumes that each sample was drawn from a normally distributed population. To use ANOVA at all, we need to make sure that this assumption is met.
To test for normality, we will create a Q-Q plot of residuals. The Q-Q plot plots quantiles of the data versus quantiles of a normal distribution.
Since the data points fall along a straight diagonal line in a Q-Q plot, the dataset is likely to follow a normal distribution. Thus, the data satisfies ANOVA’s assumption of normality.
Conclusion
Congratulations! You have just learned how to use one-way ANOVA to compare the means of three or more independent groups. No matter how good your data is, if you don’t have a good testing technique, you won’t be able to extract meaningful insights from your data.
With ANOVA, you will be able to determine if differences in mean values between three or more groups are by chance or if they are indeed significantly different. Eventually, it will help you decide if it is beneficial to choose one group over others.
The source code of this article could be found here:
I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.
Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these:
“One-Way ANOVA.” One-Way ANOVA — An Introduction to When You Should Run This Test and the Test Hypothesis | Laerd Statistics, statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php.