Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

tribute">x="concave points_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[2, 5])</pre></div><div id="e82b"><pre>sns.kdeplot(data=df, x="symmetry_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 0]) sns.kdeplot(data=df, x="fractal_dimension_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 1]) sns.kdeplot(data=df, x="radius_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 2]) sns.kdeplot(data=df, x="texture_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 3]) sns.kdeplot(data=df, x="perimeter_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 4]) sns.kdeplot(data=df, x="area_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 5])</pre></div><div id="eb2d"><pre>sns.kdeplot(data=df, x="smoothness_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 0]) sns.kdeplot(data=df, x="compactness_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 1]) sns.kdeplot(data=df, x="concavity_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 2]) sns.kdeplot(data=df, x="concave points_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 3]) sns.kdeplot(data=df, x="symmetry_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 4]) sns.kdeplot(data=df, x="fractal_dimension_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 5])</pre></div><div id="ee96"><pre>plt.show()</pre></div><figure id="30c7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*EhjvIq2ypm4sEF07fDWXLQ.png"><figcaption></figcaption></figure>And the box plots:<div id="04f8"><pre>#Box-plot comparing radius_mean according to diagnosis status:</pre></div><div id="ecd5"><pre>fig, axs = plt.subplots(5, 6, figsize=(40, 25))</pre></div><div id="e596"><pre>sns.boxplot(data=df, x="radius_mean", y="diagnosis", ax=axs[0, 0]) sns.boxplot(data=df, x="texture_mean", y="diagnosis", ax=axs[0, 1]) sns.boxplot(data=df, x="perimeter_mean", y="diagnosis", ax=axs[0, 2]) sns.boxplot(data=df, x="area_mean", y="diagnosis", ax=axs[0, 3]) sns.boxplot(data=df, x="smoothness_mean", y="diagnosis", ax=axs[0, 4]) sns.boxplot(data=df, x="compactness_mean", y="diagnosis", ax=axs[0, 5])</pre></div><div id="d0d6"><pre>sns.boxplot(data=df, x="concavity_mean", y="diagnosis", ax=axs[1, 0]) sns.boxplot(data=df, x="concave points_mean", y="diagnosis", ax=axs[1, 1]) sns.boxplot(data=df, x="symmetry_mean", y="diagnosis", ax=axs[1, 2]) sns.boxplot(data=df, x="fractal_dimension_mean", y="diagnosis", ax=axs[1, 3]) sns.boxplot(data=df, x="radius_se", y="diagnosis", ax=axs[1, 4]) sns.boxplot(data=df, x="texture_se", y="diagnosis", ax=axs[1, 5])</pre></div><div id="462a"><pre>sns.boxplot(data=df, x="perimeter_se", y="diagnosis", ax=axs[2, 0]) sns.boxplot(data=df, x="area_se", y="diagnosis", ax=axs[2, 1]) sns.boxplot(data=df, x="smoothness_se", y="diagnosis", ax=axs[2, 2]) sns.boxplot(data=df, x="compactness_se", y="diagnosis", ax=axs[2, 3]) sns.boxplot(data=df, x="concavity_se", y="diagnosis", ax=axs[2, 4]) sns.boxplot(data=df, x="concave points_se", y="diagnosis", ax=axs[2, 5])</pre></div><div id="d904"><pre>sns.boxplot(data=df, x="symmetry_se", y="diagn

Options

osis", ax=axs[3, 0]) sns.boxplot(data=df, x="fractal_dimension_se", y="diagnosis", ax=axs[3, 1]) sns.boxplot(data=df, x="radius_worst", y="diagnosis", ax=axs[3, 2]) sns.boxplot(data=df, x="texture_worst", y="diagnosis", ax=axs[3, 3]) sns.boxplot(data=df, x="perimeter_worst", y="diagnosis", ax=axs[3, 4]) sns.boxplot(data=df, x="area_worst", y="diagnosis", ax=axs[3, 5])</pre></div><div id="17a5"><pre>sns.boxplot(data=df, x="smoothness_worst", y="diagnosis", ax=axs[4, 0]) sns.boxplot(data=df, x="compactness_worst", y="diagnosis", ax=axs[4, 1]) sns.boxplot(data=df, x="concavity_worst", y="diagnosis", ax=axs[4, 2]) sns.boxplot(data=df, x="concave points_worst", y="diagnosis", ax=axs[4, 3]) sns.boxplot(data=df, x="symmetry_worst", y="diagnosis", ax=axs[4, 4]) sns.boxplot(data=df, x="fractal_dimension_worst", y="diagnosis", ax=axs[4, 5])</pre></div><div id="4c73"><pre>plt.show()</pre></div><figure id="2a1c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3TFNhVQu5h0Uk819bdClRQ.png"><figcaption></figcaption></figure>By observing the graphic results we can already expect to find some significant differences in the means of the two groups. So let’s now learn how to calculate the t-value and its significance.The t-value is given by the following formula:<figure id="7110"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*MKgf_xHMgPZKaL12edkgYw.jpeg"><figcaption></figcaption></figure>Looking at the formula, we can see that the t-value is given by the ratio between the difference between the means of the two groups, and the square root of the sum of the variance divided by the number of samples in each group. We can compute the t-value manually with Python:<div id="3375"><pre>#create different dataframes for the two groups: df_M = df.loc[df['diagnosis'] == 'M'] df_B = df.loc[df['diagnosis'] == 'B']</pre></div><div id="8109"><pre>#get the N for both groups: print('The n for Malignant is: ', len(df_M)) print('The n for Benign is: ', len(df_B))</pre></div><div id="1d8f"><pre>[OUT] The n for Malignant is: 212 [OUT] The n for Benign is: 357</pre></div>Now we will find the difference between the two means:<div id="e9c5"><pre>#Find the difference between the two means: mean_radius_mean_M = statistics.mean(df_M['radius_mean']) print(mean_radius_mean_M)</pre></div><div id="2f9f"><pre>[OUT] 17.462830188679245</pre></div><div id="cf73"><pre>mean_radius_mean_B = statistics.mean(df_B['radius_mean']) print(mean_radius_mean_B)</pre></div><div id="49e7"><pre>[OUT] 12.14652380952381</pre></div><div id="435c"><pre>mean_diff_radium_mean = mean_radius_mean_M - mean_radius_mean_B print(mean_diff_radium_mean)</pre></div><div id="186b"><pre>[OUT] 5.316306379155435</pre></div>And the variance:<div id="d3a2"><pre>#Find the variances:</pre></div><div id="8fe3"><pre>var_radius_mean_M = statistics.variance(df_M['radius_mean']) print(var_radius_mean_M)</pre></div><div id="3ab8"><pre>[OUT] 10.26543081462935</pre></div><div id="7552"><pre>var_radius_mean_B = statistics.variance(df_B['radius_mean']) print(var_radius_mean_B)</pre></div><div id="d1dd"><pre>[OUT] 3.1702217220438738</pre></div>Now we are ready to calculate the t-value:<div id="ad97"><pre>#Applying t-value formula: t_value = mean_diff_radium_mean/(math.sqrt(((var_radius_mean_M/212)+(var_radius_mean_B/357)))) print('The t-value is: ', t_value)</pre></div><div id="3339"><pre>[OUT] The t-value is: 22.208797758464517</pre></div>The last value we will need is the degrees of freedom in our samples:<div id="4da5"><pre>#Degrees of Freedom: dof = (212+357)-1 print('Dof: ', dof)</pre></div><div id="72d6"><pre>[OUT] Dof: 568</pre></div>Now we just need to look up in a table with the t-distribution the value of p according to our t-value and the DoF value. I can tell you that the p-value is lower than 0.001, so the difference between the two means is statistically significant.However, we don’t always need to do all these calculations. I just did these calculations to help you understand how the t-test works. With Python, we can use pre-defined functions for this statistical test, which facilitates, speeds up, and prevents errors. Let’s see how easy it is:<div id="da1e"><pre>#Variable radius_mean: ttest = stats.ttest_ind(a=df_M['radius_mean'], b=df_B['radius_mean'], equal_var =False) print(ttest)</pre></div><div id="033f"><pre>[OUT] Ttest_indResult(statistic=22.208797758464524, pvalue=1.6844591259582747e-64)</pre></div>Now that we have the p-value for the t-test, we can see how this value manifests visually if we look again at the density plot and at the box plot:<figure id="9176"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JDbYaCmr6wFd7LCRLyA2bg.png"><figcaption></figcaption></figure>It is possible to see that the density plots have an area of overlap, but the means are still different. On the box plot, we can see that the area with the mean and standard deviation of the two groups does not have any overlap. Now let´s see another example where the p-value is higher than 0.05:<div id="30f5"><pre>#Variable smoothness_se: ttest = stats.ttest_ind(a=df_M['smoothness_se'], b=df_B['smoothness_se'], equal_var =False) print(ttest)</pre></div><div id="8642"><pre>[OUT] Ttest_indResult(statistic=-1.6228692577349724, pvalue=0.10529700302804572)</pre></div><figure id="08bd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fCxUBAF2341K54_zZwdUYA.png"><figcaption></figcaption></figure>The graphical differences are huge, as you can see overlaps everywhere.<h2 id="8652">__________________________________________________________</h2>Thank you for reading! Don’t forget to subscribe to receive notifications about my future publications.If: you liked this article, don’t forget to follow me and thus receive all updates about new publications.Else If: you want to read more on the topic, you can buy my book “<a href="https://www.amazon.com/dp/B0C7J9GD7J">Data-Driven Decisions: A Practical Introduction to Machine Learning</a>” which will give you all the information you need to start with Machine Learning. It will cost you only a coffee, and give me a small tip!Else: Thank you!</article></body>

Statistics for Data Science: Comparing Two Means

The Student’s t-test

For those who are familiar with data analysis/science, you know that one of the most common problems in this area is the need to compare two means or two proportions, whether from two different samples (or populations), from the same sample but at different times, or even from the same sample but for different variables. In order to solve this type of problem, statistical inference is used. In this article, we will cover the parametric tests used to compare two means.

T-test

The comparison of two means obtained from two different populations or from two samples is called a two-sample problem. Its main objective is to compare the characteristics of two different populations. It can also be used to compare two random samples. To use the t-test, the data must comply with two assumptions: normal distribution and the samples must be independent. Before applying any statistical tests we can visualize our data, and the ideal plots for comparing two means are histograms and box plots.

For this example, I will use the Kaggle database ‘Breast Cancer Wisconsin (Diagnostic) Data Set’ which can be found by clicking here. For comparison, we will use two independent samples that will be the cases classified as malignant or not. So let’s get our hands dirty on our database and have fun.

#import the necessary libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import math
import statistics

#import and load dataset:
df = pd.read_csv('/content/data.csv')
df

#Get all columns names
for col_name in df.columns:
    print(col_name)

#Drop column Unnamed 32:
df.drop(['Unnamed: 32'], axis=1)

Now we can build our plots. The first plots we will build are the density plots:

#Density-plot comparing radius_mean according to diagnosis status:

fig, axs = plt.subplots(5, 6, figsize=(40, 25))

sns.kdeplot(data=df, x="radius_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[0, 0])
sns.kdeplot(data=df, x="texture_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[0, 1])
sns.kdeplot(data=df, x="perimeter_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[0, 2])
sns.kdeplot(data=df, x="area_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[0, 3])
sns.kdeplot(data=df, x="smoothness_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[0, 4])
sns.kdeplot(data=df, x="compactness_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[0, 5])

sns.kdeplot(data=df, x="concavity_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[1, 0])
sns.kdeplot(data=df, x="concave points_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[1, 1])
sns.kdeplot(data=df, x="symmetry_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[1, 2])
sns.kdeplot(data=df, x="fractal_dimension_mean", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[1, 3])
sns.kdeplot(data=df, x="radius_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[1, 4])
sns.kdeplot(data=df, x="texture_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[1, 5])

sns.kdeplot(data=df, x="perimeter_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[2, 0])
sns.kdeplot(data=df, x="area_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[2, 1])
sns.kdeplot(data=df, x="smoothness_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[2, 2])
sns.kdeplot(data=df, x="compactness_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[2, 3])
sns.kdeplot(data=df, x="concavity_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[2, 4])
sns.kdeplot(data=df, x="concave points_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[2, 5])

sns.kdeplot(data=df, x="symmetry_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 0])
sns.kdeplot(data=df, x="fractal_dimension_se", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 1])
sns.kdeplot(data=df, x="radius_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 2])
sns.kdeplot(data=df, x="texture_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 3])
sns.kdeplot(data=df, x="perimeter_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 4])
sns.kdeplot(data=df, x="area_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[3, 5])

sns.kdeplot(data=df, x="smoothness_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 0])
sns.kdeplot(data=df, x="compactness_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 1])
sns.kdeplot(data=df, x="concavity_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 2])
sns.kdeplot(data=df, x="concave points_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 3])
sns.kdeplot(data=df, x="symmetry_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 4])
sns.kdeplot(data=df, x="fractal_dimension_worst", hue="diagnosis", fill=True, common_norm=False, alpha=0.4, ax=axs[4, 5])

plt.show()

And the box plots:

#Box-plot comparing radius_mean according to diagnosis status:

fig, axs = plt.subplots(5, 6, figsize=(40, 25))

sns.boxplot(data=df, x="radius_mean", y="diagnosis", ax=axs[0, 0])
sns.boxplot(data=df, x="texture_mean", y="diagnosis", ax=axs[0, 1])
sns.boxplot(data=df, x="perimeter_mean", y="diagnosis", ax=axs[0, 2])
sns.boxplot(data=df, x="area_mean", y="diagnosis", ax=axs[0, 3])
sns.boxplot(data=df, x="smoothness_mean", y="diagnosis", ax=axs[0, 4])
sns.boxplot(data=df, x="compactness_mean", y="diagnosis", ax=axs[0, 5])

sns.boxplot(data=df, x="concavity_mean", y="diagnosis", ax=axs[1, 0])
sns.boxplot(data=df, x="concave points_mean", y="diagnosis", ax=axs[1, 1])
sns.boxplot(data=df, x="symmetry_mean", y="diagnosis", ax=axs[1, 2])
sns.boxplot(data=df, x="fractal_dimension_mean", y="diagnosis", ax=axs[1, 3])
sns.boxplot(data=df, x="radius_se", y="diagnosis", ax=axs[1, 4])
sns.boxplot(data=df, x="texture_se", y="diagnosis", ax=axs[1, 5])

sns.boxplot(data=df, x="perimeter_se", y="diagnosis", ax=axs[2, 0])
sns.boxplot(data=df, x="area_se", y="diagnosis", ax=axs[2, 1])
sns.boxplot(data=df, x="smoothness_se", y="diagnosis", ax=axs[2, 2])
sns.boxplot(data=df, x="compactness_se", y="diagnosis", ax=axs[2, 3])
sns.boxplot(data=df, x="concavity_se", y="diagnosis", ax=axs[2, 4])
sns.boxplot(data=df, x="concave points_se", y="diagnosis", ax=axs[2, 5])

sns.boxplot(data=df, x="symmetry_se", y="diagnosis", ax=axs[3, 0])
sns.boxplot(data=df, x="fractal_dimension_se", y="diagnosis", ax=axs[3, 1])
sns.boxplot(data=df, x="radius_worst", y="diagnosis", ax=axs[3, 2])
sns.boxplot(data=df, x="texture_worst", y="diagnosis", ax=axs[3, 3])
sns.boxplot(data=df, x="perimeter_worst", y="diagnosis", ax=axs[3, 4])
sns.boxplot(data=df, x="area_worst", y="diagnosis", ax=axs[3, 5])

sns.boxplot(data=df, x="smoothness_worst", y="diagnosis", ax=axs[4, 0])
sns.boxplot(data=df, x="compactness_worst", y="diagnosis", ax=axs[4, 1])
sns.boxplot(data=df, x="concavity_worst", y="diagnosis", ax=axs[4, 2])
sns.boxplot(data=df, x="concave points_worst", y="diagnosis", ax=axs[4, 3])
sns.boxplot(data=df, x="symmetry_worst", y="diagnosis", ax=axs[4, 4])
sns.boxplot(data=df, x="fractal_dimension_worst", y="diagnosis", ax=axs[4, 5])

plt.show()

By observing the graphic results we can already expect to find some significant differences in the means of the two groups. So let’s now learn how to calculate the t-value and its significance.

The t-value is given by the following formula:

Looking at the formula, we can see that the t-value is given by the ratio between the difference between the means of the two groups, and the square root of the sum of the variance divided by the number of samples in each group. We can compute the t-value manually with Python:

#create different dataframes for the two groups:
df_M = df.loc[df['diagnosis'] == 'M']
df_B = df.loc[df['diagnosis'] == 'B']

#get the N for both groups:
print('The n for Malignant is: ', len(df_M))
print('The n for Benign is: ', len(df_B))

[OUT] The n for Malignant is:  212
[OUT] The n for Benign is:  357

Now we will find the difference between the two means:

#Find the difference between the two means:
mean_radius_mean_M = statistics.mean(df_M['radius_mean'])
print(mean_radius_mean_M)

[OUT] 17.462830188679245

mean_radius_mean_B = statistics.mean(df_B['radius_mean'])
print(mean_radius_mean_B)

[OUT] 12.14652380952381

mean_diff_radium_mean = mean_radius_mean_M - mean_radius_mean_B
print(mean_diff_radium_mean)

[OUT] 5.316306379155435

And the variance:

#Find the variances:

var_radius_mean_M = statistics.variance(df_M['radius_mean'])
print(var_radius_mean_M)

[OUT] 10.26543081462935

var_radius_mean_B = statistics.variance(df_B['radius_mean'])
print(var_radius_mean_B)

[OUT] 3.1702217220438738

Now we are ready to calculate the t-value:

#Applying t-value formula:
t_value = mean_diff_radium_mean/(math.sqrt(((var_radius_mean_M/212)+(var_radius_mean_B/357))))
print('The t-value is: ', t_value)

[OUT] The t-value is:  22.208797758464517

The last value we will need is the degrees of freedom in our samples:

#Degrees of Freedom:
dof = (212+357)-1
print('Dof: ', dof)

[OUT] Dof:  568

Now we just need to look up in a table with the t-distribution the value of p according to our t-value and the DoF value. I can tell you that the p-value is lower than 0.001, so the difference between the two means is statistically significant.

However, we don’t always need to do all these calculations. I just did these calculations to help you understand how the t-test works. With Python, we can use pre-defined functions for this statistical test, which facilitates, speeds up, and prevents errors. Let’s see how easy it is:

#Variable radius_mean:
ttest = stats.ttest_ind(a=df_M['radius_mean'],
                        b=df_B['radius_mean'], 
                        equal_var =False)
print(ttest)

[OUT] Ttest_indResult(statistic=22.208797758464524,
                      pvalue=1.6844591259582747e-64)

Now that we have the p-value for the t-test, we can see how this value manifests visually if we look again at the density plot and at the box plot:

It is possible to see that the density plots have an area of overlap, but the means are still different. On the box plot, we can see that the area with the mean and standard deviation of the two groups does not have any overlap. Now let´s see another example where the p-value is higher than 0.05:

#Variable smoothness_se:
ttest = stats.ttest_ind(a=df_M['smoothness_se'], b=df_B['smoothness_se'], equal_var =False)
print(ttest)

[OUT] Ttest_indResult(statistic=-1.6228692577349724,
                      pvalue=0.10529700302804572)

The graphical differences are huge, as you can see overlaps everywhere.

__________________________________________________________

Thank you for reading! Don’t forget to subscribe to receive notifications about my future publications.

If: you liked this article, don’t forget to follow me and thus receive all updates about new publications.

Else If: you want to read more on the topic, you can buy my book “Data-Driven Decisions: A Practical Introduction to Machine Learning” which will give you all the information you need to start with Machine Learning. It will cost you only a coffee, and give me a small tip!

Else: Thank you!