Why Plotting Your Data is Important: Exploring Anscombe’s Quartet with Python

Previously, I explained how to perform data visualization with Python. This article will illustrate why this is an important step when you have data science tasks to perform.
Indeed, data visualization allows us to uncover patterns, trends, and relationships that may not be immediately apparent in raw data.
One of the most famous examples that demonstrate the importance of plotting data is Anscombe’s Quartet.
Today, we’ll explore this dataset with Python!
Understanding Anscombe’s Quartet
Anscombe’s Quartet is a collection of four datasets that were created to highlight the importance of data visualization in statistical analysis. These datasets were introduced by the statistician Francis Anscombe in 1973 and have since become a classic example in the field of data science.
Let’s take a brief look at each of the four datasets in Anscombe’s Quartet:
Dataset I:
- x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
- y: [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
This dataset exhibits a relatively linear relationship between x and y.
Dataset II:
- x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
- y: [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
Similar to Dataset I, this dataset also shows a linear relationship but with a slight curve.
Dataset III:
- x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
- y: [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
Dataset III has an apparent outlier that significantly affects the linear regression.
Dataset IV:
- x: [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
- y: [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
This dataset demonstrates how a single outlier can completely alter the linear regression line.
Exploring Anscombe’s Quartet with Python
Firstly, we can load Anscombe’s Quartet in Python with the following code:
import pandas as pd
import matplotlib.pyplot as plt
def get_anscombe_quartet():
return pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/anscombe.csv')
if __name__ == '__main__':
anscombe = get_anscombe_quartet()
With this method, all the 4 datasets are contained in the same dataset, that’s why after loading Anscombe’s Quartet we have to split it into 4 different datasets.
dataset_1 = anscombe[anscombe['dataset'] == 'I']
dataset_2 = anscombe[anscombe['dataset'] == 'II']
dataset_3 = anscombe[anscombe['dataset'] == 'III']
dataset_4 = anscombe[anscombe['dataset'] == 'IV']
datasets = [dataset_1, dataset_2, dataset_3, dataset_4]
dataset_names = ['Dataset I', 'Dataset II', 'Dataset III', 'Dataset IV']
We can display them:
for dataset, name in zip(datasets, dataset_names):
print(name)
print(dataset)
print()
Dataset I
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
5 I 14.0 9.96
6 I 6.0 7.24
7 I 4.0 4.26
8 I 12.0 10.84
9 I 7.0 4.82
10 I 5.0 5.68
Dataset II
dataset x y
11 II 10.0 9.14
12 II 8.0 8.14
13 II 13.0 8.74
14 II 9.0 8.77
15 II 11.0 9.26
16 II 14.0 8.10
17 II 6.0 6.13
18 II 4.0 3.10
19 II 12.0 9.13
20 II 7.0 7.26
21 II 5.0 4.74
Dataset III
dataset x y
22 III 10.0 7.46
23 III 8.0 6.77
24 III 13.0 12.74
25 III 9.0 7.11
26 III 11.0 7.81
27 III 14.0 8.84
28 III 6.0 6.08
29 III 4.0 5.39
30 III 12.0 8.15
31 III 7.0 6.42
32 III 5.0 5.73
Dataset IV
dataset x y
33 IV 8.0 6.58
34 IV 8.0 5.76
35 IV 8.0 7.71
36 IV 8.0 8.84
37 IV 8.0 8.47
38 IV 8.0 7.04
39 IV 8.0 5.25
40 IV 19.0 12.50
41 IV 8.0 5.56
42 IV 8.0 7.91
43 IV 8.0 6.89
Now, let’s describe them, and we’ll see something strange:
for dataset, name in zip(datasets, dataset_names):
print(name)
print(dataset.describe())
print()
Dataset I
x y
count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.031568
min 4.000000 4.260000
25% 6.500000 6.315000
50% 9.000000 7.580000
75% 11.500000 8.570000
max 14.000000 10.840000
Dataset II
x y
count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.031657
min 4.000000 3.100000
25% 6.500000 6.695000
50% 9.000000 8.140000
75% 11.500000 8.950000
max 14.000000 9.260000
Dataset III
x y
count 11.000000 11.000000
mean 9.000000 7.500000
std 3.316625 2.030424
min 4.000000 5.390000
25% 6.500000 6.250000
50% 9.000000 7.110000
75% 11.500000 7.980000
max 14.000000 12.740000
Dataset IV
x y
count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.030579
min 8.000000 5.250000
25% 8.000000 6.170000
50% 8.000000 7.040000
75% 8.000000 8.190000
max 19.000000 12.500000
As you can see, each dataset has the same mean and standard deviation. Now, we can visualize the datasets:
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
axs[0, 0].plot(dataset_1['x'], dataset_1['y'], 'o')
axs[0, 0].set_title('Dataset I')
axs[0, 1].plot(dataset_2['x'], dataset_2['y'], 'o')
axs[0, 1].set_title('Dataset II')
axs[1, 0].plot(dataset_3['x'], dataset_3['y'], 'o')
axs[1, 0].set_title('Dataset III')
axs[1, 1].plot(dataset_4['x'], dataset_4['y'], 'o')
axs[1, 1].set_title('Dataset IV')
plt.show()

What can Anscombe’s Quartet tell us about data visualization?
Anscombe’s Quartet highlights the limitations of relying solely on summary statistics and emphasizes the need for visual exploration and graphical representation of data.
First, it shows that data summary statistics can be misleading. Despite the four datasets in Anscombe’s Quartet having identical mean, variance, correlation, and linear regression line parameters, they have distinct patterns. This illustrates that summary statistics alone cannot capture the full picture of the data.
Then, the graphical representation reveals hidden patterns. When plotted, the four datasets in Anscombe’s Quartet exhibit different patterns, such as linear, quadratic, and non-linear relationships. Visualizing the data allows us to uncover underlying patterns, trends, and structures that may not be immediately evident from numerical summaries. It helps us grasp the nature of the data and make informed decisions about appropriate statistical analyses.
Finally, outliers and influential points become apparent. Indeed, Anscombe’s Quartet includes datasets where we can clearly identify outliers by visualizing them. These unusual observations stand out, enabling us to assess their impact on summary statistics and regression models.
Let’s finish by calculating the linear regression for each dataset and plot it:
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
for dataset, name, ax in zip(datasets, dataset_names, axs.flatten()):
ax.plot(dataset['x'], dataset['y'], 'o')
m, b = np.polyfit(dataset['x'], dataset['y'], 1)
ax.plot(dataset['x'], m * dataset['x'] + b)
ax.set_title(name)
plt.show()

As you can see, it’s nearly the same for each model, despite the different patterns. Knowing how looks the data allows us to choose an appropriate model to analyze data.
Final Note
I hope this article has shown you how important it is to visualize your data for your data science tasks.
Even though it can sometimes seem tedious and repetitive, you have to force yourself to do it. It has to become a habit!
To explore the other stories of this series, click below!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to Medium yet and wish to support me or get access to all my stories, you can use my link: