Why Plotting Your Data is Important: Exploring Anscombe’s Quartet with Python

Previously, I explained how to perform data visualization with Python. This article will illustrate why this is an important step when you have data science tasks to perform.

Indeed, data visualization allows us to uncover patterns, trends, and relationships that may not be immediately apparent in raw data.

One of the most famous examples that demonstrate the importance of plotting data is Anscombe’s Quartet.

Today, we’ll explore this dataset with Python!

Understanding Anscombe’s Quartet

Anscombe’s Quartet is a collection of four datasets that were created to highlight the importance of data visualization in statistical analysis. These datasets were introduced by the statistician Francis Anscombe in 1973 and have since become a classic example in the field of data science.

Let’s take a brief look at each of the four datasets in Anscombe’s Quartet:

Dataset I:

x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y: [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

This dataset exhibits a relatively linear relationship between x and y.

Dataset II:

x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y: [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]

Similar to Dataset I, this dataset also shows a linear relationship but with a slight curve.

Dataset III:

x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y: [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

Dataset III has an apparent outlier that significantly affects the linear regression.

Dataset IV:

x: [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y: [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

This dataset demonstrates how a single outlier can completely alter the linear regression line.

Exploring Anscombe’s Quartet with Python

Firstly, we can load Anscombe’s Quartet in Python with the following code:

import pandas as pd
import matplotlib.pyplot as plt


def get_anscombe_quartet():
    return pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/anscombe.csv')


if __name__ == '__main__':
    anscombe = get_anscombe_quartet()

With this method, all the 4 datasets are contained in the same dataset, that’s why after loading Anscombe’s Quartet we have to split it into 4 different datasets.

    dataset_1 = anscombe[anscombe['dataset'] == 'I']
    dataset_2 = anscombe[anscombe['dataset'] == 'II']
    dataset_3 = anscombe[anscombe['dataset'] == 'III']
    dataset_4 = anscombe[anscombe['dataset'] == 'IV']

    datasets = [dataset_1, dataset_2, dataset_3, dataset_4]
    dataset_names = ['Dataset I', 'Dataset II', 'Dataset III', 'Dataset IV']

We can display them:

    for dataset, name in zip(datasets, dataset_names):
        print(name)
        print(dataset)
        print()

Dataset I
   dataset     x      y
0        I  10.0   8.04
1        I   8.0   6.95
2        I  13.0   7.58
3        I   9.0   8.81
4        I  11.0   8.33
5        I  14.0   9.96
6        I   6.0   7.24
7        I   4.0   4.26
8        I  12.0  10.84
9        I   7.0   4.82
10       I   5.0   5.68

Dataset II
   dataset     x     y
11      II  10.0  9.14
12      II   8.0  8.14
13      II  13.0  8.74
14      II   9.0  8.77
15      II  11.0  9.26
16      II  14.0  8.10
17      II   6.0  6.13
18      II   4.0  3.10
19      II  12.0  9.13
20      II   7.0  7.26
21      II   5.0  4.74

Dataset III
   dataset     x      y
22     III  10.0   7.46
23     III   8.0   6.77
24     III  13.0  12.74
25     III   9.0   7.11
26     III  11.0   7.81
27     III  14.0   8.84
28     III   6.0   6.08
29     III   4.0   5.39
30     III  12.0   8.15
31     III   7.0   6.42
32     III   5.0   5.73

Dataset IV
   dataset     x      y
33      IV   8.0   6.58
34      IV   8.0   5.76
35      IV   8.0   7.71
36      IV   8.0   8.84
37      IV   8.0   8.47
38      IV   8.0   7.04
39      IV   8.0   5.25
40      IV  19.0  12.50
41      IV   8.0   5.56
42      IV   8.0   7.91
43      IV   8.0   6.89

Now, let’s describe them, and we’ll see something strange:

    for dataset, name in zip(datasets, dataset_names):
        print(name)
        print(dataset.describe())
        print()

Dataset I
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031568
min     4.000000   4.260000
25%     6.500000   6.315000
50%     9.000000   7.580000
75%    11.500000   8.570000
max    14.000000  10.840000

Dataset II
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031657
min     4.000000   3.100000
25%     6.500000   6.695000
50%     9.000000   8.140000
75%    11.500000   8.950000
max    14.000000   9.260000

Dataset III
               x          y
count  11.000000  11.000000
mean    9.000000   7.500000
std     3.316625   2.030424
min     4.000000   5.390000
25%     6.500000   6.250000
50%     9.000000   7.110000
75%    11.500000   7.980000
max    14.000000  12.740000

Dataset IV
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.030579
min     8.000000   5.250000
25%     8.000000   6.170000
50%     8.000000   7.040000
75%     8.000000   8.190000
max    19.000000  12.500000

As you can see, each dataset has the same mean and standard deviation. Now, we can visualize the datasets:

    fig, axs = plt.subplots(2, 2, figsize=(10, 10))
    axs[0, 0].plot(dataset_1['x'], dataset_1['y'], 'o')
    axs[0, 0].set_title('Dataset I')
    axs[0, 1].plot(dataset_2['x'], dataset_2['y'], 'o')
    axs[0, 1].set_title('Dataset II')
    axs[1, 0].plot(dataset_3['x'], dataset_3['y'], 'o')
    axs[1, 0].set_title('Dataset III')
    axs[1, 1].plot(dataset_4['x'], dataset_4['y'], 'o')
    axs[1, 1].set_title('Dataset IV')
    plt.show()

What can Anscombe’s Quartet tell us about data visualization?

Anscombe’s Quartet highlights the limitations of relying solely on summary statistics and emphasizes the need for visual exploration and graphical representation of data.

First, it shows that data summary statistics can be misleading. Despite the four datasets in Anscombe’s Quartet having identical mean, variance, correlation, and linear regression line parameters, they have distinct patterns. This illustrates that summary statistics alone cannot capture the full picture of the data.

Then, the graphical representation reveals hidden patterns. When plotted, the four datasets in Anscombe’s Quartet exhibit different patterns, such as linear, quadratic, and non-linear relationships. Visualizing the data allows us to uncover underlying patterns, trends, and structures that may not be immediately evident from numerical summaries. It helps us grasp the nature of the data and make informed decisions about appropriate statistical analyses.

Finally, outliers and influential points become apparent. Indeed, Anscombe’s Quartet includes datasets where we can clearly identify outliers by visualizing them. These unusual observations stand out, enabling us to assess their impact on summary statistics and regression models.

Let’s finish by calculating the linear regression for each dataset and plot it:

fig, axs = plt.subplots(2, 2, figsize=(10, 10))
    for dataset, name, ax in zip(datasets, dataset_names, axs.flatten()):
        ax.plot(dataset['x'], dataset['y'], 'o')
        m, b = np.polyfit(dataset['x'], dataset['y'], 1)
        ax.plot(dataset['x'], m * dataset['x'] + b)
        ax.set_title(name)
    plt.show()

As you can see, it’s nearly the same for each model, despite the different patterns. Knowing how looks the data allows us to choose an appropriate model to analyze data.

Final Note

I hope this article has shown you how important it is to visualize your data for your data science tasks.

Even though it can sometimes seem tedious and repetitive, you have to force yourself to do it. It has to become a habit!

To explore the other stories of this series, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to Medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com