avatarEsteban Thilliez

Summary

The article discusses the importance of data visualization in data science using Anscombe's Quartet as a case study to demonstrate how visual representation can reveal patterns and insights that summary statistics alone may obscure.

Abstract

Anscombe's Quartet, a set of four datasets with identical summary statistics but distinctly different patterns, serves as a powerful example to illustrate the critical role of data visualization in data science. The article explains that despite having the same mean, variance, and linear regression parameters, each dataset in the quartet exhibits unique characteristics when visualized. This underscores the limitations of relying solely on numerical summaries and the necessity of graphical methods for a comprehensive understanding of data. The author emphasizes that visualizing data helps to uncover underlying relationships, trends, and anomalies that numbers alone cannot convey, which is essential for choosing appropriate statistical models and making informed decisions in data analysis. The article also provides practical guidance on how to explore Anscombe's Quartet using Python, reinforcing the message that visual exploration should be an integral part of the data analysis workflow.

Opinions

  • The author advocates for the habitual use of data visualization as a fundamental step in data science tasks.
  • Visualizing data is seen as crucial for detecting outliers and influential points that could affect statistical analysis and model accuracy.
  • The article suggests that summary statistics can be misleading and should be complemented with visual exploration to capture the full picture of the data.
  • The author implies that the ability to visually identify different data patterns is key to selecting the most suitable model for analysis.
  • There is an underlying message that data scientists should not skip the visualization step, even when it seems repetitive or time-consuming.

Why Plotting Your Data is Important: Exploring Anscombe’s Quartet with Python

Previously, I explained how to perform data visualization with Python. This article will illustrate why this is an important step when you have data science tasks to perform.

Indeed, data visualization allows us to uncover patterns, trends, and relationships that may not be immediately apparent in raw data.

One of the most famous examples that demonstrate the importance of plotting data is Anscombe’s Quartet.

Today, we’ll explore this dataset with Python!

Understanding Anscombe’s Quartet

Anscombe’s Quartet is a collection of four datasets that were created to highlight the importance of data visualization in statistical analysis. These datasets were introduced by the statistician Francis Anscombe in 1973 and have since become a classic example in the field of data science.

Let’s take a brief look at each of the four datasets in Anscombe’s Quartet:

Dataset I:

  • x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
  • y: [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

This dataset exhibits a relatively linear relationship between x and y.

Dataset II:

  • x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
  • y: [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]

Similar to Dataset I, this dataset also shows a linear relationship but with a slight curve.

Dataset III:

  • x: [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
  • y: [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

Dataset III has an apparent outlier that significantly affects the linear regression.

Dataset IV:

  • x: [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
  • y: [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

This dataset demonstrates how a single outlier can completely alter the linear regression line.

Exploring Anscombe’s Quartet with Python

Firstly, we can load Anscombe’s Quartet in Python with the following code:

import pandas as pd
import matplotlib.pyplot as plt


def get_anscombe_quartet():
    return pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/anscombe.csv')


if __name__ == '__main__':
    anscombe = get_anscombe_quartet()

With this method, all the 4 datasets are contained in the same dataset, that’s why after loading Anscombe’s Quartet we have to split it into 4 different datasets.

    dataset_1 = anscombe[anscombe['dataset'] == 'I']
    dataset_2 = anscombe[anscombe['dataset'] == 'II']
    dataset_3 = anscombe[anscombe['dataset'] == 'III']
    dataset_4 = anscombe[anscombe['dataset'] == 'IV']

    datasets = [dataset_1, dataset_2, dataset_3, dataset_4]
    dataset_names = ['Dataset I', 'Dataset II', 'Dataset III', 'Dataset IV']

We can display them:

    for dataset, name in zip(datasets, dataset_names):
        print(name)
        print(dataset)
        print()
Dataset I
   dataset     x      y
0        I  10.0   8.04
1        I   8.0   6.95
2        I  13.0   7.58
3        I   9.0   8.81
4        I  11.0   8.33
5        I  14.0   9.96
6        I   6.0   7.24
7        I   4.0   4.26
8        I  12.0  10.84
9        I   7.0   4.82
10       I   5.0   5.68

Dataset II
   dataset     x     y
11      II  10.0  9.14
12      II   8.0  8.14
13      II  13.0  8.74
14      II   9.0  8.77
15      II  11.0  9.26
16      II  14.0  8.10
17      II   6.0  6.13
18      II   4.0  3.10
19      II  12.0  9.13
20      II   7.0  7.26
21      II   5.0  4.74

Dataset III
   dataset     x      y
22     III  10.0   7.46
23     III   8.0   6.77
24     III  13.0  12.74
25     III   9.0   7.11
26     III  11.0   7.81
27     III  14.0   8.84
28     III   6.0   6.08
29     III   4.0   5.39
30     III  12.0   8.15
31     III   7.0   6.42
32     III   5.0   5.73

Dataset IV
   dataset     x      y
33      IV   8.0   6.58
34      IV   8.0   5.76
35      IV   8.0   7.71
36      IV   8.0   8.84
37      IV   8.0   8.47
38      IV   8.0   7.04
39      IV   8.0   5.25
40      IV  19.0  12.50
41      IV   8.0   5.56
42      IV   8.0   7.91
43      IV   8.0   6.89

Now, let’s describe them, and we’ll see something strange:

    for dataset, name in zip(datasets, dataset_names):
        print(name)
        print(dataset.describe())
        print()
Dataset I
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031568
min     4.000000   4.260000
25%     6.500000   6.315000
50%     9.000000   7.580000
75%    11.500000   8.570000
max    14.000000  10.840000

Dataset II
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031657
min     4.000000   3.100000
25%     6.500000   6.695000
50%     9.000000   8.140000
75%    11.500000   8.950000
max    14.000000   9.260000

Dataset III
               x          y
count  11.000000  11.000000
mean    9.000000   7.500000
std     3.316625   2.030424
min     4.000000   5.390000
25%     6.500000   6.250000
50%     9.000000   7.110000
75%    11.500000   7.980000
max    14.000000  12.740000

Dataset IV
               x          y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.030579
min     8.000000   5.250000
25%     8.000000   6.170000
50%     8.000000   7.040000
75%     8.000000   8.190000
max    19.000000  12.500000

As you can see, each dataset has the same mean and standard deviation. Now, we can visualize the datasets:

    fig, axs = plt.subplots(2, 2, figsize=(10, 10))
    axs[0, 0].plot(dataset_1['x'], dataset_1['y'], 'o')
    axs[0, 0].set_title('Dataset I')
    axs[0, 1].plot(dataset_2['x'], dataset_2['y'], 'o')
    axs[0, 1].set_title('Dataset II')
    axs[1, 0].plot(dataset_3['x'], dataset_3['y'], 'o')
    axs[1, 0].set_title('Dataset III')
    axs[1, 1].plot(dataset_4['x'], dataset_4['y'], 'o')
    axs[1, 1].set_title('Dataset IV')
    plt.show()
Anscombe’s Quartet visualization

What can Anscombe’s Quartet tell us about data visualization?

Anscombe’s Quartet highlights the limitations of relying solely on summary statistics and emphasizes the need for visual exploration and graphical representation of data.

First, it shows that data summary statistics can be misleading. Despite the four datasets in Anscombe’s Quartet having identical mean, variance, correlation, and linear regression line parameters, they have distinct patterns. This illustrates that summary statistics alone cannot capture the full picture of the data.

Then, the graphical representation reveals hidden patterns. When plotted, the four datasets in Anscombe’s Quartet exhibit different patterns, such as linear, quadratic, and non-linear relationships. Visualizing the data allows us to uncover underlying patterns, trends, and structures that may not be immediately evident from numerical summaries. It helps us grasp the nature of the data and make informed decisions about appropriate statistical analyses.

Finally, outliers and influential points become apparent. Indeed, Anscombe’s Quartet includes datasets where we can clearly identify outliers by visualizing them. These unusual observations stand out, enabling us to assess their impact on summary statistics and regression models.

Let’s finish by calculating the linear regression for each dataset and plot it:

fig, axs = plt.subplots(2, 2, figsize=(10, 10))
    for dataset, name, ax in zip(datasets, dataset_names, axs.flatten()):
        ax.plot(dataset['x'], dataset['y'], 'o')
        m, b = np.polyfit(dataset['x'], dataset['y'], 1)
        ax.plot(dataset['x'], m * dataset['x'] + b)
        ax.set_title(name)
    plt.show()

As you can see, it’s nearly the same for each model, despite the different patterns. Knowing how looks the data allows us to choose an appropriate model to analyze data.

Final Note

I hope this article has shown you how important it is to visualize your data for your data science tasks.

Even though it can sometimes seem tedious and repetitive, you have to force yourself to do it. It has to become a habit!

To explore the other stories of this series, click below!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to Medium yet and wish to support me or get access to all my stories, you can use my link:

Data Visualization
Data Science
Python
AI
Programming
Recommended from ReadMedium