Cluster Analysis Use Case: The Wine Dataset

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

Cluster analysis is a powerful technique used to identify patterns and group similar objects or data points together. In many fields, from marketing to biology, clustering analysis is used to uncover hidden insights and make better decisions.

Today, we’ll use the wine dataset to practice cluster analysis. The wine dataset is a multivariate dataset that contains the results of a chemical analysis of wines grown in a specific region of Italy. The objective is to identify patterns in wine data.

Actually, the wine dataset is a classification dataset as data is labeled and we theoretically know what we want to find. But we can also use it with a clustering approach.

Exploration of the Data

Before applying clustering algorithms to the wine dataset, it’s important to get a sense of the data and identify any trends or patterns that might be present.

First, let’s load the wine dataset into a pandas DataFrame and take a look at the first few rows:

import pandas as pd
from sklearn.datasets import load_wine

wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
print(df.head())

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  od280/od315_of_diluted_wines  proline
0    14.23        1.71  2.43               15.6      127.0           2.80        3.06                  0.28             2.29             5.64  1.04                          3.92   1065.0
1    13.20        1.78  2.14               11.2      100.0           2.65        2.76                  0.26             1.28             4.38  1.05                          3.40   1050.0
2    13.16        2.36  2.67               18.6      101.0           2.80        3.24                  0.30             2.81             5.68  1.03                          3.17   1185.0
3    14.37        1.95  2.50               16.8      113.0           3.85        3.49                  0.24             2.18             7.80  0.86                          3.45   1480.0
4    13.24        2.59  2.87               21.0      118.0           2.80        2.69                  0.39             1.82             4.32  1.04                          2.93    735.0

We can see that the wine dataset contains 13 features, such as alcohol content, malic acid, and ash. The target variable (wine class) is not included in this DataFrame, but we can access it using the wine.target attribute.

Next, let’s create some visualizations to better understand the distribution of each feature. We can use the seaborn library to create a correlation matrix or a histogram.

import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(1, 2, figsize=(10, 5))

sns.heatmap(df.corr(), ax=ax[0])
sns.distplot(df['alcohol'], ax=ax[1])

plt.show()

The first visualization is a heatmap of the correlation matrix of the wine dataset, created using the sns.heatmap() function. The purpose of this visualization is to show the strength and direction of the correlations between different pairs of variables in the dataset. The heatmap is color-coded to indicate the degree of correlation, with warmer colors (red and orange) indicating positive correlation and cooler colors (blue and green) indicating negative correlation. The diagonal of the heatmap shows the correlation of each variable with itself, which is always perfect (i.e., a correlation of 1.0).

The second visualization is a histogram of the “alcohol” variable in the wine dataset, created using the sns.distplot() function. The purpose of this visualization is to show the distribution of the "alcohol" variable and to check if it follows a normal distribution. This can be useful for selecting an appropriate clustering algorithm or for transforming the data if necessary.

Preprocessing

Preprocessing is an important step in any data analysis pipeline, and cluster analysis is no exception.

The first step is to standardize the data. Standardization is the process of scaling the data so that it has zero mean and unit variance. This can help improve the performance of clustering algorithms by ensuring that all variables are equally important and have the same range of values. We perform standardization using the StandardScaler class from the sklearn.preprocessing module, as shown below:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(df)

Then, dimensionality reduction: the wine dataset contains 13 different variables, which can make clustering algorithms computationally expensive and difficult to interpret.

To address this issue, we can use principal component analysis (PCA) to reduce the dimensionality of the dataset. PCA is a technique for reducing the number of variables in a dataset while retaining as much of the original variation as possible. We perform PCA using the PCA class from the sklearn.decomposition module, as shown below:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

In this case, we reduce the dimensionality of the dataset to two dimensions, which can be easily visualized and interpreted.

Finally, we have to remove outliers because they can have a significant impact on clustering results, as they can distort the distance measures used by clustering algorithms.

To identify and remove outliers from the wine dataset, we used the Local Outlier Factor (LOF) algorithm from the sklearn.neighbors module. The LOF algorithm assigns an outlier score to each data point based on its local density, and can be used to identify points that are significantly different from their neighbors. For example, we can remove any data points with an outlier score greater than 2.5:

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20)
outlier_scores = lof.fit_predict(X_pca)
X_pca = X_pca[outlier_scores > -2.5]

K-Means Clustering Model

Now, we can develop our clustering model. We’ll use the K-Means algorithm.

K-means clustering is a widely used algorithm for partitioning a dataset into k clusters. The algorithm works by iteratively assigning data points to the nearest cluster centroid, and then updating the centroid based on the new cluster assignment. The algorithm continues until the cluster assignments no longer change, or a maximum number of iterations is reached.

To apply k-means clustering to the wine dataset, we will use the KMeans class from the sklearn.cluster module. We will initialize the algorithm with k=3, for example. We will also set the random_state parameter to ensure the reproducibility of the results.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_pca)
labels = kmeans.labels_

The fit() method of the KMeans class performs k-means clustering on the preprocessed data, and returns an array of cluster labels for each data point. We stored these labels in the labels variable for later use.

We can then visualize the clustering results. To do this, we can use a scatter plot of the first two principal components of the preprocessed data, with each point colored according to its assigned cluster label. This will allow us to see the separation between the clusters and assess the quality of the clustering.

The resulting scatter plot shows three distinct clusters of points.

We can also evaluate the quality of our clusters using the silhouette score:

from sklearn.metrics import silhouette_score

score = silhouette_score(X_pca, labels)
print('Silhouette score:', score)

The silhouette score here is 0.56, which indicates a reasonable degree of separation between the clusters. This score falls within the range of -1 to 1, with higher values indicating better clustering results.

Fine-Tuning our Clustering Model

Once we have applied the clustering algorithm to our wine dataset and obtained the cluster labels, we can fine-tune our clustering model to improve its performance. Fine-tuning the clustering model involves adjusting the hyperparameters of the clustering algorithm to optimize its performance.

One common approach to fine-tuning the clustering model is to vary the number of clusters in the dataset. The optimal number of clusters can be determined using techniques such as the elbow method and silhouette analysis.

The elbow method involves plotting the within-cluster sum of squares (WSS) against the number of clusters. The WSS is a measure of how compact the clusters are, and it decreases as the number of clusters increases. However, the rate of decrease in the WSS will usually decrease at some point, forming an elbow shape. The optimal number of clusters is where the elbow occurs, as adding more clusters beyond this point does not significantly improve the clustering performance.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Compute the within-cluster sum of squares (WSS) for different numbers of clusters
wss = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wss.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(range(1, 11), wss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WSS')
plt.show()

Here we can see that 3 clusters seem to be the optimal number.

Then, we can perform silhouette analysis. Silhouette analysis involves computing the silhouette coefficient for each data point, which measures how similar it is to its own cluster compared to other clusters. The optimal number of clusters is where the average silhouette coefficient is the highest.

from sklearn.metrics import silhouette_samples, silhouette_score

silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
    cluster_labels = kmeans.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    sample_silhouette_values = silhouette_samples(X, cluster_labels)
    y_lower = 10


# Plot the silhouette scores for each number of clusters
plt.plot(range(2, 11), silhouette_scores)
plt.title('Silhouette Analysis')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()

The plot confirms that 3 is the optimal number of clusters.

In addition to varying the number of clusters, we can also experiment with different clustering algorithms and distance metrics:

from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering

clustering_algorithms = [
    KMeans(n_clusters=3, random_state=42),
    AgglomerativeClustering(n_clusters=3),
    SpectralClustering(n_clusters=3, random_state=42)
]

silhouette_scores = []
for algorithm in clustering_algorithms:
    cluster_labels = algorithm.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    silhouette_scores.append(silhouette_avg)

plt.bar(range(len(clustering_algorithms)), silhouette_scores)
plt.xticks(range(len(clustering_algorithms)), [type(algorithm).__name__ for algorithm in clustering_algorithms], rotation=90)
plt.show()

K-Means seems to be the best algorithm for us.

Final Note

As you can see, a cluster analysis is a lot like a classification analysis. The only difference is that clustering is unsupervised, and classification is supervised.

To explore the other stories of this story, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com