avatarEsteban Thilliez

Summary

The provided content outlines a comprehensive approach to performing cluster analysis on the wine dataset using Python, detailing data exploration, preprocessing, clustering with K-Means, and model fine-tuning.

Abstract

The article delves into the application of cluster analysis, specifically using the K-Means algorithm, to uncover patterns within the wine dataset, which comprises chemical analysis data of wines from a specific Italian region. The process begins with data exploration, visualizing feature correlations and distributions, followed by data standardization and dimensionality reduction via PCA to improve clustering performance. Outlier detection and removal are conducted using the LOF algorithm to ensure data quality. The K-Means clustering model is then applied, with the number of clusters determined through the elbow method and silhouette analysis. The article emphasizes the importance of fine-tuning the clustering model by adjusting hyperparameters and considering alternative clustering algorithms to achieve optimal results. The work concludes by comparing the performance of different algorithms and affirming the effectiveness of K-Means for this dataset, while also drawing parallels between cluster analysis and classification analysis, noting the unsupervised nature of the former.

Opinions

  • The author suggests that cluster analysis is a powerful technique applicable across various fields, including marketing and biology.
  • It is implied that Python is the preferred programming language for data science tasks, as indicated by the series title "Datascience with Python."
  • The use of visualizations, such as heatmaps and histograms, is presented as a critical step in understanding the data before applying clustering algorithms.
  • The article posits that standardizing data and reducing dimensionality with PCA are important preprocessing steps for clustering.
  • The author advocates for the removal of outliers to prevent them from skewing the results of the clustering analysis.
  • The elbow method and silhouette analysis are recommended as objective techniques to determine the optimal number of clusters.
  • The author expresses a preference for K-Means clustering in this context, based on the silhouette scores compared to other algorithms like Agglomerative and Spectral Clustering.
  • The article concludes with the opinion that cluster analysis, despite being unsupervised, can yield insights similar to supervised classification analysis.

Cluster Analysis Use Case: The Wine Dataset

Photo by Klara Kulikova on Unsplash

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Cluster analysis is a powerful technique used to identify patterns and group similar objects or data points together. In many fields, from marketing to biology, clustering analysis is used to uncover hidden insights and make better decisions.

Today, we’ll use the wine dataset to practice cluster analysis. The wine dataset is a multivariate dataset that contains the results of a chemical analysis of wines grown in a specific region of Italy. The objective is to identify patterns in wine data.

Actually, the wine dataset is a classification dataset as data is labeled and we theoretically know what we want to find. But we can also use it with a clustering approach.

Exploration of the Data

Before applying clustering algorithms to the wine dataset, it’s important to get a sense of the data and identify any trends or patterns that might be present.

First, let’s load the wine dataset into a pandas DataFrame and take a look at the first few rows:

import pandas as pd
from sklearn.datasets import load_wine

wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
print(df.head())
   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  od280/od315_of_diluted_wines  proline
0    14.23        1.71  2.43               15.6      127.0           2.80        3.06                  0.28             2.29             5.64  1.04                          3.92   1065.0
1    13.20        1.78  2.14               11.2      100.0           2.65        2.76                  0.26             1.28             4.38  1.05                          3.40   1050.0
2    13.16        2.36  2.67               18.6      101.0           2.80        3.24                  0.30             2.81             5.68  1.03                          3.17   1185.0
3    14.37        1.95  2.50               16.8      113.0           3.85        3.49                  0.24             2.18             7.80  0.86                          3.45   1480.0
4    13.24        2.59  2.87               21.0      118.0           2.80        2.69                  0.39             1.82             4.32  1.04                          2.93    735.0

We can see that the wine dataset contains 13 features, such as alcohol content, malic acid, and ash. The target variable (wine class) is not included in this DataFrame, but we can access it using the wine.target attribute.

Next, let’s create some visualizations to better understand the distribution of each feature. We can use the seaborn library to create a correlation matrix or a histogram.

import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(1, 2, figsize=(10, 5))

sns.heatmap(df.corr(), ax=ax[0])
sns.distplot(df['alcohol'], ax=ax[1])

plt.show()

The first visualization is a heatmap of the correlation matrix of the wine dataset, created using the sns.heatmap() function. The purpose of this visualization is to show the strength and direction of the correlations between different pairs of variables in the dataset. The heatmap is color-coded to indicate the degree of correlation, with warmer colors (red and orange) indicating positive correlation and cooler colors (blue and green) indicating negative correlation. The diagonal of the heatmap shows the correlation of each variable with itself, which is always perfect (i.e., a correlation of 1.0).

The second visualization is a histogram of the “alcohol” variable in the wine dataset, created using the sns.distplot() function. The purpose of this visualization is to show the distribution of the "alcohol" variable and to check if it follows a normal distribution. This can be useful for selecting an appropriate clustering algorithm or for transforming the data if necessary.

Preprocessing

Preprocessing is an important step in any data analysis pipeline, and cluster analysis is no exception.

The first step is to standardize the data. Standardization is the process of scaling the data so that it has zero mean and unit variance. This can help improve the performance of clustering algorithms by ensuring that all variables are equally important and have the same range of values. We perform standardization using the StandardScaler class from the sklearn.preprocessing module, as shown below:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(df)

Then, dimensionality reduction: the wine dataset contains 13 different variables, which can make clustering algorithms computationally expensive and difficult to interpret.

To address this issue, we can use principal component analysis (PCA) to reduce the dimensionality of the dataset. PCA is a technique for reducing the number of variables in a dataset while retaining as much of the original variation as possible. We perform PCA using the PCA class from the sklearn.decomposition module, as shown below:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

In this case, we reduce the dimensionality of the dataset to two dimensions, which can be easily visualized and interpreted.

Finally, we have to remove outliers because they can have a significant impact on clustering results, as they can distort the distance measures used by clustering algorithms.

To identify and remove outliers from the wine dataset, we used the Local Outlier Factor (LOF) algorithm from the sklearn.neighbors module. The LOF algorithm assigns an outlier score to each data point based on its local density, and can be used to identify points that are significantly different from their neighbors. For example, we can remove any data points with an outlier score greater than 2.5:

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20)
outlier_scores = lof.fit_predict(X_pca)
X_pca = X_pca[outlier_scores > -2.5]

K-Means Clustering Model

Now, we can develop our clustering model. We’ll use the K-Means algorithm.

K-means clustering is a widely used algorithm for partitioning a dataset into k clusters. The algorithm works by iteratively assigning data points to the nearest cluster centroid, and then updating the centroid based on the new cluster assignment. The algorithm continues until the cluster assignments no longer change, or a maximum number of iterations is reached.

To apply k-means clustering to the wine dataset, we will use the KMeans class from the sklearn.cluster module. We will initialize the algorithm with k=3, for example. We will also set the random_state parameter to ensure the reproducibility of the results.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_pca)
labels = kmeans.labels_

The fit() method of the KMeans class performs k-means clustering on the preprocessed data, and returns an array of cluster labels for each data point. We stored these labels in the labels variable for later use.

We can then visualize the clustering results. To do this, we can use a scatter plot of the first two principal components of the preprocessed data, with each point colored according to its assigned cluster label. This will allow us to see the separation between the clusters and assess the quality of the clustering.

The resulting scatter plot shows three distinct clusters of points.

We can also evaluate the quality of our clusters using the silhouette score:

from sklearn.metrics import silhouette_score

score = silhouette_score(X_pca, labels)
print('Silhouette score:', score)

The silhouette score here is 0.56, which indicates a reasonable degree of separation between the clusters. This score falls within the range of -1 to 1, with higher values indicating better clustering results.

Fine-Tuning our Clustering Model

Once we have applied the clustering algorithm to our wine dataset and obtained the cluster labels, we can fine-tune our clustering model to improve its performance. Fine-tuning the clustering model involves adjusting the hyperparameters of the clustering algorithm to optimize its performance.

One common approach to fine-tuning the clustering model is to vary the number of clusters in the dataset. The optimal number of clusters can be determined using techniques such as the elbow method and silhouette analysis.

The elbow method involves plotting the within-cluster sum of squares (WSS) against the number of clusters. The WSS is a measure of how compact the clusters are, and it decreases as the number of clusters increases. However, the rate of decrease in the WSS will usually decrease at some point, forming an elbow shape. The optimal number of clusters is where the elbow occurs, as adding more clusters beyond this point does not significantly improve the clustering performance.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Compute the within-cluster sum of squares (WSS) for different numbers of clusters
wss = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wss.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(range(1, 11), wss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WSS')
plt.show()

Here we can see that 3 clusters seem to be the optimal number.

Then, we can perform silhouette analysis. Silhouette analysis involves computing the silhouette coefficient for each data point, which measures how similar it is to its own cluster compared to other clusters. The optimal number of clusters is where the average silhouette coefficient is the highest.

from sklearn.metrics import silhouette_samples, silhouette_score

silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
    cluster_labels = kmeans.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    sample_silhouette_values = silhouette_samples(X, cluster_labels)
    y_lower = 10


# Plot the silhouette scores for each number of clusters
plt.plot(range(2, 11), silhouette_scores)
plt.title('Silhouette Analysis')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()

The plot confirms that 3 is the optimal number of clusters.

In addition to varying the number of clusters, we can also experiment with different clustering algorithms and distance metrics:

from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering

clustering_algorithms = [
    KMeans(n_clusters=3, random_state=42),
    AgglomerativeClustering(n_clusters=3),
    SpectralClustering(n_clusters=3, random_state=42)
]

silhouette_scores = []
for algorithm in clustering_algorithms:
    cluster_labels = algorithm.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    silhouette_scores.append(silhouette_avg)

plt.bar(range(len(clustering_algorithms)), silhouette_scores)
plt.xticks(range(len(clustering_algorithms)), [type(algorithm).__name__ for algorithm in clustering_algorithms], rotation=90)
plt.show()

K-Means seems to be the best algorithm for us.

Final Note

As you can see, a cluster analysis is a lot like a classification analysis. The only difference is that clustering is unsupervised, and classification is supervised.

To explore the other stories of this story, click below!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Data Science
AI
Python
Programming
Data
Recommended from ReadMedium