Classification Analysis Use Case: The Iris Dataset

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

Cluster analysis is a powerful technique in data analysis and machine learning that is used to group similar data points together based on their attributes. One of the most famous datasets for practicing cluster analysis is the Iris dataset, which contains measurements of sepal length, sepal width, petal length, and petal width for 150 iris flowers of three different species.

We’ll see how to perform cluster analysis on this famous dataset.

Preparing the Data

Before performing cluster analysis on the Iris dataset, it is important to properly prepare the data. This involves loading the dataset into Python and exploring its attributes, as well as potentially preprocessing the data to improve the accuracy of the clustering.

To load the Iris dataset in Python, we can use the scikit-learn library, which provides a convenient function for loading the dataset:

from sklearn.datasets import load_iris
iris = load_iris()

This will load the dataset into a variable called iris, which is a dictionary-like object containing the data and metadata.

To explore the dataset, we can first examine the shape of the data using the shape attribute:

print(iris.data.shape)

This should output (150, 4), indicating that the dataset contains 150 samples and 4 attributes.

We can also print the names of the attributes using the feature_names attribute:

print(iris.feature_names)

This should output ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], indicating the names of the four attributes in the dataset.

To visualize the data, we can use scatterplots to plot each pair of attributes against each other. For example, to plot sepal length against sepal width, we can use:

import matplotlib.pyplot as plt

plt.scatter(iris.data[:, 0], iris.data[:, 1])
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
plt.show()

This will produce a scatterplot of the data, with sepal length on the x-axis and sepal width on the y-axis.

Finally, it may be necessary to preprocess the data before performing cluster analysis. This could involve scaling the data to ensure that all attributes are on the same scale, or normalizing the data to ensure that all attributes have the same range. Preprocessing techniques will depend on the clustering algorithm being used, so we’ll do this later.

Choosing a Clustering Algorithm

There are several clustering algorithms that can be used to cluster the Iris dataset, each with its own strengths and weaknesses. Here, we will discuss three popular algorithms: k-means clustering, hierarchical clustering, and DBSCAN.

K-means clustering is a popular algorithm for clustering datasets because of its simplicity and speed. The algorithm works by dividing the data into k clusters, where k is a user-defined parameter. The algorithm then iteratively assigns data points to clusters based on their distance from the cluster centers, which are updated at each iteration. The final result is a set of k clusters that minimize the sum of squared distances between each data point and its assigned cluster center.

Hierarchical clustering is another popular algorithm that creates a hierarchical structure of clusters. The algorithm works by iteratively merging the two closest clusters until all data points belong to a single cluster. There are two main types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering starts with each data point in its own cluster and iteratively merges the closest clusters, while divisive clustering starts with all data points in a single cluster and iteratively divides the cluster into smaller clusters.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together points that are close to each other and separates points that are far apart. The algorithm works by defining a neighborhood around each point and clustering points that have a sufficiently high density within that neighborhood. Points that are not part of any cluster are considered outliers.

When choosing a clustering algorithm for the Iris dataset, it is important to consider the nature of the data and the goals of the analysis. K-means clustering may be a good choice if the dataset is relatively simple and the number of clusters is known in advance. Hierarchical clustering may be a better choice if the dataset is more complex and the structure of the clusters is not known in advance. DBSCAN may be useful if the dataset is noisy and contains outliers, or if the clusters have varying densities.

For our dataset, we’ll choose the k-means algorithm.

Data Preprocessing

Before applying the k-means clustering algorithm to the Iris dataset, it is important to preprocess the data to ensure that all attributes are on the same scale. This is because k-means clustering is a distance-based algorithm, which means that it calculates distances between data points to assign them to clusters. If the attributes are on different scales, the algorithm may be biased towards attributes with larger values.

To preprocess the data, we will use the StandardScaler class from the scikit-learn library, which scales the data to have zero mean and unit variance. To use the StandardScaler, we first import it:

from sklearn.preprocessing import StandardScaler

We can then fit the scaler to the data and transform the data using the fit_transform() method:

scaler = StandardScaler()
X = scaler.fit_transform(iris.data)

This will scale the data and store the scaled data in a variable called X.

To determine the optimal number of clusters for the k-means algorithm, we can use the elbow method, which plots the sum of squared distances between each data point and its assigned cluster center as a function of the number of clusters. The elbow method looks for the “elbow” of the plot, which is the point of inflection where adding more clusters does not significantly reduce the sum of squared distances.

To plot the elbow curve, we can use the KMeans class from the scikit-learn library to fit the k-means algorithm to the data for different values of k, and calculate the sum of squared distances for each value of k:

from sklearn.cluster import KMeans
import numpy as np

# Calculate sum of squared distances for k values ranging from 1 to 10
ssd = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(X)
    ssd.append(kmeans.inertia_)

# Plot elbow curve
plt.plot(K, ssd, 'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of squared distances')
plt.title('Elbow method for optimal k')
plt.show()

Based on the elbow curve, it appears that the optimal number of clusters for the Iris dataset is three.

Applying the Algorithm

We can now apply the k-means algorithm to the scaled Iris dataset using the KMeans class from the scikit-learn library:

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)

To visualize the clusters, we can use the first two principal components of the scaled Iris dataset, which explain the majority of the variance in the data. We can calculate the first two principal components using the PCA class from the scikit-learn library:

from sklearn.decomposition import PCA

# Calculate first two principal components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

We can then plot the clusters using a scatter plot, with the color of each point corresponding to its cluster label:

import matplotlib.pyplot as plt

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans.labels_)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-means clustering of Iris dataset')
plt.show()

Our clusters look nice, but maybe we can do better?

Fine-Tuning our Classification

After applying the k-means algorithm to the Iris dataset, we may want to evaluate the quality of the resulting clusters and fine-tune our classification.

One way to do this is to calculate the silhouette score, which measures the similarity of each data point to its assigned cluster compared to other clusters. A silhouette score close to 1 indicates that a data point is well-matched to its assigned cluster, while a score close to -1 indicates that the data point may be better matched to a different cluster.

We can calculate the silhouette score for our clustering using the silhouette_score function from the scikit-learn library:

from sklearn.metrics import silhouette_score

silhouette_avg = silhouette_score(X, kmeans.labels_)
print("Silhouette score:", silhouette_avg)

This will calculate the average silhouette score for our clustering, which we can use to evaluate the quality of the resulting clusters. A higher silhouette score indicates that the clustering is better, while a lower score indicates that the clusters may be poorly separated or overlapping.

We can also visualize the silhouette scores for each individual data point using a silhouette plot, which shows the silhouette coefficient for each data point as a vertical bar. The position of the bar indicates the cluster to which the data point is assigned, while the height of the bar indicates the silhouette coefficient for that point:

from sklearn.metrics import silhouette_samples
import matplotlib.cm as cm

# Calculate silhouette coefficients for each data point
silhouette_vals = silhouette_samples(X, kmeans.labels_)

# Plot silhouette plot
y_lower, y_upper = 0, 0
for i in range(kmeans.n_clusters):
    cluster_silhouette_vals = silhouette_vals[kmeans.labels_ == i]
    cluster_silhouette_vals.sort()
    y_upper += len(cluster_silhouette_vals)
    plt.barh(range(y_lower, y_upper), cluster_silhouette_vals, height=1)
    plt.text(-0.05, (y_lower + y_upper) / 2, str(i))
    y_lower += len(cluster_silhouette_vals)

plt.xlabel("Silhouette coefficient")
plt.ylabel("Cluster label")
plt.title("Silhouette plot for k-means clustering")
plt.show()

If the silhouette plot shows distinct clusters with high silhouette coefficients, then the clustering may be considered good. On the other hand, if the silhouette plot shows overlapping clusters with low or negative silhouette coefficients, then the clustering may need further fine-tuning.

Here, it looks good.

Then, based on the silhouette score and silhouette plot, we can fine-tune our clustering by adjusting the number of clusters or by trying different clustering algorithms. This iterative process can help us find the best possible clustering for our dataset.

But for now, I’m pretty happy with our clusters. Let’s end with printing some characteristics about them:

for i in range(kmeans.n_clusters):
    print("Cluster", i)
    print("Number of data points:", len(X[kmeans.labels_ == i]))
    print("Mean:", np.mean(X[kmeans.labels_ == i], axis=0))
    print("Standard deviation:", np.std(X[kmeans.labels_ == i], axis=0))
    print("")

Final Note

Well, organizing data into clusters is not so hard when the data is labeled, such as for the iris dataset. But it can trickier when the data is unlabeled.

That’s why in the next article, I’ll talk about cluster analysis, which implies using unlabeled data. Be sure to follow me if you don’t want to miss this article!

To explore the other stories of this story, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com