avatarAmy @GrabNGoInfo

Summary

The article outlines five methods for determining the optimal number of clusters in data science and machine learning models: the Elbow method, Silhouette score, Hierarchical graph, AIC & BIC from GMM, and Gap statistics.

Abstract

In the realm of data science and machine learning, the question of how to select the appropriate number of clusters for unsupervised models is pivotal. This article delves into five prevalent techniques used to ascertain the optimal cluster count, emphasizing their respective merits and applications. The Elbow method involves plotting the within-cluster sum of squares to identify the "elbow" point indicating the optimal number of clusters. The Silhouette score assesses the similarity of data points within their assigned clusters compared to neighboring clusters, with values ranging from -1 to +1. The Hierarchical graph, or dendrogram, visually represents the distances between data points, aiding in the identification of distinct clusters. AIC and BIC scores from the Gaussian Mixture Model (GMM) are used to compare models, with lower scores indicating better models. Lastly, Gap statistics compare the within-cluster dispersion to a uniform distribution to determine the number of clusters. The article also provides guidance on selecting the most suitable method for different scenarios and offers resources for further learning, including video tutorials and blog posts.

Opinions

  • The article suggests that the Elbow method is the most widely used technique for choosing the number of clusters.
  • It implies that the Silhouette score is a robust measure for validating the quality of clustering, with values close to 1 indicating well-defined clusters.
  • The Hierarchical graph is presented as a useful tool for visualizing the structure of clusters and determining their number based on the distances between clusters.
  • The use of AIC and BIC from GMM is recommended for model selection, with a preference for simpler models when differences in scores are small.
  • The article posits that Gap statistics can provide a clear indication of the optimal number of clusters by comparing against a null reference distribution.
  • A general guideline is offered to start with the Elbow method and Hierarchical graph, and if the optimal number of clusters is not clear, to use all five methods and choose the number suggested by the majority of methods.
  • The article encourages readers to consult additional tutorials and resources provided for a deeper understanding of the methods discussed.

How to decide the number of clusters | Data Science Interview Questions and Answers

Elbow method, Silhouette score, Hierarchical graph, AIC & BIC from GMM, Gap statistics, and when to use which method

Photo by Kier... in Sight on Unsplash

In data science and machine learning interviews, how to decide the number of clusters for an unsupervised model is one of the most commonly asked questions. In this tutorial, we will talk about five ways for deciding the number of clusters and when to use which. The five methods are:

  • Elbow method
  • Silhouette score
  • Hierarchical graph
  • AIC & BIC from GMM (Gaussian Mixture Model)
  • Gap statistics

Resources for this post:

Let’s get started!

Method 1: Elbow method

The elbow method is the most widely used method for choosing the number of clusters. It runs clustering models for a range of cluster numbers and plots the within-cluster sum of squares. The number of clusters at the elbow of the plot represents the starting point of the diminishing return for additional clusters, so it is the optimal number of clusters for the clustering model.

Example

From the example graph, we can see that the biggest decrease in the within-cluster sum of squares happens at 2 clusters, and there is a relatively large decrease from 2 to 3 clusters. The decreases are small for the models with the number of clusters greater than 3. Therefore, we can conclude that the optimal number of clusters is probably 2 or 3.

Number of clusters using the elbow method — GrabNGoInfo.com

Method 2: Silhouette Score

The silhouette score measures how similar a data point is to its own cluster compared to the closest neighbor cluster. The silhouette value ranges from −1 to +1.

  • The best value is 1, and a value close to 1 means that the sample’s distance to the nearest neighbor cluster is much larger than the intra-cluster distance.
  • 0 indicate overlapping clusters[2].
  • The worst value is -1, and a value close to -1 means that the sample is assigned to the wrong cluster[2].
  • Note that Silhouette Coefficient is only defined if the number of clusters is at least 2 and at most the number of samples minus 1.

The Silhouette score of the model is the average Silhouette value of all samples.

Example

From the example graph, we can see that the model with 2 clusters has the highest value of the Silhouette score, and the model with 3 clusters has the 2nd highest value, so we get the result that there are 2 or 3 clusters.

Number of clusters using the silhouette score — GrabNGoInfo.com

Method 3: Hierarchical Graph

The hierarchical graph represents the distance between data points in a tree structure. The y-axis is the euclidean distance, so the longer the vertical line is, the larger distance between the clusters.

Example

From the graph, we can see that the two clusters connected by the blue line have the largest distance, and the two sub-clusters in red have a relatively large distance too. Therefore, the hierarchical graph suggests 2 or 3 clusters.

Number of clusters using the hierarchical graph — GrabNGoInfo.com

Method 4: AIC and BIC from GMM (Gaussian Mixture Model)

The Gaussian Mixture Model (GMM) AIC and BIC scores can help us decide the optimal number of clusters. Both AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are metrics that measure the relative prediction errors of different models. The lower the value is, the better the model is.

Example

From the example graph, we can see that B I C has the smallest value at 2 clusters, and similar value at 3 clusters. Although the AIC value for 10 clusters is the smallest, the difference between 2 or 3 clusters and 10 clusters is quite small. In this case, we will choose a simpler model with 2 or 3 clusters.

Number of clusters using AIC and BIC — GrabNGoInfo.com

Method 5: Gap Statistics

Gap statistics compares the change in within-cluster dispersion with the uniform distribution[3]. A large gap statistics value means that the clustering is very different from the uniform distribution.

Example

Anaconda.org has a notebook with the implementation of gap statistics[1]. From the example graph, we can see that the Gap statistics value generally increases with the number of clusters. The red dot shows that 8 clusters give us the largest gap value, but we can see from the visualization that after 3 clusters, the graph is relatively flat, showing a diminishing gap increase. Therefore, the optimal number of clusters is 3 if we use the elbow method.

Number of clusters using the gap statistics — GrabNGoInfo.com

Which method to use for the optimal number of clusters?

Now we have learned the 5 different ways of choosing the number of clusters, you might wonder how to choose a method for your project. Here is a general guideline:

  • Start with the elbow method for the within-cluster sum of squares and see if there is a clear elbow.
  • If the elbow is not very clear, use the hierarchical graph to help decide the optimal number.
  • If the optimal number of clusters is still not obvious after using the elbow method and hierarchical graph, use all five methods and choose the optimal number of clusters the majority of methods suggest.

To learn how to write Python code for the five methods, please check out my previous tutorial 5 Ways for Deciding Number of Clusters in a Clustering Model.

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.

Recommended Tutorials

References

[1] Python implementation of the Gap Statistic

[2] sklearn sihouette score documentation

[3] Gap Statistics paper

Data Science Interview
Clustering
Elbow Method
Silhouette Score
Hierarchical Clustering
Recommended from ReadMedium