avatarThomas A Dorfer

Summary

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2475

Abstract

of overlapping clusters. An algorithm that is very well suited for identifying globular clusters is <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">K-Means</a>.</p><p id="3bda">Below is an example of data comprised of 4 globular clusters (Fig. 1). The standard deviation of the clusters here is 1.2, which results in some inter-cluster overlaps. However, K-Means still achieves a decent separation with a Silhouette Coefficient of 0.44.</p><figure id="e545"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*c588AAPAUZ4lQ5UsrBmVIQ.png"><figcaption>Fig. 1: Raw data vs. K-Means output. Silhouette Coefficient: 0.44.</figcaption></figure><p id="3896">If we have extremely well-defined clusters (Fig. 2), we would expect a higher validity score. This is exactly what we obtain in the example below, which achieves a Silhouette Coefficient of 0.79. The standard deviation of the clusters in this example is 0.4, yielding a much better separability.</p><figure id="e06b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*OaLhcnzmZYenTQe1EBfvxQ.png"><figcaption>Fig. 2: Raw data vs. K-Means output. Silhouette Coefficient: 0.79.</figcaption></figure><p id="f8ae">However, if the clusters are not characterized by globular shapes, the Silhouette Coefficient may not be an appropriate validity metric. Moreover, K-Means may not be the appropriate clustering algorithm, as it typically prefers globular shapes. In the example below, K-Means clearly misclassifies parts of the crescents, even though the Silhouette Coefficient still yields a misleadingly decent value of 0.49.</p><figure id="7b55"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kQwEHjG9Nhc8nJFjepvvZg.png"><figcaption>Fig. 3: K-Means Clustering on non-globular data.</figcaption></figure><h1 id="3192">Arbitrarily Shaped Clusters</h1><p id="c9e3">For clusters with arbitrary shapes and sizes, the <a href="https://github.com/christopherjenness/DBCV">Density-Based Clustering Validation (DBCV)</a> [2] is the preferred evaluation metric. It assesses cluster quality based on the within- and between-cluster density. Good results are achieved by having a high density within clusters and a low density between clusters. Similar to the Silhouette Coefficient, DBCV ranges from -1 to +1, with higher scores indicating better cluster quality.</p><p id="6446">There are various, commonly used clustering algorithms that are capable

Options

of identifying arbitrarily shaped clusters, such as DBSCAN, HDBSCAN, or OPTICS. For the illustrations below, I will use <a href="https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html">HDBSCAN</a> (hierarchical density-based spatial clustering for applications with noise).</p><p id="a48d">Let’s see how HDBSCAN performs on the two crescents:</p><figure id="9418"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*69tOt4sgUMeelL1x1phbVg.png"><figcaption>Fig. 4: HDBSCAN on crescent-shaped clusters.</figcaption></figure><p id="1e9a">As the two crescents appear to be separable by density, it is no surprise that HDBSCAN separates them quite well with a corresponding DBCV of 0.54. By contrast, the DBCV based on the K-Means labels (Fig. 3) is -0.76, indicating incorrect clustering due to misclassified data points, as expected.</p><p id="ab5c">Let’s look at another example of arbitrarily shaped clusters:</p><figure id="4639"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*1bQgrqBB6k1pAL4YV17XXA.png"><figcaption>Fig. 5: HDBSCAN on circular clusters.</figcaption></figure><p id="d4c7">Again, these two circles seem to be separable by density and, consequently, HDBSCAN achieves descent results with a DBCV of 0.61. K-Means, by contrast, struggles to separate them and, as a result, the associated DBCV is -0.8 (Fig. 6).</p><figure id="8df0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*h-F9xKA3mCiB2zIvUT59Ew.png"><figcaption>Fig. 6: K-Means on circular clusters.</figcaption></figure><h1 id="9866">Conclusion</h1><p id="4571">If the ground truth is not available, one has to rely on model-based performance evaluation metrics. While the Silhouette Coefficient is a reliable metric for data comprised of convex, globular clusters, it often fails on arbitrarily shaped clusters and can yield deceivingly high scores despite obvious misclassifications. In such cases, the DBCV is the preferred metric to choose as it demonstrably handles arbitrary shapes much better.</p><h1 id="b00b">References</h1><p id="cc76">[1] Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Computational and Applied Mathematics 20: 53–65.</p><p id="13d6">[2] Moulavi, Davoud, et al. (2014). “Density-based clustering validation.” Proceedings of the 2014 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics.</p></article></body>

How to Evaluate Clustering Performance without Ground Truth Labels

An illustrative guide to choosing an appropriate validity metric based on cluster conformation

Photo by steve_j on Unsplash

Introduction

There are plenty of methods available from the supervised learning playbook to evaluate clustering performance for situations where the ground truth labels are known. However, performance evaluation is not as straightforward when the ground truth is not available. In this case, we need to rely on validity metrics that give us an indication of how well clusters are defined. These validity metrics rely heavily on the underlying conformation of the clusters in the data.

For data composed of convex, globular clusters, a variety of validity metrics have been proposed, the most popular being the Silhouette Coefficient. Others include the Calinski-Harabasz Index, the Davies-Bouldin Index, and the Dunn Index. However, not all data are composed of globular clusters. Some clusters can take on arbitrary shapes that are separated by density. In such cases, we can leverage the Density-Based Clustering Validation metric, or DBCV.

Note: The data used in the illustrations below was generated using scikit-learn’s make_blobs, make_moons, and make_circles functions.

Globular Clusters

For clusters with predominantly globular shapes, the Silhouette Coefficient [1] is most commonly used to evaluate performance. It is a measure based on how similar a data point is to its own cluster compared to other clusters. Its value ranges from -1, indicating incorrect clustering, to +1, indicating highly dense clustering and good separation. Values close to 0 are indicative of overlapping clusters. An algorithm that is very well suited for identifying globular clusters is K-Means.

Below is an example of data comprised of 4 globular clusters (Fig. 1). The standard deviation of the clusters here is 1.2, which results in some inter-cluster overlaps. However, K-Means still achieves a decent separation with a Silhouette Coefficient of 0.44.

Fig. 1: Raw data vs. K-Means output. Silhouette Coefficient: 0.44.

If we have extremely well-defined clusters (Fig. 2), we would expect a higher validity score. This is exactly what we obtain in the example below, which achieves a Silhouette Coefficient of 0.79. The standard deviation of the clusters in this example is 0.4, yielding a much better separability.

Fig. 2: Raw data vs. K-Means output. Silhouette Coefficient: 0.79.

However, if the clusters are not characterized by globular shapes, the Silhouette Coefficient may not be an appropriate validity metric. Moreover, K-Means may not be the appropriate clustering algorithm, as it typically prefers globular shapes. In the example below, K-Means clearly misclassifies parts of the crescents, even though the Silhouette Coefficient still yields a misleadingly decent value of 0.49.

Fig. 3: K-Means Clustering on non-globular data.

Arbitrarily Shaped Clusters

For clusters with arbitrary shapes and sizes, the Density-Based Clustering Validation (DBCV) [2] is the preferred evaluation metric. It assesses cluster quality based on the within- and between-cluster density. Good results are achieved by having a high density within clusters and a low density between clusters. Similar to the Silhouette Coefficient, DBCV ranges from -1 to +1, with higher scores indicating better cluster quality.

There are various, commonly used clustering algorithms that are capable of identifying arbitrarily shaped clusters, such as DBSCAN, HDBSCAN, or OPTICS. For the illustrations below, I will use HDBSCAN (hierarchical density-based spatial clustering for applications with noise).

Let’s see how HDBSCAN performs on the two crescents:

Fig. 4: HDBSCAN on crescent-shaped clusters.

As the two crescents appear to be separable by density, it is no surprise that HDBSCAN separates them quite well with a corresponding DBCV of 0.54. By contrast, the DBCV based on the K-Means labels (Fig. 3) is -0.76, indicating incorrect clustering due to misclassified data points, as expected.

Let’s look at another example of arbitrarily shaped clusters:

Fig. 5: HDBSCAN on circular clusters.

Again, these two circles seem to be separable by density and, consequently, HDBSCAN achieves descent results with a DBCV of 0.61. K-Means, by contrast, struggles to separate them and, as a result, the associated DBCV is -0.8 (Fig. 6).

Fig. 6: K-Means on circular clusters.

Conclusion

If the ground truth is not available, one has to rely on model-based performance evaluation metrics. While the Silhouette Coefficient is a reliable metric for data comprised of convex, globular clusters, it often fails on arbitrarily shaped clusters and can yield deceivingly high scores despite obvious misclassifications. In such cases, the DBCV is the preferred metric to choose as it demonstrably handles arbitrary shapes much better.

References

[1] Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Computational and Applied Mathematics 20: 53–65.

[2] Moulavi, Davoud, et al. (2014). “Density-based clustering validation.” Proceedings of the 2014 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics.

Data Science
Clustering
Unsupervised Learning
Machine Learning
Data Visualization
Recommended from ReadMedium