avatarAlan Jeffares

Summary

K-means is an unsupervised machine learning algorithm for clustering data, which involves partitioning a dataset into K distinct clusters based on feature similarity, with the goal of minimizing the sum of squared errors within clusters.

Abstract

K-means is a popular algorithm in data science for organizing unlabelled datasets into a predetermined number of clusters based on the similarity of data points. The algorithm operates by randomly initializing centroids for each cluster and then iteratively refining their positions by alternating between assigning points to the nearest centroid and recalculating the centroids as the mean of their assigned points. The process continues until convergence, resulting in the minimization of the total sum of squared distances from points to their cluster centroids. Choosing the optimal number of clusters, K, is a critical step, often determined using heuristics such as the elbow method, which plots the within-cluster sum of squares against the number of clusters to identify the "elbow" point beyond which additional clusters do not significantly improve the model fit. The algorithm's sensitivity to initial centroids and outliers can be mitigated by running multiple iterations with different starting points and employing variations like k-medoids or k-modes for categorical data. Furthermore, cluster validation techniques and soft clustering methods like fuzzy c-means can provide additional insights into the data's structure and the confidence of cluster assignments.

Opinions

  • The author emphasizes the practicality of k-means in real-world applications, such as targeted online advertising campaigns, where manual data segmentation is infeasible.
  • The k-means algorithm is praised for its ability to find "good enough" solutions in a reasonable amount of time, despite the problem being NP-hard.
  • The author suggests that the k-means algorithm, while effective, may not always find the global optimum, and thus multiple runs with random initializations are recommended for better results.
  • The elbow method is highlighted as a useful heuristic for determining the number of clusters, K, in the absence of prior knowledge.
  • Variations of k-means, such as k-medoids and k-modes, are presented as solutions to overcome the algorithm's sensitivity to outliers and its limitation to numerical data, respectively.
  • The concept of cluster stability is introduced as a way to assess the quality of the clustering solution, particularly when visual inspection is not possible.
  • The author advocates for the use of fuzzy c-means as an alternative to traditional k-means when there is a need to capture the uncertainty of cluster assignments, providing a probabilistic approach to clustering.
  • The inclusion of R implementation code indicates the author's endorsement of practical, hands-on application of the concepts discussed, encouraging readers to engage with the algorithm directly.

K-means: A Complete Introduction

K-means is an unsupervised clustering algorithm designed to partition unlabelled data into a certain number (thats the “ K”) of distinct groupings. In other words, k-means finds observations that share important characteristics and classifies them together into clusters. A good clustering solution is one that finds clusters such that the observations within each cluster are more similar than the clusters themselves.

There are countless examples of where this automated grouping of data can be extremely useful. For example, consider the case of creating an online advertising campaign for a brand new range of products being released to the market. While we could display a single generic advertisement to the entire population, a far better approach would be to divide the population into clusters of people who hold shared characteristics and interests displaying customised advertisements to each group. K-means is an algorithm that finds these groupings in big datasets where it is not feasible to be done by hand.

The intuition behind the algorithm is actually pretty straight forward. To begin, we choose a value for k (the number of clusters) and randomly choose an initial centroid (centre coordinates) for each cluster. We then apply a two step process:

  1. Assignment step — Assign each observation to it’s nearest centre.
  2. Update step — Update the centroids as being the centre of their respective observation.

We repeat these two steps over and over until there is no further change in the clusters. At this point the algorithm has converged and we may retrieve our final clusterings.

The K-means algorithm in action

Further detail

Before proceeding any further, it is worth taking a step back to crystallise what we mean by a good clustering solution. What exactly do we define as the best possible clustering solution? Consider the following visualisation of two possible assignments of observations to the same centroid.

Which data points should be assigned to this centroid?

It is pretty obvious that the first assignment is superior to the second, but how can we quantify this for the k-means algorithm? The solution is to consider the dotted red lines that connect each observation to the potential centre. If we were to compare the average length of these lines for the good assignment against the bad assignment, clearly the former would result in a far smaller value. Thats our goal! Of the infinite number of possible cluster centres we could potentially choose, we are looking to find the one that minimises the total sum of these lines, or more formally the sum of squared error (SSE). In mathematical terms, for a cluster of observations xᵢ ∈ {x₁, x₂, … , xm} we wish to find the centroid C that minimises:

Where, since we have defined the centroid as simply being the centre of its respective observations, we may calculate:

This equation gives us the sum of squared error for a single centroid C, but really we wish to minimise the sum of squared error across all centroids cⱼ ∈ {c₁, c₂, … , ck} for all observations xᵢ ∈ {x₁, x₂, … , xn}. Thus the goal of k-means is to minimise the total sum of squared error (SST) objective function:

While checking every possible clustering solution becomes infeasible as the data grows (NP-hard), k-means tends to find good enough solutions in many practical applications.

An issue that we have overlooked up till now is the problem of choosing k, the number of clusters. Occasionally we will know k from the outset, perhaps our online advertising campaign has budgeted for exactly five unique advertisements requiring exactly five clusters, but more often than not we don’t know the optimal number of clusters in a dataset. There is no perfect solution to choosing k but one popular heuristic is known as the elbow approach. This involves applying k-means for a range of values of k and plotting the choice of k against the SST in what is known as a scree plot. Notice, however, that as we increase the number of clusters, the SST will always decrease (consider what happens to the SST when we set the number of clusters equals the number of observations), so we are looking to find the point when adding more clusters no longer provides a significant decrease in in SST. In other words, we are looking to find an elbow in the scree plot after which increasing k no longer improves our overall solution.

Finding “the elbow” where adding more clusters no longer improves our solution

One final key aspect of k-means returns to this concept of convergence. We previously mentioned that the k-means algorithm doesn’t necessarily converge to the global minima and instead may converge to a local minima (i.e. k-means is not guaranteed to find the best solution). In fact, depending on which values we choose for our initial centroids we may obtain differing results.

As we are only interested in the best clustering solution for a given choice of k, a common solution to this problem is to run k-means multiple times, each time with different randomised initial centroids, and use only the best solution. In other words, always run k-means multiple times to ensure we find a solution close to the global minima.

Further considerations

What has been discussed up to now covers the nuts and bolts of the k-means approach to clustering leaving this section to introduce some extensions to the original algorithm.

K-medoids — One issue with the k-means algorithm is it’s sensitivity to outliers. As the centroid is calculated as the mean of the observations in a cluster, extreme values in a dataset can disrupt a clustering solution significantly. K-medoids is a popular approach to overcoming this problem. As the name suggests, this alternative algorithm uses medoids rather than centroids as the centre point of a cluster which simply means that the centre of a cluster must be one of the observations in that cluster.

In this case, the medoid is a more representative centre than the centroid

This can be likened to taking the median of a set of numbers rather than the mean and, in the exact same way, the median tends to be less sensitive to extreme values. Using the medoid requires adjusting our update step from the regular k-means algorithm. The update phase now becomes a swap phase where we greedily consider swapping the current medoid with other observations in the cluster and check if this swap would improve the overall clustering solution.

Categorical Data — One limitation of the k-means algorithm we should acknowledge is that it requires the data to be numerical. Categorical variables such as gender are meaningless to k-means. What is the mean of a group of males and females? Can we still cluster data that is partly or even fully categorical?

A solution for fully categorical data is known as k-modes. This approach is very similar the k-means, but takes the mode of a cluster as the centre and then uses a new measure to calculate the distance between each observation and its cluster centre. This distance measure compares the dissimilarity of every attribute between observation X and cluster mode Z and takes a sum of these values. Observations with many differing attributes to Z will take a higher dissimilarity value than observations that are nearly identical to Z. Specifically for attribute of observation X and centroid Z the dissimilarity is calculated as:

The dissimilarity measure is weighted by frequency which attempts to account for imbalances in the distribution of values within attributes thus n is the total number of observations in the cluster, and n⁽ʳ⁾ is the total number of observations in the cluster that take the relevant value n. K-modes then proceeds in the same way as k-means in assigning and updating clusters using this dissimilarity as a measure of distance.

Finally, for data that is a mixture of categorical and numerical data, we may apply the k-prototypes algorithm which is essentially a mixture of k-means and k-modes. Our cluster centres are now called prototypes and, as we might guess at this point, simply apply the k-means distance measure for numerical attributes and the k-modes dissimilarity measure for categorical attributes, combining the two for an overall measure of dissimilarity. With this set up we may, again, apply the same old assign-update approach.

Cluster Validation — One of the drawbacks of unsupervised approaches such as k-means is that there is no obvious method of assessing the solution we get. Did the solution we obtained capture the cluster structure sufficiently? Are there even any clusters in the data to begin with? Sure, sometimes we can plot the data and eyeball the results but this only works in two dimensions and clustering often only becomes interesting in higher dimensions anyway (why bother with k-means if we can already see the clusters in the data).

One popular approach involves examining the stability of a clustering solution. The rationale behind this approach is that if there really are clusters in the data, then when we compare a number of samples from that data, they should exhibit similar clustering solutions. In the example below we take two datasets, one that clearly contains three clusters and another that is just uniformly randomly placed observations. We take a number of bootstrap samples (samples with replacement) of each and apply k-means to each of these samples. In the structured data k-means repeatedly finds similar solutions over and over, however in the unstructured data the clusterings are far more inconsistent. This difference in stability can be quantified more rigorously by comparing the locations of the cluster means or using more technical statistics such as average silhouette width for successive samples.

Fuzzy C-means — Another limitation of K-means that we have yet to address can be attributed to the difference between hard clustering and soft clustering. K-means is a hard clustering approach meaning that each observation is partitioned into a single cluster with no information about how confident we are in this assignment. In reality, if an observation is approximately half way between two centroids it would be useful to have that uncertainty encoded into the output. Soft clustering solves this problem by giving us an estimate of the probability that an observation belongs to each of the centroids rather than a single classification. Of course we can still assign each observation to whichever centroid offers the highest value, but now we have an estimate of our confidence in this decision allowing us to differentiate between observations that are right beside a certain clusters centroid and those that are a little bit more ambiguous. Fuzzy C-means takes this soft clustering approach.

So how does Fuzzy C-means differ from the original K-means algorithm? Fortunately, this adjustment doesn’t change our approach very much at all, instead it just includes what is known as the fuzzifier term wⱼᵐ (with m simply a hyperparameter) to incorporate the weight of our membership certainty of observation Xᵢ in cluster j. This means that the Fuzzy C-means objective function we wish to minimise is given by:

where

And applying these new equations we repeat the exact same assignment and update approach taken in the original K-means.

R implementation

Well done on making it all the way to the end, follow me on twitter for a heads up on future posts. All visualisations are my own.

Machine Learning
Data Science
Tutorial
Statistics
Unsupervised Learning
Recommended from ReadMedium