Understanding K-Means, K-Means++ and, K-Medoids Clustering Algorithms
An overview of K-means, K-means++ and, K-Medoids clustering algorithms, and their relations. This article also includes its implementation from scratch and using the sklearn library.
Clustering is an unsupervised machine learning technique that divides the population or data points into several groups or clusters such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups.
- Points in the same cluster are closer to each other.
- Points in the different clusters are far apart.

In the above sample 2-dimension dataset, it is visible that the dataset forms 3 clusters that are far apart, and points in the same cluster are close to each other.
What is the Metric for Clustering?
Clustering is considered to be best if it has maximum intercluster distance and minimum intracluster distance.
Notes to avoid any confusion:Intracluster distance: Distance between two point in the same cluster.
Intercluster distance: Distance between two points in the different clusters.
The image above (Image 2), describes what is an intercluster and intracluster distance.
Various evaluation metrics are using which the effectiveness of formed clusters can be measured.
Dunn Index:

The numerator of the above function measures the maximum distance between every two points (x_i, x_j) belonging to two different clusters. This represents the intracluster distance.
The denominator of the above function measures the maximum distance between every two points (y_i, y_j) belonging to the same cluster. This represents the intercluster distance.
Clustering having the max value of Dunn-Index is considered to be best.
Silhouette analysis:

This is used to determine the degree of separation between clusters. For each sample. a_i represents the average distance from all data points in the same cluster. b_i represents the average distance from all data points in the closest cluster.
The coefficient of SA can take values in the interval [-1, 1].
- SA = 0: the sample is very close to the neighboring clusters.
- SA = 1: the sample is far away from the neighboring clusters.
- SA = -1: the sample is assigned to the wrong clusters.
Therefore, we want the coefficients to be as big as possible and close to 1.
There are different types of Clustering technique, we will discuss one of them.
K-Means Clustering:
K-Means algorithm is a centroid based clustering technique. This technique cluster the dataset to k different cluster having an almost equal number of points. Each cluster is k-means clustering algorithm is represented by a centroid point.
What is a centroid point?
The centroid point is the point that represents its cluster. Centroid point is the average of all the points in the set and will change in each step and will be computed by:

For the above equation,
C_i: i'th Centroid
S_i: All points belonging to set_i with centroid as C_i
x_j: j'th point from the set
||S_i||: number of points in set_iThe idea of the K-Means algorithm is to find k-centroid points and every point in the dataset will belong either of k-sets having minimum Euclidean distance.

From the image above (Image 3), the distance of point x_i from all three centroids are d1, d2, d3, x_i point is nearest to centroid_3 with distance d3, so the point x_i will belong to the cluster of centroid_3 and this process will continue for all the points in the dataset.
Cost Function of K-Means:

The idea of the K-Means algorithm is to find k centroid points (C_1, C_1, . . . C_k) by minimizing the sum over each cluster of the sum of the square of the distance between the point and its centroid.
This cost is NP-hard and has exponential time complexity. So we use the idea of approximation using Lloyd’s Algorithm.
Lloyd’s Algorithm:
Lloyd’s algorithm is an approximation iterative algorithm used to cluster points. The steps of the algorithm are as follows:
- Initialization
- Assignment
- Update Centroid
- Repeat Step 2 and 3 until convergence.
Iterative implementation of the K-Means algorithm:

















