Summary

The provided content is a detailed tutorial on implementing the K-Medoid clustering algorithm, specifically the Partition Around Medoids (PAM) method, in Python using the KMedoids class from the scikit-learn-extra library.

Abstract

The web content offers a comprehensive guide to understanding and applying the K-Medoid clustering algorithm, with a focus on the Partition Around Medoids (PAM) approach. It begins by discussing the limitations of the K-Means clustering algorithm, particularly its reliance on the Euclidean distance metric and the lack of interpretability due to non-data point centers. The PAM algorithm is then introduced as an alternative that uses actual data points as medoids and minimizes the dissimilarity within clusters. The tutorial includes a step-by-step explanation of the PAM algorithm, from initializing medoids to iteratively improving cluster assignments and updating medoids to reduce the loss function. A practical example is provided, illustrating the calculation of dissimilarity, the assignment of data points to medoids, and the recalculation of costs to determine the optimal medoids. The tutorial concludes with a Python implementation using scikit-learn-extra, demonstrating how to fit the KMedoids model, assign cluster labels, and visualize the resulting clusters with matplotlib.

Opinions

The author suggests that K-Medoid clustering, particularly the PAM algorithm, is more interpretable than K-Means because it uses actual data points as medoids.
The author implies that the ability to use different distance metrics, such as Manhattan distance, is an advantage of K-Medoid over K-Means, which is limited to Euclidean distance.
The tutorial conveys that the PAM algorithm is computationally efficient, reducing the complexity from O(n^2) to O(n log n) for a sample size n.
The author emphasizes the importance of the loss function in PAM, which aims to minimize the dissimilarity between points in a cluster and the medoid of that cluster.
The use of a practical example and Python code snippets indicates the author's commitment to providing actionable knowledge for readers interested in implementing K-Medoid clustering in real-world scenarios.
The author's inclusion of links to further resources and their presence on professional networks like LinkedIn and Twitter suggests a community-oriented approach to data science education and practice.

K-Medoid Clustering (PAM)Algorithm in Python

A step-by-step tutorial—with a solved example

Image Credit — Prepared by the author using Jupyter Notebook.

1. Introduction

1.1 K-Means clustering and challenges

Clustering of large-scale data is key to implementing segmentation-based algorithms. Segmentation can include identifying customer groups to facilitate targeted marketing, identifying prescriber groups to allow health care players to reach out to them with the right messaging, and identifying patterns or abnormal values in the data. K-Means is the most popular clustering algorithm adopted across different problem areas, mostly owing to its computational efficiency and ease of understanding the algorithm. K-Means relies on identifying cluster centers from the data. It alternates between assigning points to these cluster centers using the Euclidean distance metric and recomputes the cluster centers till a convergence criterion is achieved. K-Means clustering, however, suffers from a series of drawbacks:

K-Means clustering does not use any other distance metric other than Euclidean distance. Different distance metrics like cosine distance is often utilized in a recommendation system that uses sparse data.
In K-Means clustering, the final centers are not necessarily data points from the data and act as a bottleneck to interpretability for different application areas like images in computer vision.

2. Partition Around Medoids (PAM)

PAM stands for “Partition Around Medoids.” PAM converts each step of PAM from a deterministic computational to a statistical estimation problem and reduces the complexity of a sample size n to O(n log n). Medoids are data points chosen as cluster centers. K-Means clustering aims at minimizing the intra-cluster distance (often referred to as the total squared error). In contrast, K-Medoid minimizes dissimilarities between points in a cluster and points considered as centers of that cluster.

Any point in a dataset can be considered as a medoid if and only if its dissimilarity to all other data points in the cluster is minimum.

The fundamental concept of PAM includes:

Find a set of k Medoids (k refers to the number of clusters, and M is a collection of medoids) from the data points of size n (n being the number of records).
Using any distance metric (say d(.), could be euclidean, manhattan, etc.), try and locate Medoids that minimize the overall distance of data points to their closest Medoid.
Finally, swap Medoid and non-Medoid pairs that reduce the loss function L among all possible k(n-k) pairs. The loss function is defined as:

Figure 1. Illustrates the Loss Function used by the PAM algorithm.

Update centroids: In the case of K-Means, we were computing the mean of all points present in the cluster. But for the PAM algorithm, the updation of the centroid is different. If there are m-point in a cluster, swap the previous centroid with all other (m-1) points and finalize the point as a new centroid with a minimum loss. Minimum loss is computed by the cost function shown in Figure 1. A working example of the same is outlined below (GeeksforGeeks, 2019).

2.1 Algorithm — Dataset

Let us consider a set of data points and features.

Figure 2. Illustrates a hypothetical set of features for 7 data points. Image prepared by the author using Jupyter and Markdown.

2.2 Initialization Step

Let us randomly select two medoids, so select 𝐾= 2 and let M1 =(4, 5), and M2 =(9, 10) be the two medoids. Note clustering algorithm chooses data points from the data as medoids
Calculate the dissimilarity of all points to Medoids M1 and M2. Note dissimilarity is calculated as the summation of the absolute difference between medoids and data points. E.g., D=|Mx — Feature X| + |My — Feature Y|

Figure 3. Illustrates the dissimilarity calculation. Image prepared by the author using Jupyter and Markdown.

2.3 Assignment Step

Assign data points to Medoids with lower Dissimilarity (Cost). Data Points 0 and 2 go to M2, and Data points 3, 4, and 5 go to M1. Total Cost = Cost/Dissimilarity of points assigned to M1 + Cost/Dissimilarity of points assigned to M2.

Total Cost=(4+3+4)+(4+7)=22; Cost for M1 = Dissimilarity of point 3 with M1 (4) + Dissimilarity of point 4 with M1 (3) +Dissimilarity of point 5 with M1 (4). Repeat the same calculation for M2.

2.4 Centroid Recompilation Step

Randomly select one non-medoid point and recalculate the cost. Let’s select Data Point 5 M1 as (2,3) as the Medoid now and recompute the cost.

Figure 4. Illustrates the dissimilarity calculation using an updated centroid, i.e., using data point 5 (2,3) as M1. Image prepared by the author using Jupyter and Markdown.

Now only points 4 and 6 go to M1, and the rest of the points go to M2. Total Cost=(7+4)+(4+7+6)=28 and Swap Cost = New Cost — Previous Cost=28–22=6.

We will undo the swap since the Swap Cost is greater than 0. This process is continued till a convergence criterion is met.

3. Implementation using Python

We will simulate the above example using K-Medoid algorithm from scikit-learn-extra. Few of the codes and concepts and inspired from package documentation (scikit-learn-extra, 2019)

# — — — — — — -Importing Packages — — — — — — — — — — — -

import matplotlib.pyplot as plt
import numpy as np

from sklearn_extra.cluster import KMedoids

# — — — — — — -Assigning Initial Centers — — — — — — — — — — — -
centers = [[4, 5], [9, 10]]

# — — — — — — -Assigning Data: Dummy Data used in example above — — — — — — — — — — — — — — — — — — 
df=np.array([[7,8], [9,10], [11,5], [4,9], [7,5], [2,3], [4,5]])

# — — — — — — -Fit KMedoids clustering — — — — — — — — — — — -
KMobj = KMedoids(n_clusters=2).fit(df)

# — — — — — — -Assigning Cluster Labels — — — — — — — — — — — -
labels = KMobj.labels_

Visualizing the Cluster.

# — — — — — — -Extracting Unique Labels — — — — — — — — — — — -

unq_lab = set(labels)

# — — — — — — -Setting Up Color Codes — — — — — — — — — — — -
colors_plot = [
 plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unq_lab))
]

for k, col in zip(unq_lab, colors_plot):

class_member_mask = labels == k
 
 # — — — — — — -Setting datapoint Feature X and Feature Y — — — — — — — — — — — -

xy = df[class_member_mask]
 
 # — — — — — — -Plotting Feature X and Feature Y for each cluster labels — — — — — — — — — — — -
 
 plt.plot(
 xy[:, 0],
 xy[:, 1],
 “o”,
 markerfacecolor=tuple(col),
 markeredgecolor=”white”,
 markersize=10,
 );

# — — — — — — -Annotate Centroids — — — — — — — — — — — -

plt.plot(
 KMobj.cluster_centers_[:, 0],
 KMobj.cluster_centers_[:, 1],
 “o”,
 markerfacecolor=”orange”,
 markeredgecolor=”k”,
 markersize=10,
);

# — — — — — — -Add title to the plot — — — — — — — — — — — -

plt.title(“KMedoids clustering on Dummy Data- Medoids are represented in Orange.”, fontsize=14);

Figure 5. Illustrates the final cluster centroids using K-Medoids. Image prepared by the author using Jupyter Notebook.

Reference

scikit-learn-extra. (2019). KMedoids Demo — scikit-learn-extra 0.2.0 documentation. Scikit-Learn-Extra.readthedocs.io. https://scikit-learn-extra.readthedocs.io/en/stable/auto_examples/plot_kmedoids.html#sphx-glr-auto-examples-plot-kmedoids-py
GeeksforGeeks. (2019, May 17). ML | K-Medoids clustering with solved example. GeeksforGeeks. https://www.geeksforgeeks.org/ml-k-medoids-clustering-with-example/

About the Author: Advanced analytics professional and management consultant helping companies find solutions for various problems through a mix of business, technology, and math on organizational data. A Data Science enthusiast, here to share, learn and contribute; You can connect with me on Linked and Twitter;