avatarEsteban Thilliez

Summarize

Data Science with Python — Cluster Analysis

Photo by Pierre Bamin on Unsplash

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Cluster analysis is an important part of data science. It consists of grouping similar data points together based on their characteristics.

Today, we’ll explore how to perform cluster analysis in Python.

What is Cluster Analysis?

Cluster analysis is an unsupervised learning method used in data science to group similar data points together based on their characteristics. The main purpose of cluster analysis is to partition a dataset into subsets, or clusters, such that data points within each cluster share common traits and are dissimilar from those in other clusters.

Cluster analysis has many applications in various fields such as market segmentation, customer profiling, image processing, and biological data analysis. In marketing, cluster analysis is used to segment customers into groups with similar demographics, behavior, and preferences. In biology, cluster analysis is used to group genes or proteins based on their function or expression patterns.

There are several types of clustering techniques, including hierarchical clustering, partition-based clustering, density-based clustering, and model-based clustering. Hierarchical clustering builds a tree-like structure of clusters, while partition-based clustering assigns each data point to a specific cluster. Density-based clustering identifies dense regions of data points, while model-based clustering assumes that the data points are generated from a mixture of underlying probability distributions.

Clustering Alghorithms

There are various clustering algorithms that can be used to perform cluster analysis. The choice of the algorithm will depend mostly on the dataset.

  • K-Means Clustering: K-means is a popular partition-based clustering algorithm that aims to partition a dataset into K clusters. The algorithm works by randomly selecting K data points as centroids, then assigning each data point to the closest centroid based on a distance metric such as Euclidean distance. The centroids are then updated by computing the mean of all data points assigned to each cluster. The process of assigning data points and updating centroids is repeated until the centroids no longer move or a maximum number of iterations is reached.
  • Hierarchical Clustering: Hierarchical clustering is a tree-like clustering algorithm that recursively divides a dataset into subsets of smaller and smaller clusters. There are two main types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering starts by treating each data point as a separate cluster and then iteratively merges the closest pairs of clusters until all data points belong to a single cluster. Divisive clustering starts with all data points in a single cluster and then recursively divides the clusters into smaller subsets until each cluster contains only one data point.
  • DBSCAN Clustering: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that identifies dense regions of data points separated by areas of lower density. The algorithm works by defining a neighborhood around each data point based on a distance metric and a user-defined radius parameter. Data points that are within a neighborhood of a minimum number of data points are considered as core points and are used to form a cluster. Data points that are not within a neighborhood of a core point are considered as noise.
  • Gaussian Mixture Models (GMM): GMM is a model-based clustering algorithm that assumes that the data points are generated from a mixture of Gaussian distributions. The algorithm estimates the parameters of the Gaussian distributions by maximizing the likelihood of the data points. Each data point is then assigned to the Gaussian distribution that has the highest probability of generating that data point.

Building Clustering Models in Python

Python provides several libraries that can be used to build clustering models. We’ll use scikit-learn, as it’s one of the most famous and one of the easiest to use.

To perform clustering using scikit-learn, we first need to import the necessary modules:

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

We can then load the data into a Pandas DataFrame and perform data preparation, including data cleaning, scaling, and feature selection. Once the data is prepared, we can create an instance of the clustering algorithm and fit it to the data. For example, using KMeans:

kmeans = KMeans(n_clusters=3)
kmeans.fit(data)

We can then use the trained model to predict the cluster labels for new data points:

labels = kmeans.predict(new_data)

Example

Let’s walk through a simple example to illustrate the process of clustering using scikit-learn. I will make a more detailed article about a real use case later, for now, I write a little example so that you can try what you learned.

Suppose we have a dataset with two features, height and weight, and we want to cluster the data into three groups based on these features. We can start by loading the data into a Pandas DataFrame:

import pandas as pd

data = pd.DataFrame({
    'height': [170, 168, 180, 175, 174, 172, 169, 177, 181, 178],
    'weight': [70, 65, 80, 73, 72, 68, 66, 76, 82, 79]
})

We can then perform data preparation, including scaling the data to a common range using StandardScaler:

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

We can then create an instance of the KMeans clustering algorithm and fit it to the data:

kmeans = KMeans(n_clusters=3)
kmeans.fit(scaled_data)

Finally, we can use the trained model to predict the cluster labels for new data points:

new_data = pd.DataFrame({
    'height': [173, 179, 171],
    'weight': [71, 81, 67]
})

scaled_new_data = scaler.transform(new_data)

labels = kmeans.predict(scaled_new_data)

print(labels)  # Output: [0 2 0]

Fine-Tuning

Once we have built a clustering model using scikit-learn, we may want to fine-tune the model to improve its performance. Here are some techniques we can use for fine-tuning our clustering models:

  • Choosing the optimal number of clusters: The number of clusters is a key hyperparameter in clustering algorithms. In K-means clustering, we can use the elbow method or the silhouette score to determine the optimal number of clusters. In hierarchical clustering, we can use the dendrogram to determine the optimal number of clusters.
  • Feature selection: In some cases, not all features are relevant for clustering. We can use feature selection techniques to identify the most important features and remove irrelevant or redundant features.
  • Dimensionality reduction: High-dimensional data can be difficult to cluster, so we can use dimensionality reduction techniques such as PCA or t-SNE to reduce the number of features.
  • Algorithm selection: Different clustering algorithms have different strengths and weaknesses, so we may want to try different algorithms to find the one that works best for our data.
  • Hyperparameter tuning: Clustering algorithms have several hyperparameters that can be tuned to improve performance, such as the distance metric, linkage method, and DBSCAN epsilon.

I already talked about most of these techniques, so be sure to check the other stories of this series if you want to know more about these techniques.

Final Note

Now you know how to solve clustering problems in Python.

In a next article, we will see a concrete use case. Don’t hesitate to follow me if you don’t want to miss it!

To explore the other stories of this series, click below!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Data Science
Data
AI
Programming
Python
Recommended from ReadMedium