How to Perform DBSCAN Clustering in Python Using scikit-learn

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm that groups data points based on their density. In this tutorial, we will learn how to implement DBSCAN in Python using the scikit-learn library.

Before we dive into the implementation, let’s understand the basic concepts of DBSCAN.

Basic Concepts

DBSCAN works by grouping together data points that are close to each other in the feature space. It requires two parameters: the radius (eps) and the minimum number of points (min_samples) required to form a dense region. The algorithm works as follows:

Choose a random data point that has not been visited yet.
Retrieve all data points within a distance of eps from the chosen point.
If there are at least min_samples points within the eps distance, then create a new cluster and add all the points to it.
If there are less than min_samples points within the eps distance, mark the chosen point as noise.
Repeat the process until all points have been visited.

DBSCAN is a powerful algorithm that can handle non-linear and non-convex clusters, and is less sensitive to the choice of initial parameters compared to other clustering algorithms.

Get an email whenever Dr. Soumen Atta, Ph.D. publishes.

Get an email whenever Dr. Soumen Atta, Ph.D. publishes. By signing up, you will create a Medium account if you don't…

soumenatta.medium.com

Implementation

Now that we understand the basic concepts of DBSCAN, let’s implement it in Python using the scikit-learn library. We will use the Iris dataset, which is a popular dataset for classification and clustering tasks.

Import the necessary libraries

First, let’s import the necessary libraries:

from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt

The code block imports the necessary libraries for implementing DBSCAN clustering algorithm in Python:

load_iris function from sklearn.datasets module is used to load the Iris dataset.
DBSCAN class from sklearn.cluster module is used to implement the DBSCAN clustering algorithm.
StandardScaler class from sklearn.preprocessing module is used to normalize the features of the dataset.
numpy module is imported as np to perform numerical operations.
matplotlib.pyplot module is imported as plt to visualize the data and the clusters.

Together, these libraries provide a powerful framework to perform clustering on datasets using DBSCAN algorithm in Python.

Building a Random Forest Classifier with Wine Quality Dataset in Python

Random Forest is a powerful machine-learning algorithm that can be used for both classification and regression tasks…

soumenatta.medium.com

Load the Iris dataset and normalize the features

Next, we will load the Iris dataset and normalize the features:

iris = load_iris()
X = iris.data
X = StandardScaler().fit_transform(X)

These three lines of code perform the following operations:

iris = load_iris() loads the famous Iris dataset from scikit-learn's built-in datasets. This dataset contains measurements of physical characteristics of three species of Iris flowers (Setosa, Versicolour, and Virginica). The dataset consists of 150 instances, each with four features: sepal length, sepal width, petal length, and petal width.
X = iris.data assigns the feature matrix of the Iris dataset to the variable X. The feature matrix consists of 150 rows (one for each instance) and 4 columns (one for each feature).
X = StandardScaler().fit_transform(X) standardizes the feature matrix X. Standardization is an important preprocessing step in many machine learning algorithms, including DBSCAN. The StandardScaler class scales each feature to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean and dividing by the standard deviation. Standardization helps to bring all the features to a similar scale, which can improve the performance of the clustering algorithm. The fit_transform() method is used to fit the scaler to the data and transform it at the same time. The standardized feature matrix is assigned back to the variable X.

Exploring the Logistic Regression Algorithm with Heart Disease Dataset in Python

Logistic Regression is a popular classification algorithm used in machine learning. In this tutorial, we will explore…

soumenatta.medium.com

Set the parameters of DBSCAN

We will then set the parameters for the DBSCAN algorithm:

dbscan = DBSCAN(eps=0.5, min_samples=5)

In the DBSCAN algorithm, the eps and min_samples parameters are used to define the density of clusters.

The eps parameter is a positive number that specifies the radius of the neighborhood around each point. This means that any two points within a distance of eps are considered to be part of the same cluster.
The min_samples parameter is the minimum number of points required to form a dense region. Points that are not part of a dense region are considered to be noise.

The combination of eps and min_samples determines the shape and size of the resulting clusters. A larger eps value will result in larger and fewer clusters, while a smaller eps value will result in smaller and more clusters. Similarly, a larger min_samples value will result in fewer clusters, while a smaller min_samples value will result in more clusters.

It is important to note that the optimal values of eps and min_samples depend on the specific dataset and the problem at hand. Finding the optimal values can require some trial and error, and it is often a good idea to experiment with different values to see how they affect the clustering results.

Your Guide to Mastering Machine Learning: A Complete Study Path with Resources

If you’re someone who is looking to delve into the fascinating world of Machine Learning and gain expertise in this…

soumenatta.medium.com

Fit the data to the DBSCAN algorithm

We can now fit the data to the DBSCAN algorithm and get the predicted labels:

dbscan.fit(X)
labels = dbscan.labels_

The predicted labels will be -1 for noise points and non-negative integers for cluster indices.

In the code block dbscan.fit(X), the DBSCAN algorithm is applied to the standardized feature matrix X using the fit() method of the DBSCAN class. This step computes the clusters based on the eps and min_samples parameters set earlier.

The resulting clusters are stored in the labels attribute of the dbscan object, which is accessed using the dbscan.labels_ statement. The labels array contains an integer label for each point in the dataset. Points that are not part of any cluster are labeled as -1, and points that belong to the same cluster are given the same label.

The resulting labels can be used for further analysis and visualization of the clusters. For example, you can plot the data points with different colors based on their cluster labels to visualize the separation of the clusters. It is important to note that the quality of the clustering results depends on the choice of eps and min_samples parameters, and the interpretation of the resulting labels should be done with care.

NetworkX: A Practical Introduction to Graph Analysis in Python

In the world of data science, analyzing and visualizing complex networks is a critical task. That’s where NetworkX, a…

soumenatta.medium.com

Visualizing clusters

We can visualize the clusters using the following code:

colors = np.array(['#ff0000', '#00ff00', '#0000ff'])
cluster_labels = np.unique(labels)
n_clusters = len(cluster_labels) - 1 # subtract 1 for noise points
plt.figure(figsize=(8, 6))
for i, label in enumerate(cluster_labels):
    if label == -1:
        # Plot noise points in black
        plt.scatter(X[labels == label, 0], X[labels == label, 1], c='k', s=50, label='Noise')
    else:
        # Plot points in current cluster color
        plt.scatter(X[labels == label, 0], X[labels == label, 1], c=colors[i % len(colors)], s=50, label=f'Cluster {label}')
plt.legend()
plt.show()

This code will create a scatter plot of the first two features of the Iris dataset, where each point is colored according to its predicted label.

This code block is an updated version of the previous one that adds proper legends to the scatter plot. Here is what each line of the code does:

colors = np.array(['#ff0000', '#00ff00', '#0000ff']): creates an array of three colors that will be used to color the clusters.
cluster_labels = np.unique(labels): obtains the unique labels assigned by DBSCAN to each point in the data.
n_clusters = len(cluster_labels) - 1: computes the number of clusters in the data by subtracting 1 from the total number of unique labels, since the label -1 corresponds to noise points.
plt.figure(figsize=(8, 6)): creates a new figure with a size of 8x6 inches.
for i, label in enumerate(cluster_labels):: loops through each cluster label and its corresponding points.
if label == -1:: if the label is -1, it corresponds to noise points, so we plot them in black with a label of 'Noise'.
else:: if the label is not -1, it corresponds to a cluster, so we plot its points in a color from the colors array with a label of 'Cluster {label}', where {label} is replaced by the actual label value.
plt.legend(): displays the legend with the labels created in the previous steps.
plt.show(): displays the scatter plot with the legends. The plot is shown below:

By visualizing the clusters in this way, we can gain insights into the underlying structure of the data and better understand the relationships between different data points.

Building a k-Nearest Neighbors Classifier with Scikit-learn: A Step-by-Step Tutorial

Scikit-learn is a popular Python library for Machine Learning that provides tools for data analysis, data…

levelup.gitconnected.com

The complete program is given below:

from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt 

iris = load_iris()
X = iris.data
X = StandardScaler().fit_transform(X) 

dbscan = DBSCAN(eps=0.5, min_samples=5) 

dbscan.fit(X)
labels = dbscan.labels_ 

colors = np.array(['#ff0000', '#00ff00', '#0000ff'])

plt.scatter(X[:, 0], X[:, 1], c=colors[labels], s=50)
plt.savefig('dbscan.png', dpi=600)
plt.show()

Conclusion

In conclusion, DBSCAN is a robust and versatile clustering algorithm that is particularly useful when dealing with datasets that have complex structures and contain noise.

In this tutorial, we have demonstrated how to implement the DBSCAN algorithm in Python using the scikit-learn library. We started by loading the iris dataset and pre-processing the data by standardizing it using the StandardScaler function. We then set the eps and min_samples parameters and applied the DBSCAN algorithm using the fit() method of the DBSCAN class. Finally, we visualized the resulting clusters using a scatter plot.

By setting the right values for eps and min_samples, DBSCAN can identify dense regions in our data and group them into clusters. This can help us to uncover hidden patterns and relationships in our data that may not be immediately apparent.

In summary, DBSCAN is a powerful tool for exploratory data analysis and can be used in a wide range of applications, including anomaly detection, image segmentation, and recommendation systems. With the scikit-learn library, implementing DBSCAN in Python is simple and straightforward, making it an accessible technique for data scientists and machine learning practitioners.

DBSCAN Clustering with HDBSCAN: A Python Tutorial with Iris Dataset

In this tutorial, we will cover how to perform DBSCAN clustering with HDBSCAN in Python. DBSCAN is a popular clustering…

soumenatta.medium.com

Join Medium with my referral link - Dr. Soumen Atta, Ph.D.

Read every story from the thousands of writers on Medium. Become a member now! Your membership fee directly supports…

soumenatta.medium.com

How to become a machine learning expert: a step-by-step guide

Machine learning is one of the most in-demand and fastest-growing fields in technology today. From self-driving cars to…

levelup.gitconnected.com

Level Up Coding

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the Level Up Coding publication
💰 Free coding interview course ⇒ View Course
🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job