How to Perform DBSCAN Clustering in Python Using scikit-learn
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm that groups data points based on their density. In this tutorial, we will learn how to implement DBSCAN in Python using the scikit-learn library.
Before we dive into the implementation, let’s understand the basic concepts of DBSCAN.
Basic Concepts
DBSCAN works by grouping together data points that are close to each other in the feature space. It requires two parameters: the radius (eps) and the minimum number of points (min_samples) required to form a dense region. The algorithm works as follows:
- Choose a random data point that has not been visited yet.
- Retrieve all data points within a distance of eps from the chosen point.
- If there are at least min_samples points within the eps distance, then create a new cluster and add all the points to it.
- If there are less than min_samples points within the eps distance, mark the chosen point as noise.
- Repeat the process until all points have been visited.
DBSCAN is a powerful algorithm that can handle non-linear and non-convex clusters, and is less sensitive to the choice of initial parameters compared to other clustering algorithms.
Implementation
Now that we understand the basic concepts of DBSCAN, let’s implement it in Python using the scikit-learn library. We will use the Iris dataset, which is a popular dataset for classification and clustering tasks.
Import the necessary libraries
First, let’s import the necessary libraries:
from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as pltThe code block imports the necessary libraries for implementing DBSCAN clustering algorithm in Python:
load_irisfunction fromsklearn.datasetsmodule is used to load the Iris dataset.DBSCANclass fromsklearn.clustermodule is used to implement the DBSCAN clustering algorithm.StandardScalerclass fromsklearn.preprocessingmodule is used to normalize the features of the dataset.numpymodule is imported asnpto perform numerical operations.matplotlib.pyplotmodule is imported aspltto visualize the data and the clusters.
Together, these libraries provide a powerful framework to perform clustering on datasets using DBSCAN algorithm in Python.
Load the Iris dataset and normalize the features
Next, we will load the Iris dataset and normalize the features:
iris = load_iris() X = iris.data X = StandardScaler().fit_transform(X)
These three lines of code perform the following operations:
iris = load_iris()loads the famous Iris dataset from scikit-learn's built-in datasets. This dataset contains measurements of physical characteristics of three species of Iris flowers (Setosa, Versicolour, and Virginica). The dataset consists of 150 instances, each with four features: sepal length, sepal width, petal length, and petal width.X = iris.dataassigns the feature matrix of the Iris dataset to the variableX. The feature matrix consists of 150 rows (one for each instance) and 4 columns (one for each feature).X = StandardScaler().fit_transform(X)standardizes the feature matrixX. Standardization is an important preprocessing step in many machine learning algorithms, including DBSCAN. TheStandardScalerclass scales each feature to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean and dividing by the standard deviation. Standardization helps to bring all the features to a similar scale, which can improve the performance of the clustering algorithm. Thefit_transform()method is used to fit the scaler to the data and transform it at the same time. The standardized feature matrix is assigned back to the variableX.
Set the parameters of DBSCAN
We will then set the parameters for the DBSCAN algorithm:
dbscan = DBSCAN(eps=0.5, min_samples=5)In the DBSCAN algorithm, the eps and min_samples parameters are used to define the density of clusters.
- The
epsparameter is a positive number that specifies the radius of the neighborhood around each point. This means that any two points within a distance ofepsare considered to be part of the same cluster. - The
min_samplesparameter is the minimum number of points required to form a dense region. Points that are not part of a dense region are considered to be noise.
The combination of eps and min_samples determines the shape and size of the resulting clusters. A larger eps value will result in larger and fewer clusters, while a smaller eps value will result in smaller and more clusters. Similarly, a larger min_samples value will result in fewer clusters, while a smaller min_samples value will result in more clusters.
It is important to note that the optimal values of eps and min_samples depend on the specific dataset and the problem at hand. Finding the optimal values can require some trial and error, and it is often a good idea to experiment with different values to see how they affect the clustering results.
Fit the data to the DBSCAN algorithm
We can now fit the data to the DBSCAN algorithm and get the predicted labels:
dbscan.fit(X) labels = dbscan.labels_
The predicted labels will be -1 for noise points and non-negative integers for cluster indices.
In the code block dbscan.fit(X), the DBSCAN algorithm is applied to the standardized feature matrix X using the fit() method of the DBSCAN class. This step computes the clusters based on the eps and min_samples parameters set earlier.
The resulting clusters are stored in the labels attribute of the dbscan object, which is accessed using the dbscan.labels_ statement. The labels array contains an integer label for each point in the dataset. Points that are not part of any cluster are labeled as -1, and points that belong to the same cluster are given the same label.
The resulting labels can be used for further analysis and visualization of the clusters. For example, you can plot the data points with different colors based on their cluster labels to visualize the separation of the clusters. It is important to note that the quality of the clustering results depends on the choice of eps and min_samples parameters, and the interpretation of the resulting labels should be done with care.
Visualizing clusters
We can visualize the clusters using the following code:
colors = np.array(['#ff0000', '#00ff00', '#0000ff'])
cluster_labels = np.unique(labels)
n_clusters = len(cluster_labels) - 1 # subtract 1 for noise points
plt.figure(figsize=(8, 6))
for i, label in enumerate(cluster_labels):
if label == -1:
# Plot noise points in black
plt.scatter(X[labels == label, 0], X[labels == label, 1], c='k', s=50, label='Noise')
else:
# Plot points in current cluster color
plt.scatter(X[labels == label, 0], X[labels == label, 1], c=colors[i % len(colors)], s=50, label=f'Cluster {label}')
plt.legend()
plt.show()This code will create a scatter plot of the first two features of the Iris dataset, where each point is colored according to its predicted label.
This code block is an updated version of the previous one that adds proper legends to the scatter plot. Here is what each line of the code does:
colors = np.array(['#ff0000', '#00ff00', '#0000ff']): creates an array of three colors that will be used to color the clusters.cluster_labels = np.unique(labels): obtains the unique labels assigned by DBSCAN to each point in the data.n_clusters = len(cluster_labels) - 1: computes the number of clusters in the data by subtracting 1 from the total number of unique labels, since the label-1corresponds to noise points.plt.figure(figsize=(8, 6)): creates a new figure with a size of 8x6 inches.for i, label in enumerate(cluster_labels):: loops through each cluster label and its corresponding points.if label == -1:: if the label is-1, it corresponds to noise points, so we plot them in black with a label of'Noise'.else:: if the label is not-1, it corresponds to a cluster, so we plot its points in a color from thecolorsarray with a label of'Cluster {label}', where{label}is replaced by the actual label value.plt.legend(): displays the legend with the labels created in the previous steps.plt.show(): displays the scatter plot with the legends. The plot is shown below:

By visualizing the clusters in this way, we can gain insights into the underlying structure of the data and better understand the relationships between different data points.
The complete program is given below:
from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data
X = StandardScaler().fit_transform(X)
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
labels = dbscan.labels_
colors = np.array(['#ff0000', '#00ff00', '#0000ff'])
plt.scatter(X[:, 0], X[:, 1], c=colors[labels], s=50)
plt.savefig('dbscan.png', dpi=600)
plt.show() Conclusion
In conclusion, DBSCAN is a robust and versatile clustering algorithm that is particularly useful when dealing with datasets that have complex structures and contain noise.
In this tutorial, we have demonstrated how to implement the DBSCAN algorithm in Python using the scikit-learn library. We started by loading the iris dataset and pre-processing the data by standardizing it using the StandardScaler function. We then set the eps and min_samples parameters and applied the DBSCAN algorithm using the fit() method of the DBSCAN class. Finally, we visualized the resulting clusters using a scatter plot.
By setting the right values for eps and min_samples, DBSCAN can identify dense regions in our data and group them into clusters. This can help us to uncover hidden patterns and relationships in our data that may not be immediately apparent.
In summary, DBSCAN is a powerful tool for exploratory data analysis and can be used in a wide range of applications, including anomaly detection, image segmentation, and recommendation systems. With the scikit-learn library, implementing DBSCAN in Python is simple and straightforward, making it an accessible technique for data scientists and machine learning practitioners.
Level Up Coding
Thanks for being a part of our community! Before you go:
- 👏 Clap for the story and follow the author 👉
- 📰 View more content in the Level Up Coding publication
- 💰 Free coding interview course ⇒ View Course
- 🔔 Follow us: Twitter | LinkedIn | Newsletter
🚀👉 Join the Level Up talent collective and find an amazing job





