Data Science: DBSCAN Clustering using Python

Featured Image for DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm that groups data points based on their density. It has been widely used in various fields such as image segmentation, anomaly detection, and data mining. In this article, we will discuss the advantages and disadvantages of DBSCAN clustering. We will also provide a step-by-step guide to implementing DBSCAN clustering on a dataset using Python. The guide will cover generating random data, creating a Pandas dataframe, checking and filling missing data, scaling and normalizing the data, reducing the dimensionality of the dataset, performing DBSCAN clustering, assigning cluster colors, visualizing clusters, and exporting the clustered data with cluster labels to a CSV file.

DBSCAN clustering is a popular clustering algorithm that has several advantages over other clustering algorithms. One of the major advantages of DBSCAN is that it does not require any prior information about the number of clusters in the dataset. It can automatically identify the number of clusters based on the density of the data points. Additionally, DBSCAN clustering is robust to noise, and it can handle outliers effectively. It groups the data points based on their density, and the points that do not belong to any cluster are considered as noise.

Another advantage of DBSCAN clustering is its ability to handle clusters of different shapes and sizes. Unlike other clustering algorithms, such as K-Means, DBSCAN can identify clusters of any shape, including non-convex clusters. Additionally, DBSCAN clustering does not require data normalization as it is based on the density of the data points. It is also a scalable algorithm and can handle large datasets with millions of data points.

However, DBSCAN clustering also has several disadvantages that need to be considered. Firstly, it requires tuning of two parameters – Epsilon (eps) and Minimum Points (min_samples). Choosing the optimal values for these parameters is a challenging task and can significantly affect the clustering results. Secondly, DBSCAN clustering is sensitive to the density parameter (eps), and choosing the wrong value can result in poor clustering results. It also cannot handle datasets with clusters of different densities, and it assumes that all clusters have the same density, which is not true in many real-world datasets.

Finally, DBSCAN clustering struggles with high-dimensional data, and the curse of dimensionality affects the density estimation. As the dimensionality of the dataset increases, the performance of DBSCAN clustering decreases. Additionally, DBSCAN clustering is computationally expensive, and the time complexity of the algorithm is O(n^2), where n is the number of data points. The clustering time can be significantly longer for large datasets.

DBSCAN clustering is a powerful clustering algorithm that can handle clusters of different shapes and sizes and is robust to noise. It does not require normalization and can handle large datasets. However, it requires tuning of parameters and is sensitive to the density parameter. It cannot handle datasets with clusters of different densities and struggles with high-dimensional data. It is also computationally expensive, and the clustering time can be significantly longer for large datasets.

Despite its limitations, DBSCAN clustering is widely used in various fields and is an essential tool for data mining and machine learning. To achieve the best results with DBSCAN clustering, it is essential to choose the optimal values for the parameters and preprocess the data appropriately.

In this article, we use mock data for DBSCAN, Now let’s being coding,

Step-by-Step Guide to the Code:

  1. Importing Libraries: The first few lines of the code import the necessary libraries such as NumPy, Pandas, Matplotlib, and scikit-learn’s DBSCAN, StandardScaler, PCA, and normalize modules.
  2. Generating Random Data: The code then generates random data for each feature by using various distributions such as normal, binomial, and Poisson distributions. The data is stored in variables named feature_1, feature_2, and so on. You can add as many features as you like or what your data set provides. For the purposes of this blog we take into consideration 5 features. In real life the examples could be Balance,Balance Frequency,Purchases,One Off Purchases, Installments Purchases, Cash Advance, Purchases Frequency, One Off Purchases Frequency,    Purchases Installments Frequency, Cash Advance Frequency, Cash Advance Transactions, Purchases Transactions, Credit Limit, Payments, Minimum Payments, Periodic Full Payment, Tenure, including many others
  3. Creating a Pandas DataFrame: The generated data is then combined to create a pandas dataframe named “data”.
  4. Checking for Missing Data: The code checks for missing values in the dataset using the isnull() method and prints the number of missing values for each feature.
  5. Filling Missing Data: The missing data is then filled using the fillna() method with the mean value of the corresponding feature.
  6. Scaling and Normalizing Data: The data is then standardized and normalized using StandardScaler and normalize() functions, respectively.
  7. Reducing Dimensionality of Data: The PCA module is used to reduce the dimensionality of the dataset to two principal components. The transformed data is stored in a new dataframe named “x_principal”.
  8. DBSCAN Clustering: The DBSCAN module is used to perform clustering on the reduced dataset using the fit() method. The hyperparameters for the DBSCAN algorithm, such as the epsilon and minimum samples, are set to 0.036 and 4, respectively.
  9. Assigning Cluster Colors: The code assigns a color to each cluster label for visualization purposes.
  10. Visualizing Clusters: Finally, the clusters are visualized using Matplotlib’s scatter plot function. The plot shows the two principal components on the X and Y-axis, and the color of each point represents the cluster it belongs to.
  11. Exporting Data: The code exports the clustered data with cluster labels to a CSV file named “clustered_data.csv” using the to_csv() method.

Python Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize

# generate random data set 1 for each feature
feature_1 = np.random.normal(loc=50, scale=10, size=2000)
feature_2 = np.random.normal(loc=30, scale=5, size=2000)
feature_3 = np.random.binomial(n=1, p=0.5, size=2000)
feature_4 = np.ones(2000)
feature_5 = np.random.poisson(lam=2, size=2000)

# generate random data set 2 for each feature
# feature_1 = [random.randint(1, 100) for _ in range(2000)]
# feature_2 = [random.randint(1, 100) for _ in range(2000)]
# feature_3 = [random.randint(0, 1) for _ in range(2000)]
# feature_4 = [1 for _ in range(2000)]
# feature_5 = [random.randint(0, 1) for _ in range(2000)]

# create a pandas dataframe with the generated data
data = pd.DataFrame({'Feature 1': feature_1,
                     'Feature 2': feature_2,
                     'Feature 3': feature_3,
                     'Feature 4': feature_4,
                     'Feature 5': feature_5})

print("Head of the generated data:\n", data.head())

# Check for any missing values in the dataset
print("Number of missing values in the dataset:\n", data.isnull().sum())

# Fill any missing values in the dataset with their mean
data.fillna(data.mean(), inplace=True)

# Scaling and normalizing the dataset
scaler = StandardScaler()
x_scaled = scaler.fit_transform(data)
x_normal = normalize(x_scaled)
x_normal = pd.DataFrame(x_normal)

# Reduce the dimensionality of the dataset
pca = PCA(n_components=2)
x_principal = pca.fit_transform(x_normal)
x_principal = pd.DataFrame(x_principal)
x_principal.columns = ['V1', 'V2']
print("Head of the principal components:\n", x_principal.head())

# Perform DBSCAN clustering on the reduced dataset
dbscan = DBSCAN(eps=0.036, min_samples=4).fit(x_principal)
labels = dbscan.labels_
data['cluster'] = dbscan.labels_
print("Tail of the dataset with cluster labels:\n", data.tail())

# Assign colors to the clusters
clusterColor = {0: u'yellow', 1: u'green', 2: 'blue', -1: u'red'}
unique_labels = set(labels)
for label in unique_labels:
    if label not in clusterColor:
        clusterColor[label] = u'black'
colors = [clusterColor[label] for label in labels]

# Visualize the clusters
plt.figure(figsize=(12, 10))
plt.scatter(x_principal['V1'], x_principal['V2'], c=colors)
plt.title("Implementation of DBSCAN Clustering", fontname="Times New Roman", fontweight="bold")
plt.show()

# Export the data with cluster labels to a CSV file
data.to_csv("clustered_data.csv", index=False)
DBSCAN Cluster using Dataset 1
DBSCAN Cluster using Dataset 1
DBSCAN Cluster using Dataset 2
DBSCAN Cluster using Dataset 2
%d bloggers like this: