Demonstrating Customers Segmentation with DBSCAN Clustering Using Python

Density-Based Spatial Clustering Application with Noise (DBSCAN), an award-winning clustering algorithm that catches our eyes. Understanding what is DBSCAN and its application in customer segmentation, a critical area in business analytics.

Hshan.T
7 min readMar 4, 2021
What is clustering?

Intro

We always try to focus on a subset rather than broad coverage of potential customers to optimize our business objectives by shrinking our target customers through customers segmentation effort. Recall customer segmentation via centroid-based clustering — K-Means, discussed in previous post, there are some drawbacks associated with model applied that we may wish to minimize or even eliminate with goal of proposing more flexible or robust model as an alternative. In this piece of writing, we will be going through a density-based solution — DBSCAN, that overcome the issues.

Why DBSCAN?

Looking at clustering result from K-Means, every data point is assigned to a cluster and mean of points is centroid of each cluster. Thus, each point played their role in determining the cluster. In reality, it is not unusual to expect some data points to be deviated from ‘normal’, so called anomalies or outliers. Presence of a single outlier or minor change in a single point might distort entire clustering result. Besides, clusters form by K-Means is rather more spherical, problem arises if clusters are of other odd or elliptical shapes. Scree plot and silhouette plot serve as a guidance in deciding parameter k, but they are highly influential by model sensitivity to anomalies. Creation of a new approach intends to overcome complications of existing methods. DBSCAN deals or handles most downsides of K-Means.

DBSCAN

DBSCAN is density-based non-parametric unsupervised learning as well, we do not prescribe any model where data is from. Fewer assumptions, more flexible the model built, wider application. Similar to K-Means, DBSCAN is grouping data by their similarities based on distance functions and density. Density here means number of points in region defined by model parameters. Let us define terms used in explaining the process happening behind the scenes:

1. eps — define neighborhood of a point, x, or radius of a circle (or hypersphere in N-D space) with x as centre

2. minPts — minimum number of points in neighborhood to form a dense region

3. Core point — a point that has at least minPts data points within eps.

4. Border point — a point that is within a dense region or a cluster but it is not a core point

5. Noise — not belong to any cluster

6. Directly density reachable — point x is directly density reachable from y if x is within eps-neighborhood of y

7. Indirectly density reachable — point x(1) is said to be indirectly density-reachable(eps, minPts) from x(n) if there is a chain of points {x(1), …, x(n)} such that x(i+1) is directly density reachable from x(i).

8. Density connectivity — point x is density-connected(eps, minPts) from y if there is a point z such that both x and y are directly reachable from z.

DBSCAN partitions data such that dense region is separated by sparse region. It works with data of following characteristics:

· irregular clusters shapes

· different clusters sizes

· presence of outliers

Algorithm:

1. Randomly picking a point and check if it is a core point with minPts points in eps-neighborhood, form a cluster.

2. Run recursively on the points in neighborhood of a core point until all are visited, assigning density-reachable points as a cluster.

3. Iterate the process on remaining unvisited points.

Illustrating DBSCAN process. credit: https://www.digitalvidya.com

Implementing DBSCAN

Data is only useful if we try to dive into meaning behind the numbers. Data will only be a record if we do not analyze, digest and make it visible. After some brief introduction about DBSCAN, text below is demonstrating application of DBSCAN by adopting preprocessed real-world data from previous post. Overall planning will be same, but this round DBSCAN will take over role of K-Means and be in-charge of doing partitioning task.

First, we need to estimate and specify parameters for DBSCAN, start by minPts and follow by eps using the former. If minPts=1, every data point will form a cluster by itself, means if we have n data, we gonna have n clusters. This does not make any sense for clustering to be done. If minPts=2, it will be same as single-linked hierarchy clustering. Hence, minPts should be set as minPts ≥ D+1. Nevertheless, by rule of thumb, it is set to be minPts=2*D below for simplicity. Moving forward, eps is determined from k distance graph. Mean distance from k nearest neighbors is computed and plotted against range from 1:n. The y value of point where ‘’elbow’ is spotted on the curve, eps=0.7 for this case.

Python code:

code for determining parameter eps.
k distance graph

Applying parameters estimated, a DBSCAN model is fitted and result is visualized in 3D plot. Note that black dots are anomalies detected, DBSCAN model in sklearn library will label as ‘-1’. Anomalies are spotted loosely distributed away from denser regions. Number of clusters and set of outliers are largely affected by parameters settings. Spotting ‘elbow’ might be subjective in some confusing cases, where curve does not show aggressive change of distance between successive points. So, it might be importance to note the possible or logic number of clusters to look for. Try to change parameters in this case, a large number of clusters might be obtained which is not too reasonable and digress from what we are interested in. One of the reason to explain the scenario is due to small sample size of processed data. Important characteristics or aggregated features for each cluster is then analyzed and summarized in snake plot below to aid business business decision making.

Python code:

code for fitting a model.
code for visualizing clusters in 3D plot.
3D plot. Black color points are anomalies detected.

Python code:

Overall shapes in this snake plot is similar to Figure 6 here, but look carefully into numbers recorded or plotted above, there are some differences. Slightly more obvious difference in ‘Monetary’. Cluster 2 is most economically valuable to business with the lowest ‘Recency’. highest ‘Monetary’ and ‘Frequency’. When goal of this task changed, the way how we look at this result should be updated to make sound business decision. For example, if business wish to retain potential churning customers or seeking targets for some new products or promotions, cluster with high ‘Recency’ grabs our attention, subsequently other cluster is selected instead of 2. A carefully and thoroughly planned analysis goal is crucial because it affects how data is processed and hence result will be highly impacted.

DBSCAN vs K-Means

Comparing kmeans and dbscan.

Conclusion

Distance function is used to compute similarities between samples. Choices on parameters is made based on function selected. In simpler case, we manage to determine dense regions in scatter plots visually and intuitively. Such a phenomenon is pretty rare, especially as we are exposed to higher dimensional data or data without clear boundaries between dense regions. Although advantages of DBSCAN over K-Means is outlined, its result expected to deteriorate as data dimension is too big, problem of ‘curse of dimensionality’ comes in. Clustering result with data of different densities or loosely distributed will not be too decent or even departure from our goal. There are times the points fall in confusing regions of unclear boundaries, they may be grouped into either cluster. Understand data we have, choose model that is able to handle the data structures to get the most veracious result telling true story.

--

--