DBSCAN: an algorithm for the masses

3 min readMay 29, 2020

DBSCANs with same data points, but different Eps

Clustering algorithms are highly important tools when it comes to machine learning. A data scientist should become familiar with implementing a variety of clustering algorithms.

DBSCAN, or Density Based Spatial Clustering of Applications with Noise, is a great algorithm to use when the usual K-Means clustering is not able to handle the data well. It specializes in finding clusters in arbitrary shapes.

Terminology:

Eps: Maximum radius of the neighborhood

Min_samples: Minimum number of points in the Eps-neighborhood required to form a dense region

Core Point: More than min_samples in Eps-neighborhood of a core point

Border Point: fewer that min_samples within Eps-neighborhood but is within the neighborhood of a core point

Noise: (Outlier) not a core point or border point

Core Points here are shown to have an min_samples of 4. Meaning in order to be a core point, they must come in contact with at least 4 other data points. If not, they will be considered a border point, beyond that they fall into the noise category.

Specialties/Use Cases:

No need to specify number of clusters

Insensitive to ordering of data points in database

Robust to outliers

Great at finding arbitrarily shaped clusters

Disadvantages:

You are required to have an understanding of the data in order to choose Eps and min_samples

Not completely deterministic

Implementing DBSCAN

In exploring the DBSCAN algorithm, I could only read so much material and watch video after video explaining the algorithm, before I wanted to try implementing it myself. As is typical with data science, you won’t really have a sense of the work until you actually do it.

I chose a data set consisting of all ports around the world. This includes airports, train stations and ferry ports.

When all the airports are placed on a scatter plot based on their latitude and longitude, if one is familiar with K-Means clustering it is obvious what would be clustered with K-Means.

Here we can see how a DBSCAN handles the data.

I played around with the Eps and min_samples quite a bit before I came to this clustering. Concluding that an Eps of 6 and a min_samples of 11 worked best.

This is where DBSCAN really shines. A good Eps can be found once you get a better sense of the data.

The grey points are the noise.

Let’s see it again, now on the train stations data.

The train station data set has about 1/7 of the data points that the airports data set had.

Even with only one thousand data points, there is still a bit of observable clustering occurring.

DBSCAN further reveals where the clustering is densest.

Again, the grey points are noise.

I chose an Eps of 7 and a min_samples of 7.

While DBSCAN might seem a bit daunting at first, I highly recommend using it when K-Means fails to properly cluster your data.

When density is your goal, DBSCAN is your algorithm!

DBSCAN: an algorithm for the masses

Terminology:

Implementing DBSCAN

Written by Claudia

No responses yet