DBSCAN: an algorithm for the masses

Claudia
3 min readMay 29, 2020

--

DBSCAN made using random seeds
DBSCANs with same data points, but different Eps

Clustering algorithms are highly important tools when it comes to machine learning. A data scientist should become familiar with implementing a variety of clustering algorithms.

DBSCAN, or Density Based Spatial Clustering of Applications with Noise, is a great algorithm to use when the usual K-Means clustering is not able to handle the data well. It specializes in finding clusters in arbitrary shapes.

Terminology:

Eps: Maximum radius of the neighborhood

Min_samples: Minimum number of points in the Eps-neighborhood required to form a dense region

Core Point: More than min_samples in Eps-neighborhood of a core point

Border Point: fewer that min_samples within Eps-neighborhood but is within the neighborhood of a core point

Noise: (Outlier) not a core point or border point

Random Data Points
Core Points here are shown to have an min_samples of 4. Meaning in order to be a core point, they must come in contact with at least 4 other data points. If not, they will be considered a border point, beyond that they fall into the noise category.

Specialties/Use Cases:

No need to specify number of clusters

Insensitive to ordering of data points in database

Robust to outliers

Great at finding arbitrarily shaped clusters

Disadvantages:

You are required to have an understanding of the data in order to choose Eps and min_samples

Not completely deterministic

Implementing DBSCAN

In exploring the DBSCAN algorithm, I could only read so much material and watch video after video explaining the algorithm, before I wanted to try implementing it myself. As is typical with data science, you won’t really have a sense of the work until you actually do it.

I chose a data set consisting of all ports around the world. This includes airports, train stations and ferry ports.

Ho hum, regular, degular scatter plot
oooh check out those clusters

When all the airports are placed on a scatter plot based on their latitude and longitude, if one is familiar with K-Means clustering it is obvious what would be clustered with K-Means.

Here we can see how a DBSCAN handles the data.

I played around with the Eps and min_samples quite a bit before I came to this clustering. Concluding that an Eps of 6 and a min_samples of 11 worked best.

This is where DBSCAN really shines. A good Eps can be found once you get a better sense of the data.

The grey points are the noise.

Boring ol’ scatter plot
Watch out now! Clustering inbound!

Let’s see it again, now on the train stations data.

The train station data set has about 1/7 of the data points that the airports data set had.

Even with only one thousand data points, there is still a bit of observable clustering occurring.

DBSCAN further reveals where the clustering is densest.

Again, the grey points are noise.

I chose an Eps of 7 and a min_samples of 7.

While DBSCAN might seem a bit daunting at first, I highly recommend using it when K-Means fails to properly cluster your data.

When density is your goal, DBSCAN is your algorithm!

--

--

No responses yet