Clustering algorithms are highly important tools when it comes to machine learning. A data scientist should become familiar with implementing a variety of clustering algorithms.
DBSCAN, or Density Based Spatial Clustering of Applications with Noise, is a great algorithm to use when the usual K-Means clustering is not able to handle the data well. It specializes in finding clusters in arbitrary shapes.
Terminology:
Eps: Maximum radius of the neighborhood
Min_samples: Minimum number of points in the Eps-neighborhood required to form a dense region
Core Point: More than min_samples in Eps-neighborhood of a core point
Border Point: fewer that min_samples within Eps-neighborhood but is within the neighborhood of a core point
Noise: (Outlier) not a core point or border point
Specialties/Use Cases:
No need to specify number of clusters
Insensitive to ordering of data points in database
Robust to outliers
Great at finding arbitrarily shaped clusters
Disadvantages:
You are required to have an understanding of the data in order to choose Eps and min_samples
Not completely deterministic
Implementing DBSCAN
In exploring the DBSCAN algorithm, I could only read so much material and watch video after video explaining the algorithm, before I wanted to try implementing it myself. As is typical with data science, you won’t really have a sense of the work until you actually do it.
I chose a data set consisting of all ports around the world. This includes airports, train stations and ferry ports.
When all the airports are placed on a scatter plot based on their latitude and longitude, if one is familiar with K-Means clustering it is obvious what would be clustered with K-Means.
Here we can see how a DBSCAN handles the data.
I played around with the Eps and min_samples quite a bit before I came to this clustering. Concluding that an Eps of 6 and a min_samples of 11 worked best.
This is where DBSCAN really shines. A good Eps can be found once you get a better sense of the data.
The grey points are the noise.
Let’s see it again, now on the train stations data.
The train station data set has about 1/7 of the data points that the airports data set had.
Even with only one thousand data points, there is still a bit of observable clustering occurring.
DBSCAN further reveals where the clustering is densest.
Again, the grey points are noise.
I chose an Eps of 7 and a min_samples of 7.
While DBSCAN might seem a bit daunting at first, I highly recommend using it when K-Means fails to properly cluster your data.
When density is your goal, DBSCAN is your algorithm!