Thursday 4:20 PM–5:00 PM in Central Park West (#6501)

HDBSCAN, fast density based clustering, the how and the why

John Healy

Audience level:
Novice

Description

HDBSCAN is a popular hierarchical density based clustering algorithm with an efficient python implementation. In this talk we show how it works, why it works and why it should be among the first algorithms you use when exploring a new data set. Further we will show how we took an inherently O(n^2) algorithm and turned it into the O(nlogn) algorithm that is available in scikit-learn-contrib.

Abstract

Density based clustering constitutes a set of algorithms that make few assumptions about your data. Hierarchical clustering allows a user to exploring the cluster structure within their data at multiple levels of resolution without forcing them to make an assumption regarding their number of clusters. This makes both techniques excellent tools in a data scientists arsenal during the initial exploratory stages of any data analysis project.

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a generalization of DBSCAN (Density-Based Spatial Clustering of Applications with Noise). As a hierarchical density based clustering algorithm HDBSCAN should be among the first tools a data scientists brings to bear on a new data set.

In this talk we discuss why density based clustering -- and specifically these techniques -- work, how they work and how we made them scale to large data.

We begin by giving a short overview of clustering techniques showing how density based clustering, specifically the popular HDBSCAN algorithm, fit within the landscape of unsupervised learning. We provide insight into the mathematical and algorithmic underpinnings of this technique and describe how we transformed an inherently $O(n^2)$ algorithm into a fast scalable $O(nlogn)$ algorithm.

This talk is targeted at a broad base of users. Participants new to data science should leave the talk with a better understanding of clustering algorithms in general and a good easy to use default clustering option. More experienced data science practitioners should gain a good understanding of how the popular HDBSCAN algorithm works. Finally, data science researchers should come away with the knowledge of techniques applicable to making other clustering algorithms scale well.

Though we will describe the mathematical underpinnings of HDBSCAN we will do so primarily through visualizations to make the talk accessible to a broad audience.

Subscribe to Receive PyData Updates

Subscribe