PyData New York City 2018 - Presentation: HDBSCAN, fast density based clustering, the how and the why

Density based clustering constitutes a set of algorithms that make few assumptions about your data. Hierarchical clustering allows a user to exploring the cluster structure within their data at multiple levels of resolution without forcing them to make an assumption regarding their number of clusters. This makes both techniques excellent tools in a data scientists arsenal during the initial exploratory stages of any data analysis project.

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a generalization of DBSCAN (Density-Based Spatial Clustering of Applications with Noise). As a hierarchical density based clustering algorithm HDBSCAN should be among the first tools a data scientists brings to bear on a new data set.

In this talk we discuss why density based clustering -- and specifically these techniques -- work, how they work and how we made them scale to large data.

We begin by giving a short overview of clustering techniques showing how density based clustering, specifically the popular HDBSCAN algorithm, fit within the landscape of unsupervised learning. We provide insight into the mathematical and algorithmic underpinnings of this technique and describe how we transformed an inherently $O(n^2)$ algorithm into a fast scalable $O(nlogn)$ algorithm.

This talk is targeted at a broad base of users. Participants new to data science should leave the talk with a better understanding of clustering algorithms in general and a good easy to use default clustering option. More experienced data science practitioners should gain a good understanding of how the popular HDBSCAN algorithm works. Finally, data science researchers should come away with the knowledge of techniques applicable to making other clustering algorithms scale well.

Though we will describe the mathematical underpinnings of HDBSCAN we will do so primarily through visualizations to make the talk accessible to a broad audience.

Thursday 4:20 PM–5:00 PM in Central Park West (#6501)

HDBSCAN, fast density based clustering, the how and the why

John Healy

Description

Abstract

Subscribe to Receive PyData Updates