PyData Delhi 2017 - Presentation: Understanding Clustering: Supervising the unsupervised

Evaluating the results of a clustering model is very useful to optimally choose an algorithm and its parameters, we will discuss some clustering evaluation metrics and their efficacy in making more informed decisions. I'll demo Clustervision, a visual analytical tool that helps ensure data scientists find right clustering among a large amount of techniques and parameters available.

Clustering, the process of grouping together similar data points into distinct partitions, is a common unsupervised machine learning technique that can be useful for summarizing and aggregating complex multi-dimensional data. However, one needs to make choices from a plethora of algorithms, each with a number of parameters. Having a way to evaluate the results of a clustering model is very useful to optimally choose an algorithm and its parameters, unfortunately, due to the lack of the ground truth, this becomes an incredibly hard problem. We will discuss some clustering evaluation metrics (with Python implementations by the speaker) and their efficacy in making more informed decisions. Since randomization is an important ingredient of a lot of clustering algorithms (e.g k-means), exploring the robustness of the results to this randomness is very important for exuding trust and establishing transparency of the results. This talk will explain some measures (with Python implementation) to tackle the problem. I'll also demo a system, developed by the researchers at IBM, called Clustervision which is a visual analytical tool that helps ensure data scientists find the right clustering among a large amount of techniques and parameters available. This work has appeared in KDD 2017 and has been accepted for publication at IEEE Vis 2017. Time permitting, I'll also discuss some other ideas, e.g. consensus clustering, meta clustering, aggregate clustering etc., which are very important to get a grip on understanding cluster-analysis. This talk is about the issues related to understanding and analyzing clustering to obtain the desired results. Having a way to evaluate the results of a clustering model is very useful to optimally choose an algorithm and its parameters, we will discuss some clustering evaluation metrics. The robustness of the clustering to randomness will be explored along with other related concepts.

Sunday 2:30 PM–3:00 PM in C01

Understanding Clustering: Supervising the unsupervised

Janu Verma

Description

Abstract

Subscribe to Receive PyData Updates

Tickets