Sunday 1:30 PM–2:15 PM in Modeling & Data Techniques - Rm 100A

Dimensionally reducing data - squeezing out the good stuff

Aabir Abubaker Kar

Audience level:
Intermediate

Description

Data is often high-dimensional - millions of pixels, frequencies, categories. A lot of this detail is unnecessary for data analysis - but how much exactly? This talk will discuss the basic principles and techniques of dimensionality reduction, provide (just a little!) mathematical intuition about how it's done, and use scikit-learn to show you how Netflix uses it to lead you from binge to binge.

Abstract

In this session, we'll start by remembering what data really is and what it stands for. Data is a structured set of numbers, and these numbers typically (hopefully!) hold some information. This will lead us naturally to the concept of a high-dimensional space, the mystical realm in which data lives. It turns out that data in this space displays an extremely useful 'selection bias' - a datapoint can be known by the company it keeps. This is one of the basic ideas behind k-means clustering, which we will briefly discuss.

We'll discuss some seminal use-cases of dimensionality reduction, including quantitative finance, medical diagnosis and of course, recommender systems like Netflix.

We'll then talk about the informative-ness of certain aspects of the data over others. This lays the mathematical foundation for the technique of Principal Component Analysis (PCA), which we will run on the Netflix movie dataset using scikit-learn.

We will also touch upon the mathematical intuitions and ideal use cases of tSNE, another popular dimensionality-reduction algorithm. I'll leave you with some tips on best practices and use-cases.

I will be using scikit-learn for processing and matplotlib for visualization. The purpose of this session is to introduce dimensionality-reduction to those who do not know it, and to provide useful guiding intuitions to those who do.

Subscribe to Receive PyData Updates

Subscribe