This talk provides a step-by-step overview and demonstration of several dimensionality (feature) reduction techniques. Attendees should have some basic level of understanding of data wrangling and supervised learning. The presentation will also include snippets of Python code, so familiarity with Python code will be useful.
Many machine learning applications involve datasets with high dimensionality. In most cases, the intrinsic dimensionality is much smaller than the observed dimensionality of the data, and it becomes imperative to eliminate unavailing and redundant features before performing the core analysis. This is useful not only because it speeds up the core part of the analysis which typically involves complicated algorithms, but it may also improve the accuracy of the analysis (such as the classification accuracy of a classification model). Multiple feature reduction techniques have been proposed and utilized by statisticians and data scientists over the years. These methods can be broadly categorized into two groups: supervised and unsupervised methods.
In this presentation, I will focus on several dimensionality reduction techniques pertaining to the pre-processing of data for supervised learning. A large data-set will be used to demonstrate these techniques by following a dimensionality reduction work-flow. The objective of this presentation is to introduce several dimensionality reduction techniques, demonstrate how to implement them (using Python), and assess their efficacy as it relates to supervised learning problems.
Here’s a list of twelve dimensionality reduction techniques that will be discussed in this presentation: (1) Percent missing values, (2) Amount of variance, (3) Correlation (with the target), (4) Pairwise correlation, (5) Multicollinearity, (6) Principal Component Analysis (PCA), (7) Cluster analysis, (8) Forward selection, (9) Backward elimination, (10) Stepwise selection, (11) LASSO, (12) Tree-based methods.