We'd like to think of ML algorithms as smart, and sophisticated, learning machines. But they can be fooled by the different types of noise present in your data. Training an algorithm on a large set of variables, hoping that your model will separate signal from noise, is not always the right approach. We'll discuss different ways to do feature selection, and discuss open-source implementations.
In general, I want to keep it light on the maths and talk a lot about practical (code) examples of feature selection algorithms. I want to convince the audience that it pays off to do feature selection, and introduce them to some of the Python frameworks out there that do feature selection.
The talk will start with motivating why to do feature selection, and introduce the three main types of feature selection methods: wrapper, filter, and embedded methods. I'll have diagrams illustrating these three types of methods as part of the presentation.
I. Why should I care to do feature selection? (3 min)
- Feature collinearity and scarceness of data means we can't just give a model many features and let it decide which ones are useful and which ones are not.
- Business constraints might also mean the shotgun/kaggle approach to feature engineering will not work.
II. What makes a good feature selection algorithm? (5 min)
- A good feature selection algorithm will select a compact set of variables that relate to your target variable in a meaningful way.
- What does 'compact' mean here? It means that we'd like to reduce the overlap in information between the variables in the subset of selected features, and remove variables that contain redundant information about the target variable.
- A good feature selection algorithm also shouldn't look at variables purely in isolation:
- Two variables that are useless by themselves can be useful together.
- Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity.
I'll provide simple examples for each point.
III. Wrapper methods: performance based (5 min)
- The simplest class of feature selection algorithms are wrapper methods. A subset of features is selected based on the out-of-sample performance of a model that uses only those features.
- Example (with code): Iterate over subsets of features using a random forest and a hold-out set.
- As wrapper methods train a new model for each subset of variables, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model.
IV. Filter methods: mutual information based (7 min)
- Filter models do not use a learner on the original data X, but only considers statistical characteristics of the data set. A filter method use a statistical measure to decide which features are relevant to the target variable.
- Example (with code): the
mifs
and skfeature
libraries that perform filter based feature selection. They use mutual information based methods for feature selection.
- The advantage of filter methods is that they typically scale better to high-dimensional data sets, are computationally simpler, and independent of the learning algorithm.
- However, they do ignore the interaction with the learner, and often employ lower-dimensional approximations to make computations more tractable. This means they may ignore interactions between different features.
V. Embedded methods: stability selection (7 min)
- Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. Some examples include recursive feature elimination (RFE), Boruta feature selection, and stability selection.
- Example (with code): Implementation of stability selection (sklearn-like API).
- Embedded methods lie somewhere between wrapper and filter methods in terms of computational complexity. The advantage of embedded methods is that they do take the interaction between feature subset search and model selection, and the ability to take into account feature dependencies.
VI. Practical tips (3 min)
VII. Q&A (5 minutes)