PyData London 2019 - Presentation: The unreasonable effectiveness of feature hashing

Most machine learning (ML) algorithms are trained on numerical vectors; however, training data frequently consist of categorical features or raw text that must be transformed before they can be used. For example, one-hot encoding is often applied to categorical features, whilst a bag-of-words representation is commonly used for text.

Feature hashing is an approximate technique for mapping sparse, high-dimensional features onto a fixed-size vector space. Rather than building a dictionary from the training data, it applies a hash function to compute feature indices directly. This makes it not only space-efficient but also well-suited to online learning scenarios. Collisions naturally occur in the typical use case where the hash space is much smaller than the original feature space, but don't appear to compromise the statistical performance of models trained on hashed features.

Despite its general applicability to many ML algorithms, feature hashing has found preferential adoption in conjunction with linear models such as linear and logistic regression. These models can be trained very efficiently on massive data sets, even in the presence of extremely sparse features. Interaction terms are straightforward to introduce in hash space and increase model expressiveness, achieving similar statistical performance to tree-based methods.

Topics covered

Feature hashing for categorical and textual features
Efficient hash functions and reversibility
Hash sizes and the effect of collisions
Quadratic and higher-order interactions
Training linear models with feature hashing using scikit-learn, Spark ML and Vowpal Wabbit (VW)
Interpretability of linear models with feature hashing

Audience

This talk will be of interest to ML practitioners, particularly those working with large data set and/or very sparse categorical or textual features. Attendees should be familiar with fundamental ML concepts such as cross-validation and regularisation.

Materials

Slides are available at https://github.com/gcampanella/pydata-london-2019.

Sunday 11:00–11:45 in Tower Suite 1

The unreasonable effectiveness of feature hashing

Gianluca Campanella

Description

Abstract

Topics covered

Audience

Materials

Subscribe to Receive PyData Updates