PyData London 2019 - Presentation: How am I going to deal with all of these cats?

How am I going to deal with all of these cats?

Audience level:

Novice

Description

Machine learning models prefer numbers to words, but categorical fields are everywhere in real-world datasets. It's both a blessing and a curse that there are countless ways of handling them. This talk will be super-practical: I'll provide a broad view of the different approaches at your disposal, and give you some ideas about which to try in different situations.

Abstract

The talk will start off a quick tour of the many ways to plug a categorical feature into any machine learning model. These are common, widely-applicable techniques, and each come with their upsides and downsides:

Ordinal encoding
One-hot encoding
Feature hashing
Target encoding
Embeddings
...

We'll talk about the upsides and downsides of each approach, looking at some experiments on public/generated datasets. There will be lots of signposting to packages and other resources to be aware of. There will also be some discussion about what the ever-popular boosting frameworks (XGBoost, LightGBM, CatBoost) are doing behind the scenes.

The second part will focus on strategy, hitting questions such as:

What technique should I try first?
How should my approach depend on the nature of my data?
How many levels is too many levels, given my dataset?
How can I tell if my categorical variables are correlated? (And what does correlated even mean for categorical variables?)
What should I make sure to do to not mess up my cross validation?

Saturday 11:45–12:30 in Tower Suite 1

How am I going to deal with all of these cats?

Liam Kirwin

Description

Abstract

Subscribe to Receive PyData Updates