Saturday 11:45–12:30 in Tower Suite 1

How am I going to deal with all of these cats?

Liam Kirwin

Audience level:
Novice

Description

Machine learning models prefer numbers to words, but categorical fields are everywhere in real-world datasets. It's both a blessing and a curse that there are countless ways of handling them. This talk will be super-practical: I'll provide a broad view of the different approaches at your disposal, and give you some ideas about which to try in different situations.

Abstract

The talk will start off a quick tour of the many ways to plug a categorical feature into any machine learning model. These are common, widely-applicable techniques, and each come with their upsides and downsides:

We'll talk about the upsides and downsides of each approach, looking at some experiments on public/generated datasets. There will be lots of signposting to packages and other resources to be aware of. There will also be some discussion about what the ever-popular boosting frameworks (XGBoost, LightGBM, CatBoost) are doing behind the scenes.

The second part will focus on strategy, hitting questions such as:

Subscribe to Receive PyData Updates

Subscribe