Machine learning models prefer numbers to words, but categorical fields are everywhere in real-world datasets. It's both a blessing and a curse that there are countless ways of handling them. This talk will be super-practical: I'll provide a broad view of the different approaches at your disposal, and give you some ideas about which to try in different situations.
The talk will start off a quick tour of the many ways to plug a categorical feature into any machine learning model. These are common, widely-applicable techniques, and each come with their upsides and downsides:
We'll talk about the upsides and downsides of each approach, looking at some experiments on public/generated datasets. There will be lots of signposting to packages and other resources to be aware of. There will also be some discussion about what the ever-popular boosting frameworks (XGBoost, LightGBM, CatBoost) are doing behind the scenes.
The second part will focus on strategy, hitting questions such as: