Sunday 10:50–11:25 in Auditorium

(Re)training word embeddings for a specific domain

Jetze Schuurmans

Audience level:


Word embeddings (like GloVe, fastText and word2vec) are very powerful for capturing general word semantics. What if your use case is domain specific? Will your embeddings still work? If they don’t, how do you retrain them?


In this presentation we will cover how word embeddings work, what different kind of embeddings there are and how you can (re)train them using python.

Diving into the difference between count-based vs direct-prediction (skip-gram/CBOW) models and combining them (GloVe). We will compare methods for training end-to-end models and training embeddings separately.

When you need to (re)train embeddings you need a domain specific corpus. We also cover some ways you can collect your own using python. This can be done with python wikipedia api or building a webcrawler in python.

We will give an overview of NLP software packages which come in handy: SpaCy; Gensim; Flashtext; Bounter; themis-ml; Eli5.lime; Vowpal Wabbit; NLTK. We won’t have enough time to cover them all in detail, but we will mention how they can be useful in an NLP pipeline.

Subscribe to Receive PyData Updates