Friday 11:00–12:30 in Tower Suite 2

Build text classification models ( CBOW and Skip-gram) with FastText in Python

Kajal Puri, Sandeep Saurabh

Audience level:


NLP is an exciting way to interpret the textual data especially when we know that computers can neither speak nor understand any kind of human language. So, how do we represent each word of a language in such a unique numerical pattern and process it in quickest way possible. Answer is FastText library.


FastText has been open-sourced by Facebook in 2016 and with its release, it became the fastest and most accurate library in Python for text classification and word representation. It is to be seen as a substitute for gensim package's word2vec. It includes the implementation of two extremely important methodologies in NLP i.e Continuous Bag of Words and Skip-gram model. Fasttext performs exceptionally well with supervised as well as unsupervised learning.

The talk will be divided in following four segments :

  1. 0-5 minutes: The talk will begin with explaining the difference between word embeddings generated by word2vec, Glove, Fasttext and how FastText beats all the other libraries with better accuracy and in lesser time.

  2. 5-30 minutes: The code will be shown and explained line by line for both the models (CBOW and Skip-gram) trained with FastText library on a standard textual labelled data set with two categories i.e. positive and negative and will conclude with the tips on hyper-parametric tuning to get the best possible embeddings/word-representations for further assessment.

  3. 30-50 minutes: How to use the pre-trained word embeddings released by FastText in various languages and where to use them. Various use cases of what kind of problems can be solved using FastText in python.

  4. 50-75 minutes: More information and code reviews on how these word-representation vectors can be embedded in deep learning Natural Language Processing architectures like RNNs, LSTMs etc to improve the accuracy. Discussion about applications of NLP in machine translation, emotion classification and text generation.

  5. 75-90 minutes: For QA session.

Subscribe to Receive PyData Updates