In NLP, we often come across situations where the text documents are not limited to one language. In this talk, we'll explore multilingual embeddings as an alternative to traditional word embeddings for creating NLP models that can scale effectively to data in multiple languages.
In natural language processing, we use pre-trained word embedding models such as word2vec to encode the text data in a vector space where the relative distances of tokens/words in the space relates to their semantic meaning. The text embeddings are then utilized in downstream tasks such as classification, sentiment analysis etc. A more recent addition to these embeddings are multilingual embeddings that have been trained on data from multiple languages (i.e. Spanish, English, Chinese, etc) such that the semantic relatedness of each language is aligned in the space.
In this talk, I'll go over the recently released multilingual sentence encoder from Google and demonstrate how it can be used to create a language-agnostic document classification model. This is particularly useful in scenarios where the corpus may have documents written in multiple languages as it avoids the need to collect separate training data and build separate models for each language.