Expanding NLP models to new languages typically involves annotating new data sets which is time and resource expensive. To reduce the costs one can use cross-lingual embeddings enabling knowledge transfer from languages with sufficient training data to low-resource languages. In this talk, you will hear about the challenges in learning cross-lingual embeddings for multilingual resume parsing.
Expanding NLP models to new languages typically involves annotating completely new data sets for each language which is time and resource-expensive. To avoid these tedious and costly tasks, we use cross-lingual embeddings to enable knowledge transfer from languages with sufficient training data to low-resource languages. Cross-lingual embeddings aim to represent words in multiple languages in a shared vector space by capturing semantic similarities across languages. In this talk, you will hear about our experience with learning cross-lingual embeddings for a sequence labelling task, namely multilingual CV (resume) parsing.
During the presentation, you'll see how to apply transfer learning to do document parsing on low resource languages. We will show you where the approach works well and why and we’ll discuss the challenges of training domain-specific cross-lingual embeddings as well as elaborate on the different factors that affect the quality of the embeddings. Finally, we will discuss what we learned and we’ll show the workflow of creating your own cross-lingual system.
We will cover the following topics:
You will leave this talk having heard some concrete recommendations on creating cross-lingual embeddings for sequence labelling, based on our experimental results. Moreover, you will learn more about the pitfalls and crucial details to pay attention to when developing such a cross-lingual system.
This talk will be most useful for those looking to expand their existing system to new languages, including low resource.