PyData Amsterdam 2019 - Presentation: How to expand your NLP solution to new languages using transfer learning

Expanding NLP models to new languages typically involves annotating completely new data sets for each language which is time and resource-expensive. To avoid these tedious and costly tasks, we use cross-lingual embeddings to enable knowledge transfer from languages with sufficient training data to low-resource languages. Cross-lingual embeddings aim to represent words in multiple languages in a shared vector space by capturing semantic similarities across languages. In this talk, you will hear about our experience with learning cross-lingual embeddings for a sequence labelling task, namely multilingual CV (resume) parsing.

During the presentation, you'll see how to apply transfer learning to do document parsing on low resource languages. We will show you where the approach works well and why and we’ll discuss the challenges of training domain-specific cross-lingual embeddings as well as elaborate on the different factors that affect the quality of the embeddings. Finally, we will discuss what we learned and we’ll show the workflow of creating your own cross-lingual system.

We will cover the following topics:

Sensitivity of cross-lingual embeddings to the choice of the bilingual lexicon. Since our application focuses on the specific domain of human resources, we will show how adding domain-specific terms to the bilingual lexicons improves the downstream performance.
What can be achieved with cross-lingual embeddings in a zero-shot setting and what do we gain by adding a little data from the low-resource language.
Comparison of data sampling strategies for the multilingual training setup.
Intrinsic and extrinsic evaluation: How can intrinsic metrics help to get insights about the cross-lingual model and predict the downstream task performance.
Lessons learned working with different language groups.

You will leave this talk having heard some concrete recommendations on creating cross-lingual embeddings for sequence labelling, based on our experimental results. Moreover, you will learn more about the pitfalls and crucial details to pay attention to when developing such a cross-lingual system.

This talk will be most useful for those looking to expand their existing system to new languages, including low resource.

Sunday 11:40–12:15 in Auditorium

How to expand your NLP solution to new languages using transfer learning

Lena Shakurova

Description

Abstract

Subscribe to Receive PyData Updates