Sunday 10:15–11:00 in Tower Suite 3

RNN sequence labeling for document parsing in Tensorflow

Carsten van Weelden

Audience level:
Novice

Description

In this talk we will show how we improved our CV parsing performance by training RNN models using Tensorflow. We will demonstrate how to set up and train a BLSTM sequence labeling model and discuss extensions such as learning line representations, combining RNN and CRF layers and training multilingual models.

Abstract

At Textkernel we apply AI techniques to match people to jobs. The backbone of this technology is our multilingual CV and vacancy parsing service. This service takes the unstructured text from these documents and parses out important information (job titles, companies, dates, skills, etc). Recently we reduced errors in this service by ~30% while reducing the need for manual feature engineering by replacing our classical NLP pipeline with RNNs.

In this talk we will show how to set up and train a BLSTM-CRF sequence labeling model, which is at the core of our new pipeline. We will discuss the choices we made and what issues you will run into when productizing these models. We will also cover some of the extensions that we made to this model, such as learning line embeddings and training multilingual models. The talk will be illustrated through Tensorflow code snippets.

Who is this talk for?

This talk is targeted at machine learning engineers, data scientists, NLP engineers and other machine learning practitioners that want to learn how to build and train a sequence labeling model in Tensorflow. You should be somewhat familiar with word embeddings, RNNs, and Tensorflow basics to make the most of this talk.

Find the slides and the notebook here.

Subscribe to Receive PyData Updates

Subscribe