Tuesday 2:55 PM–3:35 PM in Central Park East (6501a)

Implementing Lightweight Random Indexing for Polylingual Text Classification

Ian Whalen

Audience level:
Intermediate

Description

Most NLP tasks offer a simple setup: one language and a target. In the real world, we may face with the fact that people speak different languages when building models. This talk will motivate the use of lightweight random indexing (see paper here) to combine data sets across multiple languages and walk through a working implementation.

Abstract

Description:


Text classification is the complex task of assigning labels to natural language. Polylingual text classification extends this task by requiring a modeling approach to assign labels to corpi with multiple languages. Modeling each language separately may cause overfitting to datasets with too few data points. To remedy this, we present Lightweight Random Indexing, a method that can deal with large corpi and project multiple languages into the same space.

Random indexing is a dimensionality reduction method that projects data points into a lower dimensional space while preserving the distance between points. This process often involves performing a large matrix multiplication which can be infeasible on some datasets. Lightweight random indexing introduces a clever dictionary trick to store mappings of vectors to a subspace using significantly less memory. We will present all of the above topics and their implementations in Python using common libraries.

Outline:


Additional Notes:


We have a working implementation that can be cleaned up and shared publicly before the talk.

Subscribe to Receive PyData Updates

Subscribe