PyData New York City 2019 - Presentation: Implementing Lightweight Random Indexing for Polylingual Text Classification

Description:

Text classification is the complex task of assigning labels to natural language. Polylingual text classification extends this task by requiring a modeling approach to assign labels to corpi with multiple languages. Modeling each language separately may cause overfitting to datasets with too few data points. To remedy this, we present Lightweight Random Indexing, a method that can deal with large corpi and project multiple languages into the same space.

Random indexing is a dimensionality reduction method that projects data points into a lower dimensional space while preserving the distance between points. This process often involves performing a large matrix multiplication which can be infeasible on some datasets. Lightweight random indexing introduces a clever dictionary trick to store mappings of vectors to a subspace using significantly less memory. We will present all of the above topics and their implementations in Python using common libraries.

Outline:

Introduction (2 minutes; 2 total)
Quick Bio on myself
Agenda of talk
Problem Definition (5 minutes; 7 total)
Text classification refresher
Text representation refresher (BoW, TFIDF, word vectors)
Polylingual setting
Naive Approach (3 minutes; 10 total)
Model for each language
Drawbacks of this approach
Random Indexing (8 minutes; 18 total)
Introduction to random indexing
Some theory around the topic + intuition
Python implementation
Problem with Vanilla Random Indexing (1 minute; 19 total)
Lightweight Random Indexing (4 minutes; 23 total)
Approach description
Theoretical justification
Python implementation
Empirical Results (2 minutes; 25 total)
How well does it work?
Results from paper
Results on a baseline, TFIDF on 1 dataset vs. a model trained with another language's embeddings
Questions (5 minutes; 30 total)

Additional Notes:

We have a working implementation that can be cleaned up and shared publicly before the talk.

Tuesday 2:55 PM–3:35 PM in Central Park East (6501a)

Implementing Lightweight Random Indexing for Polylingual Text Classification

Ian Whalen

Description

Abstract

Description:

Outline:

Additional Notes:

Subscribe to Receive PyData Updates