Most NLP tasks offer a simple setup: one language and a target. In the real world, we may face with the fact that people speak different languages when building models. This talk will motivate the use of lightweight random indexing (see paper here) to combine data sets across multiple languages and walk through a working implementation.
Text classification is the complex task of assigning labels to natural language. Polylingual text classification extends this task by requiring a modeling approach to assign labels to corpi with multiple languages. Modeling each language separately may cause overfitting to datasets with too few data points. To remedy this, we present Lightweight Random Indexing, a method that can deal with large corpi and project multiple languages into the same space.
Random indexing is a dimensionality reduction method that projects data points into a lower dimensional space while preserving the distance between points. This process often involves performing a large matrix multiplication which can be infeasible on some datasets. Lightweight random indexing introduces a clever dictionary trick to store mappings of vectors to a subspace using significantly less memory. We will present all of the above topics and their implementations in Python using common libraries.
Agenda of talk
Problem Definition (5 minutes; 7 total)
Polylingual setting
Naive Approach (3 minutes; 10 total)
Drawbacks of this approach
Random Indexing (8 minutes; 18 total)
Python implementation
Problem with Vanilla Random Indexing (1 minute; 19 total)
Lightweight Random Indexing (4 minutes; 23 total)
Python implementation
Empirical Results (2 minutes; 25 total)
Results on a baseline, TFIDF on 1 dataset vs. a model trained with another language's embeddings
Questions (5 minutes; 30 total)
We have a working implementation that can be cleaned up and shared publicly before the talk.