Monday 1:55 PM–2:45 PM in Central Park East 6501a (6th fl)

Addressing Prejudice in Text Data

Mike Cunha

Audience level:
Intermediate

Description

Bias for and against genders, races, and more exist in popular NLP datasets like GloVe and word2vec. This talk will discuss how to detect and remove prejudice in text datasets and derived word embeddings along with the impacts of ignoring them.

Abstract

Word embeddings have become a widespread component of machine learning algorithms and deep learning architectures as a more compact way to represent text data. Unwanted biases in popular text corpora are being amplified in word embeddings. Left unaddressed, these prejudices present a high risk for naively creating unfair and harmful data products and services.

Metrics to measure prejudice in text data will be covered as well as ways to remove the unwanted bias from trained word embeddings. Current fixes work better on some types of prejudice than others.

This talk is for anyone using word embeddings to build products like chat bots, sentiment, neural translation, IR and search. This includes data scientists, machine learning engineers, and anyone responsible for the products they build.

At the end of this talk you will know:

  1. How to test your text data for obvious prejudice using the Word Embedding Associate Test, in Python
  2. An overview of the consequences of ignoring unwanted bias in text data
  3. Where to find free text data that has been “debiased”
  4. Questions to ask that will make it easier to deal with remaining implicit bias
  5. Where to find more information about specific pitfalls

Subscribe to Receive PyData Updates

Subscribe