Saturday 10:45–11:30 in A208

Gold standard data: lessons from the trenches

Miroslav Batchkarov

Audience level:
Novice

Description

The first stage in a data science project is often to collect training data. However, getting a good data set is surprisingly tricky and takes longer than one expects. This talk describes our experiences in labelling gold-standard data and the lessons we learnt the hard way. We will present three case studies from natural language processing and discuss the challenges we encountered.

Abstract

It is often said that rather than spending a month figuring out how to apply unsupervised learning to a problem domain, a data scientist should spend a week labelling data. However, the difficulty of annotating data is often underestimated. Gathering a sufficiently large collection of good-quality labelled data requires careful problem definition and multiple iterations. In this talk, I will describe three case studies and lessons learnt from them. Each case shows several aspect of the process that should be considered in advance to ensure the project is successful.

Case study 1: word embeddings

Methods for representing words as vectors have become popular in recent years. Historically, these have been evaluated by correlating the "similarity" of word pairs, as predicted by a model, to that provided by a human judge. A typical data set consists of word pairs and a similarity score, e.g. cat - dog = 80% and cat - democracy = 21%.

Questions:

Is the task clearly defined?

What exactly is word similarity? We all have an intuitive understanding, but a lot of corner cases exist, e.g. What rice - cooking and big - small? Each annotator may interpret your instructions differently. Can you guarantee they have not misunderstood the task?

Is the task easy to do?

Because it is so hard to provide annotators with exact instructions, they will often disagree. The similarity of tiger–cat ranges from 50% to 90% in a popular data set. If human cannot agree what the right answer for a given input is, how can a model ever do well?

Do you have quality control in place?

How should one deal with poor-quality annotation? Do you have a mechanism for identifying annotator errors or consistently under-performing annotators?

How much data can you get?

Can you get enough data? How much do you need? What if you have to data by unreliable annotators?

Case study 2: symptom recognition in medical data

The task is to identify mentions of symptoms in notes taken by a doctor during routine exams, e.g. **Abdominal pain**. Causes may include **acute bacterial tonsilitis**. No allergies.

Do you need an expert (linguist/doctor/banker)?

Can you find subject experts? Can you communicate to them exactly what the task is, keeping in mind they will not be NLP experts?

People issues

What if annotators do not show up, or take two-week breaks between sessions?

Can you measure inter-annotator agreement?

Can you quantify annotator agreement? In the case of word similarity, this is reasonably easy. This gets trickier when the annotation unit has complex structure (e.g. it is a phrase). See submission by Alexandar Savkov for more details.

Do you have a process for resolving conflicts?

If two of the annotators disagree, can the conflict be resolve or should the data be discarded?

Case study 3: boilerplate removal in HTML pages

What is the data for?

The distinction between content and boilerplate is blurry and may change depending on what you plan to do next.

Tooling

Do you need specialist software? Do you have to write it yourself?

Subscribe to Receive PyData Updates

Subscribe

Tickets

Get Now