Monday 15:00–15:30 in Track 2

Hand in hand with weak supervision using snorkel

Szymon Wojciechowski

Audience level:
Intermediate

Description

In Natural Language Processing, where we want to use a supervised model the frequent problem is the availability of labels. One way of circumventing tedious and cost-intensive manual annotation of (tens) of thousands of samples may be weak supervision, where heuristics, fragmentary datasets with annotations and crowd-sourced indications (on a fraction of samples) can be unified to generate labels.

Abstract

Let's assume the following scenario: you've scrapped half of the web in order to build a model supporting your research. You've collected and organized thousands if not millions of data pieces that you want to subsequently reason about. Everything is fine as long as you won't be needing any labels associated with this horrifying amount of data...

What can you do? - Give up. - Cry in the depth of your soul, traversing samples one by one, manually annotating them, and finally give up around the 200th sample. - Crowdsource for the annotations, where you may be subjected to all shades of mankind like malice or incompetence.

Or you can find some loose patterns in the data which characterize samples that you are interested in for whatever reason, sum them up and let computers reason for you. This is where weak supervision steps in.

Weak supervision is a paradigm where you can gather under one coherent interface as many types of data annotations as you have at your disposal. They do not need to be 100% perfect, they do not need to cover each and every sentence, they do not need to be very strong - if they are meaningful in the context of your problem, then you are good to go. That's why you can mix up all the annotations that you tediously developed before you dropped off, partially faulty results from crowdsourcing and all the remarks from domain experts around you.

To make all the vague description above more practical, one of the solutions to deal with such a problem will be discussed during the presentation: snorkel. It is a tool designed under data programming paradigm, where your expertise is passed to the model as a bundle of functions expressing all your hitherto knowledge on the domain.

Subscribe to Receive PyData Updates

Subscribe