PyData Amsterdam 2019 - Presentation: Sebenz.ai - South African job creation through gamified data labeling for machine learning

Sebenza means "work" in isiXhosa and isiZulu, two popular languages in South Africa. Our mission is to create 1 million jobs in Africa. There is 28% unemployment in South Africa right now. Most of those people have a smartphone and they all have free time.

Sebenz.ai has 2 parts: 1. A machine learning (ML) labeling game that creates jobs for people in Africa who earn money on their phones by labeling training data for ML models 2. An interface where customers upload their own unlabelled data and we train a custom ML model accessible by API

Initially, we do not have enough labeled data to train an accurate ML model and instead Sebenzai workers are the API (we just return the human generated labels) with near real-time response times because many workers label the data in parallel. We continuously train a model (using open source deep neural network libraries) and use it to predict labels for each subsequent request being made to the model. The model returns a confidence with each prediction - when this is high we just return the model prediction but when it is low, human workers create a new label and the model is trained on the additional example. Over time, more and more requests are handled by the ML model and fewer cases need to be handled by humans. From the perspective of a customer using our machine learning APIs, it is impossible to know whether each request is completed by human workers or the machine learning model. We use consensus between workers, ground-truth examples and statistical methods to ensure accurate worker labels from Sebenzai workers.

There is no isiXhosa speech-to-text model and most computer vision models fail on African faces because they aren't trained on African data. You only need 10'000 hours of audio to train a speech-to-text model. You could get 1 worker to label 10'000 hours over approximately 65 months or we can get 10'000 workers to label the data in parallel and train a new speech model in a few hours using Sebenzai.

We apply privacy preserving transformations to the data before sending it to the Sebenzai human workforce to reduce the possibility of sensitive data leaking out of the system. The transformations depend on the model type, for example we slice audio into 3 second snippets and shuffle the snippets before sending them to workers to transcribe with no 1 worker labeling more than 3 seconds in any rolling 30 second window. For digitizing forms, we digitally shred the document into small pieces with a few characters each then recombine them back together into a document-level OCR result once workers (and then the model) have labelled each piece.

We use consensus algorithms and track worker skill relative to gold-standard datasets, other workers and themselves over time to guarantee label quality. We automatically allocate workers to tasks that they're good at and adjust worker rewards to provide latency guarantees and match supply and demand for worker attention.

We pay our workers more than minimum wage and empower them by giving them a way to work from anywhere, on their own terms.

Sunday 15:45–16:20 in Auditorium

Sebenz.ai - South African job creation through gamified data labeling for machine learning

Alex Conway

Description

Abstract

Subscribe to Receive PyData Updates