Many data science and machine learning techniques require labelled data. In many businesses, this means that a lot of time, energy or money goes into acquiring labels. Active learning is a technique to make this process more efficient, by choosing data points to label based on current model performance. Here, I discuss methods of doing so easily and quickly in the interactive python ecosystem.
Many data science and machine learning techniques require labelled data. In many businesses, this means that a lot of time, energy or money goes into acquiring labels. Active learning is a technique to make this process more efficient, by creating a feedback loop between a model and the labellers.
Active learning works by choosing data points to be labelled based on the confidence of a statistical or machine learning model. The less confident a model is about a data point, the more useful that data point likely is in providing additional information to the model.
The python ecosystem has amazing support for both statistical and machine learning models. It has also, through the IPython, Jupyter and ipywidgets projects, great support for interactive tool building, enabling users to create rich user interfaces from pure python code.
The combination of the two allows us to build outstanding active learning tools. By creating a user interface that enables simple interactions with complex data structures and models, active learning fits right into the standard python data science workflow. I'll discuss how this is implemented by the library superintendent, which enables the creation of active learning labelling using numpy arrays or pandas dataframes as input, and sklearn-compatible models as the active learning mechanism.
While generating user interfaces in notebooks works for single use cases, active learning at scale requires simple web interfaces that don't require the user to run code. This is possible thanks to libraries such as voila, a web server that turns jupyter notebooks into web pages with full interactivity. By allowing data scientists to prototype user interfaces in notebooks, and deploy them without changes to the web, scaling becomes much quicker.
However, distributed labelling also requires a decoupling of the model and the labelling. Fitting and evaluating a model has to happen independently to the labelling effort, and coordination of the labelling has to occur using a queueing system. These scaling issues are solved in superintendent using a database as a queue and a worker / orchestration split that can be implemented using docker compose.