Sunday 14:15–15:00 in Tower Suite 1

Data Science Frameworks and Managed Services: When to Avoid the Shiny New Toys

Jon Tutcher

Audience level:
Novice

Description

How do we spend less time fighting with computers and more time having fun? Software frameworks and managed services promise to save data scientists' time, but often introduce complexity and learning curves. In this session we'll discuss the advantages and costs of adopting these tools, from the experience of a team that's spent the last year building Python recommendation engines and services.

Abstract

The quality and capability of hosted machine learning solutions provided by Google, Amazon, and others has continued to grow over the last few years. For undertaking data science work in Python, open source projects like Kubernetes, Airflow, and JupyterHub allow us to develop more quickly, reuse code, and deploy software more flexibly. Even as these tools rapidly become easier to use, there's still an art to choosing the right tool for the job. Whilst some tasks are automated away, the effort required to use and configure frameworks sometimes becomes more effort than before.

In the BBC's Datalab team, we've spent the last 18 months building machine learning APIs and recommendations sytems for use in the corporation and by the public. In December we launched a mobile app that uses collaborative filtering to recommend content to users based on their viewing habits. With a 1:1 data scientists to engineers ratio, we ended up spending a significant amount of our time exploring ways to deploy machine learning systems at scale, eventually choosing to manage our own containerised services in Kubernetes for exploration, model training, and deployment. Recently, we've deployed another system into the BBC's more traditional "Cosmos" system using plain old VMs and autoscaling groups.

Whilst we assumed that buying in managed infrastructure or data science solutions would be easier for the team, we found that the time saved in running our own compute infrastructure was instead taken up reading documentation and learning how to interact with various managed "black boxes" instead.

This talk will suit anyone hoping to get a grasp on how to evaluate tools for deploying ML models to the real world. It'd also suit those who are sceptical about the "CV-driven-development" approaches to technology adoption.

Subscribe to Receive PyData Updates

Subscribe