PyData Amsterdam 2018 - Presentation: Scaling Machine Learning jobs with Kubernetes

Running Machine learning jobs at scale places painful demands on infrastructure from an operational perspective. As the number of jobs increase, having an easy-to-use infrastructure becomes a necessity. In this talk we will cover how we use Kubernetes at Textkernel as a job manager to scale our Tensorflow-based jobs. We will also explore other solutions such as distributed Tensorflow and Kubeflow.

Summary

At Textkernel, training parsing models is part of our everyday work to improve our products. With an increasing number of training jobs, team members and machines to schedule the jobs on, the task becomes extremely daunting. We needed a way to organize and automate the process and to separate concerns between Research Engineers and the underlying infrastructure.

Challenges

Setting up a runtime environment that uses tensorflow
Starting a training job that does hyperparameter tuning on the right machine given the hardware needs
Effective utilization of computing resources
Speeding up training by distributing the work

Solution

Kubernetes (k8s) came in as a natural solution for some of these problems given its features as a container orchestration platform. Running ML jobs in Kubernetes gave us the following benefits:

Running work in a distributed fashion
The ability to manage a cluster of different hardware specs, each fitting a certain type of job
Having an easy way to isolate runtime environments using docker
Using the k8s scheduler for job scheduling and resource management
The ability to scale the infrastructure easily by adding new machines on demand

Outline

Overview of the problem
Introduction to k8s
Why K8s fits for Managing Machine Learning jobs
A walk through of our current solution and discussion of the benefits
Other improvements and solutions proposed by the k8s community to this problem

Targeted Audience

Mainly Software and Machine Learning Engineers who are interested in scaling ML jobs
Data Scientists who are curious about what happens under the hood when you start a training job, and why is it a non-trivial task

Knowledge Prerequisites

Basic understanding of Distributed Systems and Machine Learning. Kubernetes and tensorflow are a plus

Sunday 11:30–12:05 in Megatorium

Scaling Machine Learning jobs with Kubernetes

Tarek Mehrez, Carsten Lygteskov Hansen

Description