Sunday 11:30–12:05 in Megatorium

Scaling Machine Learning jobs with Kubernetes

Tarek Mehrez, Carsten Lygteskov Hansen

Audience level:
Intermediate

Description

Running Machine learning jobs at scale places painful demands on infrastructure from an operational perspective. As the number of jobs increase, having an easy-to-use infrastructure becomes a necessity. In this talk we will cover how we use Kubernetes at Textkernel as a job manager to scale our Tensorflow-based jobs. We will also explore other solutions such as distributed Tensorflow and Kubeflow.

Abstract

Summary

At Textkernel, training parsing models is part of our everyday work to improve our products. With an increasing number of training jobs, team members and machines to schedule the jobs on, the task becomes extremely daunting. We needed a way to organize and automate the process and to separate concerns between Research Engineers and the underlying infrastructure.

Challenges

Solution

Kubernetes (k8s) came in as a natural solution for some of these problems given its features as a container orchestration platform. Running ML jobs in Kubernetes gave us the following benefits:

Outline

Targeted Audience

Knowledge Prerequisites

Basic understanding of Distributed Systems and Machine Learning. Kubernetes and tensorflow are a plus

Subscribe to Receive PyData Updates

Subscribe