Running Machine learning jobs at scale places painful demands on infrastructure from an operational perspective. As the number of jobs increase, having an easy-to-use infrastructure becomes a necessity. In this talk we will cover how we use Kubernetes at Textkernel as a job manager to scale our Tensorflow-based jobs. We will also explore other solutions such as distributed Tensorflow and Kubeflow.
Summary
At Textkernel, training parsing models is part of our everyday work to improve our products.
With an increasing number of training jobs, team members and machines to schedule the jobs on, the task becomes extremely daunting. We needed a way to organize and automate the process and to separate concerns between Research Engineers and the underlying infrastructure.
Challenges
- Setting up a runtime environment that uses tensorflow
- Starting a training job that does hyperparameter tuning on the right machine given the hardware needs
- Effective utilization of computing resources
- Speeding up training by distributing the work
Solution
Kubernetes (k8s) came in as a natural solution for some of these problems given its features as a container orchestration platform.
Running ML jobs in Kubernetes gave us the following benefits:
- Running work in a distributed fashion
- The ability to manage a cluster of different hardware specs, each fitting a certain type of job
- Having an easy way to isolate runtime environments using docker
- Using the k8s scheduler for job scheduling and resource management
- The ability to scale the infrastructure easily by adding new machines on demand
Outline
- Overview of the problem
- Introduction to k8s
- Why K8s fits for Managing Machine Learning jobs
- A walk through of our current solution and discussion of the benefits
- Other improvements and solutions proposed by the k8s community to this problem
Targeted Audience
- Mainly Software and Machine Learning Engineers who are interested in scaling ML jobs
- Data Scientists who are curious about what happens under the hood when you start a training job, and why is it a non-trivial task
Knowledge Prerequisites
Basic understanding of Distributed Systems and Machine Learning. Kubernetes and tensorflow are a plus