Tuesday 1:00 PM–1:45 PM in The Franklin Suite, 3rd Floor / Technical

Deploy and Use a Multiframework Distributed Deep Learning Platform on Kubernetes

ANIMESH SINGH, Tommy Li

Audience level:
Intermediate

Description

Learn how to use Fabric for Deep Learning (FfDL) to execute distributed deep learning training for models written using multiple frameworks

Abstract

Training deep neural network models requires a highly tuned system with the right combination of software, drivers, compute, memory, network, and storage resources. Deep learning frameworks like TensorFlow, PyTorch, Caffe, Torch, Theano, and MXNet have contributed to the popularity of deep learning by reducing the effort and skill needed to design, train, and use deep learning models. Fabric for Deep Learning (FfDL, pronounced “fiddle”) provides a consistent way to run these deep learning frameworks as a service on Kubernetes. FfDL uses a microservices architecture to reduce coupling between components, keep each component simple and as stateless as possible, isolate component failures, and allow each component to be developed, tested, deployed, scaled, and upgraded independently.

Animesh Singh shared lessons learned while building and using FfDL and demonstrate how to leverage it to execute distributed deep learning training for models written using multiple frameworks, using GPUs and object storage constructs. They then explain how to take models from IBM’s Model Asset Exchange, train them using FfDL, and deploy them on Kubernetes for serving and inferencing.

Subscribe to Receive PyData Updates

Subscribe