Friday October 29 5:30 PM – Friday October 29 6:00 PM in Talks I

Towards Cloud-Native Distributed Machine Learning Pipelines at Scale (pre-recorded)

Yuan Tang

Prior knowledge:
Previous knowledge expected
machine learning, docker, python

Summary

This talk presents various best practices and challenges on building large, efficient, scalable, and reliable distributed machine learning pipelines using cloud-native technologies such as Argo Workflows and Kubeflow as well as how they fit into Python ecosystem with cutting-edge distributed machine learning frameworks such as TensorFlow and PyTorch.

Description

Presentation slides: https://github.com/terrytangyuan/public-talks/blob/main/talks/towards-cloud-native-distributed-machine-learning-pipelines-at-scale-pydata-global-2021/presentation.pdf

In recent years, advances in machine learning have made tremendous progress yet large scale machine learning still remains challenging. With the variety of machine learning frameworks such as TensorFlow and PyTorch, it’s not easy to automate the process of training machine learning models on distributed Kubernetes clusters. Machine learning researchers and algorithm engineers with less or zero DevOps experience cannot easily launch, manage, monitor, and optimize distributed machine learning pipelines.

This talk presents various best practices on building large, efficient, scalable, and reliable distributed machine learning pipelines using cloud-native technologies such as Argo Workflows and Kubeflow as well as cutting-edge distributed machine learning frameworks such as TensorFlow and PyTorch. We will also talk about some of the challenges we had when managing distributed machine learning pipelines in our Kubernetes clusters with tens of thousands of nodes in production.