Presentation: Serving BERT Models in Production with TorchServe

Time Zone

Friday October 29 6:30 PM – Friday October 29 8:00 PM in Workshop/Tutorial II

Serving BERT Models in Production with TorchServe

Adway Dhillon, Nidhin Pattaniyil

Prior knowledge:: Previous knowledge expected
Basic knowledge in docker, some deep learning and python

Summary

This talk is for a data scientist or ML engineer looking to serve their PyTorch models in production.
It will cover post training steps that should be taken to optimize the model such as quantization and torch script.
It will also walk the user in packaging and serving the model through Facebook’s TorchServe.

Description

Intro (10 mins).
- Introduce the deep learning BERT model.
- Walk over the notebooks on Google Collab Setup.
- Show the end model served along with sample inference.

Review Some Deep Learning Concepts (10 mins) - Review sample trained PyTorch model code - Review sample model transformer architecture - Tokenization / pre and post processing

Optimizing the model (30 mins) - Two modes of PyTorch: eager vs script mode
- Benefits of script mode and PyTorch JIT - Post training optimization methods: static and dynamic quantization, distillation - Hands on: - Quantizing model - Converting the Bert model with torch script

Deploying the model (30 mins) - Overview of deployment options : Pure flask app vs model servers like Torch Serve / TF-Serving - Benefits of Torch Serve: high performance serving, multi model serving, model version for A/B testing, server side batching, support for pre and post processing - Exploring the built in model handlers and how to write your own - Managing the model through management api - Exploring built and custom metrics provided by Torch Serve - Hands on : - Package the given model using Torch Model Archive - Write a custom handler to support pre processing and post processing

Lessons Learned: (10min) - share some performance benchmarks of model served at Walmart Search - future next steps

Q&A (5 mins)