Friday October 29 6:30 PM – Friday October 29 8:00 PM in Workshop/Tutorial II

Serving BERT Models in Production with TorchServe

Adway Dhillon, Nidhin Pattaniyil

Prior knowledge:
Previous knowledge expected
Basic knowledge in docker, some deep learning and python


  • This talk is for a data scientist or ML engineer looking to serve their PyTorch models in production.

  • It will cover post training steps that should be taken to optimize the model such as quantization and torch script.

  • It will also walk the user in packaging and serving the model through Facebook’s TorchServe.


Intro (10 mins).
- Introduce the deep learning BERT model.
- Walk over the notebooks on Google Collab Setup.
- Show the end model served along with sample inference.

Review Some Deep Learning Concepts (10 mins) - Review sample trained PyTorch model code - Review sample model transformer architecture - Tokenization / pre and post processing

Optimizing the model (30 mins) - Two modes of PyTorch: eager vs script mode
- Benefits of script mode and PyTorch JIT - Post training optimization methods: static and dynamic quantization, distillation - Hands on: - Quantizing model - Converting the Bert model with torch script

Deploying the model (30 mins) - Overview of deployment options : Pure flask app vs model servers like Torch Serve / TF-Serving - Benefits of Torch Serve: high performance serving, multi model serving, model version for A/B testing, server side batching, support for pre and post processing - Exploring the built in model handlers and how to write your own - Managing the model through management api - Exploring built and custom metrics provided by Torch Serve - Hands on : - Package the given model using Torch Model Archive - Write a custom handler to support pre processing and post processing

Lessons Learned: (10min) - share some performance benchmarks of model served at Walmart Search - future next steps

Q&A (5 mins)