This talk is for a data scientist or ML engineer looking to serve their PyTorch models in production.
It will cover post training steps that should be taken to optimize the model such as quantization and torch script.
It will also walk the user in packaging and serving the model through Facebook’s TorchServe.
Intro (10 mins).
- Introduce the deep learning BERT model.
- Walk over the notebooks on Google Collab Setup.
- Show the end model served along with sample inference.
Review Some Deep Learning Concepts (10 mins) - Review sample trained PyTorch model code - Review sample model transformer architecture - Tokenization / pre and post processing
Optimizing the model (30 mins)
- Two modes of PyTorch: eager vs script mode
- Benefits of script mode and PyTorch JIT
- Post training optimization methods: static and dynamic quantization, distillation
- Hands on:
- Quantizing model
- Converting the Bert model with torch script
Deploying the model (30 mins) - Overview of deployment options : Pure flask app vs model servers like Torch Serve / TF-Serving - Benefits of Torch Serve: high performance serving, multi model serving, model version for A/B testing, server side batching, support for pre and post processing - Exploring the built in model handlers and how to write your own - Managing the model through management api - Exploring built and custom metrics provided by Torch Serve - Hands on : - Package the given model using Torch Model Archive - Write a custom handler to support pre processing and post processing
Lessons Learned: (10min) - share some performance benchmarks of model served at Walmart Search - future next steps
Q&A (5 mins)