Saturday 11:00–11:45 in LG6

High-Performance Distributed Tensorflow: Request Batching and Model Post-Processing Optimizations

Chris Fregly

Audience level:
Intermediate

Description

In this completely demo-based talk, Chris will demonstrate various techniques to post-process and optimize trained Tensorflow AI models to reduce deployment size and increase prediction performance.

Abstract

In this completely demo-based talk, Chris will demonstrate various techniques to post-process and optimize trained Tensorflow AI models to reduce deployment size and increase prediction performance.

First, we'll use various techniques such as 8-bit quantization, weight-rounding, and batch-normalization folding, we will simplify the path of forward propagation and prediction.

Next, we'll loadtest and compare our optimized and unoptimized models - in addition to enabling and disabling request batching.

Last, we'll dive deep into Google's Tensorflow Graph Transform Tool to build custom model optimization functions.