Testing is an essential practice in software engineering, yet, it remains overlooked by many Machine Learning practitioners. This talk describes a workflow that practitioners can incrementally adopt using open-source tools to deploy models with confidence.
Talk directed to practitioners who deploy models and are looking for a practical guide to improve their workflow.
Attendees will get a high-level understanding of how to test Machine Learning code to apply to their projects.
The workflow consists of five maturity levels that practitioners can adopt as they progress.
[0 - 2 minutes] Introduction
The section introduces concepts such as training pipeline and inference pipeline.
[2 - 6 minutes] Level 1: Smoke testing
Smoke testing is implemented at the beginning of the project. It is the most basic testing because we do not check the actual outputs; we only ensure our code runs.
[6 - 10 minutes] Level 2: Integration and unit testing
We start to test the output of our code. First, we add integration tests that check the data quality for every task in our pipeline (e.g., check that the data produced by a clean_data
function meets specific criteria). Second, we abstract parts of our code into functions that then we unit test using sample inputs.
[10 - 14 minutes] Level 3: Testing variable distributions and inference pipeline
In the previous level, our integration tests only verified basic data properties. Now, we write tests that check whether the distribution of certain variables has changed. Secondly, we test our inference pipeline to ensure it runs or throws an informative error if the input is incorrect.
[14 - 18 minutes] Level 4: Inference pipeline integration testing
We now add more complete tests that check the outputs of our inference pipeline. The objective is to ensure that we do not have training-serving skew; this problem arises when the pre-processing at training time does not match the pre-processing at serving time.
[18 - 22 minutes] Level 5: Testing deployment artifact and model quality
This last maturity level ensures that our testing suite prevents the deployment of low-quality models. We also check that our deployment artifact works and can make predictions.
[22 - 25 minutes] Summary and Conclusions
Summary of each testing level and conclusions.