Stress Test Center: moving the stress from the user to the model

Simona Maggio

Prior knowledge:
Previous knowledge expected
Machine Learning


Before moving an experimental ML model to production, we need to validate its resilience to known data changes, possible at deployment time. To achieve this the performance metrics on the test set are not enough and we need robustness measures. This talk shows users concerned with robust ML how to design stress tests and what robustness metrics are insightful for model selection and understanding.


Even when a ML model has great performances on a test set, things can still go wrong at deployment time, where the incoming data is quite different from a carefully collected training set. Before shipping the model to production, we should ask ourselves: “is this model robust?”. Even without formalizing their requirements, domain experts have expectations regarding the model's robustness. For instance in NLP, users would expect the model to be invariant to synonyms or typos; instead for models consuming the user’s age a difference of one year should probably not change the prediction. In this talk we show how to evaluate the model’s robustness, leveraging a stress test center. First we design pertinent stress tests via simulation, reflecting the desired properties of the model, then we compute three types of robustness metrics: 1. the model performance drop 2. the stress failure rate and 3. the prediction stability across stress tests. We show how to leverage these metrics to make sure that a model matches our expectations before deployment, but also at design time to compare multiple equi-accurate models and select the most robust one for production.