Debugging machine learning apps is hard. I feel that this topic is important, however, relatively rarely touched compared to, e.g., latest models or new interesting ML applications. In this talk, I will try try to fill that gap by discussing best practices, recommendations and my own experience on the subject.
Recently, OpenAI examined ten popular reinforcement learning algorithms reimplementation and found that six (!) of them contained significant non-breaking bugs [1]. This illustrates the fact that debugging and testing machine learning software is difficult and requires extra attention.
In particular, machine learning (ML) driven apps exhibit rich spectrum of failure types, specific to the domain. ML systems can be diagnosed:
not to work at all (e.g., providing nearly random outcomes);
to work with (say) 70% accuracy, but with user expectation of much better performance;
to work on the first batch of data, but miserably failing in production, when new data arrive;
to initially work very well in production, but deteriorating significantly with time.
In this talk, I will touch on each of the above scenario, describing possible pitfalls and recommended engineering practices, as well as providing real-life examples.
[1] https://blog.openai.com/openai-baselines-dqn/