Machine learning pipelines are software pipelines after all. Their complexity and design viscosity lead to spectacular, costly and even deadly ML failures. This talk describes the most important Clean Code and Clean Architecture design principles, applied to machine learning applications. It aims to help the audience reduce machine learning technical debt, and to design robust ML architectures.
As a community, our work in machine learning inherently depends on external tools and frameworks. However, we have no control over the development and maintenance of these external dependencies. The primary problem is that as a machine learning pipeline become intertwined with a specific ML framework, the harder and more expensive it is to change. This leads ML teams to accumulate technical debt, with serious symptoms like entanglement, hidden feedback loops, undeclared consumers, and pipeline jungles.
However, from a business perspective, Tensorflow, PyTorch, and Scikit-Learn are details. MySQL, EMR, and Hive are details. Airflow, KubeFlow, and Dask are also details. There must be a way to decouple our ML applications from these frameworks and tools. This talk aims to cover the most important Clean Code design principles that can help evolve our ML engineering craftsmanship.
We will cover the following goals of a clean machine learning architecture:
To achieve those goals we will dive into the clean code design principles, and explain how they relate to common ML tasks and components:
It is well accepted that a good architecture maximizes the number of decisions not made. Creating good architecture requires extensive experience in the target domain. However, as of 2019, 40% of data scientists in the USA have less than 5 years of experience. This inexperienced workforce does not make these challenges any easier. At the same time, we are experiencing a boom in ML development and usage. This is similar to previous software engineering expansions in the 2000s. The current expansion manifests itself with a menagerie of constructs, frameworks, and workflows. This creates a multitude of integration challenges that remind us of good old software engineering problems. Some challenges of ML engineering are indeed new. However, the majority of the software engineering concerns have a historical smell. Going back to the basics of good software engineering can help with today’s ML engineering problems.
This talk will help the audience apply the principles of clean machine learning code, and escape the vicious cycle of ML technical debt.