The road from prototype Machine Learning models in a notebook to a full-fledged pipeline running 24/7 in a production environment is long and winding. I present tough lessons we had to learn while developing and deploying ML in the old-fashioned industry of cement production. These range from how less time is spent on models vs. infrastructure to unexpected interactions of customers with ML.
In December 2018 my co-founder and I launched alcemy, a Machine Learning startup for the cement and concrete supply chain. I experienced first hand moving from a simple proof of concept, a ML model inside a Jupyter notebook, to a full-fledged pipeline running 24/7 and steering massive amounts of cement production in real plants. I can tell you the road was long and winding. I want to share some of the hard lessons we learned along the way with you. If you are an aspiring ML or Software Engineer, Data Scientist, Entrepreneur, or you are just wondering how Machine Learning applied in the wild looks like this talk is for you. No prior knowledge is required except some familiarity with basic concepts and terminology of Machine Learning.
Cement alone is responsible for about 8% of worldwide CO2 emissions. Fortunately, we have quickly learned that low-carbon alternatives to "conventional" cement and concrete already exist. For instance, 60% of carbon emissions can be avoided if burnt limestone, the main ingredient for cement, is replaced partly by limestone powder (which isn't burnt, and therefore doesn't release carbon into the atmosphere). Yet, these low-carbon cement recipes have a substantial shortcoming: They react much more sensitive to changes, e.g. changes in weather conditions or in the chemical and mineralogical composition of ingredients. As a consequence, low-carbon cements and the resulting concrete (made by mixing cement with sand and water) can only be reliably produced under laboratory conditions.
We are changing this. We use data intelligence and predictive Machine Learning control to optimize production processes such that low-carbon cement and concrete can be manufactured in real plants and at scale. I will quickly introduce our solution that is already deployed in 5 cement plants. Moreover, we are currently prototyping to move into concrete production as well. Of course, we do this (mostly) in Python.
Machine Learning in production is vastly different from solving a kaggle challenge. In fact, the particular choice of Machine Learning model is much less important than you think. I will cover the benefits of using rather simple models such as random forests or even linear regression in comparison to deep learning. If stuff goes wrong, and it will, interpretable and debuggable models are far superior to complex architectures. Also having proper model evaluation that reflects production requirements, and good baselines for comparison are always crucial first steps and pay off in the long run. It was surprising how much less time we spent on the core Machine Learning algorithms in comparison to infrastructure, such as deployments on AWS fargate or k8s, re-training processes, proper database layout, or home-brewed tooling to allow easier configurations of dozens of ML models.
We quickly learned that data is way more important than models. Some might have heard the phrase Garbage in garbage out coined by programmers in the 50s. This is even more important when it comes to today's widespread usage of Machine Learning. We run ML not on our own data, but on data provided by our customers. While the level of data-maintenance and quality that our customers are used to allows for in-house bookkeeping and short analyses, it does not necessarily suffice for ML. I will discuss why and how we spend a good amount of time cleaning and really drilling into the data provided by our customers.
Moreover, differences between training and real-time inference data can be a real challenge. For example, it is not guaranteed that the location where samples are drawn from cement mills, i.e. the live data used for inference, is as representative of the actual cement as silo samples that can be used for training. Fine particles might not be captured simply due to the physical properties of the sample site. To tackle problems like these as a Machine Learning engineer you have to become an expert in the domain your models are applied. You really need to understand the data in every detail and know how it is generated by your customers and understand the context and consequences of all of your customers' processes.
Our customers are, of course, no Machine Learning experts. Why should they be? If they were, they wouldn't need us anyway. However, oftentimes we as Machine Learning engineers forget the ramifications of this. I will talk about customer relations and their interactions with our Machine Learning models. For example, we had to deal with a rather skeptical customer not believing our models' predictions. They pretty much went against all recommendations made by the model. Although it is nice if in the end the model predictions turn out to be right, your customer does not necessarily feel the same way. In contrast, the customer does not enjoy being wrong and may even feel mocked by a machine. Having a strong customer success team, who knows both how ML works and, of course, how the customer operates and thinks, is often more valuable than "rockstar" Machine Learning engineers.
Lastly, a tough lesson to learn was that Machine Learning as a service should not be mistaken for a software as a service business model. Our marginal costs are not zero. Besides a great deal of consulting that is needed for every customer, on-boarding a new customer is time consuming and needs a lot of work. Integrating into existing infrastructure of cement plants (who are not top-notch IT companies) can be tough or plain-right frustrating at times. Therefore, scaling a Machine Learning startup can be hard, and we learned to better go hunting for elephants, i.e. few high paying customers, than for mice, many low paying ones.