What do you do if you have a lot of models to fit, don’t want to spend all day with your laptop as a space heater, and have access to AWS? Take it to the cloud! I’ll share my experience setting up a system to take models coded with scikit-learn and run them in a cloud computing environment. This talk will focus on training data that fit in memory and data for prediction which maybe doesn’t.
I worked as part of a team to create software which moves data to and from scikit-learn models running in AWS’s EC2 service, and my talk will highlight some of the challenges we faced and the solutions we came up with. This project is possible because scikit-learn has a standardized API for all model types. No matter what algorithm you’re using, it has the same methods with the same arguments as any other algorithm.
Data start and end either as tables in AWS’s Redshift (a postgres database) or CSVs stored in AWS’s S3 (a key-value store). The training data need to fit in memory, but we can make predictions on arbitrarily-large Redshift tables in roughly constant time, given a large enough pool of EC2 instances. The software and execution environment are packaged into Docker containers for reproducibility and speed in setting up on new EC2 instances.
The challenges on the training side are in massaging input data to match the formats which scikit-learn models expect and in storing enough metadata to ensure that we can reproduce the arrays of features at prediction time. Predictions distribute chunks of data to their own EC2 instances. I’ll show off the custom backend for the joblib
library that we use to manage the remote processes for predictions.
Our software runs in Civis Analytics’s data science platform. For the application described in this talk, the platform mediates interactions with AWS services to provide security and permissioning. The principles I’ll discuss will be of general applicability to anyone interested in cloud-based production systems based on scikit-learn.