Friday 2:50 PM–3:30 PM in Room 2

Balancing scale and interpretability in analytical applications with sklearn and ensembling methods

Lanhui Wang

Audience level:
Novice

Description

We present a machine learning framework using ensemble learning to combine models developed by multiple analysts for distinct subsets of a large feature space. We apply the framework to retail sales data, but its design can accommodate other types of target variables. This is a novel application of ensemble learning since it addresses both analytical and organizational challenges.

Abstract

Models of large-scale, high-dimensional business data are often challenging to develop because they must be simultaneously accurate (they must account for the data well) and intelligible (they must yield output that is easily interpretable by the people who use the models to inform business decisions). This is compounded by the fact that often multiple analysts are responsible for contributing to these analyses. We demonstrate a way of addressing this problem in which we (1) divide a large set of features into smaller sets of features with easily interpretable labels, (2) fit a separate model for each feature subset in order to obtain predictors for the ensembling model, and then (3) combine these separate predictors via ensemble learning. We also assess each predictor’s importance in the ensembled model. We implement this solution in Python, leaning heavily on sklearn, taking advantage of Python’s flexibility and ease-of-use both in software development and data analysis.

We describe the modeling framework in detail and present a case study demonstrating our approach by modeling the weekly sales volume of household goods across many locations of a national retail chain that yields models that leverage the predictive information of many variables without sacrificing interpretability. Moreover, the framework yields models that are easily extensible and modifiable by many contributors. The framework is implemented in Python, and showcases Python’s broad applicability in data-intensive industrial applications. Base models are developed using the Python analytics stack (primarily Pandas, numpy, and scikit-learn). Moreover, Python’s native object-oriented functionality provides users with an easy way of flexibly and quickly combining and ensembling base models for new data sets.