Thursday October 28 10:30 PM – Thursday October 28 11:00 PM in Talks I

Submodular optimization for minimizing redundancy in massive data sets

Jacob Schreiber

Prior knowledge:
No previous knowledge expected


As data sets grow in size they generally grow in usefulness, but they also frequently grow in redundancy. Submodular optimization is a theoretically-grounded set of approaches for minimizing redundancy while preserving the information within data sets. In this talk, I will describe submodular optimization at a high level and introduce Apricot, a Python package that implements these methods.


The amount of data produced is dramatically increasing every year. While this increase in data has led to an explosion of sophisticated machine learning models, a notable downside is one needs a corresponding increase in compute power to handle the data. Unfortunately, because the availability of data is outpacing the availability of hardware, many data scientists are left with the difficult task of handling huge data sets with only modest resources.

Submodular optimization is a solution to this massive data problem. This approach is commonly considered to be the discrete analog of convex optimization and formalizes.the selection of representative subsets from massive data sets explicitly as an optimization problem. Empirically, these subsets have numerous advantages over the original data sets, including being easier to compute on while still maintaining the diversity of the original data set.

Apricot is a Python package that implements submodular optimization. Internally, the package is organized in a similar manner to deep learning libraries, with submodular function objects that are analogous to layers/model objects and optimizers optimizer objects that implement various strategies for optimizing submodular functions. The API follows that of a scikit-learn transformer, with pairs of submodular functions and optimizers being wrapped in a selector object that can be dropped into existing scikit-learn pipelines if desired.

This tutorial will focus on the practical usage of submodular optimization for summarizing large data sets with several examples to highlight key points. It will likely be most useful for those that have had experience training machine learning models, but no theoretical knowledge in either machine learning or submodular optimization will be needed.