Thursday 10:50 AM–11:35 AM in Track 2 Room

Preparing messy data for supervised learning with vtreat

John Mount, Nina Zumel

Audience level:


Cleaning messy data is a necessary component of data science projects. The vtreat package automates common data preparation steps for supervised machine learning. In this talk, we will introduce vtreat and demonstrate its effective use with Pandas and xgboost on real-world data.


Data characterization, treatment, and cleaning are necessary (though not always glamorous) components of machine learning and data science projects. While there is no substitute for getting your hands dirty in the data, there are many data issues that repeat from project to project. In particular, there are pitfalls in properly dealing with missing data values, previously unobserved categorical values, and high-cardinality categorical variables.

In this talk, we will discuss using the vtreat package to prepare data for supervised machine learning. We will demonstrate vtreat on a real-world data set, with xgboost and Pandas. Vtreat automates the statistically sound treatment of common data problems, leaving the data scientist free to concentrate on problem-specific data and modeling issues.

Subscribe to Receive PyData Updates