Wednesday 1:00 PM–1:45 PM in Track 2 Room

Introducing Autoimpute: a Python Package for Grappling with Missing Data

Joseph Kearney, Shahid Barkat

Audience level:
Intermediate

Description

Real-world data is messy and missing, yet most statistical models require it to be clean and complete. Analysts are often well versed in modeling, but few are familiar with handling missingness. This talk teaches data professionals best practices for dealing with missingness and introduces Autoimpute, our Python package that helps users grapple with missing data during statistical analysis.

Abstract

Most real-world datasets contain missing data, but many statistical models expect input datasets to be complete. This disconnect requires analysts to figure out what to do about missing data before they can proceed with statistical analysis.

Unfortunately, most aspiring data professionals spend the bulk of their time studying statistical models themselves, not techniques to handle missing data. As a result, individuals opt for simple methods such as listwise deletion or mean imputation and underestimate the impact these methods have on parameter inference of statistical models.

This problem inspired us to create Autoimpute, a Python package that offers a framework for properly handling missing data during end-to-end analysis. In this talk, we focus on handling missing data during regression analysis, and we demonstrate how to use Autoimpute to tackle the problem methodologically.

To start, we provide context to understand different types of missing data, and we define terminology used in the remainder of the talk. We then walk through examples that contain different types of missingness. Each example use a four-step methodology we developed to perform statistical analysis with missing data. We start by assessing the extent of the missing data problem using descriptive and visual measures. We end by measuring the impact of imputation on the bias and variance of parameters derived from regression models built on imputed data.

By the end of the talk, each listener should leave equipped with a methodological approach to handling missing data during statistical analysis. Additionally, the audience should feel comfortable using the Autoimpute package as a tool in their Python data analysis ecosystem.

Subscribe to Receive PyData Updates

Subscribe