Saturday 11:15–12:00 in Hall 1

Holy D@t*! How to Deal with Imperfect, Unclean Datasets

Katharine Jarmul

Audience level:


Ever wondered what sort of sick person created the datasets you work with? Sadly, we can't answer that question directly, but we can aim to handle messy data problems. From the non-significant or null datasets, to unclean and unclear string data, to difficult formats like PDFs, we'll take a closer look at how to best work with imperfect data and what questions you can answer given your datasets.



This talk will cover how to handle and manage working with unclean and imperfect datasets. We'll cover several issues and suggestions as well as some code examples for managing that messy data.


  • The Noble Quest against Messy Data
  • Working with null data
  • Insignificant data
  • Messy strings Regex Fuzzy Match
  • XML / HTML Data
  • PDF Data
  • Where to go from here