Sunday 4:30 PM–5:15 PM in Room 2

Machine learning techniques for data cleaning

Cathy Deng

Audience level:


Often, the most interesting datasets - data about people and organizations - are the messiest and most difficult to analyze. When data comes from multiple sources, or when data is entered manually, variation & ambiguity are inevitable. Learn about ways to infer structure and relationships in messy data, using open source Python libraries.


  1. How does messiness arise & why is it challenging?
  2. Inferring structure in unstructured strings
  3. NLP parsers for names, organizations, addresses
  4. how to make your own probabilistic string parser
  5. Inferring relationships in datasets
  6. clustering similar rows in a dataset
  7. linking similar rows across datasets
  8. clustering & linking without writing any code, using the interfacce