PyData NYC | Presentation: The Art and Science of Data Matching

Wednesday 1:55 p.m.–2:35 p.m.

The Art and Science of Data Matching

Mike Mull

Audience level:: Intermediate

Description

Data matching is the process of finding records in one or more data sources that refer to the same item. Variants of this process include de-duplication (one data source), record linkage (two data sources), and entity resolution (2+ data sources). This talk will discuss Python tools and libraries that can be applied to data matching, as well as various tricks of the trade.

Abstract

Data matching enriches existing data sources, leading to new data products or clean input for further analysis. Correct matching is also a crucial aspect of information quality for enterprise data. Although there are many commercial tools for data matching, the Python ecosystem has components that make it relatively simple to build domain-specific matching applications or to incorporate matching into products and services.

Data matching uses basic computer science, NLP, statistics and machine learning; combined with a variety of hacks to deal with notoriously messy data like human names and street addresses. This talk will work through a test case, covering the following specific areas: