Wednesday 1:55 p.m.–2:35 p.m.

The Art and Science of Data Matching

Mike Mull

Audience level:
Intermediate

Description

Data matching is the process of finding records in one or more data sources that refer to the same item. Variants of this process include de-duplication (one data source), record linkage (two data sources), and entity resolution (2+ data sources). This talk will discuss Python tools and libraries that can be applied to data matching, as well as various tricks of the trade.

Abstract

Data matching enriches existing data sources, leading to new data products or clean input for further analysis. Correct matching is also a crucial aspect of information quality for enterprise data. Although there are many commercial tools for data matching, the Python ecosystem has components that make it relatively simple to build domain-specific matching applications or to incorporate matching into products and services.

Data matching uses basic computer science, NLP, statistics and machine learning; combined with a variety of hacks to deal with notoriously messy data like human names and street addresses. This talk will work through a test case, covering the following specific areas:

  • Using pandas as a framework for pre-processing and merging data
  • Profiling data to assess how hard or successful the matching process might be
  • Similarity metrics for approximate string matching
  • Techniques for parsing and matching human names
  • Techniques for handling address data, including geo-coding
  • Using blocking or indexing to reduce the number of comparisons
  • Probabilistic methods for optimal matching, such as the Fellugi-Sunter method
  • Using scikit-learn classifiers for record-linkage
  • A demonstration of the open-source dedupe tool
  • Information quality metrics

Sponsors


Become a sponsor.