Friday October 29 12:00 PM – Friday October 29 12:30 PM in Talks I

DedupliPy: a new deduplication package

Frits Hermans

Prior knowledge:
No previous knowledge expected

Summary

Deduplication or entity resolution is the task to identify which records belong to the same real world entity. DedupliPy is a new Python package that enables the user to quickly train a deduplication model with a minimum of data labelling. In this talk I explain how the algorithm works and I show in a live demo how you can start using DedupliPy today.

Description

Which of the following names belong to the same company: ‘AirBnB, ‘Airbus’, ‘AirBnb UK Ltd’ and ‘Airbus Ltd’? For humans this is an easy question but coding it brings a lot of challenges. Comparing all names with each other blows up the number of comparisons quadratically. When do two company names belong to the same company? Note how ‘AirBnB’ and ‘Airbus’ only differ by two letters and ‘Airbus’ and ‘Airbus Ltd’ by three letters. So the former pair is more similar but not a match whereas the other pair is a match. This shows that it is not so simple to distinguish if a pair is a match or not.

The new Python package DedupliPy (www.deduplipy.com) implements a smart deduplication algorithm. The user provides only a small number of labelled record pairs during an active learning session. DedupliPy deals with the quadratic scaling problem in an efficient way. The package works out-of-the-box and allows advanced users to configure the algorithm to their needs.

This talk is for any data scientist who works with less-than-perfect text records like databases containing names or transactions. At the end of the talk you will be able to deduplicate your data with only a few lines of code using DedupliPy. Moreover, you will understand the challenges of deduplication and how the package deals with them.