PyData Carolinas 2016 | Presentation: Scalable Patient Records De-duplication using machine learning

Thursday 1:10 PM–1:50 PM in Room 1

Scalable Patient Records De-duplication using machine learning

Jaafar Ben-Abdallah

Audience level:: Intermediate

Description

Simple matching to identify duplicates in patient records produces numerous errors for various reasons. To improve the identification of duplicates, we built an incremental model on top of an existing machine learning based Python package. We made the model updatable and scalable to accommodate an ever increasing patient record file.

Abstract

Objective:

To produce an improved identification of a continuously increasing patient records database.

Problem:

Proper identification of duplicated patient information remains an arduous problem for hospitals, pharmacies and service providers. Simple matching of these records does not result in the correct identification of existing duplicates for various reasons such as noisy and incomplete records.

Methodology:

In this work, we build on top of a python package that uses active learning to create a labeled data set, train a linear regression model on that sample and then derive predicate rules to speed up the pairwise comparisons of records before matching them. This existing enhanced methodology improves duplicates identification with a high level of performance but doesn't scale effectively. In order to provide scalability, our application creates an initial model on a sample of the data and then updates the model as more batches of records are added to the database, all the while de-duplicating those records by merging them with existing entries. Our contributions consist of: 1. Applying a cluster updating model to properly add additional batches of records to the existing cluster when applicable or creating new cluster. Each unique patient is represented by a cluster.
2. Produce a scalable solution capable of processing millions or records and updating the cluster representatives and records assignments, all under the constraints of the patient information protection rules.