Friday 13:10–13:40 in Track 3

Masking personal data in medical documents

Kornel Lewandowski

Audience level:
Intermediate

Description

The medical and pharmaceutical companies create a lot of documents. Analyzing them all could provide meaningful insights. Unfortunately, these texts usually contain personal data and that is a real blocker for passing documents to any external analytics tools. I will present our scalable solution that achieves excellent results and handles a few European languages.

Abstract

Masking personal data in medical documents

Introduction

The medical and pharmaceutical companies create a lot of documents. Analyzing them all could provide meaningful insights. Unfortunately, these texts usually contain personal data and that is a real blocker for passing documents to any external analytics tools. I will present our scalable solution achieving excellent results, which handles this problem and is able to handle few European languages.

Solution overview

Masking the data should be supported by high quality algorithms to maximize the number of true positives and minimize the error rate. Our solution combines a few different techniques from the fields of natural language processing, text mining and advanced data analysis to find as many pieces of personal data as possible and custom filters and taxonomy-based solutions decreasing the number of false positives. Our results are really promising and the process performed on the documents prepares them to be passed to all Roche's units and even to external companies with respect to the most of data privacy rules.

Technical details

Our solution is highly scalable because we use Hadoop Streaming for parallel and distributed processing. Most of the code is written in Python and uses popular natural language processing tools, models and libraries such as NTLK and CoreNLP.

Subscribe to Receive PyData Updates

Subscribe

Tickets

Get Now