Monday 3:40 PM–4:25 PM in Winter Garden (5412)

Simplified Data Quality Monitoring of Dynamic Longitudinal Data: A Functional Programming Approach

Jacqueline Gutman

Audience level:
Intermediate

Description

Data scientists are often tasked with being the first to detect issues of data quality that may have serious consequences for downstream consumers of their data. We will demonstrate how to simplify data quality monitoring through a functional programming approach that empowers the 5 key pillars of a user-friendly workflow: readability, compositionality, reproducibility, efficiency, and robustness.

Abstract

Ensuring the quality of data we deliver to customers or provide as inputs to models is often one of the most underappreciated and yet time-consuming responsibilities of a modern data scientist. This task can be a challenge when working with static data, but when we have access to dynamic, longitudinal, continuously updating data, that complexity becomes an asset. At Flatiron Health, a cancer research company acquired for billions of dollars partly due to the complexity, breadth, and immediacy of its data, we harness the dynamic nature of our ever-changing datasets to define a robust and systematic process for early and actionable detection of data quality concerns.

Outline

Introduction: I am a data scientist and cancer researcher working with electronic medical records in oncology at the healthcare technology company Flatiron Health (3 min)

Problem: How do we identify issues in data that is longitudinal, demands recency, and is dynamic within and between people over time? (4 min)

As a cancer researcher, undetected issues in data quality have real downstream consequences that impact patients' lives--for example, data could be used as evidence providing that one cancer drug works better than another when in fact the opposite is true.

Solution: Build a continuous and efficient process to monitor data quality, simplify dimensions on which the data has recently changed, and create actionable flags for investigation (1 min)

Conceptual framework: We continually freeze and version our data and we treat each distinct versioned snapshot in time of our data as a self-contained input to readable, declarative functions that summarize key quality metrics of each dataset. We can borrow from 5 key principles and paradigms used in functional programming to reason about data quality in an enjoyable, scalable, and self-documenting process. (5 min)

5 Key Pillars of Enjoyable, User-Friendly Data Quality Monitoring (with code snippets): We provide code using functools + Pandas to demonstrate how we reason about multiple evolving versions of our data to assess its quality

Wrap Up: We are all responsible for protecting the integrity of the data we work with. This framework allows us to strive towards readibility, compositionality, reproducibility, efficiency, and robustness in monitoring of quality. (2 min)

Subscribe to Receive PyData Updates

Subscribe