PyData New York City 2019 - Presentation: Simplified Data Quality Monitoring of Dynamic Longitudinal Data: A Functional Programming Approach

Ensuring the quality of data we deliver to customers or provide as inputs to models is often one of the most underappreciated and yet time-consuming responsibilities of a modern data scientist. This task can be a challenge when working with static data, but when we have access to dynamic, longitudinal, continuously updating data, that complexity becomes an asset. At Flatiron Health, a cancer research company acquired for billions of dollars partly due to the complexity, breadth, and immediacy of its data, we harness the dynamic nature of our ever-changing datasets to define a robust and systematic process for early and actionable detection of data quality concerns.

Outline

Introduction: I am a data scientist and cancer researcher working with electronic medical records in oncology at the healthcare technology company Flatiron Health (3 min)

Problem: How do we identify issues in data that is longitudinal, demands recency, and is dynamic within and between people over time? (4 min)

How quickly can we catch data quality issues?
What are the downstream consequences if these issues are missed?

As a cancer researcher, undetected issues in data quality have real downstream consequences that impact patients' lives--for example, data could be used as evidence providing that one cancer drug works better than another when in fact the opposite is true.

Solution: Build a continuous and efficient process to monitor data quality, simplify dimensions on which the data has recently changed, and create actionable flags for investigation (1 min)

Conceptual framework: We continually freeze and version our data and we treat each distinct versioned snapshot in time of our data as a self-contained input to readable, declarative functions that summarize key quality metrics of each dataset. We can borrow from 5 key principles and paradigms used in functional programming to reason about data quality in an enjoyable, scalable, and self-documenting process. (5 min)

5 Key Pillars of Enjoyable, User-Friendly Data Quality Monitoring (with code snippets): We provide code using functools + Pandas to demonstrate how we reason about multiple evolving versions of our data to assess its quality

Readability: Abstraction and Declarative intent (3 min)
Compositionality: Reasoning with higher level functions (3 min)
Reproducibility: Avoiding side effects and external states (3 min)
Efficiency: Laziness and caching (3 min)
Robustness: Testability and elegant error handling (3 min)

Wrap Up: We are all responsible for protecting the integrity of the data we work with. This framework allows us to strive towards readibility, compositionality, reproducibility, efficiency, and robustness in monitoring of quality. (2 min)

Monday 3:40 PM–4:25 PM in Winter Garden (5412)

Simplified Data Quality Monitoring of Dynamic Longitudinal Data: A Functional Programming Approach

Jacqueline Gutman

Description

Abstract

Subscribe to Receive PyData Updates