PyData Seattle 2017 - Presentation: Provenance for Reproducible Data Science

In science, results that are not reproducible by peer scientists are valueless and of no significance. Among other practices, recording the provenance of data facilitates to reproduce results in data science and users can be confident in quality of the data. The talk shows how to record and to analyse provenance using the provenance model PROV for Python data analytics processes.

In science, results that are not reproducible by peer scientists are valueless and of no significance. Good practices for reproducible science are to publish used codes under Open Source licenses, perform code reviews, save the computational environments with containers (e.g., Docker), use open data formats, use a data management system, and record the provenance of all actions.

The provenance of data provides detailed information about the origin of that data. That includes information about ownership and both actions and modifications performed on the data. With provenance information, data will be traceable and users can be confident in quality of the data. To specify and store provenance information, W3C has standardized the provenance model PROV. Using PROV and associated implementations, users can record provenance of data analytics processes. The provenance information are directed acyclic graphs that can be analysed to get insight into the data analytics processes.

The talk covers * Introduction to provenance and PROV * Modelling provenance for data processing * Python APIs for provenance recording * Provenance recording for Jupyter notebooks * Storing provenance in graph databases * Analysis of provenance information

Thursday 10:00 AM–10:45 AM in Track 3 - Hood

Provenance for Reproducible Data Science

Andreas Schreiber

Description

Abstract

Subscribe to Receive PyData Updates

Tickets