PyData New York City 2018 - Presentation: Repeatable Data Setup for Repeatable Science using Julia

This talk presents a framework DataDeps.jl, and its accompanying tool DataDepsGenerators.jl, for working with datasets in a more automatic and replicable fashion. DataDeps.jl provides a system for the automatic download and unpacking of datasets. DataDepsGenerators.jl automatically generates the code required by DataDeps to download data from several public repositories.

It is a dream of every data scientist to get hold of data on their plate without much hassles. This includes data for a new set of experiments or data needed in order reproduce an existing result. Vandewalle et al. (2009) distinguishes 6 degrees of reproducibility for scientific code. To achieve either of the 2 highest levels, requires that “The results can be easily reproduced by an independent researcher with at most 15 min of user effort”. It is our experience that one can often expend much of that time just on setting up the data. This involves reading the instructions, locating the download link, transferring it to the right location, extracting an archive, and identifying how to inform the script as to where the data is located. These tasks are automatable therefore should be automated; to save user time, and remove the opportunity for mistakes, as per the key practice identified by Wilson et al. (2014) “let the computer do the work”.

DataDeps.jl is a library for the Julia programming language, which helps beat the exact same cause. It uses a registration block, a chunk of julia code, which describes where the data can be downloaded, who created it, what the terms and conditions for its use are, etc. The urls retrieved from these blocks aid in downloading the required data for running the experiment. It can be pointed out that creating a registration block can be a tedious task, but there exists a support package DataDepsGenerators.jl, which covers the most popular data repositories.

At present, DataDepsGenerators.jl supports UCI ML, GitHub and DataOne repositories which currently supports a large no. of datasets. UCI ML provides around 436 commonly used datasets while DataOne a whopping 800k datasets with over 46 TB of content. Future endeavours will bring support for many other data repositories (like CKAN, OAI-PMH, DataCite DOIs) eventually placing almost all of the open research data of the world at your fingertips.

Thursday 2:40 PM–3:20 PM in Central Park West (#6501)

Repeatable Data Setup for Repeatable Science using Julia

Sebastin Santy

Description

Abstract

Subscribe to Receive PyData Updates