Sunday 2:00 PM–2:45 PM in Room 2

AutoDocish: Automated-ish Dataset Documentation

Elizabeth Wickes

Audience level:
Intermediate

Description

AutoDocish is a command line Python tool to semi-automate the dataset documentation process. Written with a framework for expansion and customization, it produces template files in MarkDown that contain a basic data dictionary structure. This talk explains dataset documentation practices and how this tool could fit into the data publishing workflow.

Abstract

The creation of dataset documentation is often left to an authoring researcher. Documentation is vital for the future reusability of a dataset, and yet this time consuming process yields very little immediate impact. Arguments can be made that well documented datasets are more reusable and can generate better citation metrics in the long term, but it still takes cognitive and workload energy away from the core short-term research mission. Deconstructing the documentation process tells us that much of the content can be derived from data profiling tools, which can be automated. This means that the remaining work is simply explaining the context of the data and the meaning of certain values. AutoDocish is a prototype tool based in Python to assist anyone needing to create data documentation with a base template that includes basic profiles over each column of data and places for authors to provide meanings and explanations for each. AutoDocish is a command line Python tool written with a framework for expansion, with areas researchers can easily add their own custom functions for specific types of data profiles and calculations. The output is in basic markdown, meaning that it can easily be added to a website or included as a readable plain text file within a dataset deposit.