Presentation: Building Responsible Data Science Workflows: Transparency, Reproducibility, and Ethics by Design

Time Zone

Thursday October 28 8:30 PM – Thursday October 28 10:30 PM in Workshop/Tutorial I

Building Responsible Data Science Workflows: Transparency, Reproducibility, and Ethics by Design

Valentin Danchev, Ben Marwick, Dr. Brandeis Marshall (she/her), Kirstie Whitaker, Sara Stoudt, Thibault Lestang, Yacine Jernite

Prior knowledge:: No previous knowledge expected

Summary

Diverse communities have developed principles and tools aiming to harness transparency, reproducibility, research design, software reusability, and more recently, interpretability and fairness of data science and AI/ML/NLP models. This workshop will bring together members of these communities to discuss best ways of integrating such principles and tools into responsible data science workflows.

Description

What constitute responsible data science workflows? What makes them effective, computationally reproducible, and ethical? Over the last few years, many diverse communities have developed guidelines, metrics, and tools aiming to harness transparency, reproducibility, research design, software reusability, and more recently, interpretability, fairness, and equity of data-intensive and computational research, and AI/ML/NLP models.

As a result, issues of research workflow are not narrowly centered on the process of data analysis alone but integrate data analysis with considerations of good research and computational practices, collaboration, inclusion, communication, ethics, and the social impact of research. For example, earlier discussions of computational reproducibility have materialised into reproducible computational notebooks for scientific computing (e.g., Project Jupyter), literate programming, and reproducible computational environments. Research software engineers and communities, such as the Software Sustainability Institute, have been developing reliable tools for collaboration, code testing (and reviewing), continuous integration, and pipeline automation, fostering reliable data analysis and reproducible results.

Open-science communities have advocated for the importance of preregistration of research prior to data collection and for FAIR (Findable, Accessible, Interoperable, and Reusable) data, code, and materials, extending the notion of workflow to considerations both before and after the process of data analysis. Related, many tools have been developed to foster open communication, including tools to collaboratively and reproducibly assemble code, text, and outputs into various documents — articles, presentations, interactive dashboards — that can be communicated with broader audiences. The causal inference research community has emphasised another pre-analysis consideration—the importance of causal diagrams in research design for guiding reliable data analysis and reducing bias. Further, to improve reproducibility in machine learning research, ML communities have developed reproducibility initiatives, e.g., the Machine Learning Reproducibility Checklist. Recently, many communities have been developing models and tools to improve the interpretability, fairness, and equity of AI/ML/NLP models. Related, the High Level Expert Group on Artificial Intelligence at the European Commission has published Ethics Guidelines for Trustworthy AI. Finally, open-science collaborations have put in practice many of these emerging guidelines, metrics, and tools. For example, the Turing Way is an open, community-driven project 'dedicated to making collaborative, reusable and transparent research “too easy not to do”'. More recently, the BigScience project has brought together hundreds of researchers in an open scientific collaboration to build and then evaluate large computational language models from various aspects, including bias, social impact, limitations, ethics, and carbon impact.

While different in many respects, the above communities and initiatives have something in common—they drive some aspect of good research and computational practices, collaboration, communication, ethics, and societal impact of research to the core of data science workflows. This workshop aims to bring together members of these diverse communities to discuss (1) how various aspects can work (or not) together to form responsible data science workflows that foster transparent, reproducible, and ethical research by design (see Open Science by Design), (2) what changes in the incentive system in academia and industry can promote the implementation of responsible workflows, and (3) what is the future of responsible data science workflows: should we strive for a single, overarching workflow (unlikely and probably unproductive!); should we develop standardised workflows for different types of data-intensive research (similar to reporting guidelines in medical research); should workflows be discipline-specific or interdisciplinary; or should we reimagine new ways of developing and updating responsible research workflows. Participants in the workshops will aim to synthesise key themes from the discussion into a collaborative publication, highlighting the potential and challenges of responsible research workflows.