Presentation: Love your (data scientist) neighbour: Reproducible data science the Easydata way

Time Zone

Saturday October 30 7:00 PM – Saturday October 30 9:00 PM in Workshop/Tutorial I

Love your (data scientist) neighbour: Reproducible data science the Easydata way

Amy Wooding

Prior knowledge:: Previous knowledge expected
Basic familiarity with conda, git and jupyter notebooks

Summary

Tired of wasting your time and energy re-doing work that you’ve done before? Want to reduce the hidden costs that come with collaboration? In this hands-on tutorial, we’ll uncover the overlooked parts of making your data science workflow reproducible. You’ll learn about gotchas, reproducibility bugs, and better defaults along the way.

Description

Getting Oriented: What is this all about?

What is reproducibility and why should I care?
Better by Default: automate the boring stuff
Get out of the way: Doing data science your way

Tool Time: The Right Tools for the Job

Easydata: An open-source framework for Reproducible Data Science
python3, Cookiecutter, Conda: getting things set up
Make: Automating and documenting the process
git: tracking our history
jupyter: notebooks for storytelling

Saving the Environment: Making your environment reproducible

Conda vs the world
Makefiles for reproducible environment management
One environment per project
Environment.yml and lockfiles
Local data and the Catalog
Using Paths

Git in the Flow: Making your workflow reproducible

A mental model for git
.gitignore. keping data out of the repo.
A simple collaborative git workflow (and cheat sheet)

Transform your World: Making your Data reproducible

Packaging datasets for re-use: the Dataset
On the importance of Metadata
A quick note on data licenses
Keeping data out of git: Transformers and Caches
Jupyter notebooks as Transformers
Sharing your Datasets

The punchline:

Blow it away and make it again