Saturday October 30 7:00 PM – Saturday October 30 9:00 PM in Workshop/Tutorial I

Love your (data scientist) neighbour: Reproducible data science the Easydata way

Amy Wooding

Prior knowledge:
Previous knowledge expected
Basic familiarity with conda, git and jupyter notebooks

Summary

Tired of wasting your time and energy re-doing work that you’ve done before? Want to reduce the hidden costs that come with collaboration? In this hands-on tutorial, we’ll uncover the overlooked parts of making your data science workflow reproducible. You’ll learn about gotchas, reproducibility bugs, and better defaults along the way.

Description

Getting Oriented: What is this all about?

  • What is reproducibility and why should I care?
  • Better by Default: automate the boring stuff
  • Get out of the way: Doing data science your way

Tool Time: The Right Tools for the Job

  • Easydata: An open-source framework for Reproducible Data Science
  • python3, Cookiecutter, Conda: getting things set up
  • Make: Automating and documenting the process
  • git: tracking our history
  • jupyter: notebooks for storytelling

Saving the Environment: Making your environment reproducible

  • Conda vs the world
  • Makefiles for reproducible environment management
  • One environment per project
  • Environment.yml and lockfiles
  • Local data and the Catalog
  • Using Paths

Git in the Flow: Making your workflow reproducible

  • A mental model for git
  • .gitignore. keping data out of the repo.
  • A simple collaborative git workflow (and cheat sheet)

Transform your World: Making your Data reproducible

  • Packaging datasets for re-use: the Dataset
  • On the importance of Metadata
  • A quick note on data licenses
  • Keeping data out of git: Transformers and Caches
  • Jupyter notebooks as Transformers
  • Sharing your Datasets

The punchline:

  • Blow it away and make it again