You're a data scientist. You have a bunch of analyses you performed in Jupyter Notebooks, but anything older than 2 months is totally useless because it's never working right when you open the notebook again. Also, you cannot remember the dropout rate on the second to last layer of this convolutional neural network which gave really great results 2 weeks ago and that you now want to deploy into production. Does that ring a bell?

You're a software engineer in a data science team. You can’t live without Git. Reviews on readable files, tests, code analysis, CI, used to belong to your daily basis. You were thinking of Jupyter Notebooks only as a demo tool. You need reproducibility for every step of your work even if you lose a server. And last but not least, you want to be able to deliver to production something usable by anyone.

What

This tutorial explains and shows how to use MLV-tools to set up a development environment and to be able to deliver the project avoiding frustrations due to teams segregation or point of view.

There is no magical solution, but compromises can be found. MLV-tools helps to:

Keep using Jupyter Notebooks but get synchronized executable & configurable Python 3 scripts
Increase testability and usage of IDE features
Easily version and share experiments
Avoid re-run time consuming tasks
Export an ML pipeline

Requirements

clone MLV-tools tutorial git repository
be familiar with code versioning (basic usage of Git is enough)
install Git, Virtualenv, Docker, Docker Compose, Jupyter, an editor
some familiarity with Machine Learning workflows

Outline

Global goal: be able to easily set up your own project using MLV-tools

Attendees will be guide step by step to experiment on their own computer.

1 - Introduction

Goal: expose versioning, automation and reproducibility issues with Machine Learning projects.

Jupyter Notebooks and automation/delivery?
Code x Data x Hyperparameters versioning?

2 - What is DVC and how it works ?

Goal: understand how to handle code, hyperparameters and data versioning using Git an DVC pipeline.

What is DVC?
Install [interactive part]
Create a dummy pipeline [interactive part]
Collaborative work example [interactive part]

3 - Handle a ML project with DVC and MLV-tools

Goal: easily use DVC on a Machine Learning project with MLV-tools.

Project organisation proposal (explanation)
From a Jupyter Notebook set to a DVC pipeline [interactive part]
Create and run a new experiment [interactive part]
Re-run part of a pipeline [interactive part]

4 - Going further

Goal: see how the process fits daily basis cases

See (or try) several realistic cases

Friday 11:15–12:45 in GoDataDriven

How to easily set up and version your Machine Learning pipelines, using DVC and MLV-tools

Stéphanie Bracaloni, Sarah Diot-Girard

Description

Abstract