Sunday 14:15–15:00 in Audimax

Data versioning in machine learning projects

Dmitry Petrov

Audience level:
Intermediate

Description

In machine learning projects it is easy to get lost in many versions of your data files. Data Version Control or DVC is an open source tool for data science projects that was created to solve the issue of discrepancy between code and data files. It works on top of Git and helps you switch between Git branches and extracts not only source code but a right version of data files.

Abstract

It is easy to get lost in your data files versions when you work on machine learning projects: - Did you ever forget what does file name model_withvgg16_l45tune_120e.pkl stands for? - Did you ever lose a right version of a data file that was used for training a particular version of your machine learning code?

Data Version Control or DVC is an open source tool for data science projects that was created to solve the issue of discrepancy between code and data files. DVC works on top of any Git repository, extends Git by a set of data versioning related command with a similar to Git semantic. The tool maintains data files outside of your Git repository in a key-value store on your hard drive.

DVC gives you the ability to switch back and forth between your Git branches and extracts not only source code but a right version of data files from the key-value store.

Integration with remote computers in a Git manner and a cloud storages support gives you the ability to efficiently sync your code and data files between your laptop, a desktop machine with GPU and a cloud instance with extra memory.

DVC project page: https://github.com/dataversioncontrol/dvc/

Subscribe to Receive PyData Updates

Subscribe