Thursday 2:20 PM–3:05 PM in Track 2 Room

Datasets and machine learning models versioning using open source tools

Dmitry Petrov

Audience level:
Intermediate

Description

AI and ML are becoming an essential part of software engineering. Open source tools like Git, Git-LFS, MlFlow can increase ML teams productivity by introducing best practices. However, large datasets management and versioning are not covered by these tools. We will show how to overcome the limitations of the tools by using DVC.org - an open-source project for ML models and datasets versioning.

Abstract

AI and ML are becoming an essential part of software engineering. The traditional engineering toolset does not fully cover machine learning team's needs. The teams need new tools for data versioning, ML pipeline versioning, ML model versioning, experiments metrics tracking, and others.

ML workflow is data-centric while software engineering workflow is centered around source code. We will discuss the current practices of organizing ML projects using open-source tools like Git, Git-LFS, MlFlow as well as their limitations. Thereby motivation for developing new ML specific data versioning systems will be explained.

Data Version Control or DVC.ORG is an open-source command-line tool. We will show how to version ML models and multi-gigabyte datasets, how to use your favorite cloud storage (S3, Google Cloud Storage, or bare metal SSH server) as a data file backend, how to apply the best engineering practices to your ML projects and how to combine the different tools in the same project.

Subscribe to Receive PyData Updates

Subscribe