Data Workflows - getting repeatable results in Data Science

Rob Parrish

Audience level:
Intermediate

Description

Reproducibility in Data Science is hard. So is organizing your logic in complex Machine Learning models. In this talk, we’ll discuss how Workflow Engines can be used to address both of these challenges. I’ll give an overview of the available open source Python toolkits available, with a particular focus on Luigi & Airflow. We’ll conclude with a few example workflows to help you get up & running.

Abstract

There are significant challenges against achieving reproducibility in Data Science, including: changing data formats, increasing scale, evolving processing systems, machine learning model complexity and business pressures that lead to innate bias.

This talk will discuss what’s needed to achieve reproducibility, and why data workflows should be used as a critical component of any production machine learning or analytics pipeline.

I’ll provide an overview of the available open source Python workflow engines, discuss the design principles upon which they were created and cover their shortcomings. We’ll wrap up with a few examples to help you avoid common pitfalls.

Outline:

  • The challenge of reproducible Data Science
  • What's needed to achieve it
  • A first step: workflows
  • Workflow design paradigms
  • Available options in Python: Luigi, Airflow, & Jupyter
  • Examples
  • What’s next

Sponsors


Become a sponsor.