Saturday 11:45 AM–12:30 PM in Room #370B/C (3rd Floor)

How I learned to time travel, or, data pipelining and scheduling with Airflow

Laura Lorenz

Audience level:
Intermediate

Description

Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.

Abstract

The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn

  • pros and cons of several Python-based/Python-supporting data pipelining libraries
  • the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
  • some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
  • some quick-start tips for implementing Airflow at your organization