Sunday 10:00–10:35 in Megatorium

Elegant data pipelining with Apache Airflow

Bolke de Bruin

Audience level:
Intermediate

Description

Batch data processing, historically known as ETL,  is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot. As ETL pipelines grow in complexity, and as data teams grow in numbers, using methodologies that provide clarity isn’t a luxury, it’s a necessity.

Abstract

Batch data processing, historically known as ETL,  is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot. As ETL pipelines grow in complexity, and as data teams grow in numbers, using methodologies that provide clarity isn’t a luxury, it’s a necessity. In this talk I will be drawing from paradigms of functional programming and experiences from Netflix, Lyft, and ING on how to manage semantics and apply them to workflows in Apache Airflow. Functional programming brings clarity. When functions are “pure” ,  meaning they do not have side-effects ,  they can be written, tested, reasoned-about and debugged in isolation, without the need to understand external context or history of events surrounding its execution. Semantics bring clarity and consistency. Semantics provide meaning and consistency to your workflows.

Subscribe to Receive PyData Updates

Subscribe