Presentation: Designing Functional Data Pipelines for Reproducibility and Maintainability

Time Zone

Friday October 29 8:30 AM – Friday October 29 9:00 AM in Talks I

Designing Functional Data Pipelines for Reproducibility and Maintainability

Chin Hwee Ong

Prior knowledge:: Previous knowledge expected
Basic understanding of data pipelines; hands-on experience in imperative (procedural and object-oriented) programming

Summary

Designing reliable and extensible data pipelines at scale is often a challenge, as testing and debugging across compute units are often complex and time-consuming due to dependencies at runtime. In this talk, I will be exploring how the use of functional programming design patterns in Python/Spark enables us to build production-ready data pipelines that are reproducible and maintainable at scale.

Description

When building data pipelines at scale, it is crucial to design data pipelines that are reliable, scalable and extensible according to evolving business needs. Designing data pipelines for reproducibility and maintainability is a challenge, as testing and debugging across compute units (threads/cores/computes) are often complex and time-consuming due to dependencies and shared states at runtime. In this talk, I will be sharing about common challenges in designing reproducible and maintainable data pipelines at scale, and exploring the use of functional programming in Python and Apache Spark to build scalable production-ready data pipelines that are designed for reproducibility and maintainability. Through analogies and realistic examples inspired by data pipeline designs in production environments, you will learn about:

What is Functional Programming, and how it differs from other programming paradigms
Key Principles of Functional Programming
How "control flow" is implemented in Functional Programming
Functional design patterns for data pipeline design in Python and Apache Spark, and how they improve reproducibility and maintainability
Whether it is possible to write a purely functional program

This talk assumes basic understanding of building data pipelines with functions and classes/objects. While the main target audience are data scientists/engineers and developers building data-intensive applications, anyone with hands-on experience in imperative programming (including Python) would be able to understand the key concepts and use-cases in functional programming.