Saturday 3:00 PM–3:45 PM in C11

Automating Data Pipeline using Apache Airflow

Mridu Bhatnagar

Audience level:
Intermediate

Description

Manually running scripts to extract, transform and load data is a trade-off with time, is tedious and cumbersome. The process of building a data pipeline can be automated. Scripts to extract data can be scheduled using crontab. However, using crontab has its own drawbacks. One major challenge is monitoring. Airflow is a platform to programmatically author, schedule and monitor workflows.

Abstract

Today, we are moving towards machine learning. Making predictions, finding out insights based on data. For the same purpose, the initial step is to have efficient processes in place which help us in collecting data from various different data sources. Using traditional ways to collect data is tedious and cumbersome. Manually running scripts to extract, transform and load data is a trade-off with time.

To make the process efficient. The data pipeline can be automated. Scripts to extract data can be auto-scheduled using crontab. However, using crontab has its own drawbacks. One major challenge comes in monitoring. This is where an open source tool built by AirBnB engineering team - Apache airflow helps. Airflow is a platform to programmatically author, schedule and monitor workflows.

The talk aims at introducing the attendees to.

  1. Airflow - overview of the tool. Advantages, disadvantages
  2. Directed acyclic graph - Examples of directed acyclic graph and directed cyclic graphs
  3. Operators a. Bash Operator b. Python Operator c. Email Operator
  4. Python context manager
  5. Examples
  6. Demo

Subscribe to Receive PyData Updates