Introducing the basics of Airflow and how to orchestrate workloads on Google Cloud Platform (GCP). A GCP environment will be provided, users will just need to login with a Google account.
Apache Airflow is a pipeline orchestration tool for Python initially built by Airbnb and then open-sourced. It allows data engineers to configure multi-system workflows that are executed in parallel across any number of workers. A single pipeline may contain single or multiple operations like python, bash or submitting a spark-job into the cloud. Airflow is written in python and users can write their custom operators in python.
A data pipeline is a critical component of an effective data science product, and orchestrating pipeline tasks enables simpler development and more robust and scalable engineering.
In this tutorial, we will give a practical introduction to Apache Airflow. We will cover:
Prerequisite: basic familiarity of python.