Stau - lightweight job orchestration for data science workloads

Christian Juncker Brædstrup

Prior knowledge:
No previous knowledge expected

Summary

This talk presents a lightweight tool for orchestrating real-world data science jobs. It outlines the design for a barebone orchestration system supporting job scheduling, advanced dependencies, parallel execution of work, and more. All contained in ~1500 lines of Python and built to quickly onboard new team members. The last mile to production is often the hardest - but it doesn't have to be.

Description

Have you ever built a great model or data pipeline only to stall when moving it to production? Ensuring the code runs when needed and its dependencies are met can be a large and unwelcoming challenge. There are plenty of great frameworks (airflow, niFi, prefect, kedro, and others) to help you but they are often equally complicated to set up and maintain. What if you didn't have to think about orchestration when building a new model? What if everything just worked, almost, automagically?

Recently, my data science team redesigned our existing data orchestration toolchain to handle increasing data volumes and to reduce our execution times. Unfortunately, all existing tools were either too simple or too complex for our setting. All frameworks required a large refactoring of the codebase that wasn't an option. The solution had to provide maximum flexibility in algorithm choice without cluttering the code with framework-specific function calls and decorators.

The resulting tool is called Stau, german for traffic congestion, and takes an alternative design approach. Dependencies are nested alongside the code as simply Python types. Add a single variable to your script and you're done. This enables developers to keep job-relevant information in a single place, alongside the code and allows automatic dependency detection and execution simply by importing the Python modules. Stau relies heavily on the APScheduler and networkX packages to keep a low code footprint.

The design bundles all components into a single executor which can schedule jobs, resolve dependencies, execute work, and run in multiple copies. The tool is currently running production workloads, processing 4000+ jobs and hundreds of millions of measurements per day. The design supports most mid-size data science workloads and attempts to remove as much complexity as possible.