Everybody has got that one nasty scientific data set, that takes 15 steps to munge and is still not ready to process. "Aha!", you say, "a workflow is just a flowchart and a flowchart is just a DAG - and I have got plenty of tools and techniques for that!".
Let's talk theory and practice about workflow modeling & processing!
Scientific and industrial data sets are often incredibly difficult to massage into a form that is actually usable for analysis - especially in the context of live production environments. Every scientist can think of one data set that required numerous steps acquire, preprocess, fix, reformat, reorientate, and store before any actual work can be done.
From a systems administrations perspective, these workflows are merely graph-based execution problems, very similar to the build-package-deploy problem that is already solved by a number of different build frameworks and tools.
Indeed, people use many of these tools for their data processing. This talk discusses a popular cross-section of these tools, identifying the successes and failures of each and proposes a number of design ideas with corresponding implementations for building a better workflow system.
These ideas include: - The use of Python as a DSL for describing a workflow - An interpreter-embedding approach to the construction of a DSL (i.e. repurposing Python-syntax, rather than using syntax that require complex custom parsing) - Automatic generation of first-class tooling (e.g. automatically generating of workflow progress reporting)
This talk is based off of my Master's thesis, which I am currently preparing. The above design and implementation topics will be drawn from my research. All of my code is open source licensed and will be available on GitHub as part of my thesis materials.