Processing data is hard. Building a robust system for data processing is even harder. This talk examines how data engineering has been impacted by the demands of modern data science architectures. We’ll use real-world examples gathered as a PMC member of Apache Airflow to motivate a discussion of best practices for building robust data pipelines.
We distinguish between two kinds of engineering: positive and negative.
Positive engineering includes all the things we want to do -- whether analytics, infrastructure, or deployment.
Negative engineering involves all the things we have to do to make sure those things actually happen - like anticipating and trapping errors, dealing with malformed data, and managing resources.
Unfortunately, as data systems become more complex, negative engineering demands a greater share of data engineers' and data scientists' time. For that reason, workflow management systems have become a critical part of building any data application. While a good WMS won't guarantee that code always runs properly, it lets us take concrete steps to diagnose and solve problems as they occur.
Robust data pipelines treat failure as an expected outcome, not an outlier. This approach is philosophically incompatible with most data science frameworks, as they are inherently fragile. We explore whether data engineering best practices can help us design better data science pipelines, using code from Airflow and the forthcoming Prefect library.