Building data pipelines is hard, building reusable and testable data pipelines is even harder. Should you use notebooks or is it preferable to work in an IDE? This talk will focus on lessons learned implementing these data pipelines using PySpark and related components such as delta.
Data pipelines usually consist of loading the data, transforming it and writing to some other location. Initially, this does not sound very complicated. The question arises why it is so hard then to do? In this talk we will discuss how to perform these steps in pyspark, and especially what the latest developments are around delta lake, data quality checks and data modeling. What patterns are preferable and why? At the end of this talk data engineers and data scientists should have a view on a pattern that will fit in a lot of general situations and will help them to set up a pipeline more quickly while preventing a lot of issues upfront.