As data teams scale, data pipelines become increasingly interconnected and often share components. Though efficient for development, upstream changes can cause unintended consequences to downstream datasets. In this talk, we’ll show how data validation solves this and especially focus on how to scale current validation frameworks to handle big data with Spark and Dask.
Data validation is implementing checks to see if data is coming in (and being processed) as expected. Data teams apply data validation to preserve the integrity of existing data workflows. As data pipelines become interconnected, it becomes very easy for one pipeline’s changes to cause breaking changes to other data applications. In situations like this, data validation serves both as tests for the pipeline, and as a monitoring solution to capture malformed data from flowing through the system. Without these checks, data applications can produce inaccurate results without anyone being alerted.
While data validation frameworks are available, it is still hard to bring these solutions to big data. Most frameworks are built for pandas and are challenging to apply with distributed compute frameworks such as Spark and Dask, if at all possible. In this talk, we will cover the basics of data validation, but more importantly, we will also discuss how to apply it on a large dataset.
To do this, we will use Fugue, an abstraction layer that enables users to port pandas, Python, and SQL code to Spark and Dask. By combining Fugue with existing validation frameworks such as Pandera, we can port pandas-based validation code and apply it distributedly. For large scale data, there is also a unique use case to apply different validations on different partitions of data. This is currently not feasible with any single validation library. In this talk, we will show how validation by partition can be achieved by combining Fugue and validation frameworks such as Pandera.