Saturday October 30 4:30 PM – Saturday October 30 5:00 PM in Talks II

Simplifying Testing of Spark Applications

Megan Yow

Prior knowledge:
Previous knowledge expected
Spark

Summary

Unit tests, and even testing code snippets require, the Spark environment to be spun up. This is cost and resource-intensive. On the other hand, Pandas and Python can't scale to handle big data, but applications are significantly faster to test. Here we introcue Fugue, an abstraction layer that enables us to test code in a Pandas environment before executing it on Spark, accelerating development.

Description

Data practitioners use distributed computing frameworks such as Apache Spark to work with big data. One of the major pain points of Apache Spark is its testability. In order to run tests on simple code changes, users have to spin up a local PySpark instance, which takes a few minutes. Some users submit jobs to a cluster for test code. Even worse, libraries such as databricks-connect forward all of the local Spark code to be executed on a cluster. This means that simple tests spin up the Spark cluster to run. This leads to very expensive projects, considering both developer time wasted, and unneeded cluster usage.

The lack of testability also leads to slow development cycles. In the case of machine learning applications, rapid iterations are needed to achieve optimum performance. In this demo, they'll introduce a library called Fugue that serves as an abstraction layer for distributed compute frameworks. Users can write code in native Python or Pandas, and then port it to Spark and Dask during execution time. This allows users to test code much faster, and free from Spark dependencies. When ready to run on a cluster, users just need to specify the engine for execution (Pandas or Spark). Fugue dramatically speeds up development cycles and makes data projects cheaper.