Friday October 29 7:30 AM – Friday October 29 8:00 AM in Talks I

FugueSQL - The Enhanced SQL Interface for Pandas, Spark, and Dask DataFrames

Chengxuan Wang

Prior knowledge:
Previous knowledge expected
SQL

Summary

SQL users working with Pandas, Spark, and Dask DataFrames quickly realize SQL is a second-class interface, invoked between predominantly Python code. We will introduce FugueSQL, an enhanced SQL interface that allows SQL lovers to express end-to-end workflows predominantly in SQL. With a Jupyter notebook extension, SQL commands used for interactive handling of in-memory datasets.

Description

To cater users coming from a SQL background, Pandas, Spark, and Dask have SQL interfaces (some unofficial) that allow SQL lovers to manipulate DataFrames in their language of choice. Still, these SQL interfaces are often insufficient for being used for end to end workflows. Having such an interface would allow SQL users to seamlessly transition between working on databases, to working on flat files. There would remove the need to learn specific frameworks to work with data in-memory.

In this talk we'll introduce FugueSQL, a language that allows SQL-lovers to use their preferred tool in expressing end-to-end computation workflows. FugueSQL has additional keywords such as LOAD and SAVE that support ETL operations on flat files such as parquet and csv. Because SQL as an interface is agnostic to frameworks, FugueSQL also allows scaling from Pandas to Spark or Dask without any code change.

To elevate FugueSQL as a first class interface for working with data, enhancements such as variable assignment, jinja templating, and Python interoperability have been added. Python functions can be invoked inside predominantly SQL code. FugueSQL also is meant to take advantage of the Spark and Dask engines to handle big data. Thus, it has support for distributed compute-relevant operations such as PARTITION, PERSIST and BROADCAST. In this talk, we will show the functionality of FugueSQL for end-to-end Extract, Transform, Load (ETL) pipelines with the Spark and Dask engines.