Monday 4:25 PM–5:10 PM in Central Park West (6501)

Spark Backend for Ibis: Seamless Transition Between Pandas and Spark

Li Jin, Hyonjee Joo

Audience level:
Intermediate

Description

Ibis a library that provides unified pandas-like API on top of both single-node local execution (e.g. pandas) and multi-node remote execution (e.g. BigQuery, Impala). In this talk, we will introduce a new backend for executing Ibis programs on Spark and show how you can write analytics that run on both Spark and pandas.

Abstract

Pandas is the de facto standard (single-node) DataFrame implementation in Python. However, as data grows larger, pandas no longer works very well due to performance reasons. On the other hand, Spark has become a very popular choice for analyzing large dataset in the past few years. However, there is an API gap between pandas and Spark, and as a result, when users switch from pandas to Spark, they often need to rewrite their programs.

Ibis is a library designed to bridge the gap between local execution (pandas) and cluster execution (BigQuery, Impala, etc). In this talk, we will introduce a Spark backend for ibis and demonstrate how users can go between pandas and Spark with the same code.

Subscribe to Receive PyData Updates

Subscribe