Saturday 3:00 PM–3:45 PM in Track 2

Hyperrest: A new Apache Arrow API For High Performance Data Access in Pandas

Sudheesh Katkam

Audience level:
Intermediate

Description

Pandas is one of the most popular data analytics frameworks for Python, and is widely used in machine learning applications. Pandas provides access to many data formats through a relatively slow ODBC interface. We will review performance benchmarks using Arrow with Pandas, and demonstrate a new API for Arrow called Hyperrest implemented in Dremio, a new open source project for Data Fabric.

Abstract

Pandas is one of the most popular data analytics frameworks for Python. Accessing data in Pandas can be done by using a DataFrame, a column-oriented data structure allowing for fast access, filter and transformation.

Pandas also supports many data sources, including relational databases, as long as a compatible ODBC driver exists. However, the ODBC API is designed around a record-centric paradigm, so some processing is required to convert data received from an ODBC source to a Pandas DataFrame. As a result, ODBC access is relatively slow compared to other approaches to data access.

Dremio is a new open source project for Data Fabric that uses Apache Arrow in-memory columnar storage to represent data internally. The Arrow format is very similar to a Pandas DataFrame, and the Apache Arrow project provides fast conversion functions between Arrow and Pandas. Dremio also offers both an ODBC driver for general purpose tools, and an API to return Arrow data directly to Pandas.

We will present how to use Pandas to connect to Dremio using either the ODBC driver or the Arrow API, and how they compare in term of performance.

Subscribe to Receive PyData Updates

Subscribe