Sunday 14:15–15:00 in Kursraum 1

Extending Pandas using Apache Arrow and Numba

Uwe L. Korn

Audience level:
Intermediate

Description

With the latest release of Pandas the ability to extend it with custom dtypes was introduced. Using Apache Arrow as the in-memory storage and Numba for fast, vectorized computations on these memory regions, it is possible to extend Pandas in pure Python while achieving the same performance of the built-in types. In the talk we implement a native string type as an example.

Abstract

In the 0.23 release of Pandas, the concept of ExtensionArrays was introduced. They allow the extension of Pandas DataFrames and Series with custom, user-defined typed. The most prominent example is cyberpandas which adds an IP dtype that is backed by the appropriate representation using NumPy Arrays. While using NumPy arrays will be sufficient for a great set of custom dtypes, it's focus on matrices means that some kinds of data cannot be represented in an optimal way. Apache Arrow on the other hand is aimed at standardizing columnar memory and provides efficient columnar storage for many data types.

Apache Arrow boasts a broad set of types and also has a vast integration test suite that ensures that the data can be passed between different programming languages. It is sadly still missing an analytic layer on top. This is the missing component that is the core compentence of Pandas. Combining the modern columar storage of Arrow with Pandas' analytic layer enables new data types. These will not only be primitive ones like a string but can also extend to nested types like lists.

The core algorithms of Apache Arrow are written in C++ internally. This enables reuse between multiple languages like Python and R while achieving maximum performance. But this also lead to an entry barrier one needs to overcome first to extend it. Adding a new storage type to Pandas is only one part of the implementation of an ExtensionArray. One also needs to implement a basic set of algorithms so that Pandas' operations like groupy or slice work on top of the new structures. Additionally each new data type has also its own special operations that users frequently want to apply. To implement these algorithms using pure Python but in a performant fashion, we are utilising Numba. It allows us to define arbitrary operations on the Arrow memory structures and compiles them using its Just-in-time compiler to efficient, vectorized code. As the operations are all done on native Arrow memory, we can also make use of Numba's parallization features as well as we are able to release the GIL.

Subscribe to Receive PyData Updates

Subscribe