Sunday 10:00–10:45 in LG6

Efficient and portable DataFrame storage with Apache Parquet

Uwe L. Korn

Audience level:
Intermediate

Description

Apache Parquet is the most used columnar data format in the big data processing space and recently gained Pandas support. It leverages various techniques to store data in a CPU and I/O efficient way and provides capabilities to push-down queries to the I/O layer. In this talk, it is shown how to use it in Python, detail its structure and present the portable usage with other tools.

Abstract

Since its creation in 2013, Apache Parquet has risen to be the most widely used binary columnar storage format in the big data processing space. While supporting basic attributes of a columnar format like reading a subset of columns, it also leverages techniques to store the data efficiently while providing fast access. In addition the format is structured in such a fashion that when supplied to a query engine, Parquet provides indexing hints and statistics to quickly skip over chunks of irrelevant data.

In recent months, efficient implementations to load and store Parquet files in Python became available, bringing the efficiency of the format to Pandas DataFrames. While this provides a new option to store DataFrames, it especially allows us to share data between Pandas and a lot of other popular systems like Apache Spark or Apache Impala. In this talk we will show the improvements that Parquet bring performance-wise but also will highlight important aspects of the format that make it portable and efficient for queries on large amount of data. As not all features are yet available in Python, an overview of the upcoming Python-specific improvements and how the Parquet format will be extended in general is given at the end of the talk.