PyData Barcelona 2017 - Presentation: HDF5 and pandas

HDF5 is a hierarchical, binary database format that is extremely popular and includes features like chunking, ragged data, extensible data, parallel I/O, compression, complex selection, among others. On its part, pandas is also the de facto standard for providing high-performance, easy-to-use data structures and data analysis tools in Python. Together, HDF5 and pandas can be a winner couple.

Description

HDF5 is a hierarchical, binary database format that has become the de facto standard for scientific computing. While the specification may be used in a relatively simple way (persistence of static arrays) it also supports several high-level features that prove invaluable. These include chunking, ragged data, extensible data, parallel I/O, compression, complex selection, and more. Moreover, HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py). This tutorial will cover HDF5 itself through the lens of both h5py and PyTables and will show how to use them in order to persist both NumPy and pandas containers.

This tutorial will discuss tools, strategies, and hacks for really squeezing every ounce of performance out of HDF5 in new or existing projects. It will also go over fundamental limitations in the specification and provide creative and subtle strategies for getting around them. We will also see how pandas can use HDF5 via its HDFStore module. Overall, this tutorial will show how HDF5 plays nicely with all parts of an application making the code and data both faster and smaller.

Knowledge of Python, NumPy, pandas, and basic HDF5 is recommended but not required.

Outline

Intro and setup (10 min)
Basic datatypes (10 min)
- Homogeneous types (Arrays)
- Compound types (Tables)
Chunking (10 min)
- How it works
- How to properly select your chunksize
Meaning in layout (10 min)
- Tips for choosing your hierarchy
Why you should always use compression (20 min)
- Compression algorithms (aka codecs) available
- Choosing the most adequate codec
- Exercise
Queries and Selections (30 min)
- PyTables.where()
- Indexed queries
- Exercise
Integration with pandas (HDFStore) (20 min)
- Storing/loading dataframes
- Querying a serialised dataframe
- Exercise

Friday 15:00–17:00 in Intermediate

HDF5 and pandas

Francesc Alted

Description

Abstract

Description

Outline

Subscribe to Receive PyData Updates

Tickets