PyData Amsterdam | Presentation: Store and manage data effortlessly with HDF5

Sunday 15:15–15:55 in Room 2

Store and manage data effortlessly with HDF5

Margaret Mahan

Audience level:: Novice

Description

Are you looking for accessible, compressed, organized data? HDF5 might be the solution you’re looking for. HDF5 works like a file system within a file, designed for flexible and efficient storage and I/O for high volume, complex data. Come learn from a Pyentist how to leverage HDF5, get started with h5py, and see a real-world example of a processing pipeline utilizing HDF5.

Abstract

Are you

a Pyentist¹?
frequently ‘grep’-ing?
drowning in ASCII files?
extending filenames for each processing step?
looking for accessible, compressed, organized data?

If you answered yes to any of these questions, then HDF5 might be the solution you’re looking for. HDF5 is entirely open source and supported by a variety of programming languages and tools, including Python (h5py). HDF5 not only supports large, complex, heterogeneous data but is self-describing and supports data slicing. In this talk, you’ll learn about embracing HDF5 from a Pyentist.

This talk is aimed at data scientists who have large, numerical datasets that need to be managed and stored but also accessed and processed efficiently. Basic knowledge of NumPy and UNIX will be useful for attendees but not required. Attendees will learn how to get started with h5py, as well as how to leverage HDF5 in order to attain accessible, compressed, and organized data.

HDF5 stands for Hierarchical Data Format, version 5. It is a file format, library, and data model for storing and managing data. More simply, HDF5 can be described as a file system within a file. An HDF5 file contains two kinds of objects, namely, datasets and groups. Datasets work like NumPy arrays while groups work like dictionaries that hold datasets and other groups. In addition, objects can have attributes, or metadata. HDF5 is designed for flexible and efficient storage and I/O for high volume, complex data. Data scientists will find HDF5 to be invaluable for managing, manipulating, and storing their data.

Part of this talk will demonstrate how to get started with HDF5. In this demo, attendees will learn how to: create and handle HDF5 files using h5py, manage and manipulate datasets, work with groups, and make use of attributes. A real-world example of a processing pipeline of brain recordings, utilizing HDF5 for storing and managing data at each processing step, will be presented. Attendees will have access to an IPython notebook to follow along during the demo and explore examples. After this talk, attendees will be able to begin using HDF5 to effortlessly store and manage their data.

Outline:

Introduction (3-4 min)
- Who am I?
- How I began using HDF5
What is HDF5? (4-5 min)
- Brief history of HDF5
- Overview of primary features
- Explain why you’d use HDF5 (big-picture)
HDF5 specifics (5-6 min)
- HDF5 structure (datasets, groups, attributes)
- Expand on why you’d use HDF5 (detailed)
Getting started with HDF5 in Python {IPython notebook} (10-13 min)
- Imports (h5py, numpy) and data setup
- Creating and handling HDF5 files
- Working with datasets, groups, and attributes
- Examples using brain data recordings
What else can I do with HDF5? (3-5 min)
- Overview of advanced features (chunking, parallel I/O)
- How to use viewers (HDFView, HDFCompass)
Q & A (5-7 min)

¹A Pyentist is a Python programming scientist