Friday October 29 7:00 PM – Friday October 29 7:30 PM in Talks I

All you need is zarr.: Parallel access to remote HDF5, TIFF, grib2 and others.

Martin Durant

Prior knowledge:
No previous knowledge expected

Summary

We introduce ReferenceFileSystem, a virtual implementation for fsspec which views arbitrary byte chunks at specific keys, presenting chunks of HDF5, TIFF, grib2 and others at the appropriate paths conforming to zarr's model. Thus, you can use zarr to load data from potentially thousands of remote data files, selecting only what you need, and with parallelism and concurrency.

Description

fsspec's ReferenceFileSystem allows a file system like virtual view onto chunks of bytes stored in arbitrary locations elsewhere, e.g., cloud bucket storage. We can present each byte chunk as a particular path in the filesystem conforming to the zarr hierarchy model, such that the original set of chunks, potentially across many files, appears as a single zarr dataset. This brings the following advantages:

  • only zarr (plus the relevant codecs) is required to read the data, but the original data could be locked in HDF5, TIFF or grib2 files (and more to come)
  • the metadata is extracted once, so future opening of the dataset does not need to scan through the target files to find metadata, and so the process is much faster
  • you get a single logical view over potentially thousands of files, but due to zarr's access model, you only load the data you need
  • loading can happen chunk-wise and in parallel

The details of how to make such reference files is described at https://github.com/fsspec/kerchunk , and the latest result for some 80TB/370k files of water flow modelling can be seen at https://nbviewer.jupyter.org/gist/rsignell-usgs/02da7d9257b4b26d84d053be1af2ceeb . Note that this is using xarray to process the data, but only zarr is needed to load it.