Single-cell sequencing generates a new kind of genomic data, promising to revolutionize understanding of the fundamental units of life. The Human Cell Atlas is a multi-year, multi-institution effort to develop and standardize methods for generating and processing this data, which poses interesting storage and compute challenges.
I'll talk about recent work parallelizing analysis of this data using a variety of distributed backends (Apache Spark, Dask, Pywren, Apache Beam). I'll also discuss the Zarr format for storing and working with N-dimensional arrays, that several scientific domains have recently gravitated toward in response to challenges using HDF5 in parallel and in the cloud.