PyData New York City 2019 - Presentation: Improve the efficiency of your Big Data application

Our libraries feature a novel approach to store and process data compressed in-memory to achieve low-memory consumption while maintaining high-performance.

Caterva: Built on top of C-Blosc2, implements a simple multidimensional container for compressed data. It adds the capability to store, extract, and transform compressed data in these containers, either in-memory or on-disk.

IronArray: Built on top of Caterva and C-Blosc2, adds type-safety as well as a computational engine, so that matrix and vector calculations are performed efficiently on top of compressed and multidimensional containers.

Caterva

While there are several existing solutions for storing compressed data (HDF5 is one of the most well known examples), Caterva brings the following novel features which set it apart from its competitors:

In-Memory compression: By default the multidimensional container is stored entirely in-memory in a compressed form, this allows for low-memory consumption while still providing high-performance to access the data.
Compression algorithms: By using C-Blosc2 the user can choose from various state of the art compression algorithms and compression filters, which allow to achieve the optimal trade-off between performance and memory efficiency.
Minimize memory copies: Compared to other solutions which often treat compression as an after-thought, Caterva minimizes the amount of memory copies as much as possible and hence, increases performance.
On-disk persistence: Both in-memory and on-disk paradigms are supported the same way. This means that the same API can be used for data that can be either in-memory or on-disk.
Support for a plain buffer data layout. This allows for essentially zero-copy data sharing among existing libraries (NumPy), so one can use the existing functionality directly in Caterva containers without loosing performance.

IronArray

IronArray implements a computational engine that is optimized to deal with compressed data. IronArray adds type definitions to Caterva containers and takes every measure to reduce the compression overhead to seamlessly perform calculations on these; its ultimate goal is to be able to perform computations on compressed containers at the same speed than by using uncompressed containers.

During our talk, we will introduce Caterva and IronArray features by using cat4py, a Python wrapper for Caterva and IronArray for Python.

Tuesday 10:05 AM–10:45 AM in Winter Garden (5412)

Improve the efficiency of your Big Data application

Francesc Alted, Christian Steiner

Description

Abstract

Caterva

IronArray

Subscribe to Receive PyData Updates