If you're using NumPy and your data uses too much memory or requires too much computational resources this talk is for you! We'll introduce Caterva and IronArray, two libraries that, when used together, can greatly improve the efficiency and reduce the cost of your big data applications.
Our libraries feature a novel approach to store and process data compressed in-memory to achieve low-memory consumption while maintaining high-performance.
Caterva: Built on top of C-Blosc2, implements a simple multidimensional container for compressed data. It adds the capability to store, extract, and transform compressed data in these containers, either in-memory or on-disk.
IronArray: Built on top of Caterva and C-Blosc2, adds type-safety as well as a computational engine, so that matrix and vector calculations are performed efficiently on top of compressed and multidimensional containers.
While there are several existing solutions for storing compressed data (HDF5 is one of the most well known examples), Caterva brings the following novel features which set it apart from its competitors:
In-Memory compression: By default the multidimensional container is stored entirely in-memory in a compressed form, this allows for low-memory consumption while still providing high-performance to access the data.
Compression algorithms: By using C-Blosc2 the user can choose from various state of the art compression algorithms and compression filters, which allow to achieve the optimal trade-off between performance and memory efficiency.
Minimize memory copies: Compared to other solutions which often treat compression as an after-thought, Caterva minimizes the amount of memory copies as much as possible and hence, increases performance.
On-disk persistence: Both in-memory and on-disk paradigms are supported the same way. This means that the same API can be used for data that can be either in-memory or on-disk.
Support for a plain buffer data layout. This allows for essentially zero-copy data sharing among existing libraries (NumPy), so one can use the existing functionality directly in Caterva containers without loosing performance.
IronArray implements a computational engine that is optimized to deal with compressed data. IronArray adds type definitions to Caterva containers and takes every measure to reduce the compression overhead to seamlessly perform calculations on these; its ultimate goal is to be able to perform computations on compressed containers at the same speed than by using uncompressed containers.
During our talk, we will introduce Caterva and IronArray features by using cat4py, a Python wrapper for Caterva and IronArray for Python.