Wednesday 2:45 PM–3:25 PM in Radio City (#6604)

The TileDB Array Data Storage Manager

Stavros Papadopoulos, Jake Bolewski

Audience level:
Intermediate

Description

TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.

Abstract

TileDB is an open-source, MIT licensed, embeddable storage manager for persisting multi-dimensional array data that arise in scientific applications. In contrast to existing scientific array data management systems, TileDB is optimized for both multi-dimensional sparse and dense arrays. The TileDB storage format (based on log-structured merge trees) is well adapted to modern append-only / cloud storage backends including HDFS and AWS S3. TileDB’s key idea is to organize array elements into ordered collections called fragments. Each fragment is dense or sparse, and groups contiguous array elements into data tiles of fixed capacity. The organization of data into fragments turns random writes / updates into sequential writes, and, coupled with a novel read algorithm, leads to efficient reads. TileDB enables effective compression (supporting a variety of state-of-the-art compressors) and massively parallel reads / writes (through process- and thread-safety).

This talk will focus on the high-level design of TileDB and how TileDB’s features and built-in parallelism can be leveraged by users in Python, R, etc. to accelerate data science applications. Attendees that have used exiting array data management solutions such as HDF5 and need to store and query massive amounts of array data locally, on distributed file systems, or on the cloud using Python or another high level technical computing language/environment should benefit from this talk.

TileDB was originally developed as a research project at the ISTC-BD (a collaboration between Intel Labs and MIT) and has seen production use in various genomics applications. TileDB, Inc., the company, was formed to continue TileDB’s development after the eventual termination of ISTC-BD and bring TileDB’s benefits to the larger data science community.

Subscribe to Receive PyData Updates

Subscribe