Monday 11:40 AM–12:20 PM in Central Park East (6501a)

Small Big Data: using NumPy and Pandas when your data doesn't fit in memory

Itamar Turner-Trauring

Audience level:
Intermediate

Description

Your data is too big to fit in memory—loading it crashes your program—but it's also too small for a complex Big Data cluster. How to process your data simply and quickly?

In this talk you'll learn the basic techniques for dealing with Small Big Data: money, compression, batching and parallelization, and indexing. In particular, you'll learn how to apply these techniques to NumPy and Pandas.

Abstract

Your data is big enough that loading it into memory crashes your program, but small enough that setting up a Big Data cluster isn't worth the trouble. You're dealing with Small Big Data, and in this talk you'll learn the basic techniques used to process data that doesn't fit in memory.

First, you can just buy—or rent—more RAM. Sometimes that isn't sufficient or possible, in which case you can also:

  1. Compress your data so it fits in RAM.
  2. Chunk your data processing so you don't have to load all of it at once, and if possible parallelize processing to use multiple CPUs.
  3. Index your data so you can quickly load into memory only the subset you actually care about.

You'll also learn how to apply these techniques to NumPy:

  1. Compress using smaller data types and sparse arrays.
  2. Chunk using Zarr.
  3. Parallelize with Dask.

As well as Pandas:

  1. Compress using smaller data types.
  2. Read in chunks.
  3. Parallelize with Dask.
  4. Index and quickly load partial subsets using HDF5.

Subscribe to Receive PyData Updates