Sunday 13:30–14:05 in Megatorium

DataFrames: scaling up and out

Ondrej Kokes

Audience level:
Novice

Description

DataFrames of all sorts have become very popular for tabular data analytics for their nifty APIs and ease of use. But as they usually operate as in-memory engines, they can be hard to scale. In my talk, I'd like to outline several ways one can scale their compute platform to handle larger datasets without incurring much cost.

Abstract

If you've dealt with tabular data in Python, you've probably used Python's pandas. But given its in-memory nature, you may have struggled when you tried loading larger and larger datasets. What I'd like to cover is how one can overcome this issue in various ways.

Much like in other areas, there isn't a one-size-fits-all solution. I will show four ways I have overcome this issue in the past, each time choosing said approach for a different reason. These approaches are:

  1. Changing application logic to handle streams rather than loading the whole dataset into memory.
  2. Actually scaling up - locally by buying more memory and/or faster disk drives, or by deploying servers in the cloud and SSH tunnelling to remote Jupyter instances.
  3. Scaling your datasource and utilising pandas' SQL connector. This will help in other areas as well (e.g. direct connections in BI).
  4. Using a distributed DataFrame engine - dask or PySpark. These scale from laptops to large clusters, using the very same API the whole way through.

I will cover the various differences between these approaches and will outline their set of upsides (e.g. scaling and performance) and downsides (DevOps difficulties, cost). I won't be doing any live demos (demo gods have not been kind to me in the past), I will mainly focus on the use cases that fit each approach and what to look out for.

Subscribe to Receive PyData Updates

Subscribe