Wednesday 3:40 p.m.–4:15 p.m.

Dask - Parallelizing NumPy/Pandas through Task Scheduling

Jim Crist

Audience level:
Intermediate

Description

Dask is a pure python library that allows for easy parallelism through task scheduling and blocked algorithms. By leveraging the existing PyData ecosystem (NumPy, Pandas, etc...), as well as some clever algorithms, we're able to compute on arrays and dataframes that are larger than memory, while exploiting parallelism.

Abstract

The PyData ecosystem is great for doing data analysis. Packages like NumPy and Pandas provide an excellent interface to doing complicated computations on datasets. With only a few lines of code one can load some data into a NumPy array, run some analysis, and plot the results. However, this workflow starts to falter when working with data that's larger than the memory on your computer.

Dask is designed to fit the space between in memory tools like NumPy/Pandas and distributed tools like Spark/Hadoop. By using blocked algorithms and the existing Python ecosystem, it's able to work efficiently on large arrays or dataframes - often in parallel.

In this talk we'll discuss both the what and the how of Dask. Starting from examples of using dask collections that mirror NumPy arrays and Pandas DataFrames, we'll then dive into how these collections are actually implemented. Along the way we'll discuss the global interpreter lock and its relevance (or lack of relevance) to parallel computation in numeric Python.

Sponsors


Become a sponsor.