PyData Seattle 2017 - Presentation: Parallelizing Scientific Python with Dask

Dask is a flexible tool for parallelizing Python code on a single machine or across a cluster. It builds upon familiar tools in the PyData ecosystem (e.g. NumPy and Pandas) while allowing them to scale across multiple cores or machines. This tutorial will cover both the high-level use of dask collections, as well as the low-level use of dask graphs and schedulers.

Dask is a flexible tool for parallelizing Python code on a single machine or across a cluster.

We can think of dask at a high and a low level

High level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic and build upon NumPy arrays, Python lists, and Pandas DataFrames, but that can operate in parallel on datasets that do not fit into main memory.
Low Level schedulers: Dask provides dynamic task schedulers that execute task graphs in parallel. These execution engines power the high-level collections mentioned above but can also power custom, user-defined workloads to expose latent parallelism in procedural code. These schedulers are low-latency and run computations with a small memory footprint.

Different users operate at different levels but it is useful to understand both. This tutorial will cover both the high-level use of dask.array and dask.dataframe and the low-level use of dask graphs and schedulers. Attendees will come away

Able to use dask.delayed to parallelize existing code
Understanding the differences between the dask schedulers, and when to use one over another
With a firm understanding of the different dask collections (dask.array and dask.dataframe) and how and when to use them.

Wednesday 1:00 PM–3:00 PM in Track 2 - Baker

Parallelizing Scientific Python with Dask

Jim Crist, David Mertz

Description

Abstract

Subscribe to Receive PyData Updates

Tickets