Dask is a flexible tool for parallelizing Python code on a single machine or across a cluster. It builds upon familiar tools in the PyData ecosystem (e.g. NumPy and Pandas) while allowing them to scale across multiple cores or machines. This tutorial will cover both the high-level use of dask collections, as well as the low-level use of dask graphs and schedulers.
Dask is a flexible tool for parallelizing Python code on a single machine or across a cluster.
We can think of dask at a high and a low level
High level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic and build upon NumPy arrays, Python lists, and Pandas DataFrames, but that can operate in parallel on datasets that do not fit into main memory.
Low Level schedulers: Dask provides dynamic task schedulers that execute task graphs in parallel. These execution engines power the high-level collections mentioned above but can also power custom, user-defined workloads to expose latent parallelism in procedural code. These schedulers are low-latency and run computations with a small memory footprint.
Different users operate at different levels but it is useful to understand both. This tutorial will cover both the high-level use of dask.array
and dask.dataframe
and the low-level use of dask graphs and schedulers. Attendees will come away
dask.delayed
to parallelize existing codedask.array
and dask.dataframe
) and how and when to use them.