Presentation: Dask: From POC to Production

Time Zone

Thursday October 28 1:30 PM – Thursday October 28 2:00 PM in Talks I

Dask: From POC to Production

April Rathe

Prior knowledge:: No previous knowledge expected

Summary

Dask is a popular library for parallelizing computations in the PyData ecosystem. While we found it easy to demonstrate performance gains in POCs, it was more challenging to use Dask widely across our team. In this talk, we’ll describe our experiences integrating Dask into our environment. We'll provide a roadmap of decisions, development and patterns for taking Dask from POC to production.

Description

Quickly analyzing large historical datasets is key to our quantitative investment research efforts. These efforts can be constrained when working with datasets bigger than available memory or by long-running, complex computations. Dask offers easy-to-learn interfaces for parallelizing computations for researchers already familiar with NumPy and Pandas. Library selection was easy; widespread use was harder. There are limited guides available to help navigate the options, decisions and patterns to successfully integrate Dask. In this talk, we aim to provide an example guide for teams looking to enable the use of Dask across their team.