Saturday October 30 4:00 PM – Saturday October 30 4:30 PM in Talks II

Snowflake & Dask: How to scale workloads using distributed fetch capabilities

Miles Adkins, James Bourbeau, Mark Keller

Prior knowledge:
Previous knowledge expected
Dask & Data Warehouses

Summary

We will cover how Snowflake, a cloud data solution, integrates with Dask, a distributed compute framework, and the PyData ecosystem, to allow data scientists to leverage the data in data warehouses while using the PyData tools they love. This session demonstrates best practices for using Snowflake’s distributed fetch capabilities with Dask.

Description

Exponential increases in data generated year over year have created the demand for distributed compute engines that are able to handle the scale of data that no longer fits in memory or on a single machine.

While data warehouses have been able to adopt a distributed query approach, they are not able to offer free-form distributed operations and analyses, which is crucial to machine learning. Python frameworks and libraries, such as Dask, XGBoost, and PyTorch, offer a way to distribute machine learning.

This creates a new issue: how do we move all the data from our warehouse to a distributed framework where bandwidth is limited? The answer: native distributed fetch capabilities where each node in a distributed compute engine concurrently reads a portion of the data into memory.