We will cover how Snowflake, a cloud data solution, integrates with Dask, a distributed compute framework, and the PyData ecosystem, to allow data scientists to leverage the data in data warehouses while using the PyData tools they love. This session demonstrates best practices for using Snowflake’s distributed fetch capabilities with Dask.
Exponential increases in data generated year over year have created the demand for distributed compute engines that are able to handle the scale of data that no longer fits in memory or on a single machine.
While data warehouses have been able to adopt a distributed query approach, they are not able to offer free-form distributed operations and analyses, which is crucial to machine learning. Python frameworks and libraries, such as Dask, XGBoost, and PyTorch, offer a way to distribute machine learning.
This creates a new issue: how do we move all the data from our warehouse to a distributed framework where bandwidth is limited? The answer: native distributed fetch capabilities where each node in a distributed compute engine concurrently reads a portion of the data into memory.