Monday 2:50 PM–3:30 PM in Central Park West (6501)

Is Spark still relevant? Multi-node CPU and single-node GPU workloads with Spark, Dask and RAPIDS.

Eric Dill

Audience level:
Intermediate

Description

We compare Dask, Spark, and RAPIDS in a typical business workload: dataset exploration, model building, prediction and dashboarding / reporting. Of specific interest are infrastructure, ease of use and some general musings on the practicality of these compute frameworks. This talk will specifically address the relevance of these systems to the unique brand of problems we face in modern businesses.

Abstract

This talk compares Dask, Spark, and RAPIDS for data science use cases, particularly in traditional business workloads.

This talk compares the implementation in each of these ecosystems along the following axes: (a) infrastructure requirements for each, (b) difficulty of development and deployment for someone already familiar with Spark on YARN but not Dask or RAPIDS and (c) performance (core-hours and gigabyte-hours and cost). We source data from s3. For Spark, we use EMR on AWS. For Dask we reuse the same EMR on an AWS cluster along with EKS and bare ec2 instances. For RAPIDS (GPU-based) we use a local workstation and also a GPU-equipped cloud instance.

This talk will end with the practicality of Dask, Spark, and RAPIDS. We will address such questions as * “What if I already have a large Spark / Hadoop investment?” * ”For a greenfield project, what’s the recommendation?” * “New architectures in businesses are especially hard to adopt. Are there any really compelling reasons to use Dask / RAPIDS over Spark?”

This talk will be especially relevant if you are already familiar with one or more of these tools but will be of general interest to practicing data scientists, data engineers, IT, DevOps, analysts, and engineering or analytics management.

Subscribe to Receive PyData Updates

Subscribe