Presentation: Data Processing at Scale

Time Zone

Friday October 29 2:00 PM – Friday October 29 4:00 PM in Workshop/Tutorial I

Data Processing at Scale

Benjamin Zaitlen, James Bourbeau, Martin Durant, Matthew Powers, Richard Zamora

Prior knowledge:: Previous knowledge expected
Pandas, Parquet, Cloud

Summary

While PyData is the standard for interactive data science, it has historically lagged behind Apache Spark for large scale data processing and ETL workloads. This talk digs into the challenges around scalability, how those challenges have recently been addressed throughout the ecosystem, and highlights case studies where these changes have had strong impact especially scaled Dask and Parquet.

Description

While Python has been popular in interactive data analysis, large scale data processing / ETL workloads have historically been handled by tools like Hadoop or Apache Spark. However, as Python grows more capacity to operate at scale with technologies like Dask, RAPIDS, and Apache Arrow some users are converting large scale data processing workloads over from Spark to a more PyData native approach.

However, getting this to work smoothly is hard, and requires a number of changes throughout the PyData ecosystem. This workshop digs into the challenges faced by practitioners today by looking at a large representative workload, and then tracking down all of the problems that arise and how they have been fixed through collaboration in several of the PyData projects.