The Magic Behind PySpark, how it impacts perf & the "future"

?

Audience level:
Intermediate

Description

A look at how PySpark "works" today and how we can make it better in the future + insert engine noises of a fast car +

Abstract

This talk will introduce PySpark along with the magic done to make it work and be friends with the JVM. We will discuss why lazy evaluation makes a huge difference in PySpark, both in terms of general optimizations it opens up as well as Python specific considerations. From there we will explore much of the future of Spark, DataFrames & Datasets and what this means for PySpark. Most Spark DataFrame examples limit them selves to things written in the relational style query language, but we will explore how to add more functionality through UDFS.

We will wrap up with looking at the different pieces of work being done to make PySpark faster, from using better interchange formats like Apache Arrow to crazy hair brained schemes inspired by (but not the fault of) the Javascript on Spark project.

Hopefully no one is scared away from using Spark once they see the 300 small gnome like creatures behind the curtain, but parental guidance is encouraged for those who still believe in magic, reliable distributed systems, and vendor marketing brochures.

Subscribe to Receive PyData Updates

Subscribe

Tickets

Get Now