Monday 4:20 PM–5:00 PM in Music Box 5411/Winter Garden 5412 (5th fl)

Improving Pandas and PySpark performance and interoperability with Apache Arrow

Li Jin

Audience level:
Intermediate

Description

Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. However, it's not well integrated with popular Python tools such as Pandas, and often result in poor performance when using Pandas with PySpark. In this talk, we will demonstrate how we improve PySpark performance with Apache Arrow.

Abstract

Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. In many use cases though, a PySpark job can perform worse than an equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master.

Apache Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently, without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket avoiding serialization costs.

In this talk, we will demonstrate how we improve PySpark performance with Apache Arrow.

Subscribe to Receive PyData Updates

Subscribe