Thursday 1:20 PM–2:00 PM in Radio City (#6604)

How to spend ¾ of your yearly budget in 3 weeks: a PySpark cautionary tale

Nicole Carlson

Audience level:
Intermediate

Description

This is a cautionary tale of how I spent a large amount of money running PySpark clusters and what I discovered about deploying a recommendation engine at scale. I’ll describe how I finally chased down an intermittent error and lessons learned along the way including why you can’t always write PySpark code like Python code and how important it is to check third party library integrations.

Abstract

This talk is a PySpark cautionary tale of how I chased down an intermittent S3 ExpiredToken error (that was hiding a completely different error) and what I learned about PySpark along the way.

My company, ShopRunner, is an online members-only marketplace that partners with retailers like Newegg, Bloomingdales, and Neiman Marcus. One of our projects is recommending similar items to a target item. Our first recommendation engine was built on order data.

Since orders had worked well, I assumed it would be straightforward to modify the code to handle pageviews. Instead, I ran into confusing intermittent PySpark errors, and three weeks later, I realized that I had spent a very large chunk of our budget.

While trying to solve this issue, I made some classic errors including just throwing more nodes/clusters at the problem, running the same notebook multiple times and hoping it would work, and not running the code on a smaller test set first.

I’ll discuss some better methods for debugging PySpark, more optimal ways to write PySpark code (you can’t just write it like Python code), and some issues to look for when integrating third party libraries with each other.

At the end of this talk, I hope you'll have learned some tips/tricks for running PySpark clusters at scale and staying on (or under) budget.

Subscribe to Receive PyData Updates

Subscribe