Monday 3:40 PM–4:20 PM in Music Box 5411/Winter Garden 5412 (5th fl)

Replacing Hadoop with Your Laptop: The Case for Multiprocessing

Vicki Boykis

Audience level:
Intermediate

Description

Many companies use big data, but distributed systems are complicated and have a bunch of tradeoffs. In this talk, I'll walk through analyzing the same dataset in bash, with Python multiprocessing, and on PySpark locally. I'll also talk about some tradeoffs you make as you move from local environments to distributed systems.

Abstract

Hadoop clusters and other distributed systems are often a common go-to architectural pattern for working with large amounts of data. But what constitutes large?By the time you set up a Hadoop cluster, Kerberize it, get your analysts up to speed on Spark/JVM/HDFS, you could have already analyzed something on a single server, or even your laptop.

Based on my consulting work in the Hadoop and big data space, I'll walk through how to think of your data size and analysis paradgims using: + Bash processing + Python multiprocessing + Spark with PySpark locally focusing on understanding the tradeoffs between the three environments. I'll also cover big data environment considerations.

By the end of this talk, you should be able to evaluate whether what you have is "big data", and which solution makes the most sense for you, as well as see how multiprocessing can speed up your analysis. You will also be able to better evaluate what kinds of decisions you need to make as you move to big data.

Subscribe to Receive PyData Updates

Subscribe