Many companies use big data, but distributed systems are complicated and have a bunch of tradeoffs. In this talk, I'll walk through analyzing the same dataset in bash, with Python multiprocessing, and on PySpark locally. I'll also talk about some tradeoffs you make as you move from local environments to distributed systems.
Hadoop clusters and other distributed systems are often a common go-to architectural pattern for working with large amounts of data. But what constitutes large?By the time you set up a Hadoop cluster, Kerberize it, get your analysts up to speed on Spark/JVM/HDFS, you could have already analyzed something on a single server, or even your laptop.
Based on my consulting work in the Hadoop and big data space, I'll walk through how to think of your data size and analysis paradgims using: + Bash processing + Python multiprocessing + Spark with PySpark locally focusing on understanding the tradeoffs between the three environments. I'll also cover big data environment considerations.
By the end of this talk, you should be able to evaluate whether what you have is "big data", and which solution makes the most sense for you, as well as see how multiprocessing can speed up your analysis. You will also be able to better evaluate what kinds of decisions you need to make as you move to big data.