So you've went through all the word counts, but you have some real data to analyse and you found answers to your issues missing, especially using Python, I can help! This tutorial shows how to deal with common issues of Data Scientist using Spark for real data sets. I'll demonstrate some common practices like memory tuning working with not-so-clean data efficiently and much more.
So you've went through all the word counts and other hello-world type of codes, but you have some real data to analyse and you found answers to your issues missing, especially using Python, I can help! This tutorial goes beyond the basic examples and shows how to deal with common issues of Data Scientist using Spark for real (big) data sets. I focus on the Spark SQL and DataFrame, show why it's good for Python users, how to avoid typical errors and how to deal with them. Also, I'll demonstrate some common practices like memory tuning, working with not-so-clean data efficiently, show what can go wrong in a join and how to write the data in best way for the later use. This will involve some more in-depth discussion about Spark internals and how it differs from other systems like databases. The tutorial is focused on Spark 2+, but many of the techniques shown are applicable in the older versions of Spark or even other similar systems.