Saturday 10:30–12:45 in Hall 7

Using Spark -- With PySpark

Dr. Frank Gerhardt, Bence Zambo

Audience level:
Novice

Description

Spark is one of the most popular Big Data frameworks for scaling up your tasks in a cluster. Although it is written in Java/Scala it supports Python via PySpark. We show you how to get started with Spark. We provide a small cluster of compute nodes to work on. You will use some publicly available data set to run machine learning algorithms on and show results in a Jupyter notebook.

Abstract

Spark is one of the most popular Big Data frameworks. It is a great choice when you need to scale up your data science jobs. It is written in Java/Scala but also supports other languages like R and Python. PySpark is the API to use Spark from Python. With Spark DataFrames the overhead of calling Spark from Python is claimed to be near zero compared to Java/Scala. Spark development is supported by a number of notebooks, the Spark Notebook, the IPython and Jupyter notebooks, and the new Apache Zeppelin notebooks. Are you interested already?

In this tutorial we show you how to get started with Spark. We provide a small cluster for you so that we don't loose time with installing servers. Then we use some publicly available data set and show you how to use this data in Spark. We will use machine learning algorithms to analyze the data and make predictions. Results will be visualized in the notebooks. You will be able to see what performance difference it makes when you run your code on a cluster instead of a single node.

Participants should bring their own laptop with a web browser installed and be able to access Wifi. All coding will be done in Jupyter notebooks on the cluster we provide. At the end of the tutorial you can export all your notebooks to take them with you.

Please download Docker and our Docker container before the workshop. On Mac and Windows install Docker Toolbox, on Linux install Docker Engine (www.docker.com). Then download the Docker image we have prepared for the workshop https://hub.docker.com/r/gerhardt/pyspark-workshop/. The Docker image contains Python, Spark, PySpark and Jupyter. Overall your downloads will be around 1 GB.