Saturday 13:45–16:00 in Hall 7

Which city is the cultural capital of Europe? An introduction to Apache PySpark for GeoAnalytics

Shoaib Burq, Kashif Rasul

Audience level:
Novice

Description

In this workshop we will very quickly introduce you to the Apache Spark stack and then get into the meat of performing a full featured geospatial analysis. Using OpenStreetMap data as our base our end goal will be to find the most cultural city in Western Europe!

Abstract

Which city is the cultural capital of Europe? An introduction to Apache PySpark for Big Data GeoAnalytics

In this workshop we will very quickly introduce you to the Apache Spark stack and then get into the meat of performing a full featured geospatial analysis. Using OpenStreetMap data as our base, our end goal will be to find the most cultural city in Western Europe!

That's right! We will develop our own Cultural Weight Algorithm (TM) ;) and apply it to a set of major cities in Europe. The data will be analyzed using Apache Spark and in the process we will learn the following phases of Big Data projects:

  • Consuming: Retrieving raw data from REST API's (OpenStreetMap).
  • Preparation: Data exploration and schema creation of geospatial data
  • Summarize: We will query data by Location. Perform Spatial Operations such as finding Overlapping geospatial features, do joins by location, also known as Spatial Joins and finally obtain location based summary statistics to arrive at our answer regarding the cultural capital of Europe.

Here's a summary of the workshop as a sketch.

Summary of Workshop

I hope you will join us on this journey of exploring one of the most exciting technology stacks to come out of the good folks at the UCBerkeley

Why Spark?

Spark has quickly overtaken Hadoop as the front runner in big data analysis technologies. There are a number of reasons for this such as its support for developer friendly interactive mode, it's polyglot interface in Scala, Java, Python, and R, and the full stack of Algorithmic libraries that such language ecosystems offer.

Out of the box, Spark includes a powerful set of tools: such as the ability to write SQL queries, perform streaming analytics, run machine learning algorithms, and even tackle graph-parallel computations but what really stands out is its usability.

With it's interactive shells (in both Scala and Python) it makes prototyping big data applications a breeze.

Why PySpark?

PySpark provides integrated API bindings around Spark and enables full usage of the Python ecosystem within all the nodes of the Spark cluster with the pickle Python serialization and, more importantly, supplies access to the rich ecosystem of Python’s machine learning libraries such as Scikit-Learn or data processing such as Pandas.

During the workshop we are going to use a Docker Container with the relevant libaries. Please try to have the latest docker running on your machine for hands-on work!.