Friday 16:25–18:00 in LG6

Using Python with Hadoop to create production ready big data applications

Ulrich Zink

Audience level:


A practical end to end walk through on how to use Python for creating an Hadoop production pipeline for big data: Data analysis - Python APIs for Hadoop Streaming, Pyspark... Interactive visualisation and reporting dashboards - Bokeh, Matplotlib... and D3 * Python web service to run Hadoop/Spark based applications.


For practical and economical reasons the open source Apache Hadoop ecosystem has become the de facto solution for Big Data and machine learning at scale.

Although Hadoop is extremely powerful for analysing data, it was initially designed as an engineer's tool and allowing non technical user to interact with it can be challenging. Many business intelligence solutions coming from the RDBMS era keep failing the Hadoop test.

In this talk we are proposing to walk trough a practical example, demonstrating how to combine Python and Hadoop in order to create production grade big data applications.

We will start by going through some of the APIs that allow Python to be used as the glue to hold the different components of the Hadoop ecosystem together.

Moving on to visualisation and dashboarding. We will demonstrate how to use some of the most popular Python libraries such as Bokeh, Matplotlib in a big data context and for more data intensive jobs how to produce json output that can be consumed by javascript.

Finally we will show an example of Flask based web application running on Hadoop/Spark.