Spark is a great tool for exploration and batch processing, but what about on demand REST services? In this talk we will go over how to create a flask service that vends data using a Spark backend with Spark SQL/Dataframes.
Python has first class web development and data science tools. This talk looks at an approach to combine the two in a unique way in order to provide powerful access to Spark compute clusters via flask routes.
I will go over an example of creating a Spark SQL context using pyspark within a flask application, performing initial processing on startup, caching the dataset in memory across worker nodes, and then creating Flask routes to access data using the Spark SQL context. By doing this, you can create simple REST API's so that you can share your data in your spark cluster with the world.
There are some issues that arise with this approach that I will discuss. We will go over some approaches that we can take to get around this using gunicorn's default synchronous mode as well as using nginx as a proxy in front of multiple spark-flask applications.