PyData Seattle 2017 - Presentation: Building a community fountain around your data stream

With the trend towards data streams, building successful streaming analysis systems means building a community comfortable with streaming tech. But getting started with stream processing can be intimidating for anyone. In this talk, I’ll talk about designing and deploying a mini-testbed system to scale down the stream and how you can practice your favorite algorithm on an astronomical data stream.

The increasing availability of real-time data sources and the Internet of Things movement have pushed data analysis pipelines towards stream processing. But what does this really mean for my applications, and how do I have to change my code and workflow? In a new era of “Kappa architecture,” it’s easier than ever to use the same programming model for both batch and stream processing.

For those interested in the design and operations side, I will cover high-level design considerations for architecting a modular and scalable stream processing infrastructure that can support the flexibility of different use cases and can welcome a community of users who are more familiar with batch processing.

For the fast-batching Pythonistas, I’ll talk about some of the advantages of using streaming tech in a data processing pipeline and how to make your life easier with 1) built-in replication, scalability, and stream “rewind” for data distribution with Kafka, 2) structured messages with strictly enforced schemas and dynamic typing for fast parsing with Avro, and 3) a stream processing interface that is similar to batch with Spark that you can even use in a Jupyter notebook.

When you’re ready to jump into the stream, or at least take a drink from the fountain, I’ll point you to an open source, containerized (with Docker), streaming ecosystem testbed that you can deploy to mock a stream of data and take your streaming analytics on a dry run over an astronomical data stream.

Friday 10:45 AM–11:30 AM in Track 1 - McKinley

Building a community fountain around your data stream

Maria Patterson

Description

Abstract

Subscribe to Receive PyData Updates

Tickets