Monday 2:45 PM–3:25 PM in Music Box 5411/Winter Garden 5412 (5th fl)

Streaming Processing with Dask

Matthew Rocklin

Audience level:
Intermediate

Description

This talk discusses ongoing work to build streaming data processing systems for Python with Dask, a Pythonic library for parallel computing. This talk will discuss streaming primitives, dataframes, and integration with the Jupyter notebook and use example from financial time series and cyber-security.

Abstract

Continuous data streams arise in many applications like the following:

  1. Log processing from web servers
  2. Scientific instrument data like telemetry or image processing pipelines
  3. Financial time series
  4. Machine learning pipelines for real-time and on-line learning
  5. Network security

The PyData stack contains tools like NumPy and Pandas for analytics of fixed-sized datasets but generally lacks data structures and algorithms for online computation. To resolve this we introduce a small library for streaming programming that integrates nicely with Pandas for tabular data processing, and with Dask for parallel and distributed computation. This results in an intuitive, efficient, and scalable solution for streaming data processing in Python.

We motivate this solution with examples from financial time series and network security, and leverage the IPython widgets and Bokeh plots for interactive streaming analysis within the Jupyter notebook.

Subscribe to Receive PyData Updates