This talk discusses ongoing work to build streaming data processing systems for Python with Dask, a Pythonic library for parallel computing. This talk will discuss streaming primitives, dataframes, and integration with the Jupyter notebook and use example from financial time series and cyber-security.
Continuous data streams arise in many applications like the following:
The PyData stack contains tools like NumPy and Pandas for analytics of fixed-sized datasets but generally lacks data structures and algorithms for online computation. To resolve this we introduce a small library for streaming programming that integrates nicely with Pandas for tabular data processing, and with Dask for parallel and distributed computation. This results in an intuitive, efficient, and scalable solution for streaming data processing in Python.
We motivate this solution with examples from financial time series and network security, and leverage the IPython widgets and Bokeh plots for interactive streaming analysis within the Jupyter notebook.