Scan Statistics is a distribution based methodology for detecting anomalies. This talk will explore the use of scan statistics to perform real time analysis on streaming data using Spark Streaming.
Scan Statistics is a distribution based methodology for detecting anomalous data. Unlike simpler methodologies like moving average and exponential smoothing that rely on previous data, we can perform a hypothesis test regarding the distribution of the data and thus perform the analysis in real time. Spark Streaming is a framework that lends itself well to this use case. This talk will introduce a Python package built for Spark Streaming that performs real time anomaly detection using various distributions of count data.