There are two ways to solve any problem: Accurately or approximately. Accurate data structures has its disadvantages - too much memory usage and unscalable for real-time nature of data. In this talk we will see how to take advantage of the newly release Redis 4.0 with pluggable modules to build a data pipeline which uses probabilistic data structures to get real-time insights.
There are different insights and metrics that could be obtained from log events data. Processing the data in real-time and getting accurate results are possible in theory. In practice, not so easy.
Not all results and metrics need to be accurate. There are places where the tradeoff between accuracy and memory usage/scalability is worth it. That is where probabilistic data structures (PDS) come in. In this talk I will be explaining about different PDSs and how they work. And I will also be talking about how to use Redis and it's pluggable module system to use these data structure much more efficiently.
Introduction
a. Problem: Parsing high volume & velocity log event data.
b. Various metrics to be measured.
Difference between accurate data structures and probabilistic data structures
Top-K - Getting the top k items from a data set
Bloom Filters - Check for membership
Count Min Sketch - Get item counts
Hyperloglog - Cardinality of sets
Redis 4.0
a. Using the new modules system for accessing these data structures
b. Building our data pipeline to process real-time log events.