PyData Amsterdam 2019 - Presentation: High on cardinality, low on memory

Say you run a bookmarking service and you want to do some analytics on your saved bookmarks. A given article gets a million interactions. That’s four to eight megabytes of raw integers. And it doesn’t even carry that much information. What if you want to analyse 100 such articles, you're suddenly loading half a gigabyte of data just to know who interacted with what, if there are any overlaps etc.

What if you could save this information in a couple hundred kilobytes instead? And if you’re willing to lose some information, you could use just two kilobytes per article (not a typo). How is that possible?

We can leverage smart data structures and algorithms that are aimed at storing large numbers of unique values. We will cover two that give you exact answers (at a memory cost) and two probabilistic data structures, which compress your information in tiny packets, at the cost of giving you only approximate answers (with tunable accuracy).

Bitsets - we can utilise individual bits in byte arrays to store true/false information in the most efficient uncompressed format feasible.
Roaring Bitmaps - the same as bitsets, but with a clever compression mechanism, it utilises RLE (commonly used in columnar databases)
Bloom Filters - probabilistic set membership - if you ask if a value is in a set, it will either tell you ‘yes’ (and you’ll know how likely that’s true) or ‘no’ (and you’ll know for sure that’s correct). It will cost you just a few kilobytes.
HyperLogLog - if all you’re interested is unique values (e.g. unique users who’ve read a given article), you can leverage HLLs and use 1.5 kilobytes (!) to estimate this count.

There are many more similar data structures, we will only have time for these four. I chose some of the most popular ones and I will direct you towards some other interesting ones.

Saturday 16:30–17:05 in Auditorium

High on cardinality, low on memory

Ondrej Kokes

Description

Abstract

Subscribe to Receive PyData Updates