We often need to store loads of unique values, be it in a count(distinct) or to assign users to a given post. This can be very expensive in terms of memory and CPU. I will walk you through four different data structures for efficient storage and retrieval of information. Namely bitmap, Roaring Bitmap, Bloom Filter, and HyperLogLog.
Say you run a bookmarking service and you want to do some analytics on your saved bookmarks. A given article gets a million interactions. That’s four to eight megabytes of raw integers. And it doesn’t even carry that much information. What if you want to analyse 100 such articles, you're suddenly loading half a gigabyte of data just to know who interacted with what, if there are any overlaps etc.
What if you could save this information in a couple hundred kilobytes instead? And if you’re willing to lose some information, you could use just two kilobytes per article (not a typo). How is that possible?
We can leverage smart data structures and algorithms that are aimed at storing large numbers of unique values. We will cover two that give you exact answers (at a memory cost) and two probabilistic data structures, which compress your information in tiny packets, at the cost of giving you only approximate answers (with tunable accuracy).
There are many more similar data structures, we will only have time for these four. I chose some of the most popular ones and I will direct you towards some other interesting ones.