coding bloom filters

Preface

The idea here is to build a working architecture from first principles. i'm still learning on how to build applications without the obvious bottlenecks, and this is an attempt at that. there’s benefits and tradeoffs to what I’ve built, and I’ve tried discussing them out below.

What is a Bloom Filter?

They are a probabilistic data structure that helps determine if a certain record has been seen before, with false positives but no false negatives. That’s the wikipedia definition. Let’s go easier.

The Instagram Reels, YT Shorts, and TikTok problem

Keeping aside the wikipedia definition from above, let’s dive into what problem we’re trying to solve for, and why bloom filters exist.

Let’s say you’re on Instagram reels. Usually, all of these reels are recommended to you by Instagram. In very high level terms, you’d probably open instagram, and instagram would have a bunch of reels specifically for you already cached, which may have originated from a database, where the reference (URLs) and other metadata to the video that’s recommended to you may be stored. These videos would’ve been stored in an Object Store, let’s say S3. Each time Instagram runs its recommendation algorithm for me, how does it guarantee that out of all the reels that are recommended and put into cache, none of them have been seen by me before? Me, you, or no other user would want to repeatedly see the same set of reels, which means that I need to somewhere store what reels I’ve watched to make sure that there isn’t a repeat on the reels I’m watching.

Let’s explore our options for this. Forget about the high level concepts of “I’ll store this in a DB, or in an in memory data store”, let’s explore the fundamental data structures itself we can use to store these. We’re also not considering having 100s of servers here, we’re just talking very fundamental data structures with 1 server, and 1 user on the app, and will eventually build onto the “scale” conversation.

Arrays: In theory, I can store all the reels I’ve seen in an array. The array would look something like:

["helicopter_flying_reel_id", "baby_crying_reel_id","baby_crying_reel_2_id"....]

There’s a problem here. When I would get recommended reels, to check if I have watched the recommended reel, I would have to go over every reel recommended, check if every reel recommended is present within the above array, and if so, discard. The pseudo code would be something like:

reels_watched = ["helicopter_flying_reel_id", "baby_crying_reel_id", "baby_crying_reel_2_id"....]
reels_recommended = ["brain_rot_reel_id", "baby_dancing_reel_id"....]
for _, reel_recommended := range reels_recommended {
    for _, reel_watched := range reels_watched {
        if (reel_watched == reel_recommended) //discard
    }
}

This gives us a time complexity of O(m * n) where m and n refer to lengths of the reels_watched and reels_recommended arrays, which can both be very large for a single user as the time on the platform for the user grows.

The optimization is to perform binary search to check if the reels recommended exist in the reels watch array, which would reduce our time complexity to O(m log n), but if we for a second think of instagram at large, we’re here saying that EVERY time we generate a user feed, we’re adding this O(m log n) operation. Considering the goal of these systems is to prioritize availability over consistency (usually), it would be helpful to find an optimization over the O(m log n) operation.

Technically, we could store these reels watched in a tree as well, which would make the search log n, but we’re still left with the same O(m log n) complexity.

This is where Bloom Filters come in.

(**Note**: I had written the content below a while ago, and serves a use case very similar to the one we were solving for. The use case served below is of dating apps, where you can swipe right and left on a bunch of recommended users, and those recommended users come from a recommendation engine, and we need to make sure that those users aren’t repeatedly seen once already recommended to us.)

The Bloom Filters Solution

What’s following this is me thinking out loud how I understood bloom filters. The problem we’re trying to solve is figuring out how we can avoid repeat users coming up in a feed.

The worst case time complexity we observed above was O(m log n), but what if I want something even better, a complexity that very much nears O(m). that’s where a bloom filter comes in.

A bloom filter doesn’t actually store the data, it stores the existence of that data. to elaborate, how a bloom filter works is that it’s essentially an array of bits (with the length of the array being fixed and not depending on how much data there is), and each time you want to mark a new user as seen in a feed for let’s say a person named aneesh, you pass the users id into a hash function and mod it by the fixed length of the array which would give you a random index value in the bit array, and in Aneesh’s bit array, you change that bit to 1 from 0. the next time the same user is recommended, we just like any other user record would pass it through the hash function and mod it by the length of the array, which would give us the same index value as before, and because it’s 1, we drop that user record and don’t send it back for the users feed.

This creates a problem, since the size of the array is predetermined, we could have 2 user ids from the hash function modded by the length given out the same index value. in that case, let’s say I have a user id 1 which when hashed and modded gave me index 2, I would mark that bit as 1, and send the user id 1 to the users feed. I then much later have a user id 2 when hashed and modded also gave me index 2, which is marked as 1, but user id 2 was never shown to the users feed. This is called a false positive, where a user that wasn’t shown is being seen as shown by the bloom filter, and that’s a trade off we make.

Because we strictly want a system that doesn’t show repeat users, we’re okay with not showing users at times, since the assumption is that the former would give UX that is worse. The number of false positives can increase with time, and to counter the issue, we can reinitialize a bloom filter with a newer larger size.

Bloom Filter Code - (Implemented myself, can also use Redis)

Redis also has an implementation for bloom filters, but to understand this better, I ended up implementing my own in memory version.

type BloomFilter struct {
    filter []byte
    size   int
}

type userToFilterMap struct {
    m map[string]*BloomFilter
}

type BloomFilterPerUser struct {
    bfMap *userToFilterMap
}

var globalBloomFilter BloomFilterPerUser

func InitializeGlobalBloomFilter() (*BloomFilterPerUser, error) {
    bfMap := &userToFilterMap{
        m: make(map[string]*BloomFilter),
    }

    bf := &BloomFilterPerUser{
        bfMap: bfMap,
    }

    globalBloomFilter = *bf

    return bf, nil
}

func NewBloomFilterForUser(size int, userID string) (*BloomFilterPerUser, error) {

    if _, exists := globalBloomFilter.bfMap.m[userID]; exists {
        return nil, nil
    }

    globalBloomFilter.bfMap.m[userID] = &BloomFilter{
        filter: make([]byte, size),
        size:   size,
    }

    return &globalBloomFilter, nil
}

func hashValueAndModBySize(key string, size int) int {
    hasher := murmur3.New32()
    _, _ = hasher.Write([]byte(key))
    hash := hasher.Sum32()
    return int(hash) % size
}

func (bfpu *BloomFilterPerUser) AddToBloomFilterForUser(key string, userID string) error {

    bf, exists := bfpu.bfMap.m[userID]
    if !exists {
        return fmt.Errorf("BloomFilter not found for user: %s", userID)
    }

    idx := hashValueAndModBySize(key, bf.size)

    byteIdx := idx / 8
    bitIdx := idx % 8
    bf.filter[byteIdx] |= 1 << bitIdx

    return nil
}

func (bfpu *BloomFilterPerUser) MembershipCheck(key string, userID string) (bool, error) {
    bf, exists := bfpu.bfMap.m[userID]

    if !exists {
        _, err := NewBloomFilterForUser(1024, userID)
        if err != nil {
            return false, fmt.Errorf("failed to report false positive: %w", err)
        }
        bf, exists = bfpu.bfMap.m[userID]
        if !exists {
            return false, fmt.Errorf("error creating bloom filter for user: %w", err)
        }
    }

    idx := hashValueAndModBySize(key, bf.size)
    byteIdx := idx / 8
    bitIdx := idx % 8
    if bf.filter[byteIdx]&(1<<bitIdx) != 0 {
        return true, nil
    }

    err := bfpu.AddToBloomFilterForUser(key, userID)

    if err != nil {
        return false, nil
    }

    return false, nil
}