A brief introduction to Bloom filter

Last time we learned how to use HyperLogLog to make an estimate of big data. It is very valuable and can solve many statistical needs with low precision. But if we want to know if a value is already in the HyperLogLog structure, it does nothing. It only provides pfadd and pfcount methods. It does not provide methods like CONTAINS.

Here’s an example, like when you swipe Tiktok:

Have you ever swiped repeated recommendations? With so much content to recommend to so many users, how does it ensure that when each user watches the recommended content, there won’t be a recommended video that they’ve already seen? That is to say, how does Douyin achieve push weight removal?

You can imagine that the server records all the history that the user has seen, and when the recommendation system recommends short videos it will sift through each user’s history, filtering out those that already exist. The problem is that when the number of users is large and each user has watched a lot of short videos, does this method of recommender system rework keep up with the performance?

In fact, if the history is stored in a relational database, deduplication requires frequent EXISTS queries against the database. When the system has high concurrency, the database is difficult to withstand the pressure.

You might think of caching again, but with so many users and so many histories, it would take a lot of space to cache them all. (Maybe the boss takes one look at the bill, one look at you…) And the amount of storage grows linearly over time, so even if you can use the cache for a month, how long can you keep it going? Without caching performance can not keep up, what to do?

The Bloom Filter, shown above, is one such high-level data structure designed to solve the problem of de-duplication. However, like HyperLogLog, it is also slightly inaccurate and has a certain probability of miscalculation, but it can solve the problem of gravity, while saving more than 90% in space, is also very worthwhile.

What is a Bloom filter

Bloom Filter was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions (more on that below), but you can also think of it simply as an imprecise set structure that might misjudge the existence of an object when you use its Contains method. But the Bloom filter is not particularly inaccurate, as long as the parameters are set reasonably, its accuracy can be controlled relatively accurate enough, there will only be a small probability of misjudgment.

When a Bloom filter says a value exists, it may not exist; When it says it doesn’t exist, it certainly doesn’t exist. For example, when it says it doesn’t know you, it really doesn’t know you, but when it says it knows you, it may misjudge you because you look like another friend it knows (with a similar face).

The application scenario of bloom filter

Based on the above functions, we can use the Bloom filter in the following scenarios:

  • If your server has enough memory, then using HashMap may be a good solution. Theoretically, the time complexity can reach O(1), but when the data volume is large, only Bloom filters can be considered.
  • Resolve cache penetration: We often put hot data in Redis as a cache, such as product details. Usually after a request to come we will first query cache, instead of reading the database directly, this is the most simple performance which is the most common practice, but If one has been requested does not exist the cache, so at this time must be no cache, there will be a lot of request directly to the database, cause cache penetration, Bloom filters can also be used to solve such problems.
  • Crawler/mailbox system filtering: I don’t know if you have noticed that some normal mail is also put in the spam directory, this is due to the use of bloom filter misjudgment.

Two, bloom filter principle analysis

Bloom filters are essentially bitvectors or bitlists (lists containing only 0 or 1 bit values) of length m. Initially all values are set to 0, so let’s first create a slightly longer bitvector for demonstration:

When we add data to the Bloom filter, we use multiple hash functions to evaluate the key, calculate a certificate index value, and then modulo the length of the bit array to get a position, each of which will evaluate a different position. To complete the add operation, set each position of the bit array to 1. For example, we add a wmyskxz:

When a Bloor filter is queried for the existence of a key, just as the add operation does, the key is computed through the same hash functions to check whether the corresponding position is 1. If one bit is 0, the key does not exist in the bloor filter. If all of these positions are 1, it does not mean that the key exists, but rather that it is highly likely to exist, because the 1 in these positions may be caused by the existence of other keys.

Query for a nonexistent key after adding some data:

Obviously, the 1 in 1/3/5 is due to the first wmyskxz added above, so there is a miscalculation here. Fortunately, Bloom filter has a formula that can predict the rate of misjudgment, which is more complex, and interested friends can read it on their own, more brain-burning.. Just keep the following in mind:

  • Do not use the actual number of elements much larger than the initial number;
  • Bloom filters should be applied when the actual number of elements exceeds the initial numberThe reconstructionAnd reassign onesizeLarger filters, then batch all the historical elementsaddConduct;

Three, the use of bloom filter

The bloom filter was introduced in Redis 4.0 as a plugin. Bloom filters are loaded into Redis Server as a plug-in, providing Redis with powerful bloom de-duplexing capabilities. Let’s try out the Bloom filter of Redis 4.0. In order to save the tedious installation process, let’s use Docker directly.

> docker pull redislabs/rebloom # pull mirror
> docker run -p6379:6379 redislabs/rebloom Run the container
> redis-cli Connect to the Redis service in the container
Copy the code

If the above three instructions execute without problem, the following can experience bloom filter.

  • Of course, if you don’t want to use the Docker, can also check the machine Redis version after qualified to install the plugin, you can refer to here: blog.csdn.net/u013030276/…

Basic usage of bloom filter

The bloom filter has two basic directives, bf.add to add an element and bf.exists to check if an element exists. This is used in the same way as sadd and sismember of set sets. Note that bf.add can only add one element at a time. If you want to add more than one element at a time, you need to use the bf.madd directive. Similarly, if you need to query the existence of multiple elements at once, you need to use the bf. Mexists directive.

127.0.0.1:6379> bf.add codehole user1
(integer) 1
127.0.0.1:6379> bf.add codehole user2
(integer) 1
127.0.0.1:6379> bf.add codehole user3
(integer1 127.0.0.1:6379> bf.exists codehole user1 (integer) 1
127.0.0.1:6379> bf.exists codehole user2
(integer1 127.0.0.1:6379> bf.exists codehole user3 (integer1 127.0.0.1:6379> bf.exists codehole user4 (integer) 0 127.0.0.1:6379> bf.madd codehole user4 user5 user6integer1) (2)integer) 1 (3)integer) 1 127.0.0.1:6379> bf. Mexists codehole user4 user5 user6 user7 1) (integer1) (2)integer) 1 (3)integer) 1 (4)integer) 0
Copy the code

The Bloby filter used above is just the bloby filter for the default parameter, which was created automatically when we first added it. Redis also provides bloem filters with custom parameters that can be created explicitly using the bf.reserve directive before add. If the corresponding key already exists, bf. Reserve will report an error.

Bf. reserve has three parameters, which are key, error_rate and initial_size:

  • The lower the error_rate is, the more space is needed. For occasions that do not need to be too precise, it is ok to set it a little larger. For example, the push system mentioned above will only filter out a small part of the content, and the overall viewing experience will not be greatly affected.
  • Initial_size means the number of elements expected to be put in. When the actual number exceeds this value, the misjudgment rate will increase. Therefore, it is necessary to set a large value in advance to avoid the increase of misjudgment rate caused by exceeding this value.

If bf.reserve is not applicable, the default error_rate is 0.01 and the default initial_size is 100.

Four, bloom filter code implementation

Their own simple simulation

Based on the above basic theory, we can easily implement a Bloem filter data structure for simple simulation by ourselves:

public static class BloomFilter {

    private byte[] data;

    public BloomFilter(int initSize) {
        this.data = new byte[initSize * 2]; // Create a space of size 2 by default
    }

    public void add(int key) {
        int location1 = Math.abs(hash1(key) % data.length);
        int location2 = Math.abs(hash2(key) % data.length);
        int location3 = Math.abs(hash3(key) % data.length);

        data[location1] = data[location2] = data[location3] = 1;
    }

    public boolean contains(int key) {
        int location1 = Math.abs(hash1(key) % data.length);
        int location2 = Math.abs(hash2(key) % data.length);
        int location3 = Math.abs(hash3(key) % data.length);

        return data[location1] * data[location2] * data[location3] == 1;
    }

    private int hash1(Integer key) {
        return key.hashCode();
    }

    private int hash2(Integer key) {
        int hashCode = key.hashCode();
        return hashCode ^ (hashCode >>> 3);
    }

    private int hash3(Integer key) {
        int hashCode = key.hashCode();
        return hashCode ^ (hashCode >>> 16); }}Copy the code

Here is very simple, only internal maintenance of a byte type of data array, in fact, byte still occupy a byte, can be optimized into a bit to replace, here is only for the convenience of simulation. In addition, I created three different hash functions, which use their own hash and the result of shifting different digits to the right using HashMap hash jitter. Basic Add and CONTAINS methods are provided.

Here’s a quick test of how the Bloom filter works:

public static void main(String[] args) {
    Random random = new Random();
    // Suppose we have data of 1 million
    int size = 1 _000_000;
    // Use a data structure to hold all the actual values
    LinkedList<Integer> existentNumbers = new LinkedList<>();
    BloomFilter bloomFilter = new BloomFilter(size);

    for (int i = 0; i < size; i++) {
        int randomKey = random.nextInt();
        existentNumbers.add(randomKey);
        bloomFilter.add(randomKey);
    }

    // Verify that all existing numbers exist
    AtomicInteger count = new AtomicInteger();
    AtomicInteger finalCount = count;
    existentNumbers.forEach(number -> {
        if(bloomFilter.contains(number)) { finalCount.incrementAndGet(); }}); System.out.printf("Actual amount of data: %d, judged existing amount of data: %d \n", size, count.get());

    // Verify 10 nonexistent numbers
    count = new AtomicInteger();
    while (count.get() < 10) {
        int key = random.nextInt();
        if (existentNumbers.contains(key)) {
            continue;
        } else {
            // This must be a nonexistent numberSystem.out.println(bloomFilter.contains(key)); count.incrementAndGet(); }}}Copy the code

The output is as follows:

Actual amount of data: 1,000,000; actual amount of data: 1,000,000false
true
false
true
true
true
false
false
true
false
Copy the code

That’s what it says. When a Bloom filter says a value exists, it probably doesn’t exist, when it says a value doesn’t exist, it definitely doesn’t exist, and there’s a certain misjudgment rate…

Manual implementation reference

Of course, the above version is very low, but the main idea is not bad, here is also a better version of the implementation of their own reference:

import java.util.BitSet;

public class MyBloomFilter {

    /** * the size of the bit array */
    private static final int DEFAULT_SIZE = 2 << 24;
    /** * from this array you can create 6 different hash functions */
    private static final int[] SEEDS = new int[] {3.13.46.71.91.134};

    /** * bit array. The elements in the array can only be 0 or 1 */
    private BitSet bits = new BitSet(DEFAULT_SIZE);

    /** * an array of classes containing hash functions */
    private SimpleHash[] func = new SimpleHash[SEEDS.length];

    /** * Initializes an array of classes containing hash functions, each with a different hash function */
    public MyBloomFilter(a) {
        // Initialize multiple different Hash functions
        for (int i = 0; i < SEEDS.length; i++) {
            func[i] = newSimpleHash(DEFAULT_SIZE, SEEDS[i]); }}/** * add element into array */
    public void add(Object value) {
        for (SimpleHash f : func) {
            bits.set(f.hash(value), true); }}/** * Determines whether the specified element exists in the bitarray */
    public boolean contains(Object value) {
        boolean ret = true;
        for (SimpleHash f : func) {
            ret = ret && bits.get(f.hash(value));
        }
        return ret;
    }

    /** * static inner class. For hash operations! * /
    public static class SimpleHash {

        private int cap;
        private int seed;

        public SimpleHash(int cap, int seed) {
            this.cap = cap;
            this.seed = seed;
        }

        /** * Computes the hash value */
        public int hash(Object value) {
            int h;
            return (value == null)?0 : Math.abs(seed * (cap - 1) & ((h = value.hashCode()) ^ (h >>> 16))); }}}Copy the code

Use the Bloom filter that comes with Google’s open source Guava

The purpose of self-implementation is mainly to make myself understand the principle of Bloom filter. The realization of Bloom filter in Guava is relatively authoritative, so we do not need to manually implement a bloom filter in the actual project.

First we need to introduce Guava’s dependencies into the project:

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>28.0 the jre</version>
</dependency>
Copy the code

The actual usage is as follows:

We created a Bloom filter that holds up to 1500 integers, and we can tolerate a 1% (0.01) probability of misjudgment.

// Create a Bloom filter object
BloomFilter<Integer> filter = BloomFilter.create(
        Funnels.integerFunnel(),
        1500.0.01);
// Checks whether the specified element exists
System.out.println(filter.mightContain(1));
System.out.println(filter.mightContain(2));
// Add the element to the Bloom filter
filter.put(1);
filter.put(2);
System.out.println(filter.mightContain(1));
System.out.println(filter.mightContain(2));
Copy the code

In our example, when the mightContain() method returns true, we can be 99% sure that the element is in the filter, and when the filter returns false, we can be 100% sure that the element is not in the filter.

Guava provides a good implementation of bloom filters (see the source code for more details), but it has a major drawback that it can only be used in a single machine (and capacity expansion is not easy), and the Internet is generally distributed. To solve this problem, we need to use the Bloom filter in Redis.

reading

  1. Redis (1) – 5 kinds of basic data structures – www.wmyskxz.com/2020/02/28/…
  2. Redis (2) – jump table – www.wmyskxz.com/2020/02/29/…
  3. Redis (3) – distributed lock delve into – www.wmyskxz.com/2020/03/01/…
  4. Reids (4) – magical HyperLoglog solving statistical problems – www.wmyskxz.com/2020/03/02/… \

The resources

  1. The depth of the Redis adventures – Qian Wenpin/a – book.douban.com/subject/303…
  2. 5 minutes to understand bloom filter, billions of data filtering algorithm you deserve! – juejin. Cn/post / 684490…
  3. Don’t know about Bloom filters? A penny for you to make it clear! – github.com/Snailclimb/…
  • Github: More Than Java: More Than Code star: github.com/wmyskxz/Mor…
  • Personal public number: wmyskxz, personal independent domain name blog: wmyskxz.com, adhere to the original output, below the scan code concern, 2020, and you grow together!

Thank you very much for reading this, if you think this article is well written, if you think there is something about “I don’t have three hearts”, please like, follow, share and leave a comment!

Creation is not easy, your support and recognition, is the biggest motivation for my creation, we will see you in the next article!