This article is to discuss the Bloom filter from the perspective of the white, if you are a graduate, or smart, or really want to fully understand the bloom filter can move on.

I do not know when to start, the original unknown bloom filter suddenly became famous, as if in the Internet, doing development, no one knows, no one knows, even if the technology is not very concerned about the partner also heard its name. I also spent a lot of time studying Bloom filters and reading a lot of blogs, but I’m not a graduate, I’m not that smart, AND I’m lazy… After the infinite cycle of “give up, pick up, give up, pick up”, it should be understood the core idea of Bloom filter, so I want to share with you.

Application of Bloom filter

Let’s take a look at the bloom filter application scenario to show you what the magic Bloom filter can do.

The cache to penetrate

We often put some data in Redis caches, such as product details. In this way, when a query request comes in, we can directly fetch data from the cache according to the product Id, instead of reading the database. This is the simplest, most common, and most effective way to improve performance. The general query request flow is like this: first check the cache, if there is a cache directly return, if there is no cache, then go to the database query, and then put the data from the database into the cache, everything looks good. But what happens if a lot of requests come in now, and they’re all asking for a product Id that doesn’t exist? Since the product Id does not exist, there must be no cache, no cache, then a large number of requests to the database will be on the pressure of the database, and may kill the database. There are many ways to solve this problem, but our hero is the Bloom filter, and yes, the Bloom filter can solve (alleviate) the cache penetration problem. As for why it is “remission”, you will understand it after reading.

A lot of data to determine if a given is in it

Now there is a large amount of data, and the size of the data has far exceeded the memory of the server, now I give you another data, how to determine whether the data given to you is in it. If the server memory is large enough, HashMap is a good solution, theoretically O(1) time, but now the size of the data is far too large to use HashMap, the “Bloom filter” can be used to solve the problem. But again, there is a certain “misjudgment rate.”

What is a Bloom filter

The Bloom filter was proposed by a guy named Bloom, and it is itself a long binary vector, and since it is a binary vector, it is obvious that it stores either zeros or ones.

Now let’s create a new bloom filter of length 16, with the default value 0, like this:

Now we need to add a piece of data:

We calculate that Hash1(data)=5 in some way, such as Hash1. We change the subscript 5 to 1, like this:

We calculate that Hash2(data)=9 using some method, such as Hash2. We change the subscript 9 to 1, like the following:

If Hash3 =2, we change the subscript 2 to 1, like this:

Thus, the data just added occupies the bloom filter “5”, “9”, “2” three grids.

As you can see, the Bloom filter itself does not store complete data at all, but just uses a series of random mapping functions to calculate the position and fill in the binary vector.

How does that help? For example, if you are given another data, and you want to determine whether the data is repeated, what do you do?

All you have to do is figure out which cells the number occupies using the three fixed calculations above, and then see if all of these cells are filled with 1s. If any of these cells is not a 1, that means the number is not one of them. This is easy to understand, for example, now you are given the data you just added, you through three fixed calculation methods, the result must be the same as the above, also occupy the Bloom filter “5”, “9”, “2” three grids.

However, there is a problem to note that if all the 1’s are placed in these cells, it does not necessarily mean that the given data must be repeated, and maybe the results of other data calculated by three fixed calculation methods are also the same. It makes sense, for example, that we need to determine whether objects are equal, not just whether their hashes are equal.

That is, the Bloom filter can only determine whether the data must not exist, but not whether the data must exist.

Normally, after introducing the process of adding and querying, it is necessary to introduce the process of deleting, but unfortunately, it is very difficult to delete data bloom filter, why? For example, if you want to delete the data given to you, you change the “5”, “9” and “2” to 0, but other data may also map to the “5”, “9” and “2” three squares, isn’t that a mess?

I believe that after my introduction, you should have a superficial understanding of Bloom filter, at least you should know the advantages and disadvantages of bloom filter:

  • Advantages: because the storage is not complete data, so the memory occupation is very little, and new, query speed is fast enough;
  • Disadvantages: The misjudgment rate increases with the increase of data; Unable to delete data; It can only judge whether the data must not exist, but not whether the data must exist.

As you can see, the advantages of the Bloom filter are just as obvious as the disadvantages.

In this paper, I cite examples of binary vector length is 16, calculated by the three random mapping function, position, in the actual development, if you want to add a large amount of data, only 16 is not enough, in order to reduce misjudgment rate, we can also use more random mapping function, the longer the binary vector to calculate the position.

Guava implements bloom filters

Now BELIEVE that you should have a more perceptual understanding of The Bloom filter, the core idea of the Bloom filter is not difficult, difficult is how to design random mapping function, exactly how many times to map, the length of the binary vector set is better, which may not be the general development can control, Let’s take a look at the “gift” Google has given us by providing us with an out-of-the-box component to help implement bloom filters.

First, introduce “gifts” in POM:

<dependency> <groupId>com.google.guava</groupId> <artifactId> <version>19.0</version> </dependency>Copy the code

Then you can test:

    private static int size = 1000000;// How much data is expected to be inserted

    private static double fpp = 0.01;// Expected misjudgment rate

    private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size, fpp);

    public static void main(String[] args) {
        // Insert data
        for (int i = 0; i < 1000000; i++) {
            bloomFilter.put(i);
        }
        int count = 0;
        for (int i = 1000000; i < 2000000; i++) {
            if (bloomFilter.mightContain(i)) {
                count++;
                System.out.println(i + "Misjudged.");
            }
        }
        System.out.println("Total number of miscarriages of justice :" + count);
    }
Copy the code

Code simple analysis: we define a Bloom filter, there are two important parameters, respectively is how much data we expect to insert, we expect the misjudgment rate, the misjudgment rate can not be 0. I inserted 0-1000000 into the Bloom filter, and used 1000000-2000000 to test the misjudgment rate.

Running results:

1999501, 1999567, 1999640, 1999697, 1999827, 1999942 total number of miscarriages of justice: 10,314Copy the code

There are now a total of 1 million data that don’t exist, 10,314 misjudgments, so let’s calculate the misjudgment rate

Redis implements bloom filters

We can also use Redis to implement the Bloom filter. The data structure we are going to use is bitmap. You may have questions, redis supports five data structures: String, List, Hash, Set, ZSet, no bitmap. Yes, bitmaps are essentially strings.

Some people might say, Nani, how can a bitmap come out before the bloom filter is introduced? It’s ok, you can think of a bitmap as a binary vector.

To use Redis to achieve bloom filter, we need to design their own mapping function, measure the length of binary vector, this for me, is undoubtedly an impossible task, can only use the search engine, the following directly released the code.

public class RedisMain {
    static final int expectedInsertions = 1000;// How much data to insert
    static final double fpp = 0.01;// Expected misjudgment rate

    //bit The length of the array
    private static long numBits;

    // Number of hash functions
    private static int numHashFunctions;

    static {
        numBits = optimalNumOfBits(expectedInsertions, fpp);
        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    }

    public static void main(String[] args) {
        Jedis jedis = new Jedis("localhost".6379);
        for (int i = 0; i < 1000; i++) {
            long[] indexArray = getIndexArray(String.valueOf(i));
            for (long index : indexArray) {
                jedis.setbit("codebear:bloom", index, true); }}int num = 0;
        for (int i = 1000; i < 2000; i++) {
            long[] indexArray = getIndexArray(String.valueOf(i));
            for (long index : indexArray) {
                if(! jedis.getbit("codebear:bloom", index)) {
                    System.out.println(i + "It must not exist.");
                    num++;
                    break;
                }
            }
        }
        System.out.println("There must be something that doesn't exist." + num + "个");
    }

    /** * Get bitmap subscript */ based on key
    private static long[] getIndexArray(String key) {
        long hash1 = hash(key);
        long hash2 = hash1 >>> 16;
        long[] result = new long[numHashFunctions];
        for (int i = 0; i < numHashFunctions; i++) {
            long combinedHash = hash1 + i * hash2;
            if (combinedHash < 0) {
                combinedHash = ~combinedHash;
            }
            result[i] = combinedHash % numBits;
        }
        return result;
    }

    private static long hash(String key) {
        return Hashing.MURMUR_HASH.hash(key);
    }

    // Calculate the number of hash functions
    private static int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
    }

    // Calculate the length of the bit array
    private static long optimalNumOfBits(long n, double p) {
        if (p == 0) {
            p = Double.MIN_VALUE;
        }
        return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2))); }}Copy the code

Running results:

1997 definitely didn't exist 1998 definitely didn't exist 1999 definitely didn't exist 989Copy the code

That’s the end of this blog, thank you.