preface

For the record, we all know the usage scenarios of Bloom filter, and we probably ask too many questions in the interview. For example, the second kill scenario of member marketing, we can use bloom filter to do part of the blocking function to prevent cache penetration. These scenes of bloom filter data rarely changes or no obvious expired features, if the data is a large amount of data, outdated data is big, with the passage of time the data quantity is more and more to may have billions a year levels of data quantity, all in bloom filter inside the server is also hard to bear, or starting from the principle analysis below.

define

Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions.

The principle of

BloomFilter is composed of a fixed size binary vector or bitmap and a series of mapping functions.

In the initial state, for a bit array of length m, all its bits are set to 0, as shown below:

When a variable is added to the set, it is mapped to K points in the bitmap by K mapping functions and set them to 1 (assuming that two variables are passed by 3 mapping functions). For example, the value “baidu” and three different hash functions generate hash values of 1, 4 and 7 respectively with the value “Tencent”. If the hash function returns 4, 9, 13, then the following figure is shown

So when we look up a variable we just have to see if all of these points are 1 and we know with a high probability that it’s in the set

  • If any of these points have a 0, the queried variable must not be there;
  • If both are 1, the queried variable is likely to exist

Why is it possible, not certain? That’s because the mapping function itself is a hash function, and hash functions are subject to collisions.

I’m not going to talk about the rate of misjudgment that comes out of the mathematical formula.

Misjudgment rate

The error of bloom filter is that multiple inputs are hashed at the same bit position 1, so it is impossible to determine which input is generated. Therefore, the error is caused by the same bit being mapped multiple times and set to 1.

features

The advantages and disadvantages of the Bloom filter are obvious

  1. When an element is judged to exist, there is a misjudgment, not necessarily an existence. But to say that an element does not exist means that it does not exist.
  2. Bloom filters can only add elements, not remove them.

use

Use I do not describe here, general local words can use Guava package bloom filter, distributed Redis bloom filter can use Redisson package.

<dependencies>
    <dependency>
        <groupId>com.google.guava</groupId>
        <artifactId>guava</artifactId>
        <version>26.0 the jre</version>
    </dependency>
</dependencies>
Copy the code

Estimated capacity

Refer to the following two blogs, I directly post conclusions, some of the following articles are also a little different, about 10 billion URLS, error rate 0.01%, about the size is about 25G

The last

Why is there no Bloom filter to remove elements, because removing elements can lead to a higher miscalculation rate. We have some scenarios that require timed expired and expired data that do not need to be placed in the Bloom filter. How to do that? This is the scenario I discussed with my colleagues tonight. We can do some circumvention from business. For example, your data volume in one month is 100 million, but these data will expire after three months.

Again, architecture without a specific business scenario is empty talk, and good architects are “good at faking things!!”

Reference documentation

www.cnblogs.com/qdhxhz/p/11… www.cnblogs.com/D-Rui/artic… Blog.csdn.net/BrHiker/art…