Recently I saw an article explaining the principle of Bloom filter, speak very clearly, reprint record. The original link: www.cnblogs.com/rinack/p/97…

1, the principle of

The great use of the Bloom filter is that it can quickly determine whether an element is in a set. So he has the following three usage scenarios: Web crawlers to URL to heavy, avoiding the same URL anti-spam crawl, from the list of billions of spam judge whether a mailbox junk mail (by the same token, the spam messages) cache penetration, put all possible data cache in bloom filter, when hackers access does not exist the cache quickly returned to avoid cache and DB hang up. OK, let’s talk about how bloom filters work

A bit array of all zeros is maintained internally. It should be noted that The Bloom filter has a concept of misjudgment rate. The lower the misjudgment rate is, the longer the array is and the larger the space it occupies. The higher the misjudgment rate, the smaller the array and the smaller the space occupied. Suppose, based on the error rate, we generate an array of 10 bits and two hash functions ((f_1,f_2)), as shown in the figure below (the number of bits in the generated array and the number of hash functions, we do not care how to generate, there are mathematical papers to prove it).

Assuming that the input set is ((N_1,N_2)), the value obtained by calculation (f_1(N_1)) is 2, and the value obtained by calculation (f_2(N_1)) is 5, then the position of the array subscript 2 and table 5 is set to 1, as shown in the figure below








2. Performance test

(1) Create a Maven project and introduce the Guava package

<dependencies> <dependency> <groupId>com.google.guava</groupId> <artifactId> <version>22.0</version> </dependency> </dependencies>Copy the code

(2) The time required to test whether an element belongs to a million-element set

import java.util.ArrayList; import java.util.List; import com.google.common.hash.BloomFilter; import com.google.common.hash.Funnels; public class Test { private static int size = 1000000; private static BloomFilter<Integer> bloomFilter =BloomFilter.create(Funnels.integerFunnel(), size); public static void main(String[] args) {for(int i = 0; i < size; i++) { bloomFilter.put(i); } List<Integer> list = new ArrayList<Integer>(1000); // Deliberately take 10,000 values that are not in the filter to see how many are considered in the filterfor (int i = size + 10000; i < size + 20000; i++) {
            if (bloomFilter.mightContain(i)) {
                list.add(i);
            }
        }
        System.out.println("The number of miscarriages of justice:"+ list.size()); }}Copy the code

The output is as follows: 330 If, as shown in the code above, we deliberately took 10,000 values that were not in the filter, yet 330 were still thought to be in the filter, this indicates a misjudgment rate of 0.03. That is, without setting anything, the default misjudgment rate is 0.03. Here is the source code to prove:

Next, when the misjudgment rate is 0.03, the length of the low-level maintained bit array is shown in the figure below

Private static bloomfilter bloomfilter = bloomfilter. create(funnel.integerFunnel (), size,0.01); That is, the misjudgment rate is 0.01. In this case, the length of the low-level maintained bit array is shown below

3. Practical use

The Redis pseudocode is shown below

String get(String key) {
    String value = redis.get(key);     
    if (value  == null) {
        if(! bloomfilter.mightContain(key)){return null; 
        }else{ value = db.get(key); redis.set(key, value); }}returnThe value; }Copy the code

advantages

  1. Thinking is simple
  2. Ensure consistency
  3. Strong performance

disadvantages

  1. Code complexity increases
  2. A separate collection needs to be maintained to hold cached keys
  3. Bloor filters do not support deleting values