This is the fifth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

HashSet is the first thing that comes to mind when we get the need to de-weight a collection. Indeed, using hashCode() and equals() methods, HashSet is efficient, fast and strict in preventing duplicate elements within a collection.

However, when the number of set elements is too large, the memory of a HashSet becomes very large, and such a large HashSet is clearly undesirable. That’s where the Bloom filter comes in.

concept

The Bloom filter is essentially a clever probabilistic data structure that inserts and queries efficiently and tells you “something definitely doesn’t exist or might exist.”

The data structure

The Bloom filter consists of a BitMap. A BitMap is essentially an array. Unlike normal arrays, bitmaps have one bit per array element, which is a huge space saver.

When using a Bloom filter, the following is done for each element added:

  1. Performs n hash computations for the elements to be added.
  2. Set the BitMap bit corresponding to the hash result to 1.

For queries (to determine whether an element exists), the following operations are performed:

  1. Performs n hash computations for the elements to be queried.
  2. Query whether the BitMap bit corresponding to the hash result is 1. If both are 1, it indicates that the element may exist. If there are non-1 bits, the element must not exist.

miscalculation

Note that the Bloom filter cannot determine the presence of an element, only that it does not exist. The reason is that multiple inputs are hashed at the same bit position 1, so it is impossible to determine which input is generated, so the root cause of misjudgment is that the same bit is mapped multiple times and set to 1.

This situation also causes bloor filter deletion problems, i.e. there is no bloor filter deletion. Because each bit of a Bloom filter is not exclusive, it is possible that multiple elements share one bit. If we delete this bit directly, it will affect the other elements.

implementation

Since the implementation

Using the above principles, we can implement a Bloom filter ourselves, as shown in the following example:

package com.example.springtest; import java.util.BitSet; Public class BloomFilter {1073741824 Private static final int SIZE = 1 << 30; Private BitSet BitSet = new BitSet(SIZE); Seed Private static final int[] seeds = {3, 37, 61}; seed private static final int[] seeds = {3, 37, 61}; Private static HashFunction[] functions = new HashFunction[seeds.length]; public BloomFilter() { for (int i = 0; i < functions.length; i++) { HashFunction hashFunction = new HashFunction(SIZE, seeds[i]); functions[i] = hashFunction; Public void add(String s){public void add(String s){public void add(String s){public void add(String s){public void add(String s){public void add(String s){public void add(String s) : functions) { int hash = function.hash(s); bitSet.set(hash,true); }} /** * @description specifies whether an element exists * @param [s] * @return Boolean **/ public Boolean contains(String s){int containsCount = 0; for (HashFunction function : functions) { int hash = function.hash(s); boolean b = bitSet.get(hash); if (b){ containsCount++; If (containsCount >= functions. Length){return true; }else { return false; } } } class HashFunction { private int size; private int seed; public HashFunction(int size, int seed) { this.size = size; this.seed = seed; } public int hash(String value) { int result = 0; int len = value.length(); for (int i = 0; i < len; i++) { result = seed * result + value.charAt(i); } int r = (size - 1) & result; return (size - 1) & result; }}Copy the code

The test code is as follows:

Public class BloomDemo {public static void main(String[] args) {// Add 100 BloomFilter BloomFilter = new BloomFilter(); for (int i = 0; i < 100; i++) { bloomFilter.add(String.valueOf(i)); } // Verify system.out. println("30: "+ bloomfilter. contains(string.valueof (30))); System.out.println("71:"+bloomFilter.contains(String.valueOf(71))); System. The out. Println (" 93: "+ bloomFilter. The contains (String) the valueOf (93))); System.out.println("3:"+bloomFilter.contains(String.valueOf(3))); System.out.println("149:"+bloomFilter.contains(String.valueOf(149))); System.out.println("179:"+bloomFilter.contains(String.valueOf(179))); }}Copy the code

Execution Result:

Existing implementations

Better implementations of Bloom filters exist.

Guava, for example

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
        </dependency>
Copy the code

Create a BloomFilter with bloomfilter.create ()

BloomFilter<Integer> BloomFilter = bloomfiler.create (Funnels. IntegerFunnel (), 100000, 0.01);Copy the code

Among them:

Funnel is a PrimitiveSink interface that converts arbitrary types of data into HashCode (byte arrays). Guava predefined a few primitive Funnel types, such as String, Long, Integer, and so on.

100000 is the expected capacity of a Bloom filter element.

0.01 is the allowable error rate and must be less than 1.

From the above parameters, BloomFilter will calculate the appropriate capacity.

  /**
   * Computes m (total bits of Bloom filter) which is expected to achieve, for the specified
   * expected insertions, the required false positive probability.
   *
   * <p>See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the
   * formula.
   *
   * @param n expected insertions (must be positive)
   * @param p false positive rate (must be 0 < p < 1)
   */
  @VisibleForTesting
  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }
Copy the code

Redis implementation

In distributed environment, using Redis to implement bloom filter is more suitable.

The installation

In Redis, bloom filters are provided as plug-ins that need to be installed before use.

  1. Downloading the Installation package

    Wget HTTP: / / https://github.com/RedisBloom/RedisBloom/archive/refs/tags/v2.2.6.tar.gzCopy the code
  2. Unzip and compile

    Tar -zxf v2.2.6.tar.gz CD RedisBloom-2.2.6/ makeCopy the code
  3. Configure this module in redis.conf

    ################################## MODULES ##################################### # Load modules at startup. If the server is not able to load modules # it will abort. It is possible to use multiple loadmodule directives. # # loadmodule /path/to/my_module.so # loadModule /path/to/other_module.so loadModule /usr/local/RedisBloom-2.2.6/ RedisBloomCopy the code
  4. Start Redis, specifying the configuration file to start

    redis-server 
    Copy the code

use

The basic commands for RedisBloom are as follows:

The command format instructions
bf.reserve bf.reserve {key} {error_rate} {initial_size} Create an empty Bloom filter with size initial_size bitvector length and error rate error_rate
bf.add bf.add{key} {item} Adds an element item to the Bloom specified by key
bf.madd Bf. madd {key} {item} {item2} {item3}… Add multiple elements at once
bf.exists bf.exists {key} {item} Query whether the element exists
bf.mexists Bf. Mexists {key} {item} {item2} {item3}… Check to see if multiple elements exist
bf.info bf.info {key} Query information about the Bloom specified by key
bf.debug bf.debug {key} View the internal details of BloomFilter (such as the number of elements per layer, error rate, etc.)
cf.reserve cf.reserve {key} {initial_size} Create an empty Bloom filter of initial_size bit vector length
cf.add cf.add {key} {item} Adds an element item to the Bloom specified by key
cf.exists cf.exists {key} {item} Check to see if the element exists
bf.scandump bf.scandump {key} {iter} (key: the name of the Bloom filter,iter: the first call to the value 0, or the value returned from the last call to this command)
bf.localchunk bf.localchunk {key} {iter} {data} Load the SCANDUMP persisted Bloom data (key: the name of the target Bloom filter, iter: the iterator value returned by SCANDUMP, corresponding to data, data: the data chunk returned by SCANDUMP)

The following is an example:

Initialize a Bloom filter, Each parameter subtable represents key error rate expected capacity 127.0.0.1:6379> bf.reserve BFTest 0.01 10000 OK # Add element 127.0.0.1:6379> bf.add BFTest 1 (integer) 1 127.0.0.1:6379> bf.add BFTest 2 (integer) 1 # Add BFTest 4 5 6 7 1) (integer) 12) (integer) BFTest 1 (integer) 1 127.0.0.1:6379> bf.exists BFTest 1 (integer) 1 127.0.0.1:6379> bf.exists BFTest 2 (integer) 1 127.0.0.1:6379> bf.exists BFTest 3 (integer) 0Copy the code

Application scenarios

Typical applications of Bloom filters are:

  • Prevent database penetration. Google Bigtable, HBase, Cassandra, and Postgresql use BloomFilter to reduce disk lookups for nonexistent rows or columns. Avoiding costly disk lookups can greatly improve the performance of database query operations.
  • Prevent cache penetration. Cache penetration is data that is not in the cache or the database, and the user keeps making requests, such as for data with id “-1” or data with an ID that is particularly large and does not exist. In this case, the user is likely to be the attacker, and the attack will cause the database to be overburdened. Using bloom filters reduces the strain on the database by avoiding frequent queries for non-existent data.
  • In business scenarios, determining whether a user has read a certain video or article, such as Douyin or Toutiao, will of course lead to certain misjudgments, but users will not see repeated content.
  • WEB interceptor intercepts the same request if it is used to prevent repeated attacks. For the first request, the user puts the request parameters into the Bloom filter. For the second request, the user determines whether the request parameters are matched by the Bloom filter. Can improve cache hit ratio. The Squid web proxy cache server uses Bloom filters in cache digests. Google Chrome uses bloom filters to speed up secure browsing services