Start thinking about

  1. What can you think of to tell if an element is in a set?
  2. Bloom filters do not store the data itself, so how do they do this?
  3. Bloom filter implementation? Parameter configuration?

In general, when we use List, Map, Set to determine whether an element exists, we will first save the element and then filter it. However, there is a drawback of all these forms is that the data must be saved, but we just want to know whether there is data, not the actual data, this way will feel that it is a waste of space.

When do we just need to know if this element exists? In system design, we will consider the form of a large number of concurrent requests, but many requests may be accessing data that does not exist, so we do not need to continue the request, can be directly filtered at the API gateway layer.

Bloom Filter Principle of Bloom Filter

Bloom filter is a binary vector data structure proposed by Howard Bloom in 1970. It has good space and time efficiency and is used to detect whether an element is a member of a set.

Instead of storing the data itself, bloom filters use K hash functions to calculate the location of the byte[] array, and set the value of this location to 1. The number of K’s is calculated according to the formula, which will be listed later. In addition to the K value, we need to calculate the length m of the byte[] array.

  • FPP: error rate parameter, (must be 0 < FPP < 1)
  • N: The estimated total number of filters
  • Write down the names of your high school teachers

  • M: Array length
  • N: The estimated total number of filters

Let’s use the number 11 as an example. There’s a website for testing bloom filters, testing Bloom online

Advantages and disadvantages of Bloom filter

Advantages:

  • To save space, instead of saving all data, knowledge calculates positions using hash values and records them through byte[].
  • Fast speed and low time complexity O(1);

Disadvantages:

  • The accuracy is low. Assume: position 1 and 3 calculated by A; B) calculated positions 5,7; C calculates position 1,7, then c must exist?
  • It is not possible to delete directly, because to delete, the corresponding position must be set to 0, which may affect filtering of other values.

Bloom filter implementation

This is actually implemented in the Google Guava package, so we don’t have to implement it ourselves. Let’s see how that works;

   /** * calculate the length of the bit array * n: estimated amount of data * p: error rate 0-1 */
   @VisibleForTesting
   static long optimalNumOfBits(long n, double p) {
       if (p == 0.0D) {
           p = 4.9 e-324D;
       }

       return (long) ((double)(-n) * Math.log(p) / (Math.log(2.0D) * Math.log(2.0D)));
   }
Copy the code
    /** * The number of hash functions * n: estimated amount of data * m: bit array length */
    @VisibleForTesting
    static int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int)Math.round((double)(m / n) * Math.log(2.0D)));
    }
Copy the code

Play with your hands

  • ExpectedInsertions stands for the estimated number, and the larger the expectedInsertions are, the more accurate the expectedInsertions are. In the following example, you can set the P value arbitrarily. If the p value is too small, the return true will be returned
  • FPP: 0-1 margin of error

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class BloomFilterTest {

    public static void main(String[] args) {

        int expectedInsertions = 800000000;
        double fpp = 0.00001;

        BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), expectedInsertions, fpp);
        int i = 10000;
        while (i > 1){
            bloomFilter.put("aa" + i);
            System.out.println(bloomFilter.mightContain("ab"+ i)); i--; }}}Copy the code

Like articles please follow me

Application field click follow + forward, private message send [interview] or [information] to gain more resources