This is the 18th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

The idea of a Bloom filter

If you want to determine whether an element is in a set, the general idea is to store all the elements and compare them. Linked lists, trees, and so on. But with the increase of elements in the set, we need more and more storage space and slower and slower retrieval speed (O(n),O(logn)).

Data structure of the Hash table

But there is another kind of data structure called a Hash table. It maps an element to a point in a Bit array using a Hash function. So in this case, we just have to look at whether this point is 1 to see if it’s in the set. That’s the basic idea of a Bloom filter.

Hash algorithm for bloom filters

The problem with Hash is conflict. Assuming the Hash function is good, if our bit array is m points long, then if we want to reduce the collision rate to say 1%, the Hash table can only hold m over 100 elements. Obviously, that’s not space-efficient anymore. The solution is simply to use multiple hashes, and if one of them says the element is not in the collection, it’s definitely not. If they’re both true, there’s a good chance they’re lying, but the odds of that being intuitively low.

Pure Java implementation of the scheme

 public class MyBloomFilter {
 
    /** * A 1 billion bit */
    private static final int DEFAULT_SIZE = 256 << 22;
 
    /** * To reduce the error rate, the addition hash algorithm is used, so define an 8-element prime array */
    private static final int[] seeds = {3.5.7.11.13.31.37.61};
 
    /** * is equivalent to building 8 different hash algorithms */
    private static HashFunction[] functions = new HashFunction[seeds.length];
 
    /** * Initializes the bitmap of the Bloom filter */
    private static BitSet bitset = new BitSet(DEFAULT_SIZE);
 
    /** * Add data **@paramValue Specifies the value to be added */
    public static void add(String value) {
        if(value ! =null) {
            for (HashFunction f : functions) {
                // Compute the hash value and change the bitmap position to true
                bitset.set(f.hash(value), true); }}}/** * determines whether the corresponding element has *@paramValue Specifies the element *@returnResults the * /
    public static boolean contains(String value) {
        if (value == null) {
            return false;
        }
        boolean ret = true;
        for (HashFunction f : functions) {
            ret = bitset.get(f.hash(value));
            // a hash function returns false to break out of the loop
            if(! ret) {break; }}return ret;
    }
 
    /** * test... * /
    public static void main(String[] args) {
 
        for (int i = 0; i < seeds.length; i++) {
            functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
        }
 
        // Add 100 million data
        for (int i = 0; i < 100000000; i++) {
            add(String.valueOf(i));
        }
        String id = "123456789";
        add(id);
 
        System.out.println(contains(id));   // true
        System.out.println("" + contains("234567890"));  //false}}class HashFunction {
 
    private int size;
    private int seed;
 
    public HashFunction(int size, int seed) {
        this.size = size;
        this.seed = seed;
    }
 
    public int hash(String value) {
        int result = 0;
        int len = value.length();
        for (int i = 0; i < len; i++) {
            result = seed * result + value.charAt(i);
        }
        int r = (size - 1) & result;
        return (size - 1) & result; }}Copy the code

Redis implements bloom filters

Introduction to Bloom filter

Bloom filter is a long binary vector and a series of random mapping functions, suitable for determining whether a certain data in the set exists, there will be false identification.

  • The advantage is that the space efficiency and query time are much better than the general algorithm

  • The disadvantage is that there is a certain error recognition rate and deletion difficulty.

Application scenario of bloom filter

Client – Bloom filter (HashMap)- Redis cache -DB database

  1. At startup, cache all the redis keys into a Bloom filter. You can also use hashMap, but the Bloom filter will perform much faster than HashMap.
  2. When the client requests, it first passes through the Bloom filter to determine whether the key exists. If not, it directly returns the key, which can solve the redis penetration and breakdown.
  3. The Bloom filter misjudged through redis, and did not cause the original influx of large quantities, which is acceptable
  4. If the redis key changes, let bloom filter cache preheat, to solve the deletion problem

Problems with bloom filters

miscalculation

  • Jarye2 itself does not exist in the binary vector table. It is thought to exist because of the hash value and other collisions.

  • Solution: Set the misjudgment probability low enough, but it will lead to a much larger direction scale.

Delete the difficult

If Jarye2 is deleted, the value of vector address 8 and 13 will be set to 0, resulting in the failure of Jarye1 to be hit

** Solution: ** cache is reheated.

Bloem filter demo example

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>22.0</version>
</dependency>
Copy the code
/** * test demo */
public class BlongTest {
    /** * Store 1 million bytes of data in bloom */
    private static Integer size = 1000000;

    public static void main(String[] args) {
        /** * The last parameter is the misjudgment rate, which must be >0.0 * the misjudgment rate is 3% 100W data, the length of the two-lane array is 730W * the misjudgment rate is 1% 100W data, the length of the two-lane array is 960W * The overall efficiency and accuracy, the recommended value is 1% ** /
        BloomFilter<Integer> integerBloomFilter = 
            BloomFilter.create(Funnels.integerFunnel(), size, 0.01);
        for (int i = 0; i < size; i++) {
            integerBloomFilter.put(i);
        }
        // Check whether the data exists from the blob
        ArrayList<Integer> strings = new ArrayList<>();
        for (int j = size; j < size + 10000; j++) {
            if (integerBloomFilter.mightContain(j)) {
                strings.add(j);
            }
        }
        System.out.println("Number of miscarriages of justice :"+ strings.size()); }}Copy the code

Solve Redis breakdown problem based on Bloom filter

@RequestMapping("/getOrder")
public OrderEntity getOrder(Integer orderId) {
    if(integerBloomFilter ! =null) {
        if(! integerBloomFilter.mightContain(orderId)) { System.out.println("Key not found in Bloom filter");
            return null; }}// 1. Check whether the data in Redis exists
    OrderEntity orderRedisEntity = (OrderEntity) redisTemplateUtils.getObject(orderId + "");
    if(orderRedisEntity ! =null) {
        System.out.println("Return data directly from Redis");
        return orderRedisEntity;
    }
    // 2. Query the database contents
    System.out.println("Query data from DB");
    OrderEntity orderDBEntity = orderMapper.getOrderById(orderId);
    if(orderDBEntity ! =null) {
        System.out.println("Put Db data into Redis");
        redisTemplateUtils.setObject(orderId + "", orderDBEntity);
    }
    return orderDBEntity;
}

@RequestMapping("/dbToBulong")
public String dbToBulong() {
    List<Integer> orderIds = orderMapper.getOrderIds();
    integerBloomFilter = BloomFilter.create(Funnels.integerFunnel(), orderIds.size(), 0.01);
    for (int i = 0; i < orderIds.size(); i++) {
        integerBloomFilter.put(orderIds.get(i));
    }
    return "success";
}
Copy the code

Data reference

  • www.cnblogs.com/wuwuyong/p/…