Does anyone else not understand bloom filters?

1. Introduction

When we use caching, we have to think about how to deal with cache avalanche, cache penetration, and cache breakdown. Here’s how to deal with cache penetration.

Cache penetration refers to that when someone requests non-existent data illegally, the cache will not take effect because the data does not exist, and the request will be directly sent to the database. When a large number of requests are concentrated on the non-existent data, the database will hang.

So there are several solutions:

  • When the database fails to query, an empty object corresponding to the request is automatically created in the cache and has a short expiration time
  • Use bloom filters to reduce database load.

So what does bloom filter come from?

Bloom Filter was proposed by Bloom in 1970. It is used to determine whether an element is in a set. By converting the element to a hash function and doing a bunch of calculations, you end up with multiple subscripts, which are then changed to 1 in an array of length N.

So how do you know if this element is in this set? You just need to know if the value of the index is 1.

Of course, bloom filter is not perfect, its disadvantages are misjudgment, difficult to delete

advantages disadvantages
There is no need to store key values, which occupies less space There is a misjudgment, the element cannot be 100% determined to exist
The spatial efficiency and query time are much higher than ordinary algorithms Delete the difficult

Bloem filter principle: When an element is added to a collection, the element is mapped to K points in a Bit array by K hash functions, setting them to 1. When you retrieve it, you can tell if the element is in the set by looking at whether all of these points are 1; If any of these points have a 0, the element being checked must not be there; If both are 1, then the element being examined is likely to be.

So why does the error exist? Because when the storage capacity is large, the hash index may be the same, so when the hash index of two elements is the same, it is impossible to determine whether the element must exist.

The same is true of deletion difficulties, because subscripts can be duplicated, and when you zero the value of that subscript, it can affect other elements as well.

Therefore, to deal with cache penetration, we only need to check whether the element exists on the Bloom filter, if it does not, we will directly return, if it exists, we will query the cache and database, although it has the impact of false judgment rate, but also can greatly reduce the burden of the database, but also can block most of the illegal requests.

Practice 2.

2.1 Redis implements bloom filter

Redis has a series of bit operation commands, such as setbit, getbit can set the value of the bit array, this feature can be a good implementation of the Bloom filter, there are ready-made dependencies have implemented the Bloom filter.

<dependency>
    <groupId>org.redisson</groupId>
    <artifactId>redisson</artifactId>
    <version>3.16.0</version>
</dependency>
Copy the code

The following is the test code. We first fill in the 800W number, and then cycle the next 200W number to test its misjudgment rate

import org.redisson.Redisson;
import org.redisson.api.RBloomFilter;
import org.redisson.api.RedissonClient;
import org.redisson.config.Config;

public class RedisBloomFilter {

    public static void main(String[] args) {

        Config config = new Config();
        config.useSingleServer().setAddress("redis://localhost:6379");
        config.useSingleServer().setPassword("123456");
        // Create a redis connection
        RedissonClient redissonClient = Redisson.create(config);


        // Initializes the Bloom filter and passes in a custom name for the filter
        RBloomFilter<Integer> bloomFilter = redissonClient.getBloomFilter("BloomFilter");

        // Initializes the Bloom filter parameters, setting the number of elements and error rate
        bloomFilter.tryInit(110000.0.1);

        // Fill 800W digits
        for (int i = 0; i < 80000; i++) {
            bloomFilter.add(i);
        }

        // Start from 8000001 to check whether there is a false error rate
        double count = 0;
        for (int i = 80001; i < 100000; i++) {
            if(bloomFilter.contains(i)) { count++; }}// count/(1000000-8000001
        System.out.println("count=" + count);
        System.out.println("Miscarriage of justice =" + count / (100000 - 80001)); }}Copy the code

Conclusion: under the rounding, the miscarriage rate is 0.1

2.2 Google Guava tool class to implement bloom filter

Add Guava utility class dependencies

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>30.1.1 - jre</version>
</dependency>
Copy the code

Write test code:

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class GuavaBloomFilter {
    public static void main(String[] args) {

        // Initializes the Bloom filter
        BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(),
                                                              110000.0.1);

        // Populate the data
        for (int i = 0; i < 80000; i++) {
            bloomFilter.put(i);
        }

        // Error detection
        double count = 0;
        for (int i = 80000; i < 100000; i++) {
            if (bloomFilter.mightContain(i)) {
                count++;
            }
        }

        System.out.println("count="+ count);
        System.out.println("The miscarriage rate is + count / (100000-80000)); }}Copy the code

Results:

The result is lower than the set misjudgment rate, I guess it may be caused by the different hash algorithms used at the bottom of the two. In the process of use, it is obvious that the speed of Bloom filter using Guava tool class is much faster than that using Redisson, which may be because Guava directly operates on memory. Redisson’s ability to interact with Redis is certainly not as fast as Guava’s ability to manipulate memory directly.

2.3 Handwritten Bloom filter

We use Java, a wrapped bit array called BitSet. BitSet provides a number of apis. The basic operations include:

  • Empty the array of data
  • Flip a bit of data
  • Sets the data for a certain bit
  • Get data for a certain bit
  • Gets the number of bits in the current bitSet

The following points to consider when writing a Bloom filter:

  • The size of the bit array must be specified. The larger the space is, the lower the probability of hash conflicts and the lower the misjudgment rate
  • Multiple hash functions, we should use multiple different primes as seeds
  • Implement two methods, one is to add elements to the filter, one is to determine whether the bloom filter exists the element

Hash out the high and low bits for xor, multiply the seed, and take the remainder of the size of the bit array.

import java.util.BitSet;

public class MyBloomFilter {

    // Default size
    private static final int DEFAULT_SIZE = Integer.MAX_VALUE;

    // The smallest size
    private static final int MIN_SIZE = 1000;

    // The size is the default
    private int SIZE = DEFAULT_SIZE;

    // Seed of the hash function
    private static final int[] HASH_SEEDS = new int[] {3.5.7.11.13.17.19.23.29.31};

    // An array of bits, 0/1, representing features
    private BitSet bitSet = null;

    / / the hash function
    private HashFunction[] hashFunctions = new HashFunction[HASH_SEEDS.length];

    // No parameter initialization
    public MyBloomFilter(a) {
        // Use the default size
        init();
    }

    // Initialize with parameters
    public MyBloomFilter(int size) {
        // The size is initialized to be less than the minimum size
        if (size >= MIN_SIZE) {
            SIZE = size;
        }
        init();
    }

    private void init(a) {
        // Initialize bit size
        bitSet = new BitSet(SIZE);
        // Initialize the hash function
        for (int i = 0; i < HASH_SEEDS.length; i++) {
            hashFunctions[i] = newHashFunction(SIZE, HASH_SEEDS[i]); }}// Add an element to an array
    public void add(Object value) {
        for (HashFunction f : hashFunctions) {
            // Set the hash position to true
            bitSet.set(f.hash(value), true); }}// Determine if the element's characteristics exist in the bitarray
    public boolean contains(Object value) {
        boolean result = true;
        for (HashFunction f : hashFunctions) {
            result = result && bitSet.get(f.hash(value));
            // If one of the hash functions evaluates to false, it returns
            if(! result) {returnresult; }}return result;
    }

    / / the hash function
    public static class HashFunction {
        // Bit array size
        private int size;
        / / hash seeds
        private int seed;

        public HashFunction(int size, int seed) {
            this.size = size;
            this.seed = seed;
        }

        / / the hash function
        public int hash(Object value) {
            if (value == null) {
                return 0;
            } else {
                / / hash value
                int hash1 = value.hashCode();
                // High hash value
                int hash2 = hash1 >>> 16;
                // Merge hash values (equivalent to combining high and low features)
                int combine = hash1 ^ hash1;
                // Multiply the remainder
                returnMath.abs(combine * seed) % size; }}}public static void main(String[] args) {
        Integer num1 = 12321;
        Integer num2 = 12345;
        MyBloomFilter myBloomFilter = newMyBloomFilter(); System.out.println(myBloomFilter.contains(num1)); System.out.println(myBloomFilter.contains(num2)); myBloomFilter.add(num1); myBloomFilter.add(num2); System.out.println(myBloomFilter.contains(num1)); System.out.println(myBloomFilter.contains(num2)); }}Copy the code

The handwritten code is from juejin.cn/post/696168…

The code shows that implementing a simple Bloom filter requires an array of bits, multiple hash functions, and ways to add elements to the filter and determine whether they exist. The larger the bit array space is, the smaller the probability of hash collision is. Therefore, the misjudgment rate and space size in bloom filter are correlated. The lower the misjudgment rate, the larger the space required.

2.4 Application scenarios of bloom filters

The purpose of a Bloom filter is to determine whether an element exists in a collection. Some interviewers may ask if now give you 10 w data collection, so I want to quickly determine how a data exists in the collection, bloom filter can be used to solve this problem, after all, despite the bloom filter exist, but can be 100% sure that the data does not exist, compared to the downside, perfectly acceptable.

There are also some application scenarios:

  • Check whether an email address is in the email blacklist
  • In a crawler, a URL that has been crawled is de-duplicated

Solve the cache through we can preheat, ahead of the data in the bloom filter, the request came in, query bloom filter whether there is the data first, if the data does not exist is returned directly, if the data is the first query Redis, Redis does not exist to query the database again, if there is new data to add, can also add data into bloom filter. Of course, if you have a lot of data that needs to be updated, it’s best to rebuild the Bloom filter.

3. Summary

  • Bloom filter is the use of an N-length bit array, through the elements of a variety of hashes, get a number of subscript values, in the bit array to get the value of the subscript is changed to 1, then the completion of the element storage
  • The error rate of a Bloom filter is negatively correlated with its bit array, and the lower the error rate, the larger the bit array required
  • The advantages of Bloom filter are higher storage space efficiency and faster query time, but the disadvantages are difficult deletion and misjudgment
  • Bloom filters are easy to implement in Redis. There is also a Bloom filter module on Github that can be installed on Redis, and Google’s Guava utility class in Java has bloom filters
  • Bloom filter is one of the solutions to cache penetration, through which you can determine whether the queried element exists

4. Refer to the article

  • Juejin. Cn/post / 696168…
  • Redis (13) —— Redis Bloom filter – YSOcean – Blog Park (CNblogs.com)
  • Bloom filter, this article tells you very clearly – Aliyun.com
  • Brief introduction and Application of Bloom filter – GeaoZhang – Blogpark (CNblogs.com)