A preface

If you have a system of 1.5 billion users, with hundreds of millions of users accessing the system every day, how can you quickly determine whether you are a user in the system?

  • Method 1:1.5 billion users are stored in the database. Every time a user accesses the system, the user queries the database for judgment. The accuracy is high, but the query speed is slow.
  • Method 2:1.5 billion users are cached in Redis memory. Every time users access the system, they will query and judge in Redis. The accuracy is high and the query speed is fast, but it takes up a lot of memory. Even if only user IDS are stored, one character per user ID, then 1.5 billion *8 bytes =12GB, which is relatively wasteful for some servers with limited memory space.

As for the website crawler project, we all know the number of websites in the world and its many, every time we climb a new website URL, how to quickly judge whether crawler? There are spam mail filtering, advertising phone filtering and so on. If you still use the above two methods, obviously not the best solution.

Moreover, the query is a system of the highest frequency operation, when the query of a data, first to the cache query (such as Redis), if the cache is dead, so to the persistent layer database (Mongo, mysql, etc.) query, found that there is no this data, so this query failed. If there are a lot of users, and the cache is dead, and then all requests to the persistent layer database, this will put a lot of pressure on the database, which can seriously drag down the database. Commonly known as cache penetration.

You may have heard another term called cache breakdown, which is when you have a hot key, and it’s running with high concurrency, and all of a sudden the key fails, and at the moment of failure, a lot of requests to the cache die, all of them go to the database.

How can these and similar scenarios be solved efficiently? In view of this, bloom filter came into being.


Two Bloom filter

Bloom Filter was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. The advantage of this algorithm is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain error recognition rate and deletion difficulty.

A binary vector is simply a binary array. The values in this array are either 0 or 1.

A mapping function that maps an element to a point in a Bit array. So by looking at this point, you can tell whether or not this element is in the set.

The basic idea

  • When an element is added to the collection, the element is mapped to K points in an array of bits by K hash functions, setting them to 1.
  • When you retrieve an element, you map that element through these K hash functions, see if all of these positions are 1’s and you can tell if that element exists in the set. If any of these positions have a 0, then the elementA certainDoes not exist; If both are 1, then the checked elementIs likely toThere is.

Bloom Filter instead of a single hash function mapping, Bloom Filter uses k hash functions, with each element corresponding to k bits. Thus reducing the probability of conflict.

advantages

  1. Binary array, small memory footprint, and insert and query fast, constant level.
  2. Hash functions are not necessarily related to each other, so they can be implemented by hardware in parallel.
  3. Storing only zeros and ones, without storing the elements themselves, is an advantage in some cases where secrecy is very strict.

disadvantages

  1. There is a margin of error. As the number of stored elements increases, the miscalculation rate increases. (For example, in real life if normal emails are also put into the spam directory, normal SMS is blocked) you can add a small whitelist to store elements that might be misjudged.
  2. Deletion is difficult. An element mapped to the k positions of the bit array is 1. When deleting it, it cannot be simply set to 0 directly, which may affect the judgment of other elements. Because it’s possible for other elements to be mapped to 1 in the same place. Can be usedCounting Bloom FilterTo solve.

Three Redis implementation

In Redis, there is a data structure called a bitmap, or bitmap. The following are some common operation commands.

In the Redis command, SETBIT key offset value is used to set the binary array of values corresponding to the key from left to right, with the binary number of the offset subscript set to value.

The value of key K1 is keke, the corresponding ASCII code is 107 101 107 101, and the corresponding binary code is 0110 1011,0110 0101,0110 1011,0110 1011,0110 0101. Set the position of subscript 5 to 1, so it becomes 0110 1111,0110 0101,0110 1011,0110 0101. Oeke namely.

GETBIT key offset command, which is used to obtain the value of the specified subscript.

Another common command, BITCOUNT key [start end], is used to get the number of bitmaps in a range of 1. Note that start and end specify the number of bytes, not the bitarray subscripts.

Redisson is a library for manipulating Redis in Java programs. With Redisson, we can easily use Redis in programs. Redisson is a client tool that implements the Bloom filter, which is based on a bitmap data structure.

Bloom filters are available in Redis after Redis 4.0 made plug-ins available. Bloom filter is loaded into Redis Server as a plug-in, which provides Redis with a powerful bloom deduplication function. This article will not elaborate, you can be interested in the official view of the detailed documentation. It also uses the following common commands:

  1. Bf. add: Adds elements
  2. Bf. madd: Add elements in batches
  3. Bf.exists: Retrieves whether an element exists
  4. Bf. Mexists: Retrieves whether multiple elements exist
  5. Bf. reserve: Custom Bloom filter, set key, error_rate and initial_size

The following demonstration is in the local single node Redis implementation, if the amount of data is very large, and the error rate is very low, the single node memory may be insufficient. Of course, in clustered Redis, it is also possible to implement distributed Bloom filters through Redisson.

Introduction of depend on

<! -- https://mvnrepository.com/artifact/org.redisson/redisson -->
<dependency>
    <groupId>org.redisson</groupId>
    <artifactId>redisson</artifactId>
    <version>3.13.6</version>
</dependency>
Copy the code

The test code

package com.nobody;

import org.redisson.Redisson;
import org.redisson.api.RBloomFilter;
import org.redisson.api.RedissonClient;
import org.redisson.config.Config;

/ * * *@Description
 * @Author Mr.nobody
 * @Date 2021/3/6
 * @Version1.0 * /
public class RedissonDemo {

    public static void main(String[] args) {

        Config config = new Config();
        config.useSingleServer().setAddress("Redis: / / 127.0.0.1:6379");
        // config.useSingleServer().setPassword("123456");

        RedissonClient redissonClient = Redisson.create(config);
        // Get a Bloom filter whose redis key is users
        RBloomFilter<Integer> bloomFilter = redissonClient.getBloomFilter("users");

        // Suppose the number of elements is 100,000
        int size = 100000;

        // Initialize with an expected element of 100,000 with an error rate of 1%
        bloomFilter.tryInit(size, 0.01);

        // Map the 100,000 numbers from 1 to 100000 to the Bloom filter
        for (int i = 1; i <= size; i++) {
            bloomFilter.add(i);
        }

        // Check if any values already in the filter do not match
        for (int i = 1; i <= size; i++) {
            if(! bloomFilter.contains(i)) { System.out.println("Mismatched values exist:"+ i); }}// Check if any of the 1000 values that are not in the filter match
        int matchCount = 0;
        for (int i = size + 1; i <= size + 1000; i++) {
            if (bloomFilter.contains(i)) {
                matchCount++;
            }
        }
        System.out.println("Number of misjudgments:"+ matchCount); }}Copy the code

All 100,000 elements that existed were matched; There are no 1,000 elements in the Bloom filter, and there are 23 misjudgments.

Number of misjudgments: 23Copy the code

Four Guava implementation

There are many implementations and optimizations of Bloom filters, including one in Guava. The blom filter’s bit array is stored in JVM memory, so it is stand-alone and has a maximum bit size of int.

  • When using a Bloom filter, the important concerns are the estimated data volume N and the expected misjudgment rate FPP.
  • When implementing bloom filters, the important concerns are the selection of hash functions and the size of the bit array.

Bit Array size selection

According to the estimated data volume N and misjudgment rate FPP, the calculation method of M of bit array size is as follows:

Guava source code implementation is as follows:

@VisibleForTesting
static long optimalNumOfBits(long n, double p) {
  if (p == 0) {
    p = Double.MIN_VALUE;
  }
  return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}
Copy the code

Hash function selection

The choice of the number of hash functions is also very particular. The choice of hash functions affects the performance, and a good hash function can map elements to each Bit with approximately equal probability. How do you choose to construct k functions? Well, an easy way to do that is to take a hash function and send it in k different parameters.

The number of hash functions k can be calculated from the estimated amount of data n and the length of the bit array M:

Guava source code implementation is as follows:

@VisibleForTesting
  static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }
Copy the code

Introduction of depend on

<! -- https://mvnrepository.com/artifact/com.google.guava/guava --> <dependency> <groupId>com.google.guava</groupId> < artifactId > guava < / artifactId > < version > 28.2 jre < / version > < / dependency >Copy the code

The test code

package com.nobody;

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

/ * * *@Description
 * @Author Mr.nobody
 * @Date 2021/3/6
 * @Version1.0 * /
public class GuavaDemo {

    public static void main(String[] args) {

        // Suppose the number of elements is 100,000
        int size = 100000;

        // The expected element is 100,000, with a 1% margin of error
        BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size, 0.01);

        // Map the 100,000 numbers from 1 to 100000 to the Bloom filter
        for (int i = 1; i <= size; i++) {
            bloomFilter.put(i);
        }

        // Check if any values already in the filter do not match
        for (int i = 1; i <= size; i++) {
            if(! bloomFilter.mightContain(i)) { System.out.println("Mismatched values exist:"+ i); }}// Check if any of the 1000 values that are not in the filter match
        int matchCount = 0;
        for (int i = size + 1; i <= size + 1000; i++) {
            if (bloomFilter.mightContain(i)) {
                matchCount++;
            }
        }
        System.out.println("Number of misjudgments:"+ matchCount); }}Copy the code

All 100,000 elements that existed were matched; There are no 1,000 elements in the Bloom filter, there are 10 misjudgments.

Number of misjudgments: 10Copy the code

When the value of FPP is changed to 0.001, that is, the error rate is reduced, the number of misjudgments is 0.

Number of misjudgments: 0Copy the code

The analysis showed that the misjudgment rate was indeed about the same as the error tolerance rate we passed in, and that the elements in the Bloom filter matched.

Source code analysis

When the expected number of elements is 100,000 and the value of FPP is 0.01, 958505 bits and 7 hash functions are required.

When 100,000 elements are expected and FPP is 0.001, 1437,758 bits and 10 hash functions are required.

Come to the conclusion

  • The greater the fault tolerance rate, the less space and time required, and the smaller the fault tolerance rate, the more space and time required.
  • An int is 4 bytes, or 32 bits, and requires 3.2 million bits. If you use HashMap storage, you need 6.4 million bits for 50 percent HashMap storage efficiency. However, even if the fault tolerance rate of BloomFilter is 0.001, it only needs 1437758 bits, which shows that the storage space of BloomFilter is very small.

Five Expand knowledge points

Suppose you have A server with only 4GB of memory and two large files on disk. File A stores 10 billion urls and file B stores 10 billion urls. How to find the URL intersection of two files blur? How to delicately find the URL intersection of two files.

Fuzzy intersection:

With the help of Bloom filter idea, the URL of one file is mapped to the bit array by hash function first, which greatly reduces memory storage, and then the URL of another file is read to match in the bit array.

Exquisite intersection:

Hash large files into smaller files, for example, 1000 small files (if the server has A smaller memory, more smaller files can be split). For example, file A can be split into A1, A2, A3… An, file B split into B1, B2, B3… Bn. By using the same hash function, the same URL must be mapped to A small file with the same subscript. For example, www.baidu.com of file A must be mapped to A1, and www.baidu.com of file B must be mapped to B1. Finally, by finding the intersection of small files with the same subscript (such as A1 and B1) (A2 and B2).


Welcome to wechat public account: “Java words” technical articles continue to update, please continue to follow……

  • Learn the latest technical articles as soon as possible
  • Get the latest technology learning materials video
  • Latest Internet information and interview experience