preface
We talked earlier about Redis cache avalanche, penetration, and breakdown. In this article we mentioned that one of the solutions to cache penetration is the Bloom filter, but we didn’t talk about how to use the Bloom filter last time.
As a warm man’s elder brother, to make up for you, please call me IT old warm man.
What is a Bloom filter
The Bloom Filter, which was introduced in 1970 by a guy named Bloom, is now 50 years old, as old as my brother.
It’s actually a long binary vector and a bunch of random mapping functions, and binary, as you probably all know, stores data that’s either 0 or 1, and the default is 0.
It is used to determine whether an element is in a set. 0 indicates that certain data does not exist, and 1 indicates that certain data exists.
Do you understand? As aSunshine boy
My brother is drawing you a picture to help you understand:
Use of Bloom filter
-
Resolving Redis cache penetration (today’s focus)
-
In crawler, the crawler url is filtered, and the url that already exists in Braun is not crawled.
-
Spam filtering, to determine whether each address that sends mail is in bloom’s blacklist, if so, it is judged as spam.
The above is just a simple example of use, we can draw inferences from one another, flexible use in the work.
Bloom filter principle
In the process
A Bloom filter, as it says, is a collection of binary data. When a piece of data is added to the set, it undergoes the following (there are drawbacks, which will be discussed below) :
-
The data is computed by K hash functions and returns K computed hash values
-
These K hash values map to the corresponding K binary array subscripts
-
Change the binary data corresponding to K subscripts to 1.
For example, if the first hash function returns x and the second and third hash functions return y and z, then the binary of x, y, and z is changed to 1.
As shown in the figure:
The query process
The main function of bloom filter is to query a data, in this binary set, the query process is as follows:
-
This data is computed by K hash functions, corresponding to K hash values computed
-
Find the corresponding binary array subscript using the hash value
-
Judgment: If there is a location where the binary data is 0, then the data does not exist. If both are 1’s, the data is in the set. (There are drawbacks, which will be covered below)
Removal process
You generally can’t delete data in bloom filters, which is one of the drawbacks we’ll examine below.
Advantages and disadvantages of bloom filters
advantages
-
Since you store binary data, it takes up very little space
-
Its insert and query speed is very fast and the time complexity is O (K), reminiscent of the process of HashMap
-
Confidentiality is good because it doesn’t store any raw data itself, only binary data
disadvantages
Which brings us back to the shortcomings we mentioned above.
Adding data is done by calculating the hash value of the data, so it is quite possible that two different data will compute the same hash value.
For example, if “hello” and “hello” in the figure end up with the same hash value, they will change the binary data with the same subscript to 1.
At this point, you don’t know whether binary with subscript 2 means “hello” or “hello.”
The following disadvantages can be obtained:
First, miscarriage of justice
If the graph above does not contain “hello” but only “hello”, then a query using “hello” will determine that “hello” exists in the collection.
Because the hash value for “hello” and “hello” is the same, the binary data found by using the same hash value is also the same, which is 1.
Two, delete difficulty
I’m going to stop here and I think you can understand why, but as your warm brother, I’ll tell you.
Again, use the above example, because “hello” and “hello” have the same hash value and the corresponding array index is the same.
At this time, the elder brother wanted to delete “hello”, and changed the binary data in the subscript 2 from 1 to 0.
Do we delete “hello” at the same time? (0 means there is such data, 1 means there is no such data)
By this point, you’ve already understood the Bloom filter. I’m the warm guy.
Implement bloom filter
There are many ways to implement this, one of which is the one provided by Guava.
Introduction of Guava POM configuration
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
< version > 29.0 jre < / version >
</dependency>
Copy the code
Two, code implementation
And let’s measure the error rate by the way.
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
public class BloomFilterCase {
/ * *
* How much data is expected to be inserted
* /
private static int size = 1000000;
/ * *
* Expected miscarriage rate
* /
private static double fpp = 0.01;
/ * *
* Bloom filter
* /
private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size, fpp);
public static void main(String[] args) {
// Insert 100,000 sample data
for (int i = 0; i < size; i++) {
bloomFilter.put(i);
}
// Test the error rate with another hundred thousand test data
int count = 0;
for (int i = size; i < size + 100000; i++) {
if (bloomFilter.mightContain(i)) {
count++;
System.out.println(i + "Misjudged.");
}
}
System.out.println("Total number of miscarriages of justice :" + count);
}
}
Copy the code
Running results:
There are 947 misjudgments in 100,000 data, which is about 0.01%, which is the misjudgment rate set in our code: FPP = 0.01.
Dive deep into code
The core bloomfilter.create method
@VisibleForTesting
static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
.
}
Copy the code
There are four parameters:
-
Funnel: Data type (used in the Funnels tool class)
-
ExpectedInsertions: The number of values you expect to insert
-
FPP: Misjudgment rate (default 0.03)
-
Strategy: Hash algorithm
Let’s focus on the FPP parameter
FPP misjudgment rate
Scene 1:FPP = 0.01
-
Number of misjudgments: 947
-
Memory size: 9585058 bits
Scene two:FPP = 0.03
(Default parameters)
-
Number of misjudgments: 3,033
-
Memory size: 7298440 bits
Scene summary
-
The misjudgment rate can be adjusted by FPP parameters
-
The smaller the FPP, the more memory space is required: 0.01 requires more than 9 million bits and 0.03 requires more than 7 million bits.
-
The smaller the FPP, the more hash functions needed to calculate the more hash values to store in the corresponding array subscripts when adding data to the collection. (Forgot to look at the process of filtering the data stored in bloom above)
NumBits = 1 million int digits Theoretically save a million numbers, an int is 4 bytes 32 bits, need 481000000=32 million bits. If you use HashMap to store, you need 64 million bits for 50 percent HashMap storage efficiency. As you can see, BloomFilter has a very small storage space, about 1/10 of that of HashMap
The numHashFunctions above indicate that several hash functions are required to map different subscripts to whether the numbers exist (0 or 1).