Bloom filter
A Bloom filter is a probabilistic data structure consisting of a bit array and multiple hash functions that returns two possible and certain non-existent results.
An element in a Bloom filter is determined by multiple state values. Bit arrays store state values, and hash functions calculate the position of state values.
According to its algorithm structure, it has the following characteristics:
- A finite bit array is used to represent the number of elements greater than its length, because a bit’s state value can simultaneously identify multiple elements.
- Elements cannot be deleted. This is because a bit state value may simultaneously identify multiple elements.
- Adding elements never fails. However, as more elements are added, the misjudgment rate increases.
- If the judgment element does not exist, it must not exist.
For example, X,Y and Z are determined by three state values to determine whether the element exists, and the position of the state value is calculated by three hash functions.
Mathematical relationship
The probability of misjudgment
As for the probability of misjudgment, the state value of each bit may identify multiple elements at the same time, so it has a certain probability of misjudgment. If the bit array is full, it will always return true when determining whether an element exists, with a 100% error rate for non-existent elements.
So, the probability of miscalculation is related to the number of elements added, the length of the Bloom filter (the size of the bit array), the number of hash functions.
According to Wikipedia inference, the probability of misjudgment PfpP_{fp}Pfp has the following relationship:
- MMM is the size of the bit array;
- NNN is the number of added elements;
- KKK is the number of hash functions;
- Eee mathematical constant, approximately 2.718281828.
It can be concluded that when the number of added elements is 0, the false positive rate is 0. When the bit array is all 1, the false positive rate is 100%.
The relation between Pfp P_{fp}Pfp and N nn under different number of hash functions is shown as follows:
There are a couple of things you can do based on the probability of misjudgment formula
- Estimate the optimum bloom filter length.
- Estimate the optimal number of hash functions.
Optimum bloom filter length
When NNN adds elements and PfpP_{fp}Pfp false positive probability is determined, MMM equals:
Optimal number of hash functions
When NNN and PfpP_{fp}Pfp are determined, KKK is equal to:
When NNN and MMM are determined, KKK equals:
Implement bloom filter
Before using a Bloom filter, we typically evaluate two factors.
- The maximum number of elements expected to be added.
- The business’s tolerance for error. For example, if one error is allowed in 1,000 cases, the probability of misjudgment should be less than one in 1,000.
Many Bloom filtering tools provide the expected number of additions and probability of miscalculation configuration parameters, and they calculate the optimal length and number of hash functions based on the configured parameters.
There are some good Bloom filtering toolkits in Java.
Guava
中BloomFilter
.redisson
中RedissonBloomFilter
It can be used in Redis.
Take a look at Guava’s simple implementation of BloomFilter, which calculates the length of the bit array and the number of hash functions before creating it.
static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
/** * expectedInsertions: Expect number of additions * FPP: Misjudgment probability */
long numBits = optimalNumOfBits(expectedInsertions, fpp);
int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
try {
return new BloomFilter<T>(new BitArray(numBits), numHashFunctions, funnel, strategy);
} catch (IllegalArgumentException e) {
throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e); }}Copy the code
According to the optimal Bloom filter length formula, the optimal bit array length is calculated.
static long optimalNumOfBits(long n, double p) {
if (p == 0) {
p = Double.MIN_VALUE;
}
return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}
Copy the code
According to the optimal number of hash functions formula, calculate the optimal number of hash functions.
static int optimalNumOfHashFunctions(long n, long m) {
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}
Copy the code
The calculation method of RedissonBloomFilter in Redisson is also consistent.
private int optimalNumOfHashFunctions(long n, long m) {
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}
private long optimalNumOfBits(long n, double p) {
if (p == 0) {
p = Double.MIN_VALUE;
}
return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}
Copy the code
Memory footprint
Imagine a scenario where a mobile phone number is deleted. Each mobile phone number occupies 22 bytes. The estimated logical memory is as follows.
expected | HashSet | FPP = 0.0001 | FPP = 0.0000001 |
---|---|---|---|
1 million | 18.28 MB | 2.29 MB | 4MB |
10 million | 182.82 MB | 22.85 MB | 40MB |
100 million | 1.78 G | 228.53 MB | 400MB |
Note: The actual physical memory is larger than the logical memory.
Misjudgment probability PPP and added elements NNN, bit array length MMM, number of hash functions KKK relationship is as follows:
Application scenarios
- Weak password detection;
- Spam address filtering.
- Browser detects phishing sites;
- Cache penetration.
Weak cipher detection
Maintain a list of hashed weak passwords. When a user registers or updates a password, a bloom filter is used to check for the new password and the prompt user is detected.
Spam address filtering
Maintains a hashed list of spam addresses. When a user receives a message, bloom filter is used to detect it and identifies it as spam.
Browser detects phishing sites
Use the Bloom filter to find if a URL for a site exists in the phishing site database.
The cache to penetrate
Cache penetration is a query for data that does not exist and is not hit by either the cache layer or the database. When the cache is not hit, the database is queried
- If the database is not hit, empty results are not written back to the cache and returned.
- The database hits, the query result is written back to the cache and the result is returned.
A typical attack simulates a large number of requests for nonexistent data, all of which fall on the database, causing the database to go down.
One solution is to put the existing cache into a Bloom filter and validate the filter before the request is made.
summary
For quadrillion levels of data, the use of Bloom filters has certain advantages, and it is critical to properly evaluate the expected number of additions and the probability of misjudgment based on the business scenario.
reference
En.wikipedia.org/wiki/Bloom_…
hur.st/bloomfilter