Bloom Filter
thinking
If you were constantly trying to determine the presence of an element, what would you do?
- It’s easy to imagine that you could use a HashSet (HashMap) to look up elements as keys
- The time complexity of this method is O(1), but the disadvantage is that the space utilization is not high, and more memory resources need to be occupied
But if you’re writing a web crawler to crawl a billion web sites, how do you tell if a site has crawled before, in order to avoid crawling to duplicate sites?
- Obviously, a HashSet is not a very good alternative to a HashMap because it consumes a lot of memory
Is there a low time complexity, low memory footprint solution? Bloom filters do just that.
Introduction to Bloom filter
Bloom filter was proposed by Bloom in 1970. It is a probabilistic data structure with high space utilization, which can be used to tell you that an element must not exist or may exist. Based on this conclusion, bloom filter has the following advantages and disadvantages
- Advantages: space efficiency and query time are far more than the general algorithm
- Disadvantages: there is a certain misjudgment rate, difficult to delete
Although the Bloom filter has a certain misjudgment rate, the misjudgment rate can still be controlled by code, so it can be adjusted according to business requirements. In general, bloom filters can be considered in the following situations
- It is often necessary to determine whether an element exists
- The number of elements is large and you want to have less memory space
- A certain percentage of misjudgment is allowed
Essence: The essence of a Bloom filter is a long binary vector and a series of random mapping functions (Hash functions). As described above, a Bloom filter consists of two parts, one is a Hash function and the other is a binary vector
Binary vector: Can be understood as a binary array
Common application
- Web page blacklist system, spam filtering system, crawler url judge system, to solve the problem of cache penetration
The principle of bloom filter
Assume that the Bloem filter consists of a 20-bit binary (with an initial value of 0) and three hash functions, each of which generates an index
- Add elements and set the index generated by each hash function to 1. For example: Now suppose you want to add an element A, it will use three hash function respectively, used to generate the corresponding value (assuming that the first hash function generated by the index of 4, the second hash function generated by the index of 7, the third hash function to generate index to 18), and then in the array corresponds to the index value is set to 1 if you want to continue to add elements of B, It also takes three hash functions, generates the corresponding index value (assuming the first hash function generates the index value of 2, the second hash function generates the index value of 7, and the third hash function generates the index value of 15), and sets the corresponding index value in the array to 1
- Query whether the element exists: use hash function to calculate the index of the element under the corresponding function. If all the values in the corresponding array index are 1, it means that the element may exist. If at least one of the values in the corresponding index is 0, it means that the element must not exist
- If a hash function generates an index at a position other than 1, it does not exist (100% accurate)
- If each hash function generates an index position of 1, it is present, but there is a certain misjudgment rate
So, according to the principle of bloom filter, we know that
The time complexity of add/query is O(k), where K is the number of hash functions
The space complexity is O(m), where M is the number of binary bits
Error rate of Bloom filter
Generally speaking, the misjudgment rate P is affected by three factors, respectively
- The number of bits m
- The number of hash functions k
- Data size n
According to the formula shown in the figure below, the current miscarriage rate p can be calculated
Since the value of n is very large when the data size is very large, the constant coefficient of 0,5 can be ignored, and the number of binary bits is also very large, so the constant 1 can also be ignored. Therefore, the simplified formula is as follows
Therefore, in actual development, the misjudgment rate is determined in combination with the business, so the misjudgment rate can be considered as a known value. And the data size is also known, so the number of binary bits M and the number of hash table K can be obtained by using the misjudgment rate P and the data size N
The scientists came up with the following formula:
Count the number of bits
Count the number of hash tables
The implementation of Bloom filter
Bloom filters provide two apis, one for adding elements and the other for checking whether elements exist
The implementation of the two apis is as follows
/* * n is the data size * p is the misjudgment rate (0,1) * */
public BloomFilter(int n,double p) {
if (n <= 0 || p <= 0 || p >= 1) {throw new IllegalArgumentException("wrong n or p");
}
double ln2 = Math.log(2);
// Compute the length of the binary vector
bitSize = (int)(- (n * Math.log(p)) / (ln2 * ln2));
// Count the number of hash functions
hashSize = (int)(bitSize * ln2 / n);
// The length of the bits array
bits = new long[(int)((bitSize + Long.SIZE - 1)) / Long.SIZE];
}
/* * Add element * */
public void put(T value){
nullCheck(value);
int hash1 = value.hashCode();
int hash2 = hash1 >>> 16;
for (int i = 1; i <= hashSize; i++) {
int combinedHash = hash1 + (i * hash2);
if (combinedHash < 0) {
combinedHash = ~combinedHash;
}
// Generate a binary index
int index = combinedHash % bitSize;
// Set the binary bit of index to 1set(index); }}Copy the code
The above is the main implementation logic of the two apis. Specific implementation can be referred to demo.
Demo Download Address
Finished!