As we all know,IO has always been a bottleneck in computers. Many frameworks, technologies and even hardware are designed to reduce IO operations. Today, let’s talk about filters and let’s talk about a scene:
Our business back-end involves a database. When a request message queries for some information, it may first check whether there is relevant information in the cache. Any return, if not likely to go inside the database query, by this time there is a problem, if a lot of requests is the data in the request database doesn’t exist, the database will be frequent response this unnecessary IO query, if some more, most database IO request operation in response to this kind of meaningless, so how to stop these requests Outside? Filters were born!
Java backend architecture advanced Study notes:Java from entry to architecture Growth Notes :JVM+ Concurrency + source code + distributed + Microservices + Dachang field projects + Dachang performance tuning solutions (click here to get free)
Bloom filter
The general idea of Bloom Filter is that when the information you request comes in, first check whether the data you query is available. If yes, the request will be pressed to the database. If not, it will be returned directly.
As shown in the figure, a bitmap is used for recording, and the original value of the bitmap is all 0. When a data is stored in, three Hash functions are used to calculate the Hash value three times, and the corresponding position of the bitmap is set to 1. In the figure above, the bitmap Positions 1,3, and 6 are marked as 1, so if a data request comes in,** still computs the Hash value using the previous three Hash functions,** If the same data, certainly will still is mapped to 1,3,6, so you can judge the data store before, if the new data mapping three locations, there is a match, not to join the mapping to 1,3,7, because seven is 0, that is the data didn’t join into the database before, so returned directly.
Bloem filter problem
As you can already see, bloom filters have some problems:
First, the Bloom filter may misjudge:
Suppose there is such a scenario, when data packet 1 is put in, the 1,3 and 6 bits of bitmap are set to 1, and when data packet 2 is put in, the 3,6 and 7 bits of bitmap are set to 1. At this time, a data packet that has not been stored requests 3. After three hashes, the corresponding bitmap sites are 1,6 and 7 respectively, and the data has not been stored before Yes, but since data packets 1 and 2 are stored at the corresponding point of 1, request 3 will also be pressed to the database. In this case, it will increase as data is stored.
Secondly, the Bloom filter cannot delete data, which has the following two dilemmas:
- First, it is uncertain whether the data exists in the database due to the possibility of misjudgment, such as packet 3.
- Second, when you delete the flag on the bitmap corresponding to a certain packet, other packets may be affected. For example, in the above example, if you delete packet 1, it means that bitmap1,3 and 6 bits will be set to 0. At this time, when packet 2 requests, it will show that it does not exist, because the 3 and 6 bits have been set to 0.
Bloom filter enhanced edition
In order to solve the problem of the Counting Bloom filter, there is an enhanced version of the Bloom filter Filter), the idea of this Filter is to replace the bitmap of the Bloom Filter with an array. When a certain position in the array is mapped once, it will be +1, and when deleted, it will be -1. In this way, the problem of recalculating the Hash of other packets after deleting data in the common Bloom Filter is avoided, but it still cannot avoid misjudgment.
Cuckoo filter
In order to solve the problem that the Bloom Filter cannot delete elements, the author of Cuckoo Filter: Better Than Bloom proposed a Cuckoo Filter. Compared with the cuckoo filter, bloom filter has the following disadvantages: poor query performance, low space utilization efficiency, no reverse operation (delete), and no counting support.
Poor query performance is due to the fact that the Bloom filter needs to use multiple hash functions to detect multiple different loci in the bitmap, which are widely spread across memory, resulting in low CPU cache row hit ratio.
The spatial efficiency is low because the cuckoo filter is significantly more space-efficient than The Bloon filter at the same misjudgment rate, saving about 40% in space. However, bloom filters do not require bitmap length to be an exponent of two, as cuckoo filters do. From this point of view, it seems that bloom filters are more spatially scalable.
The problem of not supporting the reverse delete operation really hits the soft spot of bloom filter. In a dynamic system elements are always coming and going. Bloem filters are like blots. They come in and they leave, and they can’t clean up after they leave. For example, if you only have 1kW elements left in the system, but there are hundreds of millions of flowing elements in the system, the Bloom filter has no option but to store the traces of those missing elements there forever. Over time, the filter gets more and more crowded, until one day you realize its misjudgment rate is so high that you have to rebuild it.
The cuckoo filter claims in the paper to solve this problem by effectively supporting reverse deletion. And use it as an important selling point to lure you away from the Bloom filter and into the cuckoo filter.
Why the cuckoo?
There is an idiom, “hatoyama to occupy the nest”, cuckoo also, cuckoo never build their own nest. It lays its eggs in other people’s nests and lets others help it hatch. When the cuckoo chick emerges, because of its relatively large size, it squeezes the other babies (or eggs) of its adoptive mother out of the nest — falling to their deaths.
Cuckoo hash
The simplest cuckoo hash is a one-dimensional array structure, with two hash algorithms mapping the new element to two positions in the array. If one of the two positions is empty, the element can be put directly into it. But if both slots are filled, it will have to “occupy the nest”, kicking out one at random and taking the slot itself.
p1 = hash1(x) % l
p2 = hash2(x) % l
Copy the code
Unlike the cuckoo, the cuckoo hashing algorithm helps its victims find other nests. Because every element can be placed in two places, you can cram it into any one of them that’s free. So the sad, evicted egg will check to see if the other spot is open, and if it is, move to it and everyone will be happy. But what if the seat is also taken by someone else? Well, it will do it all over again, passing on the role of victim to others. The new victim then repeats the process until all the eggs have found their own nest.
The cuckoo hash problem
The problem is that if the array is so crowded that you can kick it hundreds of times without stopping, then the insertion efficiency can be severely affected. At this point, the cuckoo hash sets a threshold, and when continuous nesting behavior exceeds a certain threshold, the array is considered nearly full. You need to expand it and replace everything.
The other problem is that there could be a run cycle. For example, if two different elements are in exactly the same position after the hash, they can be in one position. But then a third element comes in, and it’s in the same position after the hash, and obviously you’re going to have a run cycle. But the probability that three different elements will be in the same position after two hashes is not very high, unless your hash algorithm is badly wrong.
The cuckoo hash algorithm treats this run cycle as if the array is too crowded and needs to be expanded (it’s not).
To optimize the
The average space utilization of the above cuckoo hash algorithm is not very high, perhaps only 50%. At that percentage, you quickly get a series of runs that exceed the threshold. The value of such a hash algorithm is not obvious, so it needs to be improved.
One of the improvements is to add hash functions so that each element has not just two nests, but three nests and four nests. This can greatly reduce the probability of collision and improve the space utilization to about 95%.
Another improvement is to hang multiple seats at each position of the array, so that even if two elements are hash in the same position, you don’t have to immediately “occupy the nest” because there are multiple seats and you can sit on any one of them. A run is only necessary if all these seats are taken. Obviously, this would also significantly reduce the number of runs. The space utilization of this scheme is only about 85%, but the query efficiency is very high. Multiple seats in the same location are contiguous in memory space, which makes efficient use of THE CPU cache.
So a more efficient solution would be to merge the two improvements above, such as using four hash functions with two seats above each position. In this way, we can get both time efficiency and space efficiency. This combination can even increase space utilization by 99%, which is a remarkable space efficiency.
Cuckoo filter
A cuckoo filter, like a cuckoo hash, is a one-dimensional array, but unlike a cuckoo hash, which stores the entire element, a cuckoo filter stores only the element’s fingerprint information (a few bits, similar to a Bloom filter). Here the filter sacrifices data accuracy for spatial efficiency. It is the fingerprint information of the element that is stored, so there is a misjudgment rate, just like the Bloom filter.
First, the cuckoo filter will still only use two hash functions, but each position can hold multiple seats. These two hash functions are chosen specially because only fingerprint information can be stored in the filter. When the fingerprint in this position is squeezed, it needs to compute another dual position. To compute this dual position, you need the element itself, so let’s recall the hash position formula.
fp = fingerprint(x)
p1 = hash1(x) % l
p2 = hash2(x) % l
Copy the code
We know the fingerprints of P1 and x, so we can’t figure out p2 directly.
Special hash function
The neat thing about the cuckoo filter is that it has a unique hash function that allows p2 to be computed directly from P1 and element fingerprints, rather than the full x element.
Fp = fingerprint(x) p1 = hash(x) p2 = p1 ^ hash(fp) // XORCopy the code
And as you can see from the formula above, when we know fp and P1, we can immediately figure out p2. And similarly, if we know p2 and FP, we can just figure out p1 — duality.
p1 = p2 ^ hash(fp)
Copy the code
So we don’t need to know whether the current position is P1 or P2 at all, we just need to xor the current position and hash(FP) to get the dual position. And just make sure to hash(FP)! Lambda equals 0, so p1 factorial is guaranteed. = p2, so you don’t have to kick yourself and cause an endless loop.
You might ask why the hash function doesn’t modulo the length of the array. Actually, yes, but the cuckoo filter enforces the length of the array to be an exponent of two, so modulo the length of the array is equivalent to taking the last n bits of hash. In xOR operations, ignore all but the lowest n bits. Leaving the calculated position P low n bits is the final dual position.
The last
Think a good partner remember to like attention oh, the follow-up will continue to update selected technical articles!