This is the ninth day of my participation in the First Challenge 2022. For details: First Challenge 2022.
primers
The nature of hashing is to map from a larger space to a smaller space, so after enough data is inserted, according to the pigeonnest principle, there must be position conflicts. Common Hash tables (or dictionaries) handle collisions through linked lists, open address detection, and so on. The single-bucket multi-function cuckoo hash is a kind of open address method to deal with the conflict of the hash table, but after the conflict, not through linear search for new location, but through the additional hash function to find.
Author: A miscellany of wood birdswww.qtmuniao.com/2021/12/07/…
The principle of
Cuckoo hashing was first made public by Rasmus Pagh and Flemming Friche Rodler at a conference in 2001 [1]. Its basic idea is:
- Use two hash functions h1(x), H2 (x) and two hash buckets T1 and T2.
- Insert element x:
- If either T1[h1(x)] or T2[h2(x)] is empty, insert. They’re both empty, so just pick one and insert it.
- If T1[h1(x)] and T2[h2(x)] are both full, select either of them randomly (set to Y), kick them out and insert X.
- Repeat the process and insert element y.
- If too many kicks are made during insertion, the hash bucket is full. Capacity expansion, ReHash, and insert again.
- Query element x:
- T1[h1(x)], T2[H2 (x)] and X can be read
The Cuckoo, or great Cuckoo, likes to lay its eggs in other birds’ nests. When a baby cuckoo is born, it kicks out the other eggs and uses the nest to grow. The key design of the cuckoo hash is “kicks out,” which is a masterstroke of design and name.
variant
Cuckoo hashes can come in many varieties, such as:
- Use two hash functions and a hash bucket.
- Use two hash functions and four hash buckets.
It can be proved that the bucket utilization rate of the latter will be higher. If you are interested, you can refer to the original paper [3] to see the demonstration process.
application
Bin Fan et al., CMU University, published a paper called Cuckoo Filter in 2014: Practically Better Than Bloom [3]. The basic idea is to apply the idea of key-value queries to the direction of set membership, which can replace Bloom Filter commonly used in projects. It has the following advantages:
- Support for deleting elements
- Higher query efficiency, especially at high load factors
- Easier to implement than other filters that support deletion
- If false positives are expected to be less than 3%, less space is used than Bloom Filter
To achieve the above effect, the Cuckoo Filter makes the following changes to the original Cuckoo Hash:
- To improve bucket utilization, multiple hash buckets are used.
- To reduce memory usage, only the key fingerprint is stored.
In addition, when a value of x is kicked out, another position needs to be found. The Cuckoo Hash is computed by an additional Hash function applied to x: h2(x). However, in order to save memory, the fingerprint finger(x) of fixed length is saved instead of the original value x in Cuckoo Filter. When x is kicked out, how do I find another position for it? Intuitively, I have two thoughts:
Method 1: Calculate another position by h2(Finger (x)). However, when calculating the second position, it is equivalent to forcibly compressing the original data space into the fingerprint space, which will lose a lot of information and increase the probability of collision.
Method 2: Write down another position in the value, pair(finger, the other position), but this takes up a lot of space.
The Cuckoo Filter uses a clever way to xor the position (h1(x)) with the hash(finger(x)) of the corresponding value to get another position. We instruct that the xOR operation satisfies the property that any two of the three values xor can yield a third value. When another position X is kicked out, the original position can be obtained by the same method, as shown in the figure below.
Why not xor directly with Finger (x) to compute another position? Because finger(x) tends to have fewer digits, like 8 bits. If this method is followed, it is understood physically to find another position within the range of the original position ±2^8 = 256, because xOR only changes the value of the lower 8 bits. This range is too small for balanced hashing.
There are two other points to note. First, the Cuckoo Filter cannot be expanded without additional information, because we have lost the original value x, so we cannot compute the new position hash(x) after expansion. Of course, if the corresponding key set is saved, you can also expand the capacity and insert the original key set again. Second, if two different values x and y happen to satisfy hash(x) == hash(y) && finger(x) == finger(y), a false positive will occur. To understand false positives, we can start with the nature of hashing. As mentioned in the beginning, the essence of hashing is mapping from large space to small space, so there must be collisions in small space. If one of the two original values of the collision exists, people will think that the other also exists. Back to the Cuckoo Filter, if x and y both exist in the Cuckoo Filter, when deleting x or Y, delete either of the two identical fingers.
implementation
The implementation of the cuckoo filter is very simple, pseudo code is also given in the paper, posted here.
Insert the Insert (x)
f = fingerprint(x);
i1 = hash(x); I2 = i1 radiushash(f);
if bucket[i1] or bucket[i2] has an empty entry then
add f to that bucket;
return Done;
// must relocate existing items;
i = randomly pick i1 or i2;
for n = 0; n < MaxNumKicks; n++ do
randomly select an entry e from bucket[i];
swap f andthe fingerprint stored in entry e; I = I radiushash(f);
if bucket[i] has an empty entry then
add f to bucket[i];
return Done;
// Hashtable is considered full;
return Failure;
Copy the code
Find the Lookup (x)
f = fingerprint(x);
i1 = hash(x); I2 = i1 radiushash(f);
if bucket[i1] or bucket[i2] has f then
return True;
return False;
Copy the code
Delete the Delete (x)
f = fingerprint(x);
i1 = hash(x); I2 = i1 radiushash(f);
if bucket[i1] or bucket[i2] has f then
remove a copy of f from this bucket;
return True;
return False;
Copy the code
reference
Cuckoo hashes and cuckoo filters
- Cuckoo hash: www.itu.dk/people/pagh…
- The cuckoo hash wikipedia: en.wikipedia.org/wiki/Cuckoo…
- Cuckoo filters: Practically Better Than Bloom www.cs.cmu.edu/~binfan/pap…
- Cuckoo filter implementation: [coolshell.cn/articles/17…]
I am Green teng mu bird, a distributed system programmer like photography, more interesting articles, welcome to pay attention to my public number: “Mu Bird miscellaneous Record”.