This is why’s 86th original article

In an earlier article I wrote about bloom filters:

Ah, this terrible typography, a long story…….

Writing articles is the same as writing code.

When I see a piece of code that burns my eyes, I want to smell it in my mouth.

As a result, the author of the code is himself.

I can’t believe it, but I have to look at git’s commit record.

It turns out that I typed and submitted the code myself a few months ago.

So silently get rid of.

In this situation, I often comfort myself: nothing, this is a good thing, it shows that I am making progress.

All right, let’s get down to business.

In the article at the time, I said that I couldn’t explain the inner workings of the Bloom filter.

In fact, I’m just too lazy to write it. It’s not complicated. What can’t I say?

Bloom filter

Bloom filter, in a reasonable use of the scene has the role of four to two thousand jin, because the use of the scene is in a large amount of data in the scene, so this thing is similar to the second kill, although not really used, but also to say that it is reasonable.

Common in interviews: judging duplicate data in large sets, cache penetration problems, etc.

First share a real case of Bloom filter in Tencent short video products:

https://toutiao.io/posts/mtrvsx/preview

So how does bloom filter achieve these requirements above?

First, the Bloom filter does not store raw data, because its function is to tell you whether an element exists or not. You don’t need to know what elements are in the Bloom filter.

Of course, if we know what elements are in the container, we can tell if an element exists.

However, in this way, we need to store all the existing elements. In the case of large data volume, such storage takes up very much space.

How does a Bloom filter know if an element exists without storing elements?

It’s simple: a long array plus a few Hash algorithms.

In the diagram above, there are three different Hash algorithms and an array of length 10 that stores bits, only zeros and ones. The initial value is 0.

Suppose we now have an element [why] that passes through the Bloom filter.

First [why] three different numbers are obtained through three Hash algorithms.

The Hash algorithm ensures that the resulting number is between 0 and 9, that is, within the length of the array.

We assume that the calculation results are as follows:

  • Hash1(why)=1
  • Hash2(why)=4
  • Hash3(why)=8

This corresponds to the picture:

If the element [why] is added, the index of the Hash algorithm will still be 1,4,8. Suggesting that this element was most likely present.

Notice, it says highly likely. That is to say there is a certain rate of misjudgment.

Let’s first store one more element [Jay].

  • Hash1(jay)=0
  • Hash2(jay)=5
  • Hash3(jay)=8

At this point, we combine the two elements and have the following image:

The position with subscript 8 is special, and both elements point to it.

The picture looks a little uncomfortable like this, so I’ll embellish it:

Ok, now the array looks like this:

You say, if you only look at this thing, can you tell that there were why and Jay in this filter?

Don’t say you don’t know, even the filter itself.

Now, suppose there is another element [Leslie], and after three Hash algorithms, the result is as follows:

  • Hash1(Leslie)=0
  • Hash2(Leslie)=4
  • Hash3(Leslie)=5

And by looking at the elements up here, we know that 0,4, and 5 are all 1’s.

The Bloom filter will think that this element might have been present before. It returns to the caller that [Leslie] was present.

But what about the reality?

In fact, our heart is clear, [Leslie] never come.

That’s the case with false positives.

That’s what it says: elements that bloom filters say exist don’t necessarily exist.

If the hash value of an element is 0, the element does not exist.

However, it has a fatal shortcoming, is that it does not support deletion.

Why is that?

If you want to delete [why], set the positions 1,4, and 8 to 0.

But you see, [Jay] also pointed to position 8.

If [why] is removed and position 8 becomes 0, is [jay] removed?

Why is it fatal not to support deletion?

The Bloom filter is used in large data scenarios. As time goes by, the number of ones in the filter array increases, resulting in an increase in misjudgment rates. And you have to rebuild.

Therefore, the example of Tencent cited at the beginning of the article has the following sentence:

In addition to the deletion issue, bloom filters have another problem: poor query performance.

Because the array length in a real filter is very long, the memory span of the resulting array subscripts may be very large after multiple different Hash functions. A large span is a discontinuity. Discontinuous, which will result in low CPU cache line hit ratio.

This thing, you know. Just recite the eight-part essay.

It’s a snow trap, a sound trap. That’s a Bloom filter.

If you want to play with bloom filters, you can visit this website:

https://www.jasondavies.com/bloomfilter/

Insert left, query right:

What if you want the Bloom filter to support deletion?

One is called Counting Bloom Filter.

It takes a counter array and replaces the bits of the array so that the space of one bit is enlarged into a counter.

The Bloom Filter adds deletions at the cost of taking up several times more storage.

This is also a solution.

But there is a better solution, a cuckoo filter.

In addition, there is a mathematical reasoning formula for the error rate of the Bloom filter. Very complicated, very boring, I won’t talk about, interested can go to understand.

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

The cuckoo hash

Cuckoo filters first appeared in a paper published in 2014 called “Cuckoo Filters: Practically Better Than Bloom.”

https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

But before we talk about Cuckoo filters, let’s briefly cover Cuckoo hashing, which means Cuckoo hash.

Because this word is the key word in the paper, it appears 52 times.

Cuckoo Hashing, originally from this 2001 paper:

https://www.cs.tau.ac.il/~shanir/advanced-seminar-data-structures-2009/bib/pagh01cuckoo.pdf

Look at this part of the paper:

It works like this, summed up like this:

It has two hash tables, called T1 and T2.

Two hash functions, h1, h2.

When a nonexistent element is inserted, its position in table T1 is calculated based on h1, and if it is empty, it can be inserted.

If the position is not empty, calculate its position in the T2 table according to H2. If the position is empty, it can be put in.

If the position is not empty, just kick out the element in the current position and put the current element in.

You can also randomly kick out one of the two positions, and one element will be kicked out.

What about the elements that get kicked out?

It’s okay. It has another place of its own.

The pseudocode in the paper looks like this:

It doesn’t matter if you don’t understand it. Let’s draw a diagram:

The picture above says something like this:

I want to insert element X, which, after two hash functions, has two positions in table T1 at position 2 and table T2 at position 1.

If both positions are occupied, then randomly kick y out of position 2 in table T1.

And the other position of y is occupied by the z element.

So Y kicked Z out without mercy.

Z finds that his alternate position is still vacant (even though it is also the alternate position of element V) and rushes into position.

So, when x is inserted, the graph looks like this:

The figure above is actually from the paper:

This solution similar to nesting dolls seems to be feasible, but there is always a problem that x cannot be put into the circular kick.

Such as (b) in the figure above.

When this happens, it indicates that the cuckoo hash has reached its limit and should be expanded or optimized.

So, when you look at the pseudocode again, you will understand what MaxLoop means.

The meaning of this MaxLoop is to set a threshold so that the process of kicking each other out is not executed too many times.

A cuckoo hash is an operation that resolves hash conflicts.

If you want to get your hands on it, visit this website:

http://www.lkozma.net/cuckoo_hashing_visualization/

After 16 kicks (MaxLoop), it tells you to rehash and expand the array:

That’s what cuckoo Hash is all about.

Next, we look at the cuckoo filter.

Cuckoo filter

This is the opening page of “Cuckoo Filter: Practically Better Than Bloom,” a paper on Cuckoo filters.

I’m a cuckoo filter. I’m just a little better than you.

He starts pointing his finger at the weakness of others.

One limitation of standard Bloom filters is that they cannot delete data that already exists. If you use a variant like Counting Bloom Filter, the space is 3 to 4 times bigger

I was different:

This paper will demonstrate that support for deletion does not require a higher overhead in space or performance than standard Bloom filters.

The cuckoo filter is a practical data structure that offers four advantages:

  • 1. Dynamically add and delete elements.
  • 2. Provides higher lookup performance than traditional Bloom filters, even when near full (such as 95% space utilization).
  • 3. Easier to implement than alternatives such as Quotient Filter.
  • 4. If an error rate of less than 3% is required, it takes up less space than a Bloom filter in many practical applications.

The API of the cuckoo filter is nothing more than insert, query and delete.

The most important of these is insertion. Take a look:

Part of the paper, you may have a glance, don’t understand it, I’m going to give you a wave of analysis.

Insert part of the pseudocode, you can see a little bit of cuckoo hash, because that’s what it’s based on.

So what are the biggest changes?

It’s just changing the hash function.

See of my eye stare dog stay, wish: still have this SAO operation?

First, let’s recall that the cuckoo hash stores the original value of the inserted element, such as x, which passes through two hash functions. If we had a set of length L, it would look like this:

  • p1 = hash1(x) % L
  • p2 = hash2(x) % L

How does the cuckoo filter calculate position?

  • h1(x) = hash(x),
  • H2 (x) = h1(x) ⊕ Hash (x’s fingerprint).

We can see that when we compute h2 (position 2), we do a hash on the fingerprint of x.

We’ll talk about fingerprints in a minute, but let’s focus on location calculations.

The xOR operation in the algorithm above ensures an important property: position H2 can be calculated from the “fingerprint” stored in position H1 and H1.

In human language, as long as we know the location (H1) of an element and the “fingerprint” information stored in that location, we can know the alternate location (H2) of that “fingerprint”.

These two positions are duality because of the xOR operation used.

Just make sure hash(X’s Fingerprint)! =0, so we can make sure that h2! =h1, so you can make sure that you don’t have a self-kicking loop.

Also, why do xOR operations follow a hash of the “fingerprint”?

The paper presents a counterproof: if the “fingerprint” is 8 bits long without a hash, its dual position can be calculated as far as 256 away from the current position.

Why? It says in the paper:

Because if the “fingerprint” is 8 bits long, xor only changes the lower 8 bits of the current position H1 (x), not the higher.

Even if you change all the lower eight bits, you get what I just said: up to 256 bits.

So, hashing the fingerprint ensures that the elements that are kicked out can be relocated to a completely different bucket in the hash table, reducing hash collisions and improving table utilization.

And then there’s another problem with this hash function do you see it?

It doesn’t modulo the length of the array, so how does it make sure that the index that it calculates is in the array?

Which brings us to another limitation of the cuckoo filter.

The mandatory array must be an exponential of 2.

An exponential binary of 2 must look like this: 10000000… N zeros.

The advantage of this restriction is that when xOR is performed, it ensures that the calculated index must be in the array.

The downside of this restriction is that:

  • Cuckoo filter: I support delete operation.
  • Bloom filter: I don’t need to limit the length to an exponential of 2.
  • Cuckoo filter: I find better performance than you.
  • Bloom filter: I don’t need to limit the length to an exponential of 2.
  • Cuckoo filter: I’m also space-efficient.
  • Bloom filter: I don’t need to limit the length to an exponential of 2.
  • Cuckoo filter: I’m sick of it, f * * k!

Next, “fingerprints.”

This is where “fingerprints” appear for the first time in the paper.

A fingerprint is a hash of the inserted element, and the result of the hash is a few bits.

The element’s “fingerprint” is stored in the cuckoo filter.

When querying data, it is to see if there is a corresponding “fingerprint” information on the corresponding position:

When deleting data, it simply erases the “fingerprint” of the location:

Since elements are being hashed, it is inevitable that fingerprints will be identical, that is, misjudgments will occur.

No raw data is stored, so data accuracy is sacrificed, but only a few bits are saved, thus improving spatial efficiency.

Speaking of space utilization, if you think about it, what is the space utilization of the cuckoo Hash?

In the perfect case, that is, before hashing occurs, it has a maximum space utilization of only 50%.

Since there was no conflict, at least half the seats were empty.

How can a cuckoo filter improve its space utilization beyond just storing “fingerprints”?

Look at what the paper says:

The previous (a) and (b) hash functions are simple, but instead of storing data in two arrays, it is a cuckoo hash based on a one-dimensional array, and the core is still kicked.

The point is (c), which expands the array from one dimensional to two dimensional.

For each subscript, I can put four elements.

Such a small change, the space utilization went from 50% to 98% :

I asked you if you were scared?

The first point in the paper shown in the screenshot above is a statement of the fact that:

When the hash function is fixed to two, if only one element can be placed in one index, the space utilization is 50%.

But if you can put two, four, eight elements in one index, the utilization goes up to 84%, 95%, 98%.

So far, we have understood the optimization point of the cuckoo filter for the cuckoo hash and how it works.

Everything seemed so perfect.

All the indicators are better than bloom filters, but the main feature is the deletion operation.

But is it really that good?

When I read this paragraph in section 6 of the paper, I fell silent:

Limit duplicate data: If the cuckoo filter needs to support deletion, it must know how many times a piece of data has been inserted. Do not insert the same data into KB +1 times. Where k is the number of hash functions, and b is a subscript position for how many elements.

For example, two hash functions, a two-dimensional array, which can insert up to four elements per index. The same element can be inserted up to eight times.

For example:

Why has already been inserted 8 times, and if you insert a why again, you will get a looping kick up to the maximum number of loops, and then return false.

How to avoid this problem?

We just keep a record of how many times each element is inserted.

Although the logic is simple, but can not stand the large amount of data ah. If you think about it, what’s the storage space of this table?

It hurts to think about it.

If you’re going to delete a cuckoo filter, you’re going to have to suffer.

Finally, take a look at the comparison of the various types of filters:

Also, the mathematical reasoning process, not to mention, it hurts my eyes, and it’s easy to lose my hair.

Drought cavity be out of tune

Do you know why it’s called a cuckoo?

Cuckoo, also called cuckoo.

The Compendium of Materia Medica has the following record: “鸤 turtle cannot be a nest, but live in other nests and bear children.” This is the cuckoo’s nest parasitic behavior. Nest parasitism refers to the breeding method in which a bird lays its eggs in the nest of another bird species instead of building a nest. It includes interspecific nest parasitism (the host and the host are of different species) and intraspecies nest parasitism (the host and the host are of the same species). Among the more than 10,000 species of birds today, more than 100 species have nest parasitic behavior, among which the most typical is the great rhododendron.

That is, it lays its eggs in other birds’ nests and lets other birds incubate its chicks. Oh no, hatching birds.

When the cuckoo hatches, it pushes the eggs of other nesting birds out of the nest so the mother can focus on feeding it.

Oh, my God, that is so cruel.

But isn’t the action of “pushing the nest” the same as the algorithm described above?

But our algorithm is a little bit cuter, and the egg that gets pushed out, the element that gets kicked out, gets put in a different place.

When I looked it up, I was shocked to learn that a cuckoo is a cuckoo.

There are cuckoos in many poems, such as Jin Se by Tang Dynasty poet Li Shangyin, which I like very much:

Jinse gratuitously fifty strings, a string a column think of Huannian. Zhuangsheng xiao dream fan butterfly, Wang Emperor spring heart of cuckoo. Canghai pearl tears, blue fields warm jade smoke. Right here waiting to recall, just at that time already disconsolate.

From time immemorial. The debate over whether the poem is about “mourning” or “self-injury” has never stopped.

But does it matter?

It doesn’t matter to me.

Important is, in appropriate opportunity, below appropriate atmosphere, recall the thing in the past when can appropriate come on 1: “this affection can wait for recall, just at that time already disconsolate”.

Instead of saying: hey, now think of it, a lot of things did not cherish, really TM regret.

Oh, right.

One other thing I noticed while I was writing the article.

Bloom filters were the brainchild of a guy named Burton Howard Bloom in 1970.

When I write this stuff, I want to see what the big guy looks like.

But something amazing happened. I looked inside and out, and I didn’t find any pictures of the big guy.

My search ended with the discovery of this website:

https://www.quora.com/Where-can-one-find-a-photo-and-biographical-details-for-Burton-Howard-Bloom-inventor-of-the-Bloom- filter

This question was asked nine years ago, in 2012:

There is no picture of Burton Howard Bloom on the Internet.

What a magical, understated boss.

He could be a gorgeous man.

One last word (for attention)

If you find something wrong, you can raise it backstage and I can fix it.

Thank you for reading, I insist on original, very welcome and thank you for your attention.

I’m Why, a programmer who mainly writes code, often writes articles, and occasionally shoots videos.