0 x00 preface

The standard Bloom Filter is a relatively simple data structure that supports only insert and lookup operations. The standard Bloom Filter works fine when the collection to be expressed is static, but when the collection to be expressed is constantly changing, the standard Bloom Filter suffers because it does not support deletion. This leads to the Counting Bloom Filter, which is abbreviated as CBF in this article.

0 x01 principle

First, why does BF not support deletion

Why can’t BF delete elements? We can take an example to illustrate.

For example, to delete the member Dantezhao in the set, k hash functions will be used to calculate it first. Because Dantezhao is already a member of the set, the corresponding position of the bit array must be 1. If we want to delete the member Dantezhao, I’m going to have to set all the ones that I calculated to be 0, so I’m going to set 5 and 16 to be 0.

Here’s the problem! Now, it is assumed that YYj itself is an element belonging to the set. If it is necessary to query whether YYj is in the set, we will judge whether the 16th and 26th bits are 1 after the hash function calculation. At this time, the 16th bit is 0, that is, YYj does not belong to the set. Clearly there was a miscalculation.

What is Counting Bloom Filter

Counting Bloom Filter expands each bit of the standard Bloom Filter array into a small Counter. Add 1 to the value of k Counter (k is the number of hash functions) when inserting elements, and subtract 1 from the value of k Counter when deleting elements. The Counting Bloom Filter adds deletion operations to the Bloom Filter at the cost of occupying multiple storage space. Is the basic principle simple? Take a look at the image below to see how it differs from the Bloom Filter.

3, Counter size selection

One of the main differences between CBF and BF is that CBF replaces one of BF with a Counter. How large is Counter? From the perspective of use, of course, the bigger the better, because the larger Counter is, the more information it can represent. But a larger Counter means more resource usage and, in many cases, a huge waste of space.

Therefore, when we select Counter, we can see how small the range of the value of Counter is to meet the needs.

According to the description in the paper, the probability that the value of a certain Counter is greater than or equal to I can be described by the following formula, where N is the size of the set, m is the number of Counter, and k is the number of hash functions.

In the previous article “Mathematical Background of Bloom Filter”, it has been concluded that the best value of k is K = m/ N * ln2, which can be obtained by plugging it into the formula.

If you allocate 4 bits to each Counter, it will overflow when the value of Counter reaches 16. The probability is as follows, and the value is small enough that for most applications, four bits will suffice.

For the choice of Counter size in CBF, the main reference is Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, which is explained specifically on pages 6 and 7. No more derivation details, only a general explanation, interested children can refer to the original paper.

0x03 Simple implementation

Or to implement a simple program to get familiar with the principle of CBF, there are two differences with BF:

  • One is that instead of using the bitarray provided by bitarray, we use a bytearray provided by bytearray, so the value of each Counter ranges from 0 to 255.

  • Another is the addition of a remove method to remove elements from the collection.

The code is simple, just to understand the concepts, and the libraries used in practice will be quite different.

Although CBF has solved the problem that BF cannot delete elements, it still has many defects to be improved. For example, the introduction of Counter will bring a great waste of resources, and there is still a lot of room to reduce THE FP of CBF. Therefore, there will be many upgraded versions of CBF in actual use scenarios.

For example, SBF (Spectral Bloom Filter) puts forward the concept of element occurrence frequency query on the basis of CBF, and extends the application of CBF to the field of multi-set. DlCBF (D-left Counting Bloom Filter) uses D-left hashing method to store fingerprint and solve the load balancing problem of hash table. Accurate Counting Bloom Filter (ACBF) uses offset indexing to divide the Counter array into multiple levels to reduce misjudgment rates. This content, when the opportunity to share.