Bloom filter details

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Intuitively speaking, bloom algorithm is like a hash set to determine whether an element (key) is in a set. Unlike a normal hash set, this algorithm does not store the value of the key, but only k bits for each key, each storing a flag to determine whether the key is in the set.

Algorithm:

First we need k hash functions, each of which hashes the key into 1 integer \
To initialize, an array of n bits is required, with each bit initialized to 0\
When a key is added to the set, k hash functions are used to calculate k hash values and the corresponding bit position in the array is 1\
When determining whether a key is in the set, k hash functions are used to calculate K hash values and query the corresponding bit bits in the array. If all the bit bits are 1, the key is in the set.

Advantages: No need to store keys, saving space

Disadvantages:

When the algorithm determines that the key is in the set, there is a probability that the key is not actually in the set \
Unable to delete

The typical application scenario is as follows: Empty query defects exist in the design of some storage systems: When you query a non-existent key, you need to access a slow device, resulting in low efficiency. A front-end page caching system, for example, might be designed to check whether a page exists locally, return it if it does, and fetch it from the back end if it doesn’t. However, when a page is frequently queried from the caching system, the caching system will frequently request the back end to transfer the pressure to the back end. This is a service that only needs to add a Bloom algorithm, and when a key is inserted into the back end, it is set in this service to check whether the key exists in the back end when it needs to query the back end, so as to avoid the pressure of the back end.

Bloom Filter was proposed by Burton Howard Bloom in 1970. It actually consists of a long binary vector and a series of random mapping functions, and the Bloom filter can be used to retrieve whether an element is in a set. The advantage of Bloom Filter is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has a certain False positives rate (Bloom Filter reports that an element exists in a set, but the element is not in the set) and it is difficult to delete. But there is no misdetection case (that is, False negatives to communicate if an element is indeed not in the set, then the Bloom Filter will not report that element is in the set, so it will not miss).

In everyday life, including when designing computer software, we often have to determine whether an element is in a set. For example, in word processing software, you need to check whether an English word is spelled correctly (that is, whether it is in a known dictionary); In the FBI, whether a suspect’s name is already on the list of suspects; In web crawlers, whether a url has been visited and so on. The most direct method is to store all the elements in the set in the computer. When encountering a new element, it can be directly compared with the elements in the set. In general, collections on computers are stored using hash tables. Its advantage is fast and accurate, but its disadvantage is that it costs storage space. When the set is small, this problem is not significant, but when the set is large, the problem of hash table storage efficiency becomes apparent. For example, a public email provider like Yahoo,Hotmail and Gmai will always have to filter spam from spamers. One way to do this is to keep track of the email addresses of the spammers. Because senders are constantly registering new addresses, there are at least a few billion spamming addresses around the world, and storing them all requires a lot of web servers. If use hash table, each store one hundred million email address, you need 1.6 GB of memory (using hash table implementation of specific measures is to each email address corresponding to eight gigabytes of information into a fingerprint (see: googlechinablog.com/2006/08/blo… “, and then fingerprints the information into a hash table. Since the storage efficiency of hash tables is usually only 50 percent, an email address takes up 16 bytes. A hundred million addresses takes about 1.6GB, or 1.6 billion bytes of memory. So storing billions of email addresses could require hundreds of gigabytes of memory. Unless it is a supercomputer, ordinary servers cannot store it [2]. (this quote Google mathematical beauty: www.google.com.hk/ggblog/goog…

The basic concept

If you want to determine whether an element is in a set, the general idea is to store all the elements and compare them. Linked lists, trees, and so on. But with the increase of elements in the collection, we need more and more storage space, and the retrieval speed is slower and slower. But there is another kind of data structure called a Hash table. It maps an element to a point in a Bit Array using a Hash function. So in this case, we just have to see if this point is 1 to see if it’s in our set. That’s the basic idea of a Bloom filter.

The problem with Hash is conflict. Assuming the Hash function is good, if our bit array is m points long, then if we want to reduce the collision rate to say 1%, the Hash table can only hold m over 100 elements. Obviously, this is not space-efficient. The solution is simply to use multiple hashes, and if one of them says the element is not in the collection, it’s definitely not. If they’re both true, there’s a good chance they’re lying, but the odds of that being intuitively low.

advantages

Bloom filters have huge advantages in both space and time over other data structures. Bloom filter storage space and insert/query time are constant. In addition, the Hash functions are unrelated to each other, so they can be implemented by hardware in parallel. Bloom filters do not need to store the elements themselves, which can be advantageous in situations where confidentiality is very important.

The Bloom filter can represent the whole set, and no other data structure can;

The intersection of two Bloom filters using the same set of Hash functions, k and m, can be performed using bitwise operations.

disadvantages

But the disadvantages of bloom filters are as obvious as their advantages. False Positive is one of them. As the number of stored elements increases, the miscalculation rate increases. But if the number of elements is too small, a hash table is sufficient.

In addition, you generally cannot remove elements from a Bloom filter. It is easy to think of a bit array as an array of integers, incrementing the counter for each element that is inserted, and then subtracting the counter when an element is deleted. However, it is not so simple to delete elements safely. First we must make sure that the deleted element is actually inside the Bloom filter. That’s not guaranteed by the filter alone. Counter winding can also cause problems.

False positives probability derivation

Given that the Hash function selects and sets a Bit in the Bit Array with equal probability, where M is the size of the Bit Array and k is the number of Hash functions, the probability that a particular Bit in the Bit Array will not be set in the Hash operation on element insertion is: \

Then the probability that the bit is not set to “1” after all k Hash operations is:

If we insert n elements, then the probability that one digit is still “0” is:

So the probability of this bit being “1” is:

Now check if an element is in the collection. All k positions needed to determine whether an element is in the set are set to “1” as described above, but this method may cause the algorithm to mistakenly believe that an element that is not in the set is detected as being in the set. The probability is determined by the following formula:

The result was based on the assumption that the bits needed to be set are independent of each other. As M increases, the probability of False Positives decreases, and the number of inserted elements n increases. For a given m, n, the number of Hash functions k should be determined by the following formula:

The probability of False Positives is:

Given the probability P of False Positives, how do I choose the optimal bit array size M?

The above formula shows that the size of the bit array had better be linearly related to the number of inserted elements. For a given m, n, k, the probability of false positive examples is maximum:

The following diagram shows the relationship between the probability of false positive examples of Bloom filter p and the size of bit array m and the number of inserted elements in the set N. It is assumed that the optimal number of Hash functions is selected:

Bloom Filter use cases

Google’s famous distributed database Bigtable uses bloom filters to find non-existent rows or columns to reduce disk lookups [3].

The Squid web proxy cache server uses A Bloom filter in cache digests [4].

The Venti document storage system also employs a Bloom filter to detect previously stored data [5].

SPIN model detectors also use Bloom filters to track the reachable state space when verifying problems on a large scale [6].

Google Chrome uses bloom filters to speed up secure browsing services [7].

Bloom filter is also used in many key-value systems to speed up the query process, such as Hbase, Accumulo, and Leveldb. Generally speaking, Value is stored in disk, which takes a lot of time to access. However, a Bloom filter can be used to quickly determine whether the Value corresponding to a Key exists, thus avoiding many unnecessary DISK I/O operations. However, the introduction of a Bloom filter will cause a certain amount of memory consumption. The following figure shows the typical use of a Bloom filter in a key-value system:

Bloom filter related extension

Counting filters

Basic Bloom filters do not support Deletion, but Counting filters provide a way to support element Deletion without rebuilding bloom filters. In Counting filters, each bit in the original bit array is extended to an n-bit counter. In fact, the basic Bloom filter can be considered as Counting filters with only one bit counter. Original insert operation has also been extended to add 1 n – a bit of a counter and find operations which checks a nonzero array can, and delete operations is defined as an array of the corresponding a minus 1, but this method is also an arithmetic overflow problem, namely one after multiple deletions could become negative, so a be sufficiently large array size m. The other problem is that Counting filters do not scale. Since Counting filters do not scale, the maximum number of elements that need to be saved needs to be known in advance. Otherwise, the probability of false positives increases dramatically once the number of inserted elements exceeds the size of the bit array. Of course, some people also propose a D-left Hash method to implement a Bloomed filter that supports deletion operation and has higher spatial efficiency than Counting filters. \

Data synchronization

Byers et al. proposed using bloom filter to approximate data synchronization [9]

Bloomier filters

Chazelle et al. proposed a general Bloom filter that could associate a certain value with each inserted element and implemented an associative array Map [10]. Like normal Bloom filters, Chazelle’s bloom filter also achieves low space consumption, but also generates false positive. However, in Bloomier filter, if a key is not in the map, False positive is defined when it returns. The Map structure does not return incorrect values in the Map related to keys.

A link to the

Table of false-positive rates for different configurations from a University of Wisconsin — Madison website
Interactive Processing demonstration from ashcan.org
“More Optimal Bloom Filters,” Ely Porat (Nov/2007) Google TechTalk video on YouTube
“Using Bloom Filters” Detailed Bloom Filter explanation using Perl

The resources

[1] Wikipedia: Bloom Filter: zh.wikipedia.org/zh/%E5%B8%8…

[2] the beauty of mathematics in the 21st: Bloom Filter (Bloom Filter) : www.google.com.hk/ggblog/goog…

[3] Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson; Wallach, Deborah; Burrows, Mike; Chandra, Tushar; Fikes, Andrew et al. (2006), “Bigtable: A Distributed Storage System for Structured Data”, Seventh Symposium on Operating System Design and Implementation

[4] Wessels, Duane (January 2004), “10.7 Cache Digests”, Squid: The Definitive Guide (1st ed.), O’Reilly Media, p. 172, ISBN 0-596-00162-2, “Cache Digests are based on a technique first published by Pei Cao, called Summary Cache. The fundamental idea is to use a Bloom filter to represent the cache contents.”

[5] plan9.bell-labs.com/magic/man2h…

[6] spinroot.com/

[7] src.chromium.org/viewvc/chro…

[8] en.wikipedia.org/wiki/Bloom_… \

[9] Byers, John W.; Considine, Jeffrey; Mitzenmacher, Michael; Rost, Stanislav (2004), “Informed content delivery across adaptive overlay networks”, IEEE/ACM the Transactions on Networking 12 (5) : 767, DOI: 10.1109 / TNET. 2004.836103