Original address: www.johngo689.com/2201/
Title: bloom filter (BloomFilter) principle | billion level data filtering solution
In 1970, Mr. Bloom came up with an excellent filter algorithm for determining whether an element is in a set
“Bloom filter algorithm”
The story begins
Look at the structure of the story first
As far as the current Internet environment is concerned, the Internet ecology of the head is developing more and more in the form of high concurrency and distributed. For example, each big web page blacklist system, crawler repeat rate judgment. There are more and more of these scenes.
For example, in the real-time state, more than 10 billion urls may need to be judged whether they conform to specifications or exist in the system and can be used normally.
In general, each URL is 64 bytes, so if the number of urls is 10 billion, you need about 640GB of memory capacity. For the current online server… That’s still a lot! But if you take advantage of the Bloom filter, you only need 10 billion bits with no error rate, that is, 1.2GB, and even to reduce the error rate, you don’t need more than a few tens of GB of space.
So in this case, the bloom filter is used to solve is very good, good to wikipedia says “it has the advantage of space efficiency and query time is far more than the general algorithm”, space complexity and time complexity are far more than the general algorithm, bloom filter storage space and insert/query time constant of O (1)! Note: far more!!
Looking at the great boast from all sides, let’s arrange the Bloom filter to all sides to see how it works in detail…
Wikipedia concept: A Bloom filter is actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain error recognition rate and deletion difficulty. Based on this explanation, the following is from
And so on systematic say say…
【 1 】
Main function: determine whether an element is in a set
There will be many such scenarios. Determines whether the element to be queried exists in the collection. [Whether the site allows the user to log in and whether the site accepts such URL requests]
This method is reasonable, but if you add a layer of Bloom scrubs in the middle, it is more reasonable! Why is that?
Suppose you want to query for an element that does not exist
A. If BloomFilter is not available, the query is performed directly in the database after the query is completed in the cache. This phenomenon is “slow”. After all, it takes a long time to query from the database, which greatly affects service performance.
B. BloomFilter exists in the cache. After the BloomFilter element is queried from the cache, it is found that the BloomFilter element does not exist. In this way, a large unnecessary cost to machine or service performance is avoided.
As shown in the figure above, assuming the element does not exist, if there is no Bloom filter, the query will be directed to [disk/database], which can cause a large unnecessary performance drain!
If it’s so efficient, what’s the reason?
【 2 】
Data structure: binary array + Hash algorithm
Binary array
This is the key data structure that will hash each element into each binary array.
In terms of time complexity and space complexity, both are constant, which is also due to the fact that a data structure is an array of bits, so it has an advantage in space. In the case of URL 64B, the corresponding bit is 1, which is roughly 64*8:1 space ratio.
The hash algorithm
It is familiar with hash algorithms, in which data elements are mapped to different arrays through hash functions. If conflicts occur, you can use some methods to resolve them, such as the zipper method.
When applied to BloomFilter, the hash function differs from the normal hash function:
A. BloomFilter maps the value not to an address, but to the corresponding binary array bit, which is then set to 1
B. BloomFilter does not use the common hash function to resolve address conflicts. Instead, it sets the binary array position of the same element to 1 after several hash functions. You can determine that the element exists.
See below:
According to the legend, each data element will go through three different hash functions, and then correspond to different binary array bits and set them to 1. In this way, on the one hand, the probability of conflict is reduced, and on the other hand, the error rate is reduced.
When a client queries, URL1, URL2, and URL3 all exist in BloomFilter, so these three elements must be searchable.
Now let’s try to use URL1, URL4 and URL5 for example: search success, search does not exist and search error these three real existence situation. [Below 👇]
URL1, URL4, and URL5 on the right side of the binary array of BloomFilter, you can see:
- URL1: can completely correspond to the array bit when storing URL1 and are all 1 at the same time, i.e. position: 1, 5, 9. Query successful!
- URL4:0 in array bit 2, 1 in the rest of the array, resulting in three positions not all 1, search does not exist, return null!
- URL5: The three bits of URL5 are all 1. The result is that the URL5 element is present in BloomFilter, but you can clearly see that the query error! This is where the query produces an error.
This shows that BloomFIlter is extremely excellent in terms of space and time, reaching O(1) level, but has the disadvantage of false recognition rate.
Note here that the false identification rate only exists in the case of a successful search, and there is no false identification rate if the query does not exist. That is:
- Query exists: There is a false identification rate, as shown in the above illustration
- The query does not exist: there is certainly no case of all 1s, so there is no false identification rate in this regard
[3]
Advantages: space saving + high performance
Save space
In the case of URL length, it is the binary array that is used. An element through the hash function corresponds to an array bit (bit). If it is a real URL (64B), the space ratio is about 64*8:1, that is, 512:1
Bit arrays enable BloomFilter to save a lot of space and is extremely efficient in terms of spatial efficiency.
A high performance
BloomFilter does not need to be traversed one by one like a linked list query. Instead, it will act like an array, similar to finding the desired result by simply fetching the subscript.
The difference is that BloomFilter requires several hash functions to find subscripts. So, overall, the performance is very high!
Comprehensive time and space efficiency, in the case of a very low false recognition rate, all aspects are far superior to other algorithms.
[4]
Disadvantages: Misjudgment + cannot delete elements
Miscarriage of justice – Hash conflict exists
In the previous example of URL5, this would cause the query to fail, which is a direct manifestation of traditional hash conflicts, and is why several hash functions are used. This kind of error identification can be marked in the form of whitelist to make up for the deficiency in this aspect.
In real industrial use, however, this conflict, or query error, is kept below 0.01%, both for the hash function and by increasing the number of bits in the BloomFilter binary array to avoid this conflict.
So let’s take a look at the above two considerations: what is the size of the binary array and how many hash functions are needed to keep the error rate in the above case below 0.01%?
Selection of the length of the binary bit array
N =100n=100n=10 billion, p=0.01p=0.01% P =0.01, BloomFilter binary bit array length m using the following formula:
We can find: m=19.19n, that is, the binary bit array length is about 20 times the number of elements n, 20 billion bits, equivalent to about 25GB of space, for our average industrial server, is enough to hold the array
How many hash functions is best
The industry generally uses this formula to calculate k hash functions:
That is, 14 hash functions are required to construct
According to the above calculation, we can get the best scheme to use BloomFilter
Cannot delete elements
Different elements can be set to 1 by using the hash function. You can never delete an element and set its position to 0.
If you want to delete URL1, you should set the corresponding position of the URL1 hash to 0. Obviously, if you want to delete URL1, you should set the corresponding position of the URL1 hash to 0. This will not do, will affect the URL2 query.
[5]
The industry application
There are many scenarios used in industry, such as:
-
Blog system blacklist restrictions
-
Determination of reptile repetition rate
-
The application of Bitcoin
-
Spam filtering
And so on…
However, they are generally common solutions, as shown below:
As information is written, it is written to the cache, database, and Bloom filter at the same time. If a query is needed, the cache will be queried first. If it cannot be found, BloomFilter will be used to determine whether it exists. If it exists, the database will continue to query. If it doesn’t exist, it returns empty.
This greatly improves the efficiency of application services!
Reprinted at: www.johngo689.com/2201/