1. Problem scenario
If an interviewer asks you if a website has 10 billion urls on a blacklist, each url averages 64 bytes. Ask this blacklist how to save? If you enter a URL at random, how to determine whether the URL is in the blacklist?
For the first question, if you think of the blacklist as a collection, stored in a HashMap, it seems too large, requiring 640 GIGABytes, which is clearly unscientific.
So what to do? Ok, now for today’s hero — bloom filters can solve this problem.
First of all, what is a Bloom filter? Wikipedia explains it this way:
Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain error recognition rate and deletion difficulty.
The official view is good, if you do not understand it does not matter, in fact, it is not difficult, let us speak slowly.
2. Specific introduction
A Bloom filter is actually a long binary vector and a series of random mapping functions.
“Very long binary vector” : this is a very long array, what kind of array? An array of bits (1Byte = 8 bits, 1KB = 1024 bytes).
“A series of random mapping functions” : There are multiple hash functions. So what is a hash function? There is a method in the JDK to calculate the hash value, which is a hash function.
Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain error recognition rate and deletion difficulty
Doesn’t that solve our problem in the first place? So how does it work out?
3. Solution process
Next, I will talk about the general process. The details can not be understood first. The important thing is to understand the process.
Suppose that the bit array has a length of m, each element has a value of 0, and has k hash functions.
First, when entering a URL, the URL will be processed by k hash functions to obtain multiple hash values (v1,v2… And vitamin k). And then you get these hash values that correspond to the subscripts of the array, and then you set all of these subscripts to 1.
So how to determine a URL in the blacklist? Enter a URL that, after the above processing, yields the subscript positions of multiple arrays. If these subscripts are already 1, they are in the blacklist, otherwise they are not.
Overall is such a process, the following questions you may have:
1. How to construct an array of type bit
2, get v1,v2… Vk, how do you get the index position in the array and set it to 1?
Java does not have a type like bit. We can use int. An int is 32 bits.
// create a 100 * 32bit array int[] arr = new int[100]; Arr [0]; // Arr [0];Copy the code
So it says again, “Divide these hashes by the length of the array, m, and modulo m, to find the subscript positions of these hashes in the array.”
We can take a hash of data, for example, and let’s say the int array is 100.
Void Set(int data) {// ByteNO = data / 32; // bitNo specifies the 32-bit bit. int BitNo = data % 32; / / buy 1 _table [ByteNo] | = (1 < < BitNo); }Copy the code
4. Use effect
At the beginning we said that if it takes 640GB to put 10 billion urls into a HashMap, how much space does it take to use the bloom filter? The answer is about 23 GB. In contrast, the size of this space is not acceptable a lot.
5. Shortcomings
Bloom filters have the property that they can kill a hundred people by mistake, but not one. In human language, a URL that belongs to the blacklist can be correctly judged to be in the blacklist, but a URL that does not belong to the blacklist may also be considered in the blacklist, and there is a certain error rate.
As for the error rate, array length and number of hash functions to be set respectively, you need to choose according to the actual situation. There are corresponding mathematical formulas on the Internet, so I won’t expand on them here.
Reference: blog.csdn.net/wenqiang120…
PS: This article was originally published in the wechat public account “not only Java”, the background reply “Java”, send you 13 Classic Java e-books. The public account focuses on sharing Java dry goods, reading notes, growth thinking.