5 minutes to understand bloom filter, billions of data filtering algorithm you deserve!

Author: semlinker segmentfault.com/a/119000002…

In the world of programming, bloom filters are a great tool for programmers to quickly solve some of the more difficult problems in a project. Such as web URL deduplication, spam identification, the determination of duplicate elements in large collections and cache penetration and other problems.

Bloom Filter was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. The advantage of this algorithm is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain error recognition rate and deletion difficulty.

A brief introduction to Bloom filter

When you insert new data into a simple array or list, the index of the inserted item is not determined by its value. This means that there is no direct relationship between the index value of the newly inserted item and the data value. That way, when you need to search for a value in an array or list, you have to walk through the existing collection. If there is a large amount of data in the collection, it will affect the efficiency of data search.

For this problem, you can consider using hash tables. With a hash table, you can hash a value to get the key or index value of that value, and then store that value in the index position in the list. This means that the index value is determined by the value of the inserted item, and when you need to determine whether the value exists in the list, you just need to hash the value and search the corresponding index position, which is very fast.

By definition, a Bloom filter checks whether a value is “probably in the set” or “definitely not in the set.” “Possible” means there is a certain probability, that is to say there may be a certain misjudgment rate. So why is there a miscarriage of justice? Let’s analyze the specific reasons.

A Bloom Filter is essentially a bit vector of length m or a bit list (a list containing only 0 or 1 bit values), all of which are initially set to 0, as shown in the figure below.

To add the data item to the Bloom filter, we provide K different hash functions and set the value of the corresponding bit at the result position to “1”. In the hash table mentioned earlier, we use a single hash function, so only a single index value can be output. For bloom filters, we will use multiple hash functions, which will produce multiple index values.

As shown above, when “semlinker” is entered, the default three hash functions will output 2, 4, and 6, and we put the corresponding position 1. Given another input “kakuqo”, the hash function outputs 3, 4, and 7. You may have noticed that index bit 4 has been marked by the previous “semlinker”. At this point, we have filled the bit vector with two input values of “semlinker” and “Kakuqo”. The marking state of the current bit vector is:

When searching for a value, similar to a hash table, we will use three hash functions to hash the “searched value” and see the index value it generates. Suppose, when we search for “fullstack”, the three indexes output by the three hash functions are 2, 3, and 7:

As you can see from the figure above, the corresponding index bits are set to 1, which means we can say that “fullstack” may have been inserted into the collection. In fact, this is a false positive situation, caused by a coincidence of hash collisions that store different elements in the same bit. Fortunately, bloom filters have a predictable rate of misjudgment (FPP) :

N is the number of added elements;
K number of hashes;
M The length of the Bloom filter (such as the size of the bit array);

In extreme cases, when the Bloom filter has no free space (full), each query returns true. This means that the choice of m depends on the expected number of elements, n, and that M needs to be much larger than n.

In practice, the length of Bloom filter m can be calculated according to the given misjudgment rate (FFP) and the expected number of elements n by the following formula:

After all this, we can conclude that when we search for a value, if any index bit of the value after K hash functions is “0”, then the value is definitely not in the set. However, if all hash index values are “1”, it can only be said that the value of the search may exist in the collection.

Two, bloom filter application

In practice, the common application scenarios of Bloom filter are as follows:

Web crawler to the URL to avoid crawling the same URL address;
Anti-spam, judging whether a mailbox is spam from billions of spam lists;
Google Chrome uses bloom filters to identify malicious urls;
Medium uses a Bloom filter to avoid recommending articles that users have already read;
Google BigTable, Apache HBbase, and Apache Cassandra use The Bloen filter to reduce look-ups for nonexistent rows and columns. In addition to the above application scenarios, another application scenario of Bloom filters is to solve the problem of cache penetration. The so-called cache penetration is that the service caller queries data that is not in the cache every time, so that each service call will be queried in the database. If there are too many such requests, the database will become stressed, and the cache will become meaningless.

With bloom filters we can pre-cache the primary key of the data query, such as the user ID or article ID, into the filter. When querying data according to ID, we first determine whether the ID exists. If so, we proceed to the next step. If not, return it directly so that subsequent database queries are not triggered. It is important to note that cache penetration cannot be completely solved and can only be controlled within a tolerable range.

Three, bloom filter actual combat

Bloom Filter has many implementations and optimizations. The famous Guava library developed by Google provides Bloom Filter implementation. To use Guava’s Bloom filter in a Maven-based Java project, simply introduce the following coordinates:

< the dependency > < groupId > com. Google. Guava < / groupId > < artifactId > guava < / artifactId > < version > 28.0 jre < / version > </dependency>Copy the code

After importing the Guava library, we create a new BloomFilterDemo class. In the main method we create a BloomFilter using the bloomfilter. create method, and then we initialize 1 million pieces of data into the filter. Then add 10000 pieces of data on the original basis and judge whether these data exist in bloom filter:

import com.google.common.base.Charsets; import com.google.common.hash.BloomFilter; import com.google.common.hash.Funnels; public class BloomFilterDemo { public static void main(String[] args) { int total = 1000000; // Total BloomFilter<CharSequence> bf = bloomfilter. create(funnel.stringfunnel (Charsets.UTF_8), total); // Initialize 1,000,000 pieces of data to the filterfor (int i = 0; i < total; i++) {
            bf.put(""+ i); } // check whether the value exists in the filter int count = 0;for (int i = 0; i < total + 10000; i++) {
            if (bf.mightContain("" + i)) {
                count++;
            }
        }
        System.out.println("Matched quantity"+ count); }}Copy the code

When the above code runs, the console prints the following:

The number of matches has been 1000309Copy the code

Obviously, the above output results have been given false positives, because there are 309 more elements than the expected result, and the misjudgment rate is:

309/(1000000 + 10000) * 100 ≈ 0.03059405940595940593Copy the code

If we want to improve the matching accuracy, we can set the false error rate FPP when creating the Bloom filter:

BloomFilter<CharSequence> bf = BloomFilter.create(
  Funnels.stringFunnel(Charsets.UTF_8), total, 0.0002
);
Copy the code

Within BloomFilter, the default value for misjudgment rate FPP is 0.03:

// com/google/common/hash/BloomFilter.class
public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
  returnCreate (good, expectedInsertions, 0.03 D); }Copy the code

After resetting the misjudgment rate to 0.0002, we re-run the program and the console outputs the following:

The number of matched packets is 1000003Copy the code

By observing the above results, it can be seen that the smaller the value of misjudgment rate FPP, the higher the matching accuracy. When the value of FPP of misjudgment rate is reduced, more storage space is required. Therefore, a tradeoff between misjudgment rate and storage space should be made in actual use.

Four,

This paper mainly introduces the concept of Bloom Filter and common application occasions. In the actual practice part, we demonstrate the basic use of Bloom Filter provided by Google’s famous Guava library. Meanwhile, we also introduce the reasons for false positives of Bloom Filter and how to improve judgment accuracy. Finally, in order to facilitate your understanding of Bloom filter, we introduce a simple version of bloom filter SimpleBloomFilter.

Architecture Digest, a public account, publishes a great article every day in the field of architecture, covering the application architecture of first-tier Internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture and other hot fields.

5 minutes to understand bloom filter, billions of data filtering algorithm you deserve!

A brief introduction to Bloom filter

Two, bloom filter application

Three, bloom filter actual combat

Four,

Related Posts

Github 11600+ : The most complete machine learning course resources for Ng (complete notes, subtitled videos in Chinese and English, Python assignments, provided with Baidu cloud image!) .

Sp_WhoIsActive sp_WhoIsActive sp_WhoIsActive sp_WhoIsActive

Get these 5 MySQL optimization tips and you’ll be great!