What is the

It was proposed by Blom in 1970 and consists of a long binary vector and a series of random mapping functions. It can help us quickly locate whether an element exists in a set or not. However, it should be pointed out that this is a probabilistic data structure. Of course, probability is unidirectional. When it determines that an element does not exist, it must not exist; when it determines that there is a probability.

The working principle of

We need to prepare an M bit binary array with k hash functions that map the contents to locations from 0 to m-1. When a new element is added, k hash functions compute k values and change the identifier at these locations to occupied (gray). When it is necessary to query whether an element exists, k hash functions are also used to calculate K values. If all the identifiers in the K positions are occupied, then the element is judged to exist by the Bloom filter (actually it only exists with probability). As long as the identity of 1 location is not occupied, then the element must not exist.

Let’s go through an example where, for simplicity, we set k=2 and m=8, which means there are only 2 hash functions +1 8-bit binary data. In the initial case, all 8 locations are empty.

Let’s add one more element, Jerry

Hash1 evaluates to 4 and hash2 evaluates to 6, so we set 4 and 6 to occupied

Let’s add one more element, Tom

Hash1 evaluates to 1 and hash2 evaluates to 4, so we set positions 1 and 4 to occupied

We need to query for the existence of 1 element, Lucy

Hash1 evaluates to 3, hash2 evaluates to 1, and since position 3 is not occupied, it must not exist

We need to query the existence of 1 element, Lily

Hash1 evaluates to 6, hash2 evaluates to 1, because both positions 1 and 6 are already occupied, and the Bloom filter determines that the element exists, but in fact, we know that the element does not exist, and here positions 1 and 6 are occupied because of hash conflicts with other elements

why

Essentially, a Bloom filter is a data structure. This one-way probability data structure is actually designed to improve the HashMap under certain conditions. It’s faster and takes up less space, but because it’s a probabilistic data structure, it’s bound to be used in fewer scenarios than HashMap.

Why faster?

One important reason is that hash collisions are rarely handled, and when judged to exist, whether or not they actually exist is a probabilistic event. We can compare Java’s HashMap. Since M is always finite, the probability of hash collisions increases as the number of elements increases. Then how does Java deal with this? After 1.8, if the list elements are greater than 8, it will be optimized as a red-black tree; Either way, the Hash collision takes much longer than the Hash function.

Why use multiple hash functions

In fact, you can directly compare HashMap, because M is finite, hash collisions are inevitable, and the probability of collisions increases as the element /m increases. In the event of a conflict, instead of looking for it like a HashMap, the Bloom filter simply returns that the element already exists (which it probably doesn’t). However, if you have two hash functions, you can alleviate this situation, because the probability that both hash functions will exist is smaller than that of one hash function. As you can imagine, as the value of K increases, there will always be a critical point where increasing the hash function will cause each element to enter the Bloom filter to occupy more slots, which will exacerbate the hash conflict.

Why does it take up less space

Using a binary array, each flag bit occupies only one bit and does not store the element itself, which is much smaller than the way HashMap stores entries

delete

One important disadvantage of the Bloom filter is that it does not support deleting elements. Of course, there are already bloom filters that support deleting elements, but from the principle of implementation, this compromise will partly lose the above two important advantages, but make it meaningless in the scenarios where it is originally applicable. This is a principle we’ll write about later, but from what I understand so far, the bloom filter with the delete feature has little practical value.

Applicable scenario

So what are the scenarios that apply? Some typical scenarios are listed as follows: short chain system, judging whether a short chain has used the mail system, judging whether a mailbox is cached in the blacklist system, and judging whether a follow-up query is needed. It is not difficult to find that these scenarios are two important advantages that require bloom filter: fast speed and small space; But for two important disadvantages: probability exists and does not support deletion, both scenarios can be tolerated. In fact, to find anything applicable scenarios are these four words: foster strengths and avoid weaknesses.

thinking

When you think about why you don’t use HashMap, you’ll have a better understanding of the pros and cons of using a Bloom filter.