The following are the questions from fans:

There are 5 billion phone numbers, there are 100,000 phone numbers, how to quickly and accurately determine whether these numbers already exist?

1, through the database query ——- to achieve fast is a little difficult.

2, the data is pre-placed into the memory set: 5 billion *8 bytes is about 40G, the memory is too large

Basic introduction

Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It’s actually a long binary array + a series of random hash algorithm mapping functions that determine whether an element is in a collection.

Often we encounter a lot of business scenarios to determine whether an element is in a collection, and the common idea is to save all the elements in the collection and then compare them. Linked lists, trees, Hash tables, and other data structures work this way.

However, as the number of elements in the collection increases, the amount of storage space we need also increases linearly, eventually reaching the bottleneck. At the same time, the retrieval speed is slower and slower. The retrieval time complexity of the above three structures is O(n),O(logn),O(1) respectively. That’s where the Bloom Filter comes in.In a word:It consists of an array of bits with zero initial values and a number of hash functions to quickly determine whether a certain data exists. The essence is to judge whether specific data exists in a large set. Bloom filter is a data structure similar to SET, but the statistical results are not accurate.

Bloom filter characteristics

  • Inserts and queries are efficient, take up little space, and return results that are nondeterministic.
  • An element may not exist if it is judged to exist, but it certainly does not exist if it is judged to not exist.
  • A Bloom filter can add elements, but cannot remove them. Because deleting elements increases the misjudgment rate.
  • False judgments occur only for elements not added to the filter, not for added elements.

One thing is guaranteed: if the Bloom filter determines that an element is not in a set, that element must not be in the set, i.e., yes, maybe, none, and definitely none

Usage scenarios

1. Solve the problem of cache penetration cache penetration is generally, first check whether the redis cache has the data, if not, then query the database. When that data also does not exist in the database, each query accesses the database, which is called cache penetration. The problem with transparent caching is that when a large number of requests are made to query data that does not exist in the database, it can strain the database and even drag it down.

A Bloom filter can be used to solve the cache penetration problem by storing the key of the existing data in the Bloom filter, which is equivalent to a Bloom filter in front of redis. When there is a new request, check whether it exists in bloom filter first:

  • If the item does not exist in the Bloom filter, it is returned directly.
  • If there is a bloom filter, then query cache REDis, if there is no query in redis, then query Mysql database

2. Verify the blacklist

If it is found in the blacklist, it performs a specific operation. For example: identify spam, as long as the mailbox in the blacklist, identify as spam.

Assuming that the number of blacklists is in the hundreds of millions, storage is very expensive storage space, Bloom filter is a better solution.

Put all blacklists in the Bloom filter. When receiving emails, check whether the email address is in the Bloom filter.

Bloom filter principle

In traditional Hash in Java, a hash function is a function that converts input data of any size into output data of a specific size. The converted data is called a hash value or hash encoding, also called a hash valueIf two hashes are not the same (according to the same function) then the original input of the two hashes is also not the same.

This property is a consequence of the deterministic nature of the hash function, which is called a unidirectional hash function.

The input and output of a hash function are not unique correspondence. If two hash values are the same, the two input values are likely to be the same, but may also be different, which is called “Collision of hash”.

When storing large amounts of data in hash tables, the space efficiency is still very low, and when there is only one hash function, it is also prone to hash collisions.

Java hash conflict Java case, the code is shown as follows:

 
public class HashCodeConflictDemo
{
    public static void main(String[] args)
    {
        Set<Integer> hashCodeSet = new HashSet<>();

        for (int i = 0; i <200000; i++) {
            int hashCode = new Object().hashCode();
            if(hashCodeSet.contains(hashCode)) {
                System.out.println(Duplicate hashcode appears:+hashCode+"\t run to"+i);
                break;
            }
            hashCodeSet.add(hashCode);
        }
 
System.out.println("Aa".hashCode());
System.out.println("BB".hashCode());
System.out.println("Willow wood".hashCode());
System.out.println("Chai 柕".hashCode()); }}Copy the code

Bloom Filter is a kind of advanced data structure specially designed to solve the problem of de-duplication. Essentially,A large array of bits and several different unbiased hash functions (unbiased means evenly distributed). It consists of an array of bits with zero initial values and hash functions“Is used to quickly determine whether a certain data exists. But like HyperLogLog, it is also slightly imprecise and prone to miscalculation.When adding a key, multiple hash functions are used to hash the key to obtain an integer index value, and modulo the length of the bit array to obtain a position. Each hash function yields a different position. Set each position to 1 to complete the add operation.

When you query a key, if one of the keys is 0, the key does not exist. If both keys are 1, the corresponding key may not exist.

Conclusion: Yes, yes, yes, no, no

When a variable is added to the set, the variable is mapped to N points in the bitmap by N mapping functions, setting them to 1 (assuming that there are two variables that each pass three mapping functions).When we look up a variable, we just have to see if all of these points are 1’s, and we know with a high probability that it’s in the set

If any of these points are zero then the queried variable is definitely not there, and if they’re both 1 then the queried variable is probably there. Why is it possible, not certain? That’s because the mapping function itself is a hash function, and hash functions are subject to collisions.

Initializing a Bloom filter is essentially composed of a bit vector or a bit list (a list containing only 0 or 1 bit values) of length m, with all values initially set to 0.When we add data to a Bloom filter, in order to minimize address conflicts, we use multiple hash functions to evaluate the key, calculate a subscript index value, and then modulo the length of the bit array to obtain a position. Each hash function calculates a different position. Add by setting each of these bits to 1. For example, we add a string wmyskxz.3. To check whether a key exists, run the same hash functions on the key to check whether the corresponding position is 1. If one bit is 0, the key does not exist in the Bloom filter. If all of these positions are 1, then it is highly likely to exist; Because the 1 in these positions can be caused by the presence of other keys, which is the hash conflict mentioned above…

For example, after adding the string wmyskxz data, it is clear that the 1 in the lower 1/3/5 is caused by the first addition of wmyskxz; When an existing string, inexistent-key, was not added to the session, it might calculate 1/3/5, which is an error at……

The error of bloom filter is that multiple inputs are hashed at the same bit position 1, so it is impossible to determine which input is generated. Therefore, the error is caused by the same bit being mapped multiple times and set to 1. This situation also creates deletion problems for The Bloom filter, because each bit of the bloom filter is not exclusive, and it is possible that multiple elements share one bit. If we delete this bit directly, it will affect the other elements

features

  • If an element is judged to have none,
  • If the result is present, the element does not necessarily exist.

A Bloom filter can add elements, but cannot remove them. Because deleting elements increases the misjudgment rate.

A small summary

  • It is guaranteed that if the Bloom filter determines that an element is not in a collection, that element is not in the collection
  • It is best not to use the actual number of elements much larger than the initial number
  • When the actual number of elements exceeds the initial number, the Bloom filter should be rebuilt, a larger filter should be reassigned, and all historical elements should be added in batches

Advantages and disadvantages of bloom filter 1. Advantages: Efficient insertion and query, occupies less space

2. Disadvantages:

  • Elements cannot be deleted. Because deleting elements increases the error rate, and because multiple objects in a hash conflict can be shared, you can delete one element at the same time as deleting others.
  • Different data may generate the same hash value.

Cuckoo filter (Understand)

In order to solve the problem that the Bloom filter could not delete elements, the cuckoo filter was born. Cuckoo Filter: Better Than Bloom

The author makes an in-depth comparison between the cuckoo filter and the Bloom filter. Compared to the cuckoo filter, the Bloom filter has the following disadvantages: poor query performance, low space utilization efficiency, no reverse operation (delete), and no counting support