preface

Make writing a habit together! This is the sixth day of my participation in the “Gold Digging Day New Plan ยท April More text Challenge”. Click here for more details.

I have seen such a paragraph on the Internet before ๐Ÿ‘‡

Data structures are nothing different. They are like the bookshelves of your application where you can organize your data. Different data structures will give you different facility and benefits. To properly use the power and accessibility of the data structures you need to know the trade-offs of using one.

Data structures are no different. In an application, they are like bookshelves where you can organize your data. Different data structures offer different advantages and benefits, and you need to carefully weigh your needs and use them appropriately. Bloom Filter is the representative of practicing this sentence, so today we will talk about Bloom Filter in simple terms.

Bloom Filter

Or the old rules, before learning a new knowledge, we need to understand his basic concept ๐Ÿ‘‡

Bloom Filter was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. The advantage of this algorithm is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain error recognition rate and deletion difficulty.

Bloom filters can be used to retrieve whether an element is in a collection. Some of you might ask: Why use a Bloom filter when you can just use HashMap to retrieve elements? A HashMap does help us do this, and by determining the location of an element in a single operation, a HashMap can help us check whether an element is in a collection. So let’s think about the next question: if there are a billion random numbers in this set of elements, how do we determine whether a number exists?

If the element set contains one billion random numbers, it will first bring us space problems. If one random number occupies four sub-sections, the one billion random numbers will occupy nearly 4G storage space, which will consume a lot of space. That’s where the Bloom Filter comes in.

๐Ÿ“๐Ÿ“ Basic idea of bloom filter ๐Ÿ“๐Ÿ“

We mentioned above that the element set contains a billion random numbers, and it would take up a lot of space to store these billion elements directly, so we definitely can’t store elements directly. So how do we store it? Storage is also very clever, you can use the way of bit array, not directly store elements, but the state of whether the elements exist, so you can save a lot of storage space ~ (also do not know how the gods think of such a clever way ๐Ÿ˜‚)

How does the Bloom filter determine if an element is in a set ๐Ÿ‘‡

๐Ÿฅ โ‘  ๐Ÿฅ now has a collection and an array of bits with both initial states 0

N hash functions are used to calculate the element’s position in the array, and the corresponding position in the array is changed from 0 to 1

๐Ÿฅ (3) ๐Ÿฅ if the need to determine whether the element X exists, then the element X also passes through the N the operation of hash function in the array several position, if the value of a number of positions 1, then prove the existence of element X is likely and collection, conversely prove element X must not exist in the collection. As shown in the figure below, at this point, the values of the computed positions of N hash elements of element X are not all 1, so it is proved that element X does not exist in the set.

๐Ÿ“๐Ÿ“ Bloom filter features ๐Ÿ“๐Ÿ“

In point โ‘ข, there is such a sentence: if the value in several positions obtained is 1, then it is proved that the element X is likely to exist in the set. Why is it possible to have all 1’s, but not certain? This has to do with the characteristics of hash functions…

We all know that a hash function is a function that converts input data of any size into output data of a specific size (the converted data is called a hash value). A hash function also has the following two characteristics:

  • If you get different hashes based on the same hash function, the original input values of the two hashes must be different.
  • If two hashes derived from the same hash function are equal, the original input values of the two hashes may or may not be equal.

This is similar to the way in Java that hashCodes of two objects are equal, but the objects themselves are not necessarily equal. To be clear, the value of the mapping points in the bitarray computed by the hash function is all 1, not necessarily the value set when the variable to be queried was stored in, but may also be the mapping points of other elements. This brings us to one of the peculiarities of bloom filters: there is a certain degree of misjudgment.

In League of Legends, you can trust Bloom completely, but be wary of ๐Ÿ˜‚ when writing code

So can we delete the elements in the bit array? Obviously not, because it is possible to map multiple input values to the same point in the bit array, and deleting it would affect the results of other elements in the Bloom filter. This is another feature of bloom filters: you cannot delete elements in bloom filters.

So we can sum up the advantages and disadvantages of bloom filter ๐Ÿ‘‡

๐ŸŽ advantages ๐ŸŽ : in space and time, have a huge advantage. Because it is not a complete data store, but a binary vector, it can save a lot of memory space. In terms of time complexity, since the query is calculated according to the hash function, then assuming N hash functions, the time complexity is O(N). At the same time, the storage of elements is not the element itself, but the binary vector, so it has certain advantages in some scenarios with strict requirements on confidentiality.

๐ŸŒ disadvantages ๐ŸŒ : certain misjudgment exists (the more elements are stored in the Bloom filter, the higher the misjudgment rate); You can’t delete elements in the Bloom filter. As you use it for a longer time, more and more elements are stored in the bloom filter, resulting in more memory usage, higher error rate, and finally having to reset.

๐Ÿ“๐Ÿ“ Bloom filter application ๐Ÿ“๐Ÿ“

We’ve all heard of “cache penetration”, so how do we solve cache penetration? That’s right, to solve this problem through the Bloom filter ๐Ÿ’ช.

The problem of cache penetration is mainly because the Key value passed in Redis does not exist, so it will directly hit the database, which increases the pressure of the database. In view of this situation, you can add bloom filter in front of Redis, add the data in the database to bloom filter in advance, judge whether the Key value exists through bloom filter before querying Redis, if not, return directly, if the Key value exists, then continue to execute according to the original process.

To solve the problem of cache penetration, bloom filter determines that if the result does not exist, it must not exist. However, due to certain misjudgment of Bloom filter, it cannot be said that cache penetration is completely solved, but it can greatly alleviate the problem of cache penetration.

๐Ÿ“๐Ÿ“ Simulation implementation of bloom filter ๐Ÿ“๐Ÿ“

Finally, a piece of code is affixed to simulate the implementation of the Bloom filter ๐Ÿ‘‡

import java.util.BitSet;

/ * * *@description: MyBloomFilter
 * @author: Zhuang Ba. Liziye *@create: thus * * / were 2022-04-01
public class MyBloomFilter {

    /** * A 1 billion bit */
    private static final int DEFAULT_SIZE = 256 << 22;

    /** * To reduce the error rate, the addition hash algorithm is used, so define an 8-element prime array */
    private static final int[] seeds = {3.5.7.11.13.31.37.61};

    /** * is equivalent to building 8 different hash algorithms */
    private static HashFunction[] functions = new HashFunction[seeds.length];

    /** * Initializes the bitmap of the Bloom filter, that is, the bit array */
    private static BitSet bitset = new BitSet(DEFAULT_SIZE);

    /** * Add data **@paramValue Specifies the value to be added */
    public static void add(String value) {
        if(value ! =null) {
            for (HashFunction f : functions) {
                // Compute the hash value and change the bitmap position to true
                bitset.set(f.hash(value), true); }}}/** * determines whether the corresponding element has *@paramValue Specifies the element *@returnResults the * /
    public static boolean contains(String value) {
        if (value == null) {
            return false;
        }
        boolean ret = true;
        for (HashFunction f : functions) {
            ret = bitset.get(f.hash(value));
            // a hash function returns false to break out of the loop
            if(! ret) {break; }}return ret;
    }

    /** * test */
    public static void main(String[] args) {

        for (int i = 0; i < seeds.length; i++) {
            functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
        }

        // Add 100 million elements
        for (int i = 0; i < 100000000; i++) {
            add(String.valueOf(i));
        }
        String id = "123456789";
        add(id);

        System.out.println("Does the element 123456789 exist:" + contains(id));
        System.out.println("Does element 234567890 exist:" + contains("234567890")); }}class HashFunction {

    private int size;
    private int seed;

    public HashFunction(int size, int seed) {
        this.size = size;
        this.seed = seed;
    }

    public int hash(String value) {
        int result = 0;
        int len = value.length();
        for (int i = 0; i < len; i++) {
            result = seed * result + value.charAt(i);
        }
        int r = (size - 1) & result;
        return (size - 1) & result; }}Copy the code

summary

My experience is limited, some places may not be particularly in place, if you think of any questions when reading, welcome to leave a message in the comments section, we will discuss one by one ๐Ÿ™‡

Please take a thumbs up and a follow (โœฟโ—กโ€ฟโ—ก) for this article ~ a crab (โ—’โ—ก’โ—)

If there are mistakes in the article, welcome to comment correction; If you have a better, more unique understanding, you are welcome to leave your valuable ideas in the comments area.

Love what you love, do what you do, listen to your heart and ask nothing