Ref

  • Bloom filter | wikipedia
  • What is a bloom filter | the nuggets
  • Derivation of FPP formula for false identification rate of Bloom filter
  • Redis – avoid cache penetrating weapon BloomFilter | the Denver nuggets

Overview

Bloom Filter is a data structure proposed by Burton Howard Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions.

Bloom filters can be used to efficiently retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are far better than the general algorithm, the disadvantage is that there is a certain error recognition rate, and difficult to delete (generally not supported, need additional implementation).

The error rate refers to the fact that it can be determined that the element is definitely not in the set, that the element may be in the set, and that it cannot be determined that the element is definitely in the set.

The Bloom filter is efficient because it is a probabilistic data structure that confirms that an element is definitely not in the set, or that it might be. It is possible because it has a false identification rate that makes it impossible to be 100% sure that the element is in the collection.

Questions lead

A common requirement in everyday work is to determine whether an element is in a collection. For example, the following scenario

  • Given an IP address blacklist library, check whether the specified IP address is in the blacklist.
  • When receiving emails, check whether an email address is spam.
  • Check if an English word is spelled correctly in word processing software.

When faced with this kind of problem, it is often intuitive to use collections as data structures. For example, all IP addresses in the IP address blacklist database are stored in a set, and then the specified IP address is added to the set to check whether it exists. If it exists, the IP address matches the blacklist.

Through a code to simulate the IP blacklist library storage and inspection.

public class IPBlackList {

	public static void main(String[] args) {
		Set<String> set = new HashSet<>();
		set.add("192.168.1.1");
		set.add("192.168.1.2 instead");
		set.add("192.168.1.4");
		System.out.println(set.contains("192.168.1.1"));
		System.out.println(set.contains("192.168.1.2 instead"));
		System.out.println(set.contains("192.168.1.3"));
		System.out.println(set.contains("192.168.1.4")); }}Copy the code

The internals of collections are usually implemented using hash tables. Its advantage is that the query is very efficient, but its disadvantage is that it consumes more storage space.

In general, when the amount of data is relatively small, we will use collections for storage. Changing space for time can improve query efficiency while occupying less space.

However, when storing large amounts of data, consuming large amounts of space can become a problem. This data is usually stored in process memory to speed up queries. Memory on a machine is usually limited and should be used as efficiently as possible.

Hash tables, on the other hand, need to be balanced between space and efficiency. If you store the same number of elements, the smaller the hash table, the higher the probability of conflicts, and the more time it takes to resolve conflicts, affecting performance.

Bloom Filter can solve this problem well. The ability to store data with less memory on the one hand and very efficient query performance on the other.

The basic principle of

  1. First, build a binary vector and set all bits to 0
  2. Then, K hash functions are selected to hash the elements K times and calculate the bitwise subscripts of the vectors.

Add elements

When an element is added to the set, K values are generated as subscripts by K hash functions acting on each element, and the corresponding bit of the vector is set to 1.

Check the element

If you want to check whether an element is in the set, use the same hash method to generate K subscripts and check that the corresponding bits of the vector are all 1’s.

If all are 1, the element is most likely in the set; Otherwise (as long as one or more bits are 0), the element is definitely not in the set.

Demo

  1. Suppose you have a Bloon filter with a capacity of 15 bits, using two hash functions, as shown below.

| 0 | 1 | | | 2 | 3 | 4 5 6 7 8 9 10 | | | | | 11 12 13 14 15 | | | | | | – | – | – | | – | | – | | – | | – | | – | | – | | – | | – | | 0 | | | | | | | | | | | | | | | | | | |

  1. Add a stringa, the subscripts are 4 and 10, and the corresponding bits of 4 and 10 are marked as 1 by 0.

| 0 | 1 | | | 2 | 3 | 4 5 6 7 8 9 10 | | | | | 11 12 13 14 15 | | | | | | – | – | – | | – | | – | | – | | – | | – | | – | | – | | – | | | | | | 1 | | | | | | 1 | | | | | | | | |

  1. Add another stringb, the subscript is 11, and the bit corresponding to 11 is marked as 1 by 0.

| 0 | 1 | | | 2 | 3 | 4 5 6 7 8 9 10 | | | | | 11 12 13 14 15 | | | | | | – | – | – | | – | | – | | – | | – | | – | | – | | – | | – | | | | | | 1 | | | | | | | 1 | | | | | | | |

  1. Add another stringc, the subscripts are 11 and 12, and the corresponding bit is marked as 1 instead of 0.

| 0 | 1 | | | 2 | 3 | 4 5 6 7 8 9 10 | | | | | 11 12 13 14 15 | | | | | | – | – | – | | – | | – | | – | | – | | – | | – | | – | | – | | | | | | 1 | | | | | | | 1 | 1 | | | | | | |

  1. Finally, add a stringsam, the subscripts are 0 and 7, and the corresponding bit is marked as 1 by 0.

| 0 | 1 | | | 2 | 3 | 4 5 6 7 8 9 10 | | | | | 11 12 13 14 15 | | | | | | – | – | – | | – | | – | | – | | – | | – | | – | | – | | – | | 1 | | | | 1 | | | 1 | | | | 1 | 1 | | | | | | |

Above, we added four strings, each hashed twice, with the corresponding two bits marked as 1, and finally marked as 1 with six bits instead of eight.

This shows that different elements can be hashed into overlapping positions. If you have more elements, the probability of overlap is higher. If two elements overlap, we can’t tell if any element is in the set.

If you want to check whether the element exists in the set, you just need to hash twice in the same way, and search the corresponding bits of the resulting 2 subscripts in the Bloom filter. If the corresponding 2 bits are not all 1’s, the element is definitely not in the set. If the corresponding 2 bits are all 1, the element may or may not be in the set.

For example, if the string B is checked to see if it exists in the set, the hash results in two subscripts of 11. Check that 11 corresponds to bit 1. But that doesn’t necessarily mean that B is in the set. This is because the hashed subscript of the string C also contains 11, so it is possible that the string C is in the set but b is not, which causes false identification, also known as a false positive.

Again, the string foo is hashed with subscripts 8 and 13, respectively, and the corresponding bits are 0. Therefore, the string foo is definitely not in the collection.

Mathematical principles

Here’s the math behind bloom’s filter

The probability of two completely random numbers collides is small, so you can store a lot of information in a very small amount of space with a very small misidentification rate.

recognition

The false positive probabilistic (FPP) rate was calculated as follows.

Assume that the Bloom filter is m bits in size, stores N elements, and uses k hash functions to calculate where the elements are stored.

  • Add 1 element, then the probability that any bit is 1 is1/m, the probability that any bit is 0 is1-1/m;
  • The probability of any bit being zero after k hashing of an element is(1-1/m)^k, the probability that any bit is 1 is1-(1-1/m)^k;
  • If n elements are added, then the probability that any bit is zero is(1-1/m)^kn, the probability that any bit is 1 is1-(1-1/m)^kn;
  • If you add a new element to an existing onen, then the probability that any bit is already 1 is the same as above, and the probability is1-(1-1/m)^kn. Therefore, the probability that all k bits are 1 is:(1-(1-1/m)^kn)^k, which is the false identification rate of newly inserted elements.

When n is large,$(1-(1-1/m)^{kn})^k$Approximately equal to the$(1-e^{-kn/m})^k$

Assuming that the Bloom filter size m is 16 bits, k is 8, and storage element n is 1, the probability of false identification (false positive) is 5 in 10,000 (5/10,000).

At this point, m/n=16, which actually means that one element uses 16 bits of space to store it.

Therefore, when K is the same and M /n is 16/1, 32/2 and 64/4, the false recognition rate is the same.

Bloom Filters – The Math provides a table of false recognition rates for Bloom Filters, which makes it easy to detect false recognition rates for different m, n, and K values.

Resolve false identification rate

Common approaches to addressing false identification rates include

  1. White list
  2. Go back to confirm

White list

A common solution to the misidentification rate is to create a small whitelist of data that could be misidentified.

Take spam filtering as an example. Suppose we have a junk mail library that filters out junk mail as it is received.

In this case, the spam library can be stored in the Bloom filter, and when received, the bloom filter can be used to efficiently filter out most of the normal mail.

For a small number of spam hits, some of them may be normal.

Create a whitelist library. When a spam message is detected in the Bloom filter, query the whitelist to check whether the message is normal.

Messages that are not in the whitelist are moved to the trash by default. Through manual identification, when normal mail is found in the trash bin, it is whitelisted.

Back to the source to confirm

In many cases, bloom filters are used to intercept large amounts of data that is not in the collection at a low cost and efficiently. For example,

  • Google Bigtable, Apache HBase, and Apache Cassandra and PostgreSQL use Bloom filters to reduce disk look-ups for nonexistent rows or columns. Avoiding expensive disk lookups can greatly improve the performance of database query operations.
  • A Web browser that uses bloom filters to identify malicious urls in Google Chrome. All urls are first checked against the local Bloom filter, and the executed URL is fully checked only if the Bloom filter returns a positive result (the user is warned if that result also returns a positive result).
  • Block a large number of non-IP blacklist requests. For a small number of IP addresses that may be in the blacklist, query the blacklist database again.

This is a very typical application scenario for bloom filters, which filter out most of the requests and then process only a few ambiguous requests.

This approach, unlike whitelisted libraries, does not require another set of libraries to process. Instead, it uses data and logic that already exists.

Google Bigtable, for example, is a query for rows that need to be queried, but uses bloom filters to block most unnecessary requests. Whether the IP address is blacklisted also needs to be queried. It is also necessary to use bloom filter to block most IP addresses first.

And the above spam processing, for the possible spam, not through the complete spam database to check again for confirmation, but by adding the white list to judge. Because in general, whitelist libraries are smaller and easier to cache.

By back source, we mean a request that may have been mistaken for something else, and then finally goes back to the data source or logic for confirmation.

Implementation of blom container in Guava

Guava provides an implementation of the Bloom Filter.

Guava bag into

To use the Bloom Filter, you need to introduce the Guava package

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>23.0</version>
</dependency>  
Copy the code

Misjudgment rate test

The misjudgment rate of Bloom container is tested in the following two steps

  1. You put a million numbers in the filter, and then you verify that those million numbers can get through the filter
  2. Find another 10,000 numbers and check the number of fish that escaped
/** * Test bloom filter (available for Redis cache penetration) **@authorAoBing * /
public class TestBloomFilter {

    private static int total = 1000000;
    private static BloomFilter<Integer> bf = BloomFilter.create(Funnels.integerFunnel(), total);
//    private static BloomFilter<Integer> bf = BloomFilter.create(Funnels.integerFunnel(), total, 0.001);

    public static void main(String[] args) {
        // Initialize 1,000,000 pieces of data to the filter
        for (int i = 0; i < total; i++) {
            bf.put(i);
        }

        // Match the values already in the filter, and see if there are any unmatched values
        for (int i = 0; i < total; i++) {
            if(! bf.mightContain(i)) { System.out.println("Some bad guy got away..."); }}// Match the 10000 values that are not in the filter, how many are matched
        int count = 0;
        for (int i = total; i < total + 10000; i++) {
            if (bf.mightContain(i)) {
                count++;
            }
        }
        System.out.println("Number of friendly fire injuries:"+ count); }}Copy the code

The results

Number of friendly fire injuries: 320Copy the code

The result of the run is that when you go through the million numbers in the filter, all of them are identified. Ten thousand numbers are not in the filter, and 320 are mistakenly injured, with an error rate of about 0.03.

Bloom Filter source code analysis

public static <T> BloomFilter<T> create(Funnel<? super T> funnel, int expectedInsertions) {
        return create(funnel, (long) expectedInsertions);
    }  

    public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
        return create(funnel, expectedInsertions, 0.03); // FYI, for 3%, we always get 5 hash functions
    }

    public static <T> BloomFilter<T> create(
          Funnel<? super T> funnel, long expectedInsertions, double fpp) {
        return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);
    }

    static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {... }Copy the code

There are four Bloom Filter create methods, all of which eventually lead to a fourth. Take a look at what each parameter means

  1. funnel: data type (typically in the Funnels tool class)
  2. expectedInsertions: The number of values expected to be inserted
  3. fpp: Error rate (default 0.03)
  4. strategy: Application of Hashing algorithm Bloom Filter