Remember the article from a while ago? Redis uses bitmap to record the status of online users, or it needs to implement a record of the status of online IM users. Today, we will talk about another solution, Bloom filter

Bloom filter

In our daily life and work, we often encounter this scenario, retrieving information from an Excel or not in an Excel. Remember the fear of being controlled by CTRL+F? Anyway, in software development, we usually use a Hash Table, which is also called a Hash Table. The disadvantage is the waste of storage space. In our scenario, the storage of the logged-in userId into the hash table is inefficient when the user size is very large. Today, we introduce a mathematical tool: Blom filter, which can solve the same problem with 1/8 to 1/4 of the size of the hash table.

In the endorsement

The Bloom Filter, which was proposed by Burton Bloom in 1970, is actually a long binary vector and a series of random mapping functions.

The principle of

Using our scenario, let’s say we have 100 million people online at the same time on our personal website. To store the online status of 100 million people, we first construct a vector of 1.6 billion bits, or 200 million bytes, and then write down all 1.6 billion bits as zero. For each login userId, 8 different algorithms are used to generate 8 different information fingerprints. An algorithm is used to hide the 8 information into the 8 positions of the 1.6 billion bits, and the 8 positions are set to 1, thus building a Bloom filter that records the online status of 100 million users.

Retrieval is the same principle, use the same algorithm to retrieve the userId to generate 8 information fingerprints, and then look at the eight information fingerprints in the 1.6 billion bits corresponding to the value of 1, 1 means that the userId is online, the following is to use Java code to achieve a Bloom filter.

Java implements bloom filters

Start by implementing a simple Bloom filter

package edu.se; import java.util.BitSet; /** * @author ZhaoWeinan * @date 2018/10/28 * @description */ public class BloomFileter { Private static final int[] primes = new int[]{2, 3, 5, 7, 11, 13, 17, 19}; private static final int[]{2, 3, 5, 7, 11, 13, 17, 19}; Private Hash[] hashList = new Hash[primes. Length]; private Hash[] hashList = new Hash[primes. Private BitSet bits = new BitSet(256 << 22); public BloomFileter() { for (int i = 0; i < primes.length; I ++) {// Create eight algorithms using 8 primes hashList[I] = new Hash(primes[I]); Public void add(String value) {for (Hash f: Bits. set(f.hash(value), true); // Set (f.hash(value), true); }} public Boolean contains(String value) {if (value == null) {return false; } boolean ret = true; For (Hash f: hashList) {ret = ret && bits.get(f.hash(value)); } return ret; } public static class hash {private int prime; public Hash(int prime) { this.prime = prime; } public int hash(String key) { int hash, i; for (hash = key.length(), i = 0; i < key.length(); i++) { hash += key.charAt(i); } return (hash % prime); } } public static void main(String[] args) { BloomFileter bloomFileter = new BloomFileter(); System.out.println(bloomFileter.contains("5324512515")); bloomFileter.add("5324512515"); For (int I = 1; i < 100000000 ; i ++){ bloomFileter.add(String.valueOf(i)); } long begin = System.currentTimeMillis(); System.out.println(begin); System.out.println(bloomFileter.contains("5324512515")); long end = System.currentTimeMillis(); System.out.println(end); System.out.println(" check whether 5324512515 is online :" + (begin-end)); }}Copy the code

This code builds a 1 billion bit bitSet, then adds 100 million userIDS to our Bloom filter, and recently determines whether the userId 5324512515 is logged in or not, typing out the execution time of the code

Let’s look at the memory footprint

An instance of the BloomFileter class takes more than 100 MB

It seems that bloom filters are really efficient for storage

False identification of Bloom filter

Bloom filter the advantage of fast, save space, but there are certain recognition, this probability is very small, to compute a deterrent to other probability is not difficult, put a book here The bloom filter has m bits, there are n elements, each element corresponding to k information fingerprint hash function, in the bloom filter insert an element, The probability of the bit is set to 1 to 1 / m, it is still the probability of 0 to 1-1 / m, so k hash function have no probability of his set to 1 to 1 to 1 / m k power, a bit after inserting the n elements, is set to 1 to 1 minus the probability 1-1 / m kn power, finally the book gives a formula, is not posted here, Here is a table of the false positive probabilities under different m/n ratios and different values of K:

Bloom filter for everyone here, welcome everyone to exchange, point out some of the wrong places in the article, let me deepen my understanding. Thank you!