Start thinking about
- What can you think of to tell if an element is in a set?
- Bloom filters do not store the data itself, so how do they do this?
- Bloom filter implementation? Parameter configuration?
In general, when we use List, Map, Set to determine whether an element exists, we will first save the element and then filter it. However, there is a drawback of all these forms is that the data must be saved, but we just want to know whether there is data, not the actual data, this way will feel that it is a waste of space.
When do we just need to know if this element exists? In system design, we will consider the form of a large number of concurrent requests, but many requests may be accessing data that does not exist, so we do not need to continue the request, can be directly filtered at the API gateway layer.
Bloom Filter Principle of Bloom Filter
Bloom filter is a binary vector data structure proposed by Howard Bloom in 1970. It has good space and time efficiency and is used to detect whether an element is a member of a set.
Instead of storing the data itself, bloom filters use K hash functions to calculate the location of the byte[] array, and set the value of this location to 1. The number of K’s is calculated according to the formula, which will be listed later. In addition to the K value, we need to calculate the length m of the byte[] array.
- FPP: error rate parameter, (must be 0 < FPP < 1)
- N: The estimated total number of filters
- Write down the names of your high school teachers
- M: Array length
- N: The estimated total number of filters
Let’s use the number 11 as an example. There’s a website for testing bloom filters, testing Bloom online
Advantages and disadvantages of Bloom filter
Advantages:
- To save space, instead of saving all data, knowledge calculates positions using hash values and records them through byte[].
- Fast speed and low time complexity O(1);
Disadvantages:
- The accuracy is low. Assume: position 1 and 3 calculated by A; B) calculated positions 5,7; C calculates position 1,7, then c must exist?
- It is not possible to delete directly, because to delete, the corresponding position must be set to 0, which may affect filtering of other values.
Bloom filter implementation
This is actually implemented in the Google Guava package, so we don’t have to implement it ourselves. Let’s see how that works;
/** * calculate the length of the bit array * n: estimated amount of data * p: error rate 0-1 */
@VisibleForTesting
static long optimalNumOfBits(long n, double p) {
if (p == 0.0D) {
p = 4.9 e-324D;
}
return (long) ((double)(-n) * Math.log(p) / (Math.log(2.0D) * Math.log(2.0D)));
}
Copy the code
/** * The number of hash functions * n: estimated amount of data * m: bit array length */
@VisibleForTesting
static int optimalNumOfHashFunctions(long n, long m) {
return Math.max(1, (int)Math.round((double)(m / n) * Math.log(2.0D)));
}
Copy the code
Play with your hands
- ExpectedInsertions stands for the estimated number, and the larger the expectedInsertions are, the more accurate the expectedInsertions are. In the following example, you can set the P value arbitrarily. If the p value is too small, the return true will be returned
- FPP: 0-1 margin of error
import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
public class BloomFilterTest {
public static void main(String[] args) {
int expectedInsertions = 800000000;
double fpp = 0.00001;
BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), expectedInsertions, fpp);
int i = 10000;
while (i > 1){
bloomFilter.put("aa" + i);
System.out.println(bloomFilter.mightContain("ab"+ i)); i--; }}}Copy the code
Like articles please follow me
Application field click follow + forward, private message send [interview] or [information] to gain more resources