Bloom filter is a data structure designed to solve the existence of retrieval in massive data.
Ideally, an n-bit bitmap can hold 2n2^n2n elements, and the probability of false positives increases significantly when there are more elements.
Suppose the bitmap length is MMM, there are KKK hash functions, and the hash values are evenly distributed over [0,m][0,m][0,m] for any element in the input field.
When NNN elements have been inserted into the current filter, K ∗ NK * NK ∗n has been inserted for n times, and the probability of zero at any one bit is P0=(1−1m)k∗nP_0=(1-\frac{1}{m})^{k*n}P0=(1− M1)k∗n, Is the probability that is 1 P1 = 1 – (1-1 m) k ∗ nP_1 = 1 – (1 – \ frac {1} {m}) ^ n} {k * P1 = 1 – (1 – (m1) k ∗ n.
At this point, an element is inserted to obtain K positions. If all k positions are 1, the value is false positive, with the probability P=(1−(1−1m)k∗n)kP=(1-(1-\frac{1}{m})^{k*n})^kP=(1−(1− M1)k∗n)k
Graph with Matlab, k=3,m=1000k=3,m=1000k=3,m=1000, P changes with n trend
It can be seen that under the condition of m=1000, when n=1000, the error rate is as high as 80%+. In this case, even according to the simplest bitmap, it can be 100% correct. What are the advantages of bloom filters?