Introduction to the
The Bloom filter is a Probabilistic data structure that is efficient for insertion and query, calculating “something must not exist or could exist.”
- If there is, there might be.
- If it doesn’t exist, it doesn’t exist
Compared with traditional List, Set, Map and other data structures, it is more efficient to insert and query and occupies less space. However, its disadvantage is that the returned results may be misjudged. Reasonable length and number of hash functions can improve accuracy.
Bloom filter principle
- Adds elements to the Bloom filter
key
, will use multiplehash
Function ofkey
forhash
, calculate an integer index value, and then modulo the length of the bit data to get a position of 1, eachhash
The function always gets a position - judge
key
Whether there is, along the same lines as above, to proceedhash
Modulo operation, determine whether these positions in the array are all 1, as long as one bit is 0, indicating thiskey
Does not exist. If all of these positions are 1, that doesn’t necessarily mean they exist. - If the bit array is sparse, the probability of getting it right is high; otherwise, the probability is low
Basic usage
127.0.0.1:6379> bf.add days day1 (integer) 1 127.0.0.1:6379> bf.add days day2 (integer) 1 127.0.0.1:6379> Bf.exists days Day1 (integer) 1 127.0.0.1:6379> Bf.exists days day2 (integer) 1 127.0.0.1:6379> Bf.exists days day3 (integer) 0 127.0.0.1:6379> bf.madd days day4 day5 day6 1)(integer) 12)(integer) 1 3)(integer) 1 127.0.0.1:6379> bf.mexists days day4 day5 day6 day7 1)(integer) 1 2)(integer) 1 3)(integer) 1 4)(integer) 0Copy the code
Redis also provides a custom parameter Bloom filter with the following parameters:
- Error_rate: indicates the error rate. A smaller value indicates a larger space. The default value is 0.01
- Initial_size: The expected number of elements. When the number exceeds this value, the error rate increases. Default value: 100
The advantages and disadvantages
The advantages of bloom filters are obvious:
- There is no need to store data, only bits are represented, so there is a huge advantage in space occupancy
- Retrieval efficiency, insert and query time complexity are
O(K)
(K is the number of hash functions) - The hashing functions are independent from each other and can be calculated in parallel at the hardware instruction level, so the efficiency is high.
Disadvantages:
- There are uncertain factors and it is impossible to determine whether an element definitely exists, so it is not suitable for scenes requiring 100% accuracy
- You can only insert and query elements, not delete them.