1. What is a Bloom filter
Bloom Filter was proposed by Bloom in 1970. We can think of it as a data structure consisting of binary vectors (or arrays of bits) and a series of random mapping functions (hash functions). Compared with List, Map, Set and other data structures we usually use, it occupies less and is more efficient. The disadvantage is that the results are probabilistic rather than very accurate. In theory, the more elements you add to the collection, the greater the possibility of false positives, and the data stored in the Bloom filter is not easy to delete.
Each element in a bit array takes up only 1 bit, and each element can only be 0 or 1. So applying a bit array of 100W elements takes up only 1000000Bit / 8 = 125000 Byte = 125000/10214 KB ≈ 122KB
2. Bloom filter application scenario
- Prevent cache penetration: a user queries a data and the cache does not hit the database. This is equivalent to cache penetration, bloom filter (determine whether the requested data is valid to avoid bypassing the cache request database directly), and so on
- To heavy: crawler we crawled the URL tens of millions, how to determine whether we crawled this URL
- Mailbox filtering: Checks whether the mail is spam
3. Introduction to the principle of Bloom filter
Add elements
When an element is added to a Bloom filter:
- Run the same hash again using the Hash function of the Bloom filter to compute a value
- If the value is 1, then the value is in the Bloom filter. If there is a value that is not 1, then the element is not in the Bloom filter
As shown, when a string is to be added to a Bloom filter, the hash(key) value is computed, and the following table of the corresponding bit array is set to 1 (all positions are 0 when the array is initialized)
Access to elements
In the Bloom filter, if a string is in the Bloom filter, you only need to recalculate the hash value and determine whether each element in the digit is 1. If the value is 1, the value is in the Bloom filter. If the value is not 1, the element is not in the Bloom filter
Different strings may hash to the same position, in which case we can increase the size of the bit array or adjust our hash function.
4. Java handwriting Bloom filter
- Implements the initialization of the array to hold data
- Calculate the hash value
- Add to bloom filter array
- Determines whether the element exists
public class BloomFileter {
/** * Initialize the size */
private static final int DEFAULT_SIZE = 2 << 24;
/** * bit array. The elements in an array can only be 0 and 1 */
private BitSet bits = new BitSet(DEFAULT_SIZE);
/** * Computes the hash value *@param key
* @return* /
static final int hash(Object key) {
int h;
return (key == null)?0 : Math.abs((h = key.hashCode()) ^ (h >>> 16));
}
public BloomFileter(a){}/** * add element into array */
public void add(Object value) {
bits.set(hash(value), true);
}
/** * Determines whether the specified element exists in the bitarray */
public boolean contains(Object value) {
returnbits.get(hash(value)); }}Copy the code
test
public static void main(String[] args) {
String value1 = "https://javaguide.cn/";
String value2 = "https://github.com/Snailclimb";
BloomFileter filter = new BloomFileter();
System.out.println(filter.contains(value1));
System.out.println(filter.contains(value2));
filter.add(value1);
filter.add(value2);
System.out.println(filter.contains(value1));
System.out.println(filter.contains(value2));
}
Copy the code
false
false true true
5. Use the Bloom filter that comes with Google’s open source Guava
The implementation of bloom filter in Guava is quite authoritative
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>28.0-jre</version>
</dependency>
Copy the code
public class GuavaBloomFilter {
public static void main(String[] args) {
BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8),100000.0.01);
// Checks whether the specified element exists
System.out.println(filter.mightContain(1));
System.out.println(filter.mightContain(2));
// Add the element to the Bloom filter
filter.put(1);
filter.put(2);
System.out.println(filter.mightContain(1));
System.out.println(filter.mightContain(2)); }}Copy the code
6. Bloom filter in Redis
We know that computers are stored in binary as the base unit, one byte equals eight bits. In Redis, Bitmaps provides a set of commands that operate like each bit of the string above
setbit :
getbit :
bitcount
Redisson:
public class RedissonBloomFilter {
public static void main(String[] args) {
Config config = new Config();
config.useSingleServer().setAddress("Redis: / / 192.168.14.104:6379");
config.useSingleServer().setPassword("123");
/ / Redisson construction
RedissonClient redisson = Redisson.create(config);
RBloomFilter<String> bloomFilter = redisson.getBloomFilter("phoneList");
// Initialize the Bloom filter: the expected element is 100000000L, with an error rate of 3%
bloomFilter.tryInit(100000000L.0.03);
// Insert the number 10086 into the Bloom filter
bloomFilter.add("10086");
// Check whether the following numbers are in the Bloom filter
System.out.println(bloomFilter.contains("123456"));//false
System.out.println(bloomFilter.contains("10086"));//true}}Copy the code
“When there is a high wind, life does not give up”