Bloem filters, this one gives you the lowdown

What problems can bloom filter solve?

For example: there are 5 billion phone numbers, but if you are given 100,000 phone numbers, how can you quickly and accurately determine whether those numbers exist?

Plan A: Database? —-> 5 billion phone numbers, query efficiency is very slow

Plan B: Memory? – > Just press 1 phone number 8 bytes, 5 billion x 8 bytes = 40 GB memory

There are many similar problems, such as:

  • Spam filtering
  • Word processing software (such as Word) error word detection
  • Web crawler duplicate URL detection
  • Determines whether an element exists in the billion level data
  • Hbase line filter
  • .

These questions boil down to one thing: How do you find out if a value exists in a mass of data?

Turn on the Bloom filter

Mode 1: Compilation mode

1. Download and install the Bloom filter

Git clone https://github.com/RedisLabsModules/redisbloom.git CD redisbloom make # compiler redisbloomCopy the code

After normal compilation, a redisbloom. So file is generated in the root directory.

2. Start the Redis server

> ./src/redis-server redis.conf --loadmodule ./src/modules/RedisBloom-master/redisbloom.so
Copy the code

Mode 2: Docker mode

Use of bloom filters

Common commands:

(1) bf.add: add elements

(2) bf.exists: determines whether an element exists

(3) bf.madd: Add multiple elements

(4) bf. Mexists: Judge whether multiple elements exist

(5) BF. Reserve: Set the accuracy of Bloom filter

  • This command must be executed at the beginning of the element, or an error will be reported. It takes three arguments:Key, error_rate, and initial_size. Among them:
    • error_rate: Allows the error rate of the Bloom filter. The lower the value, the more space the filter takes up, so this determines the size of the bit array, which is used to store results. The larger the bit array takes up (the more information it stores), the lower the error rate.
    • initial_size: The size of the element stored by the Bloom filter. If the actual value stored is greater than this value, the accuracy is reduced. The default value is 100.

The code field

Jedis + LUA script

import redis.clients.jedis.Jedis;
import utils.JedisUtils;

import java.util.Arrays;

public class BloomExample {
    private static final String _KEY = "user";

    public static void main(String[] args) {
        Jedis jedis = JedisUtils.getJedis();
        for (int i = 1; i < 10001; i++) {
            bfAdd(jedis, _KEY, "user_" + i);
            boolean exists = bfExists(jedis, _KEY, "user_" + i);
            if(! exists) { System.out.println("No data found I =" + i);
                break;
            }
        }
        System.out.println("Execution completed");
    }
    /** * add element *@paramJedis Redis client *@param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfAdd(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value),
                Arrays.asList());
        if (result.equals(1L)) {
            return true;
        }
        return false;
    }
    /** * query if the element exists *@paramJedis Redis client *@param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfExists(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.exists', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value),
                Arrays.asList());
        if (result.equals(1L)) {
            return true;
        }
        return false; }}Copy the code

useRedissonThe framework

Github.com/redisson/re…

RBloomFilter<SomeObject> bloomFilter = redisson.getBloomFilter("sample");
// initialize bloom filter with 
// expectedInsertions = 55000000
/ / falseProbability = 0.03
bloomFilter.tryInit(55000000L.0.03);

bloomFilter.add(new SomeObject("field1Value"."field2Value"));
bloomFilter.add(new SomeObject("field5Value"."field8Value"));

bloomFilter.contains(new SomeObject("field1Value"."field8Value"));
bloomFilter.count();
Copy the code

For a Bloom filter, it says there are no values, and it says there may be no values.

For example, we can determine whether an IP address is in the blacklist. If the IP address is in the blacklist, it must be able to determine that the IP address is in the blacklist. If the IP address is not in the blacklist, it may also determine that the IP address is in the blacklist because of hash conflict. But the probability is very small.

The principle of

The underlying Redis is a Bitmap.

Redis bloom filter implementation, rely on it in the data structure a bit array, each store keys, not directly to store data in the data structure, because it takes up too much space, it is to use several different unbiased hash function, and the hash value of the elements of the uniform store in the array, that is to say, Each addition is computed by several unbiased hash functions, which are set to 1 to complete the addition.

When element judgment is performed, the query is made to see if the value of several hash positions of the element is 1. If all of them are 1, the value exists, and if one of them is 0, the value does not exist. Because the location is hashed, even if the location is 1, there is no certainty that the element identified it as 1, so the value may not exist when the bloom filter queries that the value exists, but it certainly does not exist when the value does not exist.

And when the bit array stores sparse values, the accuracy of the query will be higher, while when the bit array stores more and more values, the error will also increase.

Advantages and disadvantages of bloom filter

advantages

Bloom filters have huge advantages in both space and time over other data structures. The storage space and insert/query time of the Bloom filter are constant O(K)O(K)O(K) O(K). In addition, the hash functions are unrelated to each other, which is convenient for parallel implementation by hardware.

Bloom filters do not need to store the elements themselves, which can be advantageous in situations where confidentiality is very important.

A Bloom filter can represent the complete set, not any other data structure.

disadvantages

But the disadvantages of bloom filters are as obvious as their advantages. Miscalculation is one of them. As the number of stored elements increases, the miscalculation rate increases. But if the number of elements is too small, a hash table is sufficient.

In addition, you generally cannot remove elements from a Bloom filter. It’s easy to think of turning an array of bits into an array of integers, incrementing the counter for each element that is inserted, and then subtracting the counter when an element is deleted. However, ensuring that elements are safely removed is not so simple. First we must make sure that the deleted element is actually inside the Bloom filter. That’s not guaranteed by the filter alone. Counter winding can also cause problems.

Much work has been done to reduce the error rate, leading to the emergence of many variants of bloom filters.

reference

Redis Core Principles and Combat

Redis advanced – Bloom filter

Redis (13) —— Redis Bloom filter

Mass data weighting — Bloom Filter vs. Bitmap