Author: Lu Wei, Zhang Read senior backend engineer
Why introduce
We often encounter library penetration problems in our business, which can usually be solved by caching. If the data has many dimensions and the resulting data set is large, the effect of caching is not obvious. Therefore, in order to solve the problem of library penetration, we introduce Bloom Filter.
Open source project address: github.com/luw2007/blo…
Let’s start by looking at the general business caching process:
Query the cache first. If the cache does not match, query the database. Even if the data doesn’t exist, you need to create a cache to prevent it from running through the library. You need to distinguish whether the data exists or not. If the data does not exist, the cache time can be set relatively short to prevent problems such as master/slave synchronization from being magnified.
The weak point in this process is that when there are too many users, we cache a lot of empty data, and when a wave of cold users comes in, there is an avalanche effect. In this case, we produce a second version of the process: the Redis Filtering cold user cache process
We put the hit users in the database into the set type of Redis and set them not to expire. This is equivalent to redis as the index of the database, as long as you query Redis, you can know whether the data exists. If it does not exist in Redis, the result can be returned directly. If so, follow the general business caching process mentioned above.
You’re smart enough to come up with more questions:
- Redis can cache itself, so why not just return the data?
- If there is a large amount of data, a single set will have performance problems, okay?
- The business is not important, and the full amount of data is stored in Redis, which occupies a large amount of server memory. Disproportionate input and output?
Problem 1 needs to be differentiated from business scenarios. The result data is small, so we can directly use Redis as cache and directly return data. As a result, it is not suitable for redis storage. For example, ugC content, a comment may contain tens of thousands of words, many business fields.
Redis uses a lot of tricks. Bigkey causes serious damage. The stability of Redis may be affected by the release of memory requests due to capacity expansion or reduction or the return of a large amount of data due to improper use of query commands. I won’t go into the cause and harm here. The solution to bigkey is simple. We can use the hash function to split buckets and split the data among multiple keys. Reduce the size of a single key without affecting query efficiency.
Problem 3 is that redis storage takes up too much memory. So we need to reduce memory usage. Rethink the purpose of introducing Redis. Redis is like a collection, and the whole business is to verify that the request parameters are in the collection.
Most programming languages have filters built in. In Python, for example, the filter function is used to filter a sequence, filtering out elements that do not meet the criteria and returning a list of elements that do meet the criteria.
Let’s look at an example:
$python2 Python 2.7.10 (default, Oct 6 2017, 22:29:07) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on Darwin Type "help", "copyright", "credits" or "license" for more information. >>> s = {2, 4} >>> filter(lambda x:x in s, [0, 1, 2]) [2]Copy the code
There are 2,4 numbers in set s, and we need to query 0, 1, and 2 for those numbers in set s. Lambda x:x in s Construct an anonymous function to determine whether the input parameter x is in the set S. The filter, in turn, performs anonymous functions on the numbers in the list. Finally, the list is returned [2].
Redis implements set using two constructs: intSet and Hash table. Non-numbers or a large number of numbers degenerate into a Hash table. Can a good algorithm save the size of a hash table?
In fact, as early as 1970 by Burton Howard Bloom proposed Bloom Filter (English: Bloom Filter). It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain error recognition rate and deletion difficulty.
BloomFilter principle
We commonly splice business fields into MD5 and put them in a collection. Md5 generates a fixed length 128bit string. If we use a bitmap, we need to
2**128 = 340282366920938463463374607431768211456 bit
Copy the code
To determine whether a value is present or not is to determine whether the value is 1 in the bitmap. But our machines around the world don’t have enough storage space to store downloads. So we can only allocate a limited amount of space for storage. Such as:
import crc32
def BloomFilter(sample, size, hash_size=1):
Construct a hash function that hashes the input data to position size
hash = lambda x:crc32(str(x).encode())%size
collision, s = 0, set()
for i in range(sample):
k = set()
for j in range(hash_size):
k.add(hash(i+j*size/hash_size))
I is considered repeated only if all hashes k are in s
if not k - s:
collision += 1
continue
Update the hash result k to set S
s |= k
return collision
Copy the code
When there is only one hash function: Collisions are easy.
As you can see, the hash result for both 1 and 2 above is 7. What happens if you add a hash function?
We use more hash functions and larger data sets to test. You get the table below
It can be seen that increasing hash method can effectively reduce collision probability. Good data are as follows:
However, the addition of hash methods will reduce the efficiency of space use. When the collection takes up 25% of the total space, the effect of increasing hash becomes insignificant
The above use of multiple hash methods to reduce collisions is the core idea of BloomFilter.
Suitable scene
-
Database protection against library penetration Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use BloomFilter to reduce disk lookups for nonexistent rows or columns. Avoiding costly disk lookups can greatly improve the performance of database query operations. Like the business scenario at the beginning. If the amount of data is large, it is not convenient to put it in the cache. Requests need to be intercepted to prevent them from penetrating the library.
-
Cache outage If the cache outage occurs, the Bloom filter may cause misjudgment. The reason is that in addition to the misjudgment rate of Bloom Filter itself, the cache before the outage may not cover all the data in the DB. After the outage, when the user requests a data that has never been requested before, misjudgment will occur. Of course, it should be tolerable to use bloom filters as a contingency for cache outages.
-
WEB interceptor Same request interception to prevent attack. For the first request, the user puts the request parameters into BloomFilter. For the second request, the user checks whether the request parameters are matched by BloomFilter. Can improve cache hit ratio
-
Malicious Address detection Chrome checks for malicious addresses. Any URL is first checked against the local BloomFilter, and the executed URL is fully checked only if BloomFilter returns a positive result (and the user is warned if it also returns a positive result).
-
Bitcoin speeds up Bitcoin uses BloomFilter to speed up wallet synchronization.
Advantages of the algorithm:
- Data space is small, do not store the data itself.
Disadvantages of the algorithm itself:
- Elements can be added to a collection, but cannot be deleted.
- The matching result can only be “absolutely not in the collection”, and there is no guarantee that the matched value is already in the collection.
- The probability of false positives increases as the collection nears full, that is, near the estimated maximum capacity.
- Data space enlargement. In general, for a 1% probability of false positives, each element is less than 10 bits, regardless of the size or number of elements in the set. – The query process slows down and the number of hash functions increases. As a result, multiple hash bits need to be searched to check whether the hash bits exist during each match.
For all the benefits of BloomFilter, the drawbacks are negligible. After all, you only need kN of storage space to store N elements. The spatial efficiency is excellent.
How do I use BloomFilter
BloomFilter requires a large bitmap to store. Given the current state of the company, the best storage container is Redis. Simple research from Github Topics: Bloom – Filter.
Redis integrated BloomFilter solution:
- Native Python calls setbit to construct BloomFilter
- The lua script
- Rebloom – Bloom Filter Module for Redis (note: Redis Module is introduced in Redis4.0)
- Call Redis pyreBloom with hiredis
The native Python approach is too slow, and lua scripts and Modules are cumbersome to deploy. So we recommend pyreBloom, the underlying use.
PyreBloom :master λ ls Makefile bloom.h bloom.pxd mutation.c pyrebloom.pyx bloom.c bloom.o main.c pyreBloomCopy the code
As you can see from the file name, Bloom is written in C. PyreBloom is written in Cython.
Bloom. H to achieve the core logic of BloomFilter, complete the interaction with Redis Server; The hash function; Add, check, and delete method implementations.
int init_pyrebloom(pyrebloomctxt * ctxt, char * key, uint32_t capacity, double error, char* host, uint32_t port, char* password, uint32_t db);
int free_pyrebloom(pyrebloomctxt * ctxt);
int add(pyrebloomctxt * ctxt, const char * data, uint32_t len);
int add_complete(pyrebloomctxt * ctxt, uint32_t count);
int check(pyrebloomctxt * ctxt, const char * data, uint32_t len);
int check_next(pyrebloomctxt * ctxt);
int delete(pyrebloomctxt * ctxt);
Copy the code
pyreBloom.pyx
import math
import random
cimport bloom
class pyreBloomException(Exception):
'''Some sort of exception has happened internally'''
pass
cdef class pyreBloom(object):
cdef bloom.pyrebloomctxt context
cdef bytes key
property bits:
def __get__(self):
return self.context.bits
property hashes:
def __get__(self):
return self.context.hashes
def __cinit__(self, key, capacity, error, host='127.0.0.1', port=6379,
password=' ', db=0):
self.key = key
if bloom.init_pyrebloom(&self.context, self.key, capacity,
error, host, port, password, db):
raise pyreBloomException(self.context.ctxt.errstr)
def __dealloc__(self):
bloom.free_pyrebloom(&self.context)
def delete(self):
bloom.delete(&self.context)
def put(self, value):
if getattr(value, '__iter__'.False):
r = [bloom.add(&self.context, v, len(v)) for v in value]
r = bloom.add_complete(&self.context, len(value))
else:
bloom.add(&self.context, value, len(value))
r = bloom.add_complete(&self.context, 1)
if r < 0:
raise pyreBloomException(self.context.ctxt.errstr)
return r
def add(self, value):
return self.put(value)
def extend(self, values):
return self.put(values)
def contains(self, value):
# If the object is 'iterable'...
if getattr(value, '__iter__'.False):
r = [bloom.check(&self.context, v, len(v)) for v in value]
r = [bloom.check_next(&self.context) for i in range(len(value))]
if (min(r) < 0) :raise pyreBloomException(self.context.ctxt.errstr)
return [v for v, included in zip(value, r) if included]
else:
bloom.check(&self.context, value, len(value))
r = bloom.check_next(&self.context)
if (r < 0) :raise pyreBloomException(self.context.ctxt.errstr)
return bool(r)
def __contains__(self, value):
return self.contains(value)
def keys(self):
'''Return a list of the keys used in this bloom filter'''
return [self.context.keys[i] for i in range(self.context.num_keys)]
Copy the code
Cdef class pyreBloom(object): cdef bloom. Pyrebloomctxt context Cdef bytes property bits: Def delete(self): def put(self, value): Def add(self, value): def extend(self, values): Def contains(self, value): // Check if contains(self, value) exists, returns' [value] 'when' value 'can be iterated, otherwise returns' bool' def keys(self): // The list of keys stored in redisCopy the code
Because pyreBloom uses the Hiredis library and has no logic to rewire itself, it misses the simple encapsulation.
# coding=utf-8
Extend, keys, contains, add, PUT, hashes, bits, delete >>> class TestModel(BaseModel): ... PREFIX = "bf:test" >>> t = TestModel() >>> t.add('hello') 1 >>> t.extend(['hi', 'world']) 2 >>> t.contains('hi') True >>> t.delete() '''
import logging
from six import PY3 as IS_PY3
from pyreBloom import pyreBloom, pyreBloomException
from BloomFilter.utils import force_utf8
class BaseModel(object):
PREFIX: redis PREFIX BF_SIZE: indicates the maximum memory size. BF_ERROR: indicates the allowed error rate. RETRIES: indicates the number of connection RETRIES. Redis server IP port: Redis server port DB: Redis server db _BF_conn: Internal save pyreBloom instance.
SLOT = {'add'.'contains'.'extend'.'keys'.'put'.'delete'.'bits'.'hashes'}
PREFIX = ""
BF_SIZE = 100000
BF_ERROR = 0.01
RETRIES = 2
def __init__(self, redis=None):
Initializing the Redis configuration :param redis: Redis configuration
The class static variable is initialized to prevent the reuse of multiple inherited classes, resulting in data contamination
self._bf_conn = None
self._conf = {
'host': '127.0.0.1'.'password': ' '.'port': 6379.'db': 0
}
if redis:
for k, v in redis.items():
if k in self._conf:
self._conf[k] = redis[k]
self._conf = force_utf8(self._conf)
@property
def bf_conn(self):
Initialize pyreBloom
if not self._bf_conn:
prefix = force_utf8(self.PREFIX)
logging.debug(
'pyreBloom connect: redis://%s:%s/%s, (%s %s %s)',
self._conf['host'], self._conf['port'], self._conf['db'],
prefix, self.BF_SIZE, self.BF_ERROR,
)
self._bf_conn = pyreBloom(
prefix, self.BF_SIZE, self.BF_ERROR, **self._conf)
return self._bf_conn
def __getattr__(self, method):
Methods that call the Pyrebloom method without an enumeration will get from 'Pyrebloom' :param Method: :return: Pyrebloom.{method}"
Only internal methods are provided
if method not in self.SLOT:
raise NotImplementedError()
Catch pyreBloom exceptions and print the necessary logs
def catch_error(*a, **kwargs):
Multiple retry service
args = force_utf8(a)
kwargs = force_utf8(kwargs)
for _ in range(self.RETRIES):
try:
func = getattr(self.bf_conn, method)
res = func(*args, **kwargs)
Python3 > python2 > python2
Handle the return type manually
if method == 'contains' and IS_PY3:
if isinstance(res, list):
return [i.decode('utf8') for i in res]
return res
except pyreBloomException as error:
logging.warn(
'pyreBloom Error: %s %s', method, str(error))
self.reconnect()
if _ == self.RETRIES:
logging.error('pyreBloom Error')
raise error
return catch_error
def __contains__(self, item):
__contains__ method :param item: query list of elements/single element :type item: list/ baseString :return: [bool...] /bool '''
return self.contains(item)
def reconnect(self):
Bloom 'pyreBloom' connection uses the C driver, does not provide timeout, uses the built-in timeout and adds multiple retries to ensure service reliability. struct timeval timeout = { 1, 5000 }; ctxt->ctxt = redisConnectWithTimeout(host, port, timeout); Del self._bf_conn calls the C del method built into 'pyreBloom' and closes the redis connection.
if self._bf_conn:
logging.debug('pyreBloom reconnect')
del self._bf_conn
self._bf_conn = None
_ = self.bf_conn
Copy the code
Advanced: Counting Filter
Provides a way to implement the delete operation on BloomFilter without having to re-create the filter. In a counting filter, the array position (bucket) expands from a single bit to an n-bit counter. In fact, a regular Bloom filter can be thought of as a count filter with a bucket size of one digit.
The insert operation is extended to increment the value of the bucket, and the find operation checks that each required bucket is non-zero. The delete operation then involves decrement the value of each bucket.
Arithmetic overflow of buckets is a problem, and buckets should be large enough to make such cases rare. If it does occur, the increment and decrement operations must set the storage area to the maximum possible value in order to preserve the attributes of BloomFilter.
Counters are usually 3 or 4 bits in size. As a result, the computational bloom filter has 3 to 4 times more space than a static bloom filter. In contrast, the data structures of Pagh, Pagh and Rao (2005) and Fan et al. (2014) also allows deletion but uses less space than static BloomFilter.
Another problem with counting filters is their limited scalability. Because you cannot extend the count bloom filter table, you must know in advance the maximum number of keys to be stored in the filter at the same time. Once the table’s design capacity is exceeded, the rate of false positives increases rapidly as more keys are inserted.
Bonomi, etc. (2006) introduces a data structure based on D-left hash, which is functionally equivalent, but uses about half the space of the computational BloomFilter. There are no scalability issues in this data structure. Once the designed capacity is exceeded, the key can be reinserted into a new hash table of double size.
A space-saving variant of Putze, Sanders and Singler (2007) can also be used to implement counting filters by supporting inserts and deletes.
Rottenstreich, Kanizo and Keslassy (2012) introduced a new general method based on variable increment, which significantly improves the probability of calculating false positives for Bloom filters and their variants, while still supporting deletion. Unlike counting Bloem filters, a hash counter is incremented by a hash variable increment rather than a unit increment at each element insertion. To query elements, consider the exact values of counters, not just their positivity. If the sum represented by the counter value cannot consist of the corresponding variable increment of the query element, the negative answer can be returned to the query.
Open source project address: github.com/luw2007/blo…