Why introduce

We often encounter library penetration problems in our business, which can usually be solved by caching. If the data has many dimensions and the resulting data set is large, the effect of caching is not obvious.

Therefore, in order to solve the problem of library penetration, we introduce Bloom Filter.

Redis Core Mind Mapping +Redis Actual learning notes

Suitable scene

  • Database protection against library penetration Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use BloomFilter to reduce disk lookups for nonexistent rows or columns. Avoiding costly disk lookups can greatly improve the performance of database query operations. Like the business scenario at the beginning. If the amount of data is large, it is not convenient to put it in the cache. Requests need to be intercepted to prevent them from penetrating the library.
  • Cache outage If the cache outage occurs, the Bloom filter may cause misjudgment. The reason is that in addition to the misjudgment rate of Bloom Filter itself, the cache before the outage may not cover all the data in the DB. After the outage, when the user requests a data that has never been requested before, misjudgment will occur. Of course, it should be tolerable to use bloom filters as a contingency for cache outages.
  • WEB interceptor Same request interception to prevent attack. For the first request, the user puts the request parameters into BloomFilter. For the second request, the user checks whether the request parameters are matched by BloomFilter. Can improve cache hit ratio
  • Malicious Address detection Chrome checks for malicious addresses. Any URL is first checked against the local BloomFilter, and the executed URL is fully checked only if BloomFilter returns a positive result (and the user is warned if it also returns a positive result).
  • Bitcoin speeds up Bitcoin uses BloomFilter to speed up wallet synchronization.

Open source project address: github.com/luw2007/blo…

Let’s start by looking at the general business caching process:

Query the cache first. If the cache does not match, query the database. Even if the data doesn’t exist, you need to create a cache to prevent it from running through the library.

You need to distinguish whether the data exists or not. If the data does not exist, the cache time can be set relatively short to prevent problems such as master/slave synchronization from being magnified.

The weak point in this process is that when there are too many users, we cache a lot of empty data, and when a wave of cold users comes in, there is an avalanche effect.

In this case, we produce a second version of the process: the Redis Filtering cold user cache process

We put the hit users in the database into the set type of Redis and set them not to expire. This is equivalent to redis as the index of the database, as long as you query Redis, you can know whether the data exists.

If it does not exist in Redis, the result can be returned directly. If so, follow the general business caching process mentioned above.

You’re smart enough to come up with more questions:

  1. Redis can cache itself, so why not just return the data?
  2. If there is a large amount of data, a single set will have performance problems, okay?
  3. The business is not important, and the full amount of data is stored in Redis, which occupies a large amount of server memory. Disproportionate input and output?

Problem 1 needs to be differentiated from business scenarios. The result data is small, so we can directly use Redis as cache and directly return data. As a result, it is not suitable for redis storage. For example, ugC content, a comment may contain tens of thousands of words, many business fields.

Redis uses a lot of tricks. Bigkey causes serious damage. The stability of Redis may be affected by the release of memory requests due to capacity expansion or reduction or the return of a large amount of data due to improper use of query commands. I won’t go into the cause and harm here.

The solution to bigkey is simple. We can use the hash function to split buckets and split the data among multiple keys. Reduce the size of a single key without affecting query efficiency.

Problem 3 is that redis storage takes up too much memory. So we need to reduce memory usage. Rethink the purpose of introducing Redis. Redis is like a collection, and the whole business is to verify that the request parameters are in the collection.

The structure is like a two-way valve for bathing: hot water on the left, cold water on the right.

Most programming languages have filters built in. In Python, for example, the filter function is used to filter a sequence, filtering out elements that do not meet the criteria and returning a list of elements that do meet the criteria.

Let’s look at an example:

$ python2
Python 2.710. (default, Oct  6 2017.22:29:07)
[GCC 4.21. Compatible Apple LLVM 9.0. 0 (clang-900.031.)] on darwin
Type "help"."copyright"."credits" or "license" for more information.
>>> s = {2.4}
>>> filter(lambda x:x in s, [0.1.2[])2]
Copy the code

There are 2,4 numbers in set s, and we need to query 0, 1, and 2 for those numbers in set s. Lambda x:x in s Construct an anonymous function to determine whether the input parameter x is in the set S. The filter, in turn, performs anonymous functions on the numbers in the list. Finally, the list is returned [2].

Redis implements set using two constructs: intSet and Hash table. Non-numbers or a large number of numbers degenerate into a Hash table. Can a good algorithm save the size of a hash table?

In fact, as early as 1970 by Burton Howard Bloom proposed Bloom Filter (English: Bloom Filter). It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection.

Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain error recognition rate and deletion difficulty.

BloomFilter principle

We commonly splice business fields into MD5 and put them in a collection. Md5 generates a fixed length 128bit string. If we use a bitmap, we need to

2六四屠杀128 = 340282366920938463463374607431768211456 bit
Copy the code

To determine whether a value is present or not is to determine whether the value is 1 in the bitmap. But our machines around the world don’t have enough storage space to store downloads. So we can only allocate a limited amount of space for storage. Such as:

When there is only one hash function: Collisions are easy.

As you can see, the hash result for both 1 and 2 above is 7. What happens if you add a hash function?

We use more hash functions and larger data sets to test. You get the table below

It can be seen that increasing hash method can effectively reduce collision probability. Good data are as follows:

However, the addition of hash methods will reduce the efficiency of space use. When the collection takes up 25% of the total space, the effect of increasing hash becomes insignificant

The above use of multiple hash methods to reduce collisions is the core idea of BloomFilter.

Advantages of the algorithm:

  • Data space is small, do not store the data itself.

Disadvantages of the algorithm itself:

  • Elements can be added to a collection, but cannot be deleted.
  • The matching result can only be “absolutely not in the collection”, and there is no guarantee that the matched value is already in the collection.
  • The probability of false positives increases as the collection nears full, that is, near the estimated maximum capacity.
  • Data space enlargement. In general, for a 1% probability of false positives, each element is less than 10 bits, regardless of the size or number of elements in the set. The query process slows down and the number of hash functions increases. As a result, multiple hash bits need to be searched to check the existence of each match.

For all the benefits of BloomFilter, the drawbacks are negligible. After all, you only need kN of storage space to store N elements. The spatial efficiency is excellent.

How do I use BloomFilter

BloomFilter requires a large bitmap to store. Given the current state of the company, the best storage container is Redis. Simple research from Github Topics: Bloom – Filter.

Redis integrated BloomFilter solution:

  • Native Python calls setbit to construct BloomFilter
  • The lua script
  • Rebloom – Bloom Filter Module for Redis (note: Redis Module is introduced in Redis4.0)
  • Call Redis pyreBloom with hiredis

The native Python approach is too slow, and lua scripts and Modules are cumbersome to deploy. So we recommend pyreBloom, the underlying use.

PyreBloom :master λ ls Makefile bloom.h bloom.pxd mutation.c pyrebloom.pyx bloom.c bloom.o main.c pyreBloomCopy the code

As you can see from the file name, Bloom is written in C. PyreBloom is written in Cython.

Bloom. H to achieve the core logic of BloomFilter, complete the interaction with Redis Server; The hash function; Add, check, and delete method implementations.

pyreBloom.pyx

import math import random cimport bloom class pyreBloomException(Exception): ”’Some sort of exception has happened internally”’ pass cdef class pyreBloom(object): cdef bloom.pyrebloomctxt context cdef bytes key property bits: def __get__(self): return self.context.bits property hashes: def __get__(self): return self.context.hashes def __cinit__(self, key, capacity, error, host=’127.0.0.1′, port=6379, password=”, db=0): self.key = key if bloom.init_pyrebloom(&self.context, self.key, capacity, error, host, port, password, db): raise pyreBloomException(self.context.ctxt.errstr) def __dealloc__(self): bloom.free_pyrebloom(&self.context) def delete(self): bloom.delete(&self.context) def put(self, value): if getattr(value, ‘__iter__’, False): r = [bloom.add(&self.context, v, len(v)) for v in value] r = bloom.add_complete(&self.context, len(value)) else: bloom.add(&self.context, value, len(value)) r = bloom.add_complete(&self.context, 1) if r < 0: raise pyreBloomException(self.context.ctxt.errstr) return r def add(self, value): return self.put(value) def extend(self, values): return self.put(values) def contains(self, value): #If the object is ‘iterable’… if getattr(value, ‘__iter__’, False): r = [bloom.check(&self.context, v, len(v)) for v in value] r = [bloom.check_next(&self.context) for i in range(len(value))] if (min(r) < 0): raise pyreBloomException(self.context.ctxt.errstr) return [v for v, included in zip(value, r) if included] else: bloom.check(&self.context, value, len(value)) r = bloom.check_next(&self.context) if (r < 0): raise pyreBloomException(self.context.ctxt.errstr) return bool(r) def __contains__(self, value): return self.contains(value) def keys(self): ”’Return a list of the keys used in this bloom filter”’ return [self.context.keys[i] for i in range(self.context.num_keys)]

Native pyreBloom methods:

cdef class pyreBloom(object): cdef bloom.pyrebloomctxt context cdef bytes property bits: property hashes: Def add(self, value): def put(self, value): def add(self, value): Def extend(self, values): def contains(self, value): Bool def keys(self): // List of keys stored in redis

Because pyreBloom uses the Hiredis library and does not have reconnection logic, it does a simple encapsulation.

#coding= UTF-8 “” Bloom filter extends, keys, contains, add, PUT, hashes, bits, delete

class TestModel(BaseModel):

. PREFIX = “bf:test”

t = TestModel() t.add(‘hello’)

1

t.extend([‘hi’, ‘world’])

2

t.contains(‘hi’)

True

t.delete()

”’ import logging from six import PY3 as IS_PY3 from pyreBloom import pyreBloom, PyreBloomException from bloomfilter. utils import force_utf8 class BaseModel(object):” BF_ERROR: indicates the maximum error rate allowed. RETRIES: indicates the number of connection RETRIES. Host: INDICATES the IP address of the REDis server. Redis server DB _bf_conn Internally save pyreBloom instances’ “SLOT = {‘add’, ‘contains’, ‘extend’, ‘keys’, ‘PUT ‘, ‘delete’, ‘bits’, ‘hashes’} PREFIX = “BF_SIZE = 100000 BF_ERROR = 0.01 RETRIES = 2 def __init__(self, redis=None): Self._bf_conn = None self._conf = {‘host’: ‘127.0.0.1’, ‘password’: ‘, ‘port’: 6379, ‘db’: 0} if redis: for k, v in redis.items(): if k in self._conf: self._conf[k] = redis[k] self._conf = force_utf8(self._conf) @property def bf_conn(self): If not self._bf_conn: prefix = force_utf8(self.prefix) logging.debug(‘pyreBloom connect: redis://%s:%s/%s, (%s%s%s)’, self._conf[‘host’], self._conf[‘port’], self._conf[‘db’], prefix, self.BF_SIZE, self.BF_ERROR, ) self._bf_conn = pyreBloom( prefix, self.BF_SIZE, self.BF_ERROR, **self._conf) return self._bf_conn def __getAttr__ (self, method): “” Calling pyrebloom methods without enumerations will get from Pyrebloom :param method: If method not in self.slot :return pyreBloom.{method} “” Raise NotImplementedError() # raise NotImplementedError() # Args = force_utf8(a) kwargs = force_utf8(kwargs) for _ in range(self.retries): try: Func = getattr(self.bf_conn, method) res = func(*args, **kwargs) If method == ‘contains’ and IS_PY3: if isinstance(res, list) return [i.decode(‘utf8’) for i in res] return res except pyreBloomException as error: logging.warn( ‘pyreBloom Error:%s%s’, method, str(error)) self.reconnect() if _ == self.RETRIES: logging.error(‘pyreBloom Error’) raise error return catch_error def __contains__(self, item): __contains__ method :param item: query list of elements/single element :type item: list/ baseString :return: [bool…] /bool ”’ return self.contains(item) def reconnect(self): The pyreBloom connection uses the C driver, does not provide timeout, uses the built-in timeout and adds multiple retries to ensure service reliability. struct timeval timeout = { 1, 5000 }; ctxt->ctxt = redisConnectWithTimeout(host, port, timeout); Del self._bf_conn calls the C del method built into pyreBloom and closes the redis connection. logging.debug(‘pyreBloom reconnect’) del self._bf_conn self._bf_conn = None _ = self.bf_conn

Advanced: Counting Filter

Provides a way to implement the delete operation on BloomFilter without having to re-create the filter. In a counting filter, the array position (bucket) expands from a single bit to an n-bit counter. In fact, a regular Bloom filter can be thought of as a count filter with a bucket size of one digit.

The insert operation is extended to increment the value of the bucket, and the find operation checks that each required bucket is non-zero. The delete operation then involves decrement the value of each bucket.

Arithmetic overflow of buckets is a problem, and buckets should be large enough to make such cases rare. If it does occur, the increment and decrement operations must set the storage area to the maximum possible value in order to preserve the attributes of BloomFilter.

Counters are usually 3 or 4 bits in size. As a result, the computational bloom filter has 3 to 4 times more space than a static bloom filter. In contrast, the data structures of Pagh, Pagh and Rao (2005) and Fan et al. (2014) also allows deletion but uses less space than static BloomFilter.

Another problem with counting filters is their limited scalability. Because you cannot extend the count bloom filter table, you must know in advance the maximum number of keys to be stored in the filter at the same time. Once the table’s design capacity is exceeded, the rate of false positives increases rapidly as more keys are inserted.

Bonomi, etc. (2006) introduces a data structure based on D-left hash, which is functionally equivalent, but uses about half the space of the computational BloomFilter. There are no scalability issues in this data structure. Once the designed capacity is exceeded, the key can be reinserted into a new hash table of double size.

A space-saving variant of Putze, Sanders and Singler (2007) can also be used to implement counting filters by supporting inserts and deletes.

Rottenstreich, Kanizo and Keslassy (2012) introduced a new general method based on variable increment, which significantly improves the probability of calculating false positives for Bloom filters and their variants, while still supporting deletion.

Unlike counting Bloem filters, a hash counter is incremented by a hash variable increment rather than a unit increment at each element insertion. To query elements, consider the exact values of counters, not just their positivity. If the sum represented by the counter value cannot consist of the corresponding variable increment of the query element, the negative answer can be returned to the query.