18. Implementation and application of Bloom filter
Introduction to Bloom filter
The Bloom Filter, proposed by Burton Howard Bloom in 1970, is actually a long binary vector (an array of bits) and a series of random mapping functions (hashes). Bloom filter can be used to retrieve whether an element is in a set. It is a probabilistic data structure with high spatial query rate. The advantage of this algorithm is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain error recognition rate and deletion difficulty.
It is designed to detect the presence of a particular element in a collection. Sounds like a very common requirement here, so why use this data structure?
When is a Bloom filter needed?
Let’s start with a few common examples:
- In word processing software, you need to check whether an English word is spelled correctly;
- Web crawler to the URL to avoid crawling the same URL address;
- Anti-spam, judging whether a mailbox is spam from billions of spam lists;
- Cache penetration, which puts existing caches in bloom and returns quickly when hackers access non-existing caches to avoid cache and DB failures.
These examples have a common feature: How do you determine if an element exists in a set?
General idea:
- An array of
- The list
- Tree, balanced binary tree, Trie(dictionary tree)
- Map (Red black tree)
- Hash table
Although these data structures described above, combined with common sorting and binary search, can quickly and efficiently handle most of the requirements to determine whether elements exist in the set. But what if the set has a large enough number of elements, 500 million records or even 100 million records? This is where the problem with conventional data structures becomes apparent. Data structures such as arrays, linked lists and trees store the contents of elements. Once the amount of data is too large, the memory consumption increases linearly and finally reaches the bottleneck.
Some of you might say, well, isn’t hashing tables very efficient? The query efficiency can reach O(1). But hash tables still consume a lot of memory. The cost of using hash tables to store 100 million spam email addresses? First, the hash function maps an email address to an 8-byte fingerprint of information. Considering that hash table storage efficiency is usually less than 50% (hash conflict); So memory consumption: 8 * 2 * 100 million bytes = 1.6 GIGABytes of memory, which a census computer cannot provide. That’s where the Bloom Filter comes in.
In order to understand the principle of Bloom filters, let’s review the knowledge of hash functions.
The hash function
The concept of a hash function is: a function that converts data of any size into data of a specific size. The converted data is called a hash value or hash code.
For hash table it is not only a hash function to get such an index value, and it will put all the elements you want to deposit the hash table inside, this is a no error data structure and the number of elements, each element has how old, so all of these elements need to be memory space in the hash table to find the corresponding memory size to put in.
So a lot of times when we’re doing industrial applications, we find that we don’t need to store all the elements themselves, we just need to store one piece of information, which is whether this element is in my table or not. In this case if there is no, as long as there is still a query at this time we need a more efficient data structure, can lead to a more efficient data structure as a result, there are many elements to save, but this table we need very little memory space, at the same time, we don’t need to save all the elements themselves, We just have to say whether this thing is there or not. Let’s look at how this data structure is actually designed.
Bloom filter principle
The core implementation of the Bloom Filter is a large bit array and several hash functions. Suppose the length of the bit array is m and the number of hash functions is K.
Specific operation flow: suppose there are 3 elements {x,y,z} in the set, and the number of hash functions is 3.
- We first initialize the bit array, setting each bit to 0.
- For each element in the set, the elements are mapped through three hash functions in turn. Each mapping produces a hash value corresponding to a point in the bit array, and then marks the position of the bit array with bit 1.
- The query
W
The same method maps W through the hash table to the three points on the array. If one of the three points is not 1, then the element must not be in the set. Conversely, if all three points are 1, the element may exist in the set.
Note: It is not possible to judge whether the element exists in the set, and there may be a certain misjudgment rate. As you can see from the figure, suppose an element is mapped to the three points with subscripts 4, 5, and 6. Although these three points are all 1, it is obvious that these three points are the positions of different elements obtained through the hash table. Therefore, this situation indicates that although elements are not in the set, they may also correspond to 1, which is the reason for the existence of misjudgment rate.
A common case
- Bitcoin network
- Distributed systems (Map-reduce) – Hadoop and Search Engine
- Redis cache
- Filtering of spam, comments, etc
Design of Bloom filter
The idea of Bloom filter is relatively simple, but for the design of random mapping function of Bloom filter, need to calculate several times, how much vector length setting is more appropriate, this is the need to seriously discuss.
- If the vector length is too short, the misjudgment rate will skyrocket.
- If the vector is too long, you waste a lot of memory.
- Too many computations take up computing resources and can easily fill up the filter quickly.
Simple implementation of bloom filter
Bloom filter adds elements
- I’m going to add elements to k hash functions
- We get k positions that correspond to the bit array
- Let’s set these k positions to be 1
The Bloom filter queries elements
- The element to be queried is given k hash functions
- We get k positions that correspond to the bit array
- If one of k positions is 0, you’re definitely not in the set
- If all k positions are 1, then it might be in the set
Java implementation
//Java
public class BloomFilter {
private static final int DEFAULT_SIZE = 2 << 24;
private static final int[] seeds = new int[] { 5, 7, 11, 13, 31, 37, 61 };
private BitSet bits = new BitSet(DEFAULT_SIZE);
private SimpleHash[] func = new SimpleHash[seeds.length];
public BloomFilter() {
for (int i = 0; i < seeds.length; i++) {
func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]);
}
}
public void add(String value) {
for (SimpleHash f : func) {
bits.set(f.hash(value), true);
}
}
public boolean contains(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (SimpleHash f : func) {
ret = ret && bits.get(f.hash(value));
}
return ret;
}
// Inner class, simpleHash
public static class SimpleHash {
private int cap;
private int seed;
public SimpleHash(int cap, int seed) {
this.cap = cap;
this.seed = seed;
}
public int hash(String value) {
int result = 0;
int len = value.length();
for (int i = 0; i < len; i++) {
result = seed * result + value.charAt(i);
}
return (cap - 1) & result;
}
}
}
Copy the code
Python implementation
# Python
from bitarray import bitarray
import mmh3
class BloomFilter:
def __init__(self, size, hash_num):
self.size = size
self.hash_num = hash_num
self.bit_array = bitarray(size)
self.bit_array.setall(0)
def add(self, s):
for seed in range(self.hash_num):
result = mmh3.hash(s, seed) % self.size
self.bit_array[result] = 1
def lookup(self, s):
for seed in range(self.hash_num):
result = mmh3.hash(s, seed) % self.size
if self.bit_array[result] == 0:
return "Nope"
return "Probably"
bf = BloomFilter(500000, 7)
bf.add("dantezhao")
print (bf.lookup("dantezhao"))
print (bf.lookup("yyj"))
Copy the code
C/C + + implementation
C/C++
#include <iostream>
#include <bitset>
#include <cmath>
using namespace std;
typedef unsigned int uint;
const int DEFAULT_SIZE = 1 << 20;
const int seed[] = { 5, 7, 11, 13, 31, 37, 61 };
class BloomFilter {
public:
BloomFilter() : hash_func_count(3) {}
BloomFilter(int bitsize, int str_count) {
hash_func_count = ceil((bitsize / str_count) * log(2));
}
~BloomFilter() {}
uint RSHash(const char *str, int seed);
void Add(const char *str);
bool LookUp(const char *str);
private:
int hash_func_count;
bitset<DEFAULT_SIZE> bits;
};
uint BloomFilter::RSHash(const char *str, int seed) {
uint base = 63689;
uint hash = 0;
while (*str) {
hash = hash * base + (*str++);
base *= seed;
}
return (hash & 0x7FFFFFFF);
}
void BloomFilter::Add(const char* str) {
int index = 0;
for(int i = 0; i < hash_func_count; ++i) {
index = static_cast<int>(RSHash(str, seed[i])) % DEFAULT_SIZE;
bits[index] = 1;
}
return ;
}
bool BloomFilter::LookUp(const char* str) {
int index = 0;
for(int i = 0; i < hash_func_count; ++i) {
index = static_cast<int>(RSHash(str, seed[i])) % DEFAULT_SIZE;
if(! bits[index])return false;
}
return true;
}
Copy the code
Javascript implementation
// JavaScript
class BloomFilter {
constructor(maxKeys, errorRate) {
this.bitMap = [];
this.maxKeys = maxKeys;
this.errorRate = errorRate;
// Bitmap variable length needs to be calculated according to maxKeys and errorRate
this.bitSize = Math.ceil(maxKeys * (-Math.log(errorRate) / (Math.log(2) * Math.log(2))));
// Hash number
this.hashCount = Math.ceil(Math.log(2) * (this.bitSize / maxKeys));
// Number of added elements
this.keyCount = 0;
}
bitSet(bit) {
let numArr = Math.floor(bit / 31);
let numBit = Math.floor(bit % 31);
this.bitMap[numArr] |= 1 << numBit;
}
bitGet(bit) {
let numArr = Math.floor(bit / 31);
let numBit = Math.floor(bit % 31);
return (this.bitMap[numArr] &= 1 << numBit);
}
add(key) {
if (this.contain(key)) {
return- 1;
}
let hash1 = MurmurHash(key, 0, 0),
hash2 = MurmurHash(key, 0, hash1);
for (let i = 0; i < this.hashCount; i++) {
this.bitSet(Math.abs(Math.floor((hash1 + i * hash2) % this.bitSize)));
}
this.keyCount++;
}
contain(key) {
let hash1 = MurmurHash(key, 0, 0);
let hash2 = MurmurHash(key, 0, hash1);
for (let i = 0; i < this.hashCount; i++) {
if(! this.bitGet(Math.abs(Math.floor((hash1 + i * hash2) % this.bitSize)))) {
return false;
}
}
return true;
}
}
/ * *
* MurmurHash
*
* refer to http://murmurhash.googlepages.com/
*
* data: value to be hashed
* offset:
* seed: set of seeds
*
* /
function MurmurHash(data, offset, seed) {
let len = data.length,
m = 0x5bd1e995,
r = 24,
h = seed ^ len,
len_4 = len >> 2;
for (let i = 0; i < len_4; i++) {
let i_4 = (i << 2) + offset,
k = data[i_4 + 3];
k = k << 8;
k = k | (data[i_4 + 2] & 0xff);
k = k << 8;
k = k | (data[i_4 + 1] & 0xff);
k = k << 8;
k = k | (data[i_4 + 0] & 0xff);
k *= m;
k ^= k >>> r;
k *= m;
h *= m;
h ^= k;
}
// avoid calculating modulo
let len_m = len_4 << 2,
left = len - len_m,
i_m = len_m + offset;
if(left ! = 0) {
if (left >= 3) {
h ^= data[i_m + 2] << 16;
}
if (left >= 2) {
h ^= data[i_m + 1] << 8;
}
if (left >= 1) {
h ^= data[i_m];
}
h *= m;
}
h ^= h >>> 13;
h *= m;
h ^= h >>> 15;
return h;
}
letBloomFilter = new bloomFilter (10000, 0.01);
bloomFilter.add("abcdefgh");
console.log(bloomFilter.contain("abcdefgh"));
console.log(bloomFilter.contain("abcdefghi"));
Copy the code
Attached: other implementations
- Example of a Bloom filter Python implementation
- Example of a high-performance Bloom filter implemented in Python
- Bloom filter Java implementation example 1
- Bloom filter Java implementation example 2
Reference article: Using Bloom filter to solve cache breakdown, spam identification, set weight
Some pictures from the network, copyright to the original author, delete.Copy the code