The headlines are spooky. Clickbait is a meow. Hahahahahaha. Okay, back to business, what if we guarantee de-duplication while we’re doing some crawling data, and I’ll talk to you today about de-duplication using bloom filters.
First of all, what is the Bloom filter? Let’s still look at Baidu.
- Bloom Filter was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. The advantage of this algorithm is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain error recognition rate and deletion difficulty.
Look silly, in fact, I also gave a straight dry meng. The binary vector is actually a bit array (redis, for example, provides a 2^32 bit array of 512MB), and a series of random mapping functions are actually hash functions. So let’s talk about hashing with you
The hash function
A hash function, also known as a hash function, is a function that takes an input of any length, hashes it into a fixed length output, and the output is the hash value. The common hash functions are MD5, SHA1, SHA256, etc. The hash function can also be understood as a means of encryption, but the hash function is irreversible, that is, you can not get the hash value from the output field and reverse derive the value of the input field. (It is irreversible because of the heavy science involved.) So let’s talk about some of the properties of hashing
- The input field is infinite, and the output field is limited (if you put a helloWord and a 10G movie in the input field, the output field will output a fixed length of output. If the input field is only one character different, the resulting hash value will be wildly different.)
- The same input must yield the same output. (If you play the same 10G movie twice with the same hash function, you will get the same hash.)
- This phenomenon is called a hash collision (because the input field is infinite and the output field is limited, it is inevitable that you can input two different values and the output field will get the same hash value).
- Each result of the output domain is evenly distributed over the entire output domain, which is also the discreteness of the hash function (the good of a hash function is generally reflected in its discreteness, the more evenly distributed the output results are, the better discreteness).
Hash functions are very, very useful, for example
Look at this. When you’re done downloading League of Legends, how do you know that you downloaded the full game successfully?
The official will put this file to hash function (md5) encryption, as shown in the above hash value, and then when you download is complete, your downloaded files encrypted with the same hash function, if the resulting value and value provided by the official, then you absolutely is complete to download the file. This type of file verification using hash functions is common.
So you can understand me when I say this (whisper bb….)
Well! Now that we’ve all got that, let’s move on to bloem filters.
Bloom filter
First, let’s talk about the process of bloom filter
- First, the data is hashed through k hash functions to obtain K hash values (H1, H2… Hk)
- Determine the length of a bit array let’s say m. Modulo the k hash values with respect to m yields k values (v1, v2….) Vk) ensure that each value is within the length range of the bit array [0,m-1]
- If you need to save, in the dimensional array, take the k values obtained from the previous step as the subscript values, find the k subscripts corresponding to the value of 1
- If you need to check whether the data exists, check whether all values of the k subscripts are 1. If all values are 1, the data exists; otherwise, the data does not exist
I’m guessing you’re a little confused, but at this point you think of the bit array as a blank sheet of paper
And then, assuming that you haven’t added any data yet, you go through four hash functions and modulo and you get four values that must be within the length of the bit array because you’re modulo, and those four values are somewhere on the white paper, and you color those four places black
So when you add a lot of data, the white paper will have a lot of little black dots.
So how do you know if you’re repeating something? If you’re doing multiple hashes and you’re modulating the values that are black on this piece of paper, that’s when you think it’s repeating something. As long as one position is white, then do not repeat, continue to add, black the position is white, and don’t care about the position is black.
Therefore, to determine the data duplication, we just need to check whether the corresponding subscript positions in the bit array are all 1. If they are all 1, the data can be repeated, and this data can be discarded.
I guess you all get it…… Emmmm, whether you understand or not, LET me go on!
When we use the Bloom filter to de-weight, there must be a miss rate (because of the nature of hash collisions), but it is guaranteed that duplicate data can be identified! (This means that there is a very small probability that your data does not duplicate, I say you duplicate! Don’t let you add data! Stand up straight!!)
In addition, Bloom filter has another characteristic, that is, it only focuses on the amount of data, not considering the size of a single data sample (this is determined by the hashing property 1, the input domain is infinite, the output domain is limited), so Bloom filter is particularly effective for massive data deduplication of large data samples.
At this point, you will find that the error rate is closely related to the length of the bit array, M, and the number of hash functions, K. How to design a Bloom filter at this time? Don’t worry! Scientists have come up with a formula for us, and we just need to calculate what you need based on your sample size and estimated error rate. That’s how exciting it is!
From the number of samples, n, and the error rate, P, you can calculate m.
Calculate the number of hash functions k according to m and n
Through the values of m, K and n, the actual error rate can be calculated
The resulting array length and the number of functions are rounded up. For example, if k is 8.23, then nine hash functions are selected directly, and the error rate will be lower than expected
So the Bloom filter can de-weigh a lot of data at a fraction of the cost, and that’s about it!
With all this bullshit, it’s time for code!
Code (Python)
Bit arrays, I’m going to use the bits provided in Redis. First of all, you need several different hash functions (this is from MIT).
#
# * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
# * *
#* General Purpose Hash Function Algorithms Library *
# * *
#* Author: Arash Partow - 2002 *
#* URL: http://www.partow.net *
#* URL: http://www.partow.net/programming/hashfunctions/index.html *
# * *
#* Copyright notice: *
#* Free use of the General Purpose Hash Function Algorithms Library is *
#* permitted under the guidelines and in accordance with the MIT License. *
#* http://www.opensource.org/licenses/MIT *
# * *
# * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
def rs_hash(key):
a = 378551
b = 63689
hash_value = 0
for i in range(len(key)):
hash_value = hash_value * a + ord(key[i])
a = a * b
return hash_value
def js_hash(key):
hash_value = 1315423911
for i in range(len(key)):
hash_value ^= ((hash_value << 5) + ord(key[i]) + (hash_value >> 2))
return hash_value
def pjw_hash(key):
bits_in_unsigned_int = 4 * 8
three_quarters = (bits_in_unsigned_int * 3) / 4
one_eighth = bits_in_unsigned_int / 8
high_bits = 0xFFFFFFFF << int(bits_in_unsigned_int - one_eighth)
hash_value = 0
test = 0
for i in range(len(key)):
hash_value = (hash_value << int(one_eighth)) + ord(key[i])
test = hash_value & high_bits
iftest ! =0:
hash_value = ((hash_value ^ (test >> int(three_quarters))) & (~high_bits))
return hash_value & 0x7FFFFFFF
def elf_hash(key):
hash_value = 0
for i in range(len(key)):
hash_value = (hash_value << 4) + ord(key[i])
x = hash_value & 0xF0000000
ifx ! =0:
hash_value ^= (x >> 24)
hash_value &= ~x
return hash_value
def bkdr_hash(key):
seed = 131 # 31 131 1313 13131 131313 etc..
hash_value = 0
for i in range(len(key)):
hash_value = (hash_value * seed) + ord(key[i])
return hash_value
def sdbm_hash(key):
hash_value = 0
for i in range(len(key)):
hash_value = ord(key[i]) + (hash_value << 6) + (hash_value << 16) - hash_value;
return hash_value
def djb_hash(key):
hash_value = 5381
for i in range(len(key)):
hash_value = ((hash_value << 5) + hash_value) + ord(key[i])
return hash_value
def dek_hash(key):
hash_value = len(key);
for i in range(len(key)):
hash_value = ((hash_value << 5) ^ (hash_value >> 27)) ^ ord(key[i])
return hash_value
def bp_hash(key):
hash_value = 0
for i in range(len(key)):
hash_value = hash_value << 7 ^ ord(key[i])
return hash_value
def fnv_hash(key):
fnv_prime = 0x811C9DC5
hash_value = 0
for i in range(len(key)):
hash_value *= fnv_prime
hash_value ^= ord(key[i])
return hash_value
def ap_hash(key):
hash_value = 0xAAAAAAAA
for i in range(len(key)):
if (i & 1) = =0:
hash_value ^= ((hash_value << 7) ^ ord(key[i]) * (hash_value >> 3))
else:
hash_value ^= (~((hash_value << 11) + ord(key[i]) ^ (hash_value >> 5)))
return hash_value
Copy the code
Next up is the implementation of Bloom
from BloomFilter.GeneralHashFunctions import *
import redis
import pickle
import base64
class BloomFilterRedis(object):
Hash function list
hash_list = [rs_hash, js_hash, pjw_hash, elf_hash, bkdr_hash, sdbm_hash, djb_hash, dek_hash]
def __init__(self, key, host='127.0.0.1', port=6379):
self.key = key
self.redis = redis.StrictRedis(host=host, port=port, charset='utf-8')
def random_generator(self, hash_value):
Map the hash function value to the interval [0, 2^32-1].
return hash_value % (1 << 32)
def do_filter(self, item, save=True):
Param item: :return: """
flag = True # default exists
for hash in self.hash_list:
# calculate the hash value
hash_value = hash(item)
Get the subscript value of the mapped array
index_value = self.random_generator(hash_value)
# check whether the specified position flag is 0
if self.redis.getbit(self.key, index_value) == 0:
Write if there is nothing to save
if save:
self.redis.setbit(self.key, index_value, 1)
flag = False
return flag
class Stu(object):
def __init__(self, name, age):
self.name = name
self.age = age
if __name__ == '__main__':
bloom = BloomFilterRedis("bloom_obj")
To de-duplicate an object, serialization must be implemented
data = pickle.dumps(Stu("xiaohong".18))
data1 = base64.b64encode(data).decode()
ret = bloom.do_filter(data1)
print(ret)
data = pickle.dumps(Stu("xiaohong".18))
data1 = base64.b64encode(data).decode()
ret = bloom.do_filter(data1)
print(ret)
bloom = BloomFilterRedis("bloom_url")
ret = bloom.do_filter("http://www.baidu.com")
print(ret)
ret = bloom.do_filter("http://www.baidu.com")
print(ret)
Copy the code
Well, the code is also masturbated up, welcome you to put forward your opinion hahaha hahaha, so today is over here, I first leave!!