preface
Today is the traditional Chinese festival “Monkey Night Festival”, which is the festival for programmers to work on code all night. AI Kaola’s technical partners are committed to breaking tradition. With the slogan “We don’t work overtime” and the guidance center “We leave work early”, we share our technical knowledge with you here. Wish you all a happy Holiday and leave work early and enjoy the Real Lantern Festival!
demand
In the financial business system, it should be common to determine whether the user is blacklisted.
So let’s say we have a million blacklisted users in our system, and we have a cell phone number, and we have a person who wants to borrow money, and we want to determine whether he’s on the blacklist. How do we do that?
A general method
The most direct method, is in the database query, the current database index, although can achieve O(logn) or theory O(1) time complexity, but after all is disk operation, with memory operation is not an order of magnitude.
Therefore, we can cache the mobile phone numbers in the blacklist into memory and store them in an array. This method has two problems: first, the search time complexity is O(N), which is very slow; second, it takes up a lot of memory.
The search speed can be further optimized by changing the array into a Set, and the internal implementation can choose to balance the binary tree or hash. In this way, the time complexity of insertion and search can achieve O(logn) or theoretical O(1), but it will bring disaster in space, which will take up more space than using the array.
Now look at the code and compare the two methods:
import random
import sys
def generate_random_phone():
"""Generate a random 11-bit string."""
phone = ' '
for j in range(0, 11):
phone += str(random.randint(0, 9))
return phone
# 100,000 blacklisted users
black_list = []
for i in range(0, 100000):
black_list.append(generate_random_phone())
# convert to set
black_set = set(black_list)
print(len(black_list), len(black_set))
Take a look at the space footprint of the two data structures
print("size of black_list: %f M" % (sys.getsizeof(black_list) / 1024 / 1024))
print("size of black_set: %f M" % (sys.getsizeof(black_set) / 1024 / 1024))
def brute_force_find():
"""Direct list linear search, random search for an element that exists or does not exist, O(n)."""
if random.randint(0, 10) % 2:
target = black_list[random.randint(0, len(black_list))]
return __brute_force_find(target)
else:
return __brute_force_find(generate_random_phone())
def __brute_force_find(target):
for i in range(0, len(black_list)):
if target == black_list[i]:
return True
return False
def set_find():
"""Set lookup, random search for an element that exists or does not exist, O(1)."""
if random.randint(0, 10) % 2:
target = black_list[random.randint(0, len(black_list))]
return __set_find(target)
else:
return __set_find(generate_random_phone())
def __set_find(target):
return target in black_set
print(brute_force_find())
print(set_find())
Copy the code
As you can see, the lengths of the array and collection are equal, indicating that the elements are unique. The space occupation of the list is 0.78m, while the space occupation of the set is 4M, mainly because the data structure of the hash table requires more pointer join conflicting elements, and the space occupation is about 5 times that of the list. That’s 10W mobile phone numbers, and if there were 100 million mobile phone numbers, it would take up 3.9GB of space.
Let’s take a look at performance testing:
import timeit
print(timeit.repeat('brute_force_find()', number=100, setup="from __main__ import brute_force_find"))
print(timeit.repeat('set_find()', number=100, setup="from __main__ import set_find"))
Copy the code
[0.0016423738561570644, 0.0013590981252491474, 0.0014535998925566673]
Copy the code
As you can see, the direct linear query takes about 0.85s, while the collection query takes only 0.0016s, which is a qualitative improvement in speed, but it takes too much space!
Is there a data structure that can make collective lookups time and space efficient?
The answer is the Bloom filter, but it has the possibility of misjudgment. When a mobile phone number is searched by the Bloom filter and returned to the blacklist, there is a certain probability that the mobile phone number is not actually in the blacklist. Going back to our business, it is acceptable not to lend to a borrower who has a 0.001% chance of being blacklisted, in a risk-control phrase: kill a hundred wrong than one wrong. It is shown that it is appropriate to use bloom filter to solve this problem.
Bloom filter principle
The principle is very simple, maintain a very large bitmap, set length m, select K hash functions.
So when we start, this bitmap, everything is set to zero. For each cell phone number in the blacklist, calculate k index values using k hash functions, and set all k positions in the bitmap to 1. When an element is queried, k index values are calculated using K hash functions. If the value of k positions in the bitmap is 1, the element may exist. If one position is not 1, it definitely does not exist.
The reason why the query might exist is because the hash functions might collide, and a nonexistent element, indexed by k hash functions, might be the same as another existing element, and that’s when the error occurs. Therefore, to reduce the misjudgment rate, obviously by increasing the bitmap length and the number of hash functions to achieve.
Take a look at the code:
from bitarray import bitarray
import mmh3
class BloomFilter:
def __init__(self, arr):
The bitmap length is temporarily set to 20 times the size of the blacklist library
self.SIZE = 20 * len(arr)
self.bit_array = bitarray(self.SIZE)
self.bit_array.setall(0)
for item in arr:
for pos in self.get_positions(item):
self.bit_array[pos] = 1
def get_positions(self, val):
# Use 10 hash functions, murmurhash algorithm, return index value
return [mmh3.hash(val, i) % self.SIZE for i in range(40, 50)]
def find(self, val):
for pos in self.get_positions(val):
if self.bit_array[pos] == 0:
return False
return True
bloomFilter = BloomFilter(black_list)
print("size of bloomFilter's bit_array: %f M" % (sys.getsizeof(bloomFilter.bit_array) / 1024 / 1024))
def get_error_rate():
# Test the error rate of bloom filter with 1W random phone numbers
size = 10000
error_count = 0
for i in range(0, size):
phone = generate_random_phone()
bloom_filter_result = bloomFilter.find(phone)
set_result = __set_find(phone)
ifbloom_filter_result ! = set_result: error_count += 1return error_count / size
print(get_error_rate())
Copy the code
size of bloomFilter's bit_array: 0.000092 M
0.0001
Copy the code
As you can see, although the bitmap is 20 times longer than the original data, it takes up very little space. This is because eight elements of the bitmap take up only 1 byte, while one element of the original data list takes up nearly 11 bytes.
The error rate is about 0.0001. You can try different bitmap lengths, such as 30 times, and the error rate will be reduced to 0.
Finally, let’s look at the performance test of the three algorithms:
def bloom_filter_find():
if random.randint(0, 10) % 2:
target = black_list[random.randint(0, len(black_list))]
return bloomFilter.find(target)
else:
return bloomFilter.find(generate_random_phone())
print(timeit.repeat('brute_force_find()', number=100, setup="from __main__ import brute_force_find"))
print(timeit.repeat('set_find()', number=100, setup="from __main__ import set_find"))
print(timeit.repeat('bloom_filter_find()', number=100, setup="from __main__ import bloom_filter_find"))
Copy the code
[0.70748823415488, 0.7686979519203305, 0.7785645266994834]
[0.001686999574303627, 0.002007704693824053, 0.0013333242386579514]
[0.001962156966328621, 0.0018132571130990982, 0.0023592300713062286]
Copy the code
As can be seen, the search speed of The Bloom filter is close to the search speed of the set, sometimes even faster, in the case of a very low misjudgment rate is acceptable, the use of the Bloom filter is to save time and space, is the best choice.
Refer to the link
- Principle and implementation of Bloom filter
Copyright belongs to the author of this article, without authorization, please do not reprint, thank you.