How to determine whether an element exists in billions of data?

preface

In everyday work, it is often necessary to determine whether an element is in a collection. Suppose you want to add a feature to your browser that tells users if they are entering a malicious URL, and you have a data set of about 10 million malicious urls, how do you do that? In my previous thinking, the first thing to do to determine if an element is in the current data set is to use a hash table, run all the malicious urls through a hash function to get the hash value, and then create a hash table (array). This scheme has one obvious flaw, is the need to store the original elements themselves, memory footprint is big, and we mainly focus on the current input URL in malicious URL is not in our data set, which is before the malicious URL is not important what is the specific values of data sets, by wu jun teachers to understand the beauty of mathematics, For this scenario, there is a suitable algorithm in the field of big data for determining whether an element already exists in the case of massive data. The key point is that the algorithm does not store the element itself, and this algorithm is Bloom filter.

The principle of

Bloom filter is made by Patton. Bloem proposed it in 1970, and wikipedia describes it as follows:

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.

Bloom filter is a data structure that can be used to determine whether an element is in a set. It has the characteristics of fast running and small memory consumption. It consists of a very long binary vector and a series of random mapping functions. The cost of efficient inserts and queries is that it is a probabilistic data structure that can only tell us that an element is definitely not in the set. The advantage of a Bloom filter is that it is fast, saves space, but has a certain misjudgment rate. The basic data structure of a Bloom filter is a bit vector. Suppose there is a bit vector of length 16. Here is a simple example to see how it works:

Each space in the bit vector above represents a bit, and the number below the space represents the index of the current position. Adding an element to the Bloom filter is done by simply hashing the input multiple times and setting the corresponding bit to 1. The following figure shows the process of adding elements https://www.mghio.cn and https://www.abc.com to the Bloom filter, using the simple hash functions func1 and func2.

When we add an element to the set, we can check whether the position of the element after applying the corresponding hash function mod the length of the bit vector is 1. In the figure, 1 represents the position of the newly added element. Then when we want to determine whether the added element is in the set, we simply apply the same hash function to the element and see if the corresponding position in the bit vector is 1 to determine whether the element is in the set. If not, the element is definitely no longer in the set, but it’s important to note that if it is, you just know that the element could be in there, because those positions could happen to be caused by other elements or combinations of other elements. That’s how bloom filters work.

How to do it yourself

The idea of Bloom filter is relatively simple. First, initialize an int array of specified length in the constructor, calculate the corresponding hash value by hash functions func1 and func2 when adding elements, set the corresponding position to 1 after the array length is mod, and judge whether elements exist in the set. Similarly, the same hash function is used to calculate the element twice, and the hash value of the corresponding position is taken. As long as the value of the existing position is 0, the element is considered not to exist. The following uses the Java language to implement a simple version of the Bloom filter in the example above:

public class BloomFilter {

  /** * Array length */
  private int size;

  /** * array */
  private int[] array;

  public BloomFilter(int size) {
    this.size = size;
    this.array = new int[size];
  }

  /** * Add data */
  public void add(String item) {
    int firstIndex = func1(item);
    int secondIndex = func2(item);
    array[firstIndex % size] = 1;
    array[secondIndex % size] = 1;
  }

  /** * Check whether item exists in the set */
  public boolean contains(String item) {
    int firstIndex = func1(item);
    int secondIndex = func2(item);
    int firstValue = array[firstIndex % size];
    int secondValue = array[secondIndex % size];
    returnfirstValue ! =0&& secondValue ! =0;
  }

  /** * Hash algorithm func1 */
  private int func1(String key) {
    int hash = 7;
    hash += 61 * hash + key.hashCode();
    hash ^= hash >> 15;
    hash += hash << 3;
    hash ^= hash >> 7;
    hash += hash << 11;
    return Math.abs(hash);
  }

  /** * Hash algorithm func2 */
  private int func2(String key) {
    int hash = 7;
    for (int i = 0, len = key.length(); i < len; i++) {
      hash += key.charAt(i);
      hash += (hash << 7);
      hash ^= (hash >> 17);
      hash += (hash << 5);
      hash ^= (hash >> 13);
    }
    returnMath.abs(hash); }}Copy the code

Although the self-implementation is simple, one problem is that the detection misjudgment rate is relatively high, which can be seen from its principle. We can increase the array length and hash calculation times to reduce the false positive rate, but the corresponding CPU and memory consumption will also increase accordingly. This requires us to weigh our options against our business needs.

The heart to ask

How to design hash functions?

The hash functions in bloom filters ideally need to be as independent and evenly distributed as possible, and they also need to be as fast as possible (although cryptographic hash algorithms such as SHA1 are widely used, they are not a good choice to consider at this point).

How large should bloom filters be designed?

Personally, one of the nice features of bloom filters is that we can modify the error rate of the filter. A large filter will have a lower error rate than a small filter. Assuming that there are k hash functions, m bits (the length of the bit array), and n inserted elements in the Bloom filter, the error rate is approximately (1-ekn/m) K, so you just need to determine the size of the data set that can be inserted, n, and then adjust k and M to configure the filter for your application.

How many hash functions should I use?

Obviously, the more hash functions a Bloom filter uses, the slower it will run, but if there are too few hash functions, the higher the error rate will be. So you need to be careful about this, and you need to determine the number of hash functions when you create a Bloom filter, which means you need to anticipate the range of elements in the set. However, once you have done that, you still need to determine the value of the number of bits and the number of hash functions. This may seem like a very difficult optimization problem, but fortunately, for a given m (the number of bits) and n(the number of elements in the set), the optimal k (the number of hash functions) is (m/n)ln(2) (PS: Wikipedia for those who need to know the derivation). That is, we can determine the number of hash functions of the Bloom filter by the following steps:

Determine the range of n (the number of elements in the set).
Select the value of m (number of bits).
Compute the optimal value of k (number of hash functions)

Calculate the error rate for a given n, m, and k, and proceed to step 2 if the error rate is unacceptable.

Time and space complexity of Bloom filter?

For a Bloem filter with m (number of bits) and K (number of hash functions) values, the time complexity of both add and judge operations is O(k), which means that every time you want to insert an element or query whether an element is in the set, you only need to evaluate that element with k hash functions. Then mark the corresponding bit or check the corresponding bit.

conclusion

Bloom filters have a wide range of practical applications, especially in scenarios where you need to determine the presence or absence of an element in a large amount of data. As you can see, the algorithm principle of BloomFilter is relatively simple, but to actually make a production level BloomFilter is still very complex, Google’s open source library Guava BloomFilter provides a Java version of the implementation, the usage is very simple. One final question: Does bloom filter support element deletion?