The bitmap

In the process of business development, when an object needs to be marked or a certain state of the object needs to be recorded, only two operations need to be inserted and read, and data structure records such as hash table and red-black tree are commonly used.

When the amount of recorded data is very large, the memory consumption will be very large, and the storage efficiency and query efficiency will be very low. If the hash table is used to record whether 100 million users have accessed the user id today, and the int user ID is used to store the user ID, 381MB memory space (4bytes x 100000000/1024/1024) is required. If the linked list Pointers are included, the memory consumption may be more than 500M.

Using 4 bytes to store each state value wastes a lot of memory. Whether the user has accessed it can be indicated by using at least one bit. A bit is used to represent a user. 100 million users need at least 100 million bits, which is 11MB, greatly reducing memory usage.

Although the insertion and read time complexity of Hash table and bitmap is O(1), the bitmap does not require expansion, Hash calculation, uses bit operation, and continuous memory is conducive to CPU cache. Theoretically, the time efficiency is higher than Hash table.

Bitmaps also have the fatal disadvantage of being less spatial efficient than hash tables when data density is very low. In the example above, 100 million users have ids ranging from 1 billion to 10 billion and use 1.16GB of space.

Advantages:
  1. Save memory space
  2. Insert and query time complexity O(1)
Disadvantages:
  1. The data must be unique

  2. Only positive integer data can be processed

  3. It has high requirements on data, and high spatial efficiency can be achieved only when data density is high

Bloom Filter

Bloom filter solves the problem of bitmap inefficiency when storing low data density.

In the example above, 100 million users are distributed between 100 and 10 billion, and we use 1 billion bits to store 100 million users. If a Hash algorithm is used to calculate the distribution of users between 100 and 1 million, Hash conflicts are highly likely to occur. However, if multiple different Hash algorithms are used, the probability that different users will end up with different results is almost negligible.

Bloem filters store elements in bitmaps through multiple Hash algorithms. The same Hash algorithm is used for query. If all values are true, the elements exist. In this way, bitmap storage efficiency can be greatly improved.

The Bloom filter also has a fatal flaw, which is the misjudgment rate, also known as the false positive rate. When the amount of data increases, the number of non-true positions in the bitmap decreases, and uninserted data is likely to occur, and the query result is true. So the bloom filter is False is always False. True is maybe True. Due to the existence of misjudgment rate, Bloom filter is mostly used for blacklist verification.

The misjudgment rate can be predicted by the number of Hash functions and the expected number of inserted data, as shown in The BloomFilter implemented by Guava. If too much data is inserted, the expected error rate is exceeded. You can choose to create a new bitmap to insert new data into the new bitmap, but you may need to query one more bitmap.

Advantages:
  1. Improves bitmap storage efficiency
  2. Compared with Hash table, there is no linked list traversal, and the main time is Hash calculation, which is theoretically more time efficient
Disadvantages:
  1. Presence of misjudgment rate
  2. Unable to delete data

The specific implementation

  1. JAVA BitSet
  2. Guava BloomFilter
  3. Redis BitMap

application

  1. Blacklist check
  2. Bitmap sort (bucket sort)
  3. Quick go to heavy
  4. ConsistencyCheck
  5. Crawler URL verification
  6. Resolve cache penetration issues

Matters needing attention

  1. The efficient space utilization of bitmaps can only be realized in the case of data density. Recently encountered scenario: popup ads when users enter for the first time every day, user ids ranging from 1 billion to 2 billion, stored with Integer. Assume that the number of daily access users is N, and the storage consumes 32N bits. A billion bits are needed for bitmap storage. So when there are 31.25 million visitors per day, bitmap storage is more efficient than list storage.

  2. Bitmap can achieve accurate calibration, bloom filter has misjudgment.