Bitmap and Bloom filters

Is there a value in the mass of integers — a bitmap

In a program, there are often cases that let us determine whether a certain number exists in a set; In most cases, you can just use a simple data structure like a Map or a list, and if you’re using a high-level language, you can hop on a fast train to call a few wrapped apis, add a few if and else, and watch your “perfect” and “robust” code run from the console in two or three lines of code.

However, nothing is perfect. In a high concurrency environment, all cases will be extreme. If this is a very large set (let’s give it a specific value, a hundred million), a simple hash map, regardless of the pointer memory required for the linked list, a hundred million ints, That’s just over 380 megabytes (4byte x 10 ^8), which is 4 GIGABytes in a billion, regardless of performance, even with 128 GIGAByte servers all over the place.

Bitmap uses bits to represent the size of a number, and the 0 or 1 stored in bit indicates whether the integer exists. The specific model is as follows:This is a “bitmap” that identifies 0-9, where the four numbers 4321 are present

Calculate the memory cost of bitmap, if it is less than 100 million data search, we only need 100 million bit = 12MB or so of memory space, can complete the massive data search, is not a very interesting memory reduction, the following is the Java bitmap code:

public class MyBitMap { private byte[] bytes; private int initSize; public MyBitMap(int size) { if (size <= 0) { return; } initSize = size / (8) + 1; bytes = new byte[initSize]; } public void set(int number) {// Int index = number >> 3; Int position = number & 0x07; / / | or an arithmetic operation of two objects as long as there is a 1, its value is 1. bytes[index] |= 1 << position; } public boolean contain(int number) { int index = number >> 3; int position = number & 0x07; return (bytes[index] & (1 << position)) ! = 0; } public static void main(String[] args) { MyBitMap myBitMap = new MyBitMap(32); myBitMap.set(30); myBitMap.set(13); myBitMap.set(24); System.out.println(myBitMap.contain(2)); }}Copy the code

Using simple byte arrays and bit operations, you can achieve a perfect balance of time and space.

If we make it clear that the set is less than 100 million, but the order of magnitude is only 10, we use bitmap, also need to consume 12M of data, if the data is less than 1 billion, the cost will rise to 120M, bitmap space overhead is always linked to its data range, only in the case of massive data, He was able to make his mark.

Let’s talk about the extreme case just mentioned. Suppose the data amount is 10 million, but the value range is good and the death is within 1 billion, then we still have to face the cost of 120M inevitably. Is there any way to deal with it?

Bloom filter

For example, if I hash something from less than a billion to a value from less than a billion, and then look it up in a bitmap, this is what bloom filters do:The value obtained by multiple hash algorithms is used to reduce the probability of hash collisions.

As described in the figure above note, we can use multiple hash algorithm to reduce the collision probability, but as long as there is a collision, there must be a wrong judgment, we can’t absolutely determine whether a value is really exists, but the charm of the hash algorithm is that I’m not sure whether you exist, but I can confirm that if you really don’t exist, This is why the above implementation is called a filter.

High concurrency cache design strategy

why cache??

If you’re reading a computer science major, the word “cache” is likely to make your ears ring. In a computer system, the cache is a mediator between the CPU and memory to smooth the processing speed gap between the CPU and memory. In OS, page cache is the mediator between memory and I/O.

Cache is a peacemaker?? It sounds weird, but it’s pretty graphic.

I’ve covered most of the algorithm theory, but in case you get sleepy, I’ll jump right into the second half of the topic, high concurrency cache design.

Even at the software level, we need to start with the simplest service architecture, usually we make requests on the server side and CURD some relational database such as Mysql. However, architectures such as these require a disk to persist as a terminal, and even with the addition of indexes and optimization of queries using data structures such as B+ trees, efficiency is still stuck on the IO that requires frequent pathfinding.

We will add some memory operations to ease the pressure of slow IO processing. Cache is not a problem, how to use it is actually a problem.

Cache consistency issues

There are several mechanisms for caching:

  • The cache value;
  • Read through.
  • Write through;
  • The write behind caching;

Cache penetration problem

Cache breakdown is when a request is sent to the database and the data cannot be read from the cache. In this case, the cache decompression effect is no longer present.

Imagine such a scenario, if a user, malicious use of heavy traffic to frequently query a database does not have a record, has been broken cache, is bound to kill the database, how to avoid cache breakdown, this is a problem.

There are two options. The first is to add a null value to the cache. If a database query fails, we can set the value to null to prevent the database from being accessed again.

The second solution is to use a Bloom filter (point) to add a layer of bloom filter between the cache and the Web server to record access keys, so that the cache breakdown problem can also be solved.

Cache avalanche problem

Cache avalanche occurs when caches fail simultaneously at a certain point in time, such as when the cache is set to expire, which can lead to a large number of cache breakdowns.

A distributed lock is a solution where only requests that hold the lock can access the database. But this way, when the number of requests too much, a large number of threads blocked, will also be overloaded memory.

Preheating the data and setting expiration times dispersively reduces the probability of cache avalanches.

Improve cache availability. The single point of cache is the potential for cache avalanche. Most cache middleware provides a highly available architecture, such as redis’ master-slave + Sentry architecture.

Original link: blog.csdn.net/that_is_coo…