User data is generally stored in the database, database data is on the disk, disk read and write speed can be said to be the slowest hardware in the computer.

When the user requests, access to the database, the number of requests, the database is easy to crash, so in order to avoid the user directly access to the database, will use Redis as a cache layer, because Redis is a memory database, we can cache the database data in Redis, equivalent to data cache memory, Memory reads and writes are orders of magnitude faster than hard disks, which greatly improves system performance.

With the introduction of the cache layer, there will be three problems of cache exceptions, namely cache avalanche, cache breakdown, and cache penetration.

These three questions are also common interview questions, and we need to know not only how they happen, but how to solve them.

No more talking, let’s go!


Cache avalanche

In order to ensure the consistency between the data in the cache and the data in the database, we usually set the expiration time for the data in Redis. When the cache data expires, if the data accessed by users is not in the cache, the business system needs to regenerate the cache, so it will access the database and update the data to Redis.

Then, when a large number of cached data at the same time expired (failure) or Redis fault outage, if there are a large number of users request at this time, can handle in Redis, so all requests are direct access to the database, resulting in increased pressure database, serious can cause database downtime, thus forming a series of chain reaction, Cause the entire system to crash, and that’s the cache avalanche.

As you can see, cache avalanches occur for two reasons:

  • Large amounts of data expire at the same time;
  • Redis is down.

Different triggers, coping strategies will be different.

Large amounts of data expire at the same time

Common responses to the cache avalanche caused by a large amount of data expiring at the same time are as follows:

  • Set expiration time evenly;
  • The mutex.
  • Double key strategy;
  • Background update cache;

1. Set the expiration time evenly

If you want to set an expiration date for cached data, you should avoid setting a large number of data to the same expiration date. We can set the expiration time for cached data by adding a random number to the expiration time to ensure that the data will not expire at the same time.

2. The mutex

When a business thread is processing a user request and finds that the accessed data is not in Redis, it adds a mutex lock to ensure that only one request at a time builds the cache (reading data from the database and updating the data to Redis), and releases the lock when the cache is completed. Requests that fail to acquire the mutex will either wait for the lock to be released and re-read the cache, or return null or default values.

When implementing a mutex, it is best to set a timeout period, otherwise the first request gets the lock, and then something happens to this request and it blocks and does not release the lock, then other requests do not get the lock, and the whole system will be unresponsive.

3. Dual-key policy

We can use two keys for the cached data, one is the primary key, which will set the expiration time, and the other is the standby key, which will not set the expiration time. They are only different keys, but the value value is the same, which is equivalent to making a copy of the cached data.

If the service thread cannot access the cache data of the primary key, it directly returns the cache data of the secondary key, and updates the data of the primary key and secondary key simultaneously when updating the cache.

4. Update the background cache

The business thread is no longer responsible for updating the cache, and the cache does not set an expiration date. Instead, the cache is made “permanent” and updated periodically by background threads.

In fact, the cached data validity is not set, doesn’t mean that data has been in the memory, because when the nervous system memory, some cached data will be “out”, and in the cache is “out” to the next background update cache regularly during this period, the business read thread cache fail it returns null, the perspective of business will think that the loss of data.

There are two ways to solve the above problem.

In the first way, the background thread is not only responsible for updating the cache periodically, but also is responsible for frequently checking whether the cache is valid. It detects that the cache is invalid, possibly because the system is stressed and is obsolete, so it immediately reads data from the database and updates the cache.

In this way, the detection interval should not be too long. If the data obtained by the user is too long, it will be an empty value rather than real data. Therefore, the detection interval should be milliseconds, but there is always an interval, and the user experience is general.

In the second way, after the service thread finds that the cache data is invalid (the cache data is eliminated), it sends a message through the message queue to notify the background thread to update the cache. After receiving the message, the background thread can determine whether the cache exists before updating the cache, and does not update the cache if it exists. The database data is read if it does not exist and loaded into the cache. This method provides more timely cache updates and better user experience than the first method.

It is better to slow down the data early in the business rather than wait for the user to access it to trigger the cache build. This is called cache warm-up, and the mechanism of updating the cache in the background is perfect for this.

Redis is down

Common solutions to cache avalanche caused by Redis failures are as follows:

  • Service fusing or request limiting mechanism;
  • Build Redis cache high reliable cluster;

1. Service fuse or request traffic limiting mechanism

Lead to cache for Redis fault outage avalanche problem, we can start the service circuit breakers, suspension of business application access to cache service, direct return error, don’t have to continue to access the database, thus reducing pressure of access to the database, ensure the normal operation of the database system, and then wait for Redis returned to normal after, Then allow business applications to access the caching service.

The service circuit breaker protects the normal permission of the database, but stops service applications from accessing the cache server system. Therefore, all services cannot work properly

In order to reduce the impact on services, we can enable the request flow limiting mechanism. Only a small part of the requests are sent to the database for processing, and more requests are directly denied service at the entrance. After Redis recovers and the cache is warmed up, the request flow limiting mechanism will be removed.

2. Build a Redis cache highly reliable cluster

A service meltdown or request limiting mechanism is a response to a cache avalanche, and it is best to build a highly reliable Redis cache cluster with master-slave nodes.

If the primary Redis cache node fails and goes down, the secondary node can switch to the primary node and continue to provide cache services, avoiding the cache avalanche caused by the Redis failure.


Cache breakdown

Our business usually has several data that are frequently accessed, such as the second kill activity. This kind of data that is frequently accessed is called hotspot data.

If a hot spot in the cache expires and a large number of requests access the hot spot, it cannot be read from the cache and accesses the database directly. The database is easily overwhelmed by high concurrency requests. This is the problem of cache breakdown.

Cache breakdowns are similar to cache avalanche, and you can think of cache breakdowns as a subset of cache avalanche.

There are two ways to deal with cache breakdown:

  • A mutex scheme that ensures that only one business thread updates the cache at a time. If a mutex request fails to obtain, it must either wait for the lock to be released and re-read the cache, or return a null or default value.
  • The background asynchronously updates the cache without setting the expiration time for hotspot data, or notifies the background thread to update the cache and reset the expiration time before hotspot data is about to expire.

The cache to penetrate

In the event of a cache avalanche or breakdown, the database still holds the data to be accessed by the application. Once the cache recovers the corresponding data, the pressure on the database can be relieved, whereas cache penetration is different.

When the user accesses data that is neither in the cache nor in the database, the cache is missing when the user accesses the cache. When the user accesses the database again, the cache data is not found in the database. Therefore, the cache data cannot be constructed to serve the subsequent requests. Then when a large number of such requests arrive, the database pressure increases dramatically, which is the problem of cache penetration.

Cache penetration generally occurs in two ways:

  • The data in the cache and the data in the database are deleted. As a result, there is no data in the cache and the database.
  • Malicious attacks by hackers deliberately accessing a large number of businesses that read non-existent data;

There are three common scenarios for dealing with cache penetration.

  • The first scheme, the restriction of illegal request;
  • The second option is to cache null or default values.
  • The third scheme uses bloom filter to quickly judge whether the data exists or not, avoiding to judge whether the data exists by querying the database.

The first option is to limit illegal requests

Malicious request access does not exist when there is a lot of data, can also occur cache penetration, therefore we should judge at the entrance of the API request parameter is reasonable, whether the request parameter contains invalid values, request field exists, if determine is malicious request return error directly, to avoid further access to the cache and database.

The second option is to cache null or default values

When we find the phenomenon of cache penetration in online services, we can set a null value or default value in the cache for the queried data, so that the subsequent requests can read the null value or default value from the cache and return it to the application, instead of continuing to query the database.

The third option is to use a Bloom filter to quickly determine whether the data exists, rather than querying the database to determine whether the data exists.

When writing data to the database, we can use the Bloom filter to mark it, and then when the user request comes, after the business thread confirms that the cache is invalid, we can quickly determine whether the data exists by querying the Bloom filter. If it does not exist, we do not need to query the database to determine whether the data exists.

Even if cache penetration occurs, a large number of requests will only query Redis and Bloom filters, not the database, ensuring that the database runs properly, Redis itself also supports Bloom filters.

So, how does bloom filter work? Next, LET me introduce you.

The Bloom filter consists of “bitmap arrays with initial values of 0” and “N hash functions”. When we write data to the database, mark it in the Bloom filter, so that the next time we query whether the data is in the database, we only need to query the Bloom filter. If the query finds that the data is not marked, it indicates that it is not in the database.

Bloom filters are tagged with three actions:

  • In the first step, N hash functions are used to hash the data, and N hash values are obtained.
  • In the second step, the N hash values obtained in the first step are modulo the length of the bitmap array to get the corresponding position of each hash value in the bitmap array.
  • Third, set the value of each hash value at the corresponding position of the bitmap array to 1.

For example, suppose you have a Bloon filter with a bitmap array length of 8 and hash functions of 3.

After writing data x to the database and marking data X in the Bloem filter, data X will be computed by three hash functions to three hashes, then modulo the three hashes to eight, assuming the result is 1, 4, 6, and then set the value of the first, fourth, and sixth positions of the bitmap array to 1. When the application wants to query whether data X is in the database, it only needs to check whether the values of the first, fourth and sixth positions of the position map array are all 1 through the Bloom filter. As long as one of them is 0, data X is not in the database.

Because the Bloat filter is based on the hash function to realize the search, there is the possibility of hash conflict in the efficient search, for example, data X and data Y may fall in the first, fourth and sixth positions, but in fact, there may be no data Y in the database, there is misjudgment.

Therefore, querying a Bloom filter to say the data exists does not necessarily prove the data exists in the database, but querying the data does not exist in the database.


conclusion

There are three problems with cache exceptions: cache avalanche, breakdown, and penetration.

Among them, the main reason for cache avalanche and cache breakdown is that data is not in the cache, which leads to a large number of requests to access the database. As a result, the database pressure increases rapidly, which easily causes a series of chain reactions, leading to system crash. However, once the data is reloaded back into the cache, the application can quickly read the data from the cache again without further access to the database, and the database stress is instantly reduced. Therefore, cache avalanche and cache breakdown have similar solutions.

The main reason for cache penetration is that the data is neither in the cache nor in the database. Therefore, cache penetration is not the same as cache avalanche and breakdown.

I’ve compiled a table here that gives you a good idea of the difference between cache avalanche, breakdown and penetration, and what to do about it.


The resources
  1. Geek Time: Redis Core Technology and Practice
  2. Github.com/doocs/advan…
  3. medium.com/@mena.meseh…