Cache exception (Part 2) : How to solve cache avalanche, breakdown, and penetration problems?

“This is the 12th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Last time, we learned about data inconsistencies between caches and databases and how to deal with them. In addition to data inconsistencies, we often face three problems with cache exceptions: cache avalanche, cache breakdown, and cache penetration. When these three problems occur, they can result in a large backlog of requests to the database layer. If the number of concurrent requests is high, the database will go down or fail, which is a serious production accident.

In this lesson, I’m going to talk to you about the symptoms, causes and solutions of these three problems. As the saying goes, if you know your enemy and know yourself, you can win a hundred battles. Understanding the cause of the problem, we can apply Redis cache, reasonable cache Settings, as well as the corresponding business application front-end Settings, ready in advance.

Let’s take a look at the cache avalanche and how to deal with it.

Cache avalanche

Cache avalanche is when a large number of application requests cannot be processed in the Redis cache, and then the application sends a large number of requests to the database layer, causing a surge of pressure on the database layer.

Cache avalanches generally result from two causes, and solutions vary, so let’s take a look at each one.

The first reason is that a large amount of data in the cache is out of date at the same time, resulting in a large number of requests not being processed.

Specifically, when data is stored in the cache and an expiration time is set, a cache miss occurs when a large amount of data expires at a certain point and the application accesses the data at that point. The application then sends the request to the database and reads the data from the database. If the number of concurrent requests for your application is high, the database becomes stressed, which further affects the normal processing of other business requests for the database. Let’s take a look at a simple example, as shown below:

I offer you two solutions to the cache avalanche caused by a large number of simultaneous data failures.

First, we can avoid setting the same expiration time for a lot of data. If the business layer does require some data failure at the same time, you can use the EXPIRE in order to give each data set the expiration time, these data the expiration date of the increase to a small random number (for example, random increase 1 ~ 3 minutes), as a result, the expiration time of different data is different, but the difference is not too big, avoided large amounts of data already expired at the same time, It also ensures that the data will expire at approximately the same time and still meet business requirements.

In addition to fine-tuning expiration times, we can also deal with cache avalanches by downgrading services.

Service degradation is when a cache avalanche occurs and different data is processed differently.

When a business application accesses non-core data (such as e-commerce commodity attributes), it temporarily stops querying the data from the cache and returns predefined information, null values, or error messages.
When the business application is accessing core data (such as e-commerce inventory), the query cache is still allowed, and if the cache is missing, it can continue to be read from the database.

In this way, only partial requests for outdated data are sent to the database, and the database is less stressed. The graph below shows the execution of data requests when the service is degraded.

In addition to a cache avalanche caused by a large number of data failures at the same time, a cache avalanche can also occur when the Redis cache instance fails and is unable to process requests, causing a large number of requests to pile up in the database layer.

Typically, a Redis instance can support tens of thousands of request processing throughput, while a single database may only support thousands of request processing throughput, and the two may differ by nearly a factor of ten. Because of the cache avalanche, the Redis cache is invalidated, so the database can be overwhelmed by nearly ten times the number of requests and crash under the strain.

At this point, because the Redis instance is down, we need to deal with the cache avalanche in other ways. I offer you two pieces of advice.

The first recommendation is to implement service circuit breakers or request traffic limiting mechanisms in business systems.

The so-called service meltdown is when a cache avalanche occurs, in order to prevent cascading database avalanche, or even the entire system crash, we suspend business applications’ interface access to the cache system. To be more specific, when a business application calls the cache interface, the cache client does not send the request to the Redis cache instance, but directly returns it. After the Redis cache instance resumes service, the application request is allowed to be sent to the cache system.

In this way, we avoid a large number of requests due to missing cache and backlogs to the database system, ensuring the normal operation of the database system.

While the business system is running, we can monitor the load metrics of the machine where the Redis cache is located and the machine where the database is located, such as requests per second, CPU utilization, memory utilization, etc. A cache avalanche occurs if we find that the Redis cache instance is down and the load on the database host machine suddenly increases (for example, the number of requests per second spikes). A large number of requests are sent to the database for processing. We can enable the service circuit breaker mechanism to suspend the business application’s access to the cache service, thus reducing the database access pressure, as shown in the following figure:

Service fusing can ensure the normal operation of the database, but suspends the access of the cache system, which has a wide impact on service applications. To minimize this impact, we can also perform request limiting. By request flow limiting, we control the number of requests entering the system per second at the front end of the request entry of the business system to avoid too many requests being sent to the database.

Let me give you an example. Assume that when the business system runs normally, the front end of the request entry allows 10,000 requests to enter the system per second, among which 9000 requests can be processed in the cache system and only 1000 requests will be sent to the database for processing by the application.

Once a cache avalanche occurs and the number of requests per second in the database suddenly increases to 10,000, we can start a request limiting mechanism that only allows 1,000 requests per second to enter the system in front of the request entry point, and any more requests will be denied service in front of the request entry point. So, by using request limiting, you can avoid a lot of concurrent request stress passing to the database layer.

Using service circuit breakers or request flow limiting mechanisms to deal with cache avalanches caused by Redis instance outages is “after the fact”, meaning that cache avalanches have already occurred, and we use these two mechanisms to reduce the impact of the avalanche on the database and the overall business system.

The second piece of advice I would give you is prevention.

Redis cache highly reliable cluster is constructed by using master and slave nodes. If the primary Redis cache node fails and goes down, the secondary node can switch to the primary node and continue to provide cache services, avoiding the cache avalanche caused by cache instance outages.

Cache avalanches occur when a large amount of data fails at the same time, and cache breakdowns, which I’m going to describe to you, occur when a hotspot fails. Compared to cache avalanche, the amount of cache breakdown failure data is much smaller, and the response method is different, let’s take a look.

Cache breakdown

A cache breakdown is when a request for a very frequently accessed hot data cannot be processed in the cache, and then a large number of requests to access that data are suddenly sent to the back-end database, resulting in a surge in database stress that affects the database’s ability to process other requests. Cache breakdown usually occurs when hotspot data expires, as shown in the following figure:

In order to avoid a spike in database stress caused by cache breakdowns, our solution is straightforward: we do not set expiration times for hot data that are accessed very frequently. In this way, access requests to hot data can be processed in the cache, while Redis’s high throughput of tens of thousands can handle a large number of concurrent requests.

Ok, so here you have the cache avalanche and cache breakdown issues, and how to deal with them. In the event of a cache avalanche or breakdown, data for application access is still stored in the database. Unlike avalanche and breakdown, cache penetration occurs when the data is not in the database. This puts pressure on both the cache and the database. So let’s see.

The cache to penetrate

Cache penetration means that the data to be accessed is neither in the Redis cache nor in the database. As a result, when the request accesses the cache, the cache is missing, and when the request accesses the database, the data to be accessed is not found in the database either. In this case, the application cannot read data from the database and then write it to the cache to serve subsequent requests. As a result, the cache becomes a “display”. If the application continues to have a large number of requests to access data, it will put great pressure on both the cache and the database, as shown in the following figure:

So, when does cache penetration occur? In general, there are two cases.

Business layer misoperation: Data in the cache and data in the database are deleted by mistake, so there is no data in the cache and database.
Malicious attack: Specifically accessing data that is not in the database.

To avoid the impact of cache penetration, I offer you three solutions.

The first option is to cache null or default values.

Once cache penetration occurs, we can cache a null value in Redis for the queried data or a default value negotiated with the business layer (for example, the default value for inventory can be set to 0). Then, when the application sends the subsequent request to query, it can read the null value or default value directly from Redis and return it to the business application, avoiding sending a large number of requests to the database for processing, and maintaining the normal operation of the database.

The second solution is to use the Bloom filter to quickly determine whether the data exists, so as to avoid querying whether the data exists from the database and reduce the database pressure.

Let’s take a look at how the Bloom filter works.

Bloom filter consists of an array of bits with initial values of 0 and N hash functions, which can be used to quickly determine whether a certain data exists. When we want to flag that data exists (for example, that data has been written to a database), bloom filters do this by doing three things:

First of all, using N hash functions, you compute the hash value of this data, and you get N hash values.
We then modulo these N hash values with respect to the length of the bit array to find the position of each hash value in the array.
Finally, we set the corresponding bit to 1, which completes marking the data in the Bloom filter.

If the data does not exist (for example, there is no data written to the database) and we have not marked the data with the Bloom filter, then the value of the corresponding bit in the bit array is still 0.

When we need to query for some data, we perform the calculation described above, first obtaining the corresponding N positions of the data in the bit array. Next, we look at the values of the N bits in the bit array. As long as any of these N bits are not 1, it indicates that the bloom filter has not marked the data, so the queried data must not be stored in the database. I drew a picture for you to understand, so you can look at it.

In the figure, the Bloom filter is an array containing 10 bits, using three hash functions. When marking data X in the Bloom filter, X will be hashed three times and modulo 10. The modulo results are 1, 3 and 7 respectively. So, the first, third, and seventh bits of the bit array are set to 1. When an application wants to query for X, it simply looks to see if the first, third, and seventh bits of the array are 1, and if any of them are 0, X is definitely not in the database.

Because of the fast detection characteristics of the Bloom filter, we can use the bloom filter to mark data when writing to the database. When the cache is missing, the database can be queried using the Bloom filter to quickly determine whether the data exists. If it doesn’t exist, you don’t need to query it in the database. This way, even if a cache penetration occurs, a large number of requests will only be queried through Redis and Bloom filters, not backlogged to the database, and will not affect the normal operation of the database. Bloom filters, which can be implemented using Redis, can bear the burden of concurrent access.

Finally, request detection is performed at the front end of the request entry. Cache penetrate one reason is that there are a large number of malicious data request access does not exist, therefore, an effective response is a front entrance to the request, evaluate the legitimacy in the business system receives the request, the malicious request (for example, the unreasonable request parameters, request parameter is illegal value fields, requested does not exist) to filter out directly, Deny them access to back-end caches and databases. This way, there is no cache penetration problem.

Compared with cache avalanche and cache breakdown, cache penetration is a bigger problem. I hope you can focus on it. From a preventive perspective, we need to avoid accidentally deleting data from databases and caches; From a coping perspective, we can use caching null or default values, using Bloom filters, and malicious request detection in business systems.

summary

In this lesson, we learned about cache avalanche, breakdown, and penetration. In terms of the cause of the problem, cache avalanches and breakdowns occur mainly because the data is not in the cache, while cache penetrations occur because the data is neither in the cache nor in the database. Therefore, in the event of a cache avalanche or breakdown, once the data in the database has been written to the cache again, the application can quickly access the data in the cache again, and the database pressure is correspondingly reduced. When a cache penetration occurs, both the Redis cache and the database are under constant request pressure.

For your convenience, I have summarized the causes and solutions of these three problems into a table, which you can review again.

Finally, I want to emphasize that service circuit breakers, service downgrading, and request limiting are all “lossy” solutions that can negatively impact business applications while ensuring database and overall system stability. For example, when the service is degraded, the request for some data can only get an error message and cannot be processed normally. If a service circuit breaker is used, the entire cache system service is suspended, affecting a wider range of services. After the request traffic limiting mechanism is used, the throughput rate of the entire service system decreases, and the number of concurrent user requests decreases, which affects user experience.

So, my advice to you is to use the precautionary plan whenever possible:

In view of the cache avalanche, the data expiration time should be set reasonably and the high reliable cache cluster should be built.
For cache breakdown, do not set expiration time when the cache accesses very frequent hotspot data.
In view of cache penetration, malicious request detection is implemented in front of the entry in advance, or the data deletion operation of the database is standardized to avoid accidental deletion.

Each lesson asking

Well, as usual, LET me give you a quick question. When I talked about cache avalanches, I mentioned that service circuit breakers, service downgrades, and request limiting can be used to deal with them. Can these three mechanisms be used to deal with cache penetration?

Welcome to write down your thoughts and answers in the comments area, and we will exchange and discuss together. If you find today’s content helpful, you are welcome to share it with your friends and colleagues. I’ll see you next time.

Cache exception (Part 2) : How to solve cache avalanche, breakdown, and penetration problems?

Cache avalanche

Cache breakdown

The cache to penetrate

summary

Each lesson asking

Related Posts

[fertilizer toward] local can run, on-line collapse? Panic!

Redis distributed lock

JUC series (7) | JUC three commonly used tools CountDownLatch, CyclicBarrier, Semaphore