Hi, I’m Li Ge.

Last time we discussed the architecture of caching in distributed systems, from browser cache to client cache to CDN cache to reverse proxy cache to local cache to distributed cache. There are a lot of caches throughout the link.

In the whole cache link, there are various problems, such as cache penetration, cache breakdown, cache avalanche, cache data consistency problems, etc. Common problems include cache skew, cache block, cache slow query, cache consistency between master and slave, cache high availability, cache fault discovery and recovery, cluster expansion and contraction, and hot keys with large keys.

Today we are going to talk about: cache breakdown

As always, take a look at the outline of this article:

What is cache breakdown?

As we know, the cache works by fetching data from the cache first, returning it directly to the user if there is data, or reading the actual data from the slow device if there is no data and putting it into the cache. Something like this:

In many cases, there is an expiration time for the cache. If a cache data fails, the request will pass through the cache layer to the slow device layer. If the cache data is heavily visited at this time (hot data), the slow device will be bombarded with this wave of traffic and may break down if it fails.

Something like this:

This is cache breakdown.

Cache breakdown emphasizes single key data expiration + high concurrency

What are the pain points of cache breakdown?

From the above flow chart, we already know that the biggest pain point is the slow device down, which leads to a series of serious problems later (the whole slow device down, that is to say, if it is a single library service, the whole service is almost unavailable; If it is related to the sub-table, the breakdown of several libraries will make the hashing of requests in those libraries almost unavailable).

Second, the service QPS dropped instantaneously. Assuming that it takes 0.01s to cache data and 1s to fetch data from the database, then a thread can process only 1 request in 1s, instead of 100 requests. Assuming that the number of Tomcat threads for a service is 200, QPS= threads * requests processed per second =200 * 100=20000. QPS= threads * requests processed per second =200 * 1=200.

When the number of threads is too large (the actual QPS is greater than 200), the number of threads is not enough and the user has to wait in line for the threads to be released. This is what the user says: “What crap site, it’s stuck!” .

Cache breakdown solution?

Once the problem of cache breakdown occurs, it seems to be the problem of slow devices, but the actual culprit is not slow devices. Why do you say so?

I believe that this phenomenon is caused by unreasonable architectural design. In my opinion, there are several stages governing cache breakdowns.

  1. Quarantine if necessary
  2. Ensure that hotspot data is cached
  3. Prevent hot data from not being in the cache

Quarantine if necessary

Isolation? How?

In general, there is no need for isolation, because the cost of isolation is too high, and this kind of isolation is only considered during super promotion activities, such as 618, Double 11, double 12.

In this case, hotspot isolation means that the common environment and hotspot environment are isolated without affecting each other.

For this reason, even if a slow device in a hot environment can’t handle this instantaneous traffic outage, it doesn’t affect our normal environment requests.

In our actual production environment, there are other isolation schemes, such as: room isolation, environment isolation, cluster isolation, process isolation, thread isolation, resource isolation and so on.

Ensure that hotspot data is cached

If you have this kind of highly concurrent data, you need to make sure that the cache layer is protected.

So how do you make sure the cache layer is protected?

There are two cases where a cache block or outage causes a cache breakdown, and cached data does not cause a cache breakdown.

If the cache is blocked or down, the solution in this case is:

  1. Check the cause of downtime, check the slow query, optimize the slow query;
  2. Cache cluster optimization or expansion;

If there is no data in the cache, the solution in this case is:

  1. Hotspot cache data never expires
  2. Periodically refresh the cache data and expiration time using scheduled tasks to ensure the existence of cache data

Prevent hot data from not being in the cache

Above, we have analyzed many cases and made many protection measures, of course, we still can’t guarantee that the data will always be available, still can’t guarantee that the cached data will never expire.

So, we also need to consider a bottom-of-the-line solution: what if the cache doesn’t exist?

If you don’t, once that happens there’s a chance that both the service and the database will go down and then… I packed my schoolbag and went home.

So be sure to act.

I don’t know if you remember from the last article that we talked about converting “high concurrency to low concurrency”, but you can read the previous article for details:

The idea is that when the cache layer is broken, high concurrency requests will be stored on slow devices, and we can turn “high concurrency into low concurrency”. Most of the time we use distributed locks.

There are many options for using distributed locks:

  1. Database unique key
  2. Database pessimistic locking
  3. Based on setNX principle of Redis
  4. Implementation based on Redisson
  5. Implemented based on Zookeeper

For locks, not our topic today, we will publish a series of articles related to locks, stay tuned.

conclusion

Concept: Cache breakdown refers to the expiration of a single data, which may cause the slow device to break down after an increased number of concurrent requests are sent to the slow device.

Symptoms: Slow devices break down, service QPS plummets, and user feedback is slow

Solution:

  1. Improve cache high availability;
  2. Ensure that hotspot data is cached.
  3. Switch from high concurrency to low concurrency when accessing slow devices (generally distributed locks are used);

Thanks for reading and following me to help you become an architect.

Your sharing, liking and watching will be the biggest support for me!