This article has been included on Github /JavaMap. There is a map of advanced technical knowledge for Java programmers and my series of articles. Welcome to Star.
The accident background
Recently, the company arranged a wave of commodity buying activities. Due to the backstage boy’s operation mistakes, the activity was not effective and was complained by users and agents. The manager asked me to review the online accident with my colleagues.
What causes it?
The rush is scheduled to start at exactly 00:00,
At 22:00, the operation personnel will go online through the background
23:00 Background brother has imported the goods into the cache, preheating in advance
The traffic is very large at the beginning of the rush, and the plan is to take most of the user query requests through Redis to avoid all the requests falling on the database.
Pictured above is expected to be hit most of the request cache, but due to the background little elder brother preheating cache will all goods cache is set to 2 hours out of date, all the goods are in the same point in time all failed, instantly all the requests on the database, database carry collapse pressure, all user requests timeout error.
Virtually all requests go directly to the database, as shown below:
When did you find out?
At 01:02 am, the SRE receives a system alarm. After logging in to the O&M management system, the CPU and memory of the database node exceed the threshold.
Why wasn’t it discovered earlier?
Because the cache is set to expire in 2 hours, the cache can hit most requests before 1am, and the database service is normal.
What was done when it was discovered?
After locating and troubleshooting the problem through logs, the background brother performs a series of operations:
First, the API Gateway was used to limit most of the traffic coming in. Then, the down database service was restarted and the cache was warmed up again. After confirming that the cache and database service were normal, the Gateway traffic was released.
How to avoid it next time?
In fact, the reason for this accident is that there is a cache avalanche, a huge amount of query data, the request directly falls on the database, causing the database to break down too much pressure.
In the industry, there are mature methods to solve the cache avalanche, such as:
- Uniform overdue
- Add a mutex
- Cache never expires
(1) Uniform expiration
Set different expiration times so that the cache invalidates as evenly as possible. It is usually possible to add random values to the validity period or to plan the validity period uniformly.
(2) add mutex
Similar to the cache breakdown solution, only one thread is allowed to build the cache at a time, while the other threads block the queue.
(3) The cache never expires
Consistent with the cache breakdown solution, the cache is physically never expired, and the cache is updated with an asynchronous thread.
Checking summary
By reviewing the online incident with colleagues, we have gained a deeper understanding of the cache avalanche. To avoid another cache avalanche, a number of solutions were discussed:
(1) Uniform expiration
(2) add mutex
(3) The cache never expires
Hope the techies are in awe of every line of code!
— END —
Daily request for praise: hello technology people, like first and then look to develop a habit, your like is my motivation on the way forward, the next more wonderful.
Introduce: The blogger graduates from the master of huazhong university of science and technology, it is a programmer that has passion to the technology, have passion to the life. A few years in a number of first-line Internet factories, with many years of practical experience in development.
Wechat search public number [love laughing architect], I have technology and story, waiting for you.
The articles are updated continuously, and you can see my archived series of articles on Github /JavaMap, with interview experience and technical tips, welcome to Star.