preface
Everybody is good! I’m Wan Junfeng, author of Go-Zero. Thank you ArchSummit for this great opportunity to share the cache best practices of Go-Zero.
First of all, you can think about: in the case of our traffic surge, which part of the server is most likely to be the first bottleneck? I believe that most people encounter will be the first database can not carry, volume together, database slow query, even card. At this point, the upper service has how strong governance ability is of no help.
So we often talk about how well a system architecture is designed, and a lot of times you can tell by how well the cache is designed. We once encountered such a problem. Before I joined our service, there was no cache. Although the traffic was not high at that time, we would be very nervous every day during the peak traffic hours. I was a consultant at that time. I looked at the system design, but I could only save the day and let everyone add cache first. However, due to the lack of knowledge about cache and the chaos of the old system, every business developer would tear cache by hand in his own way. The problem is that the cache is used, but the data is fragmented and there is no way to ensure consistency. This is indeed a painful experience that should evoke resonance and memories.
Then I took the whole system down and redesigned it, in which the architectural design of the cache part played a very important role, hence today’s sharing.
I will discuss with you mainly in the following parts:
- Common cache system Problems
- Caching and automatic management of single line queries
- Multi-line query caching mechanism
- Design of distributed cache system
- Caching code automation practices
There are many problems and knowledge points involved in the cache system. I divide it into the following aspects to discuss:
- The stability of
- correctness
- observability
- Standardize landing and tool construction
Due to the length of the article, this article is the first in a series on “cache system stability”.
Cache system stability
In terms of cache stability, almost all caches articles and shares on the web address three key points:
- The cache to penetrate
- Cache breakdown
- Cache avalanche
Why cache stability in the first place? Remember, when did we introduce caching? Generally, when DB is under pressure, or even often suspended, the cache will be introduced, so we first introduced the cache system to solve the problem of stability.
The cache to penetrate
Cache penetrates the request is not the reason of existing data, we can see from figure 1 will first visit to a request for the same data cache, but because the data does not exist, so definitely not in the cache, then fell to the DB, request 2 and 3 of the same data can also be through the cache beneath the DB, In this way, when a large number of requests for non-existent data, DB pressure will be particularly large, especially the possibility of malicious requests (malicious people find that a data does not exist, and then launch a large number of requests for the non-existent data).
Go – the zero solution is: for there is no data request we will briefly in the cache (for example, a minute) to store a placeholder, so don’t have a data of about the same DB requests will with decoupling, the actual number of requests in the business side, of course, also can remove the placeholder when new data to ensure that new data can query to immediately.
Cache breakdown
Cache the cause of the breakdown is a hot spot in data expiration, because it is a hot spot data, so once the expired may have a large number of requests for this hot data at the same time, but if can’t find the data in all requests in the cache, if at the same time into the DB, DB will instantly under huge pressure, even stuck directly.
Go-zero’s solution is: For the same data, we can use core/syncx/SharedCalls to ensure that only one request falls on DB at a time. Other requests for the same data wait for the first request to return and share results or errors. Depending on the concurrency scenario, we can choose to use in-process locks (concurrency is not very high). Or distributed locks (high concurrency). We generally recommend in-process locks if not absolutely necessary, since introducing distributed locks can add complexity and cost. Take Occam’s Razor as an example: Don’t add entities if you don’t have to.
Let’s take a look at the cache breakdown protection process above. We use different colors to represent different requests:
- The green request arrives first, finds no data in cache, goes to DB query
- The pink request arrives and requests the same data. The green request is in singleFlight mode
- The green request is returned, and the pink request is returned with the shared result of the green request
- Subsequent requests, such as blue requests, can fetch data directly from the cache
Cache avalanche
The reason for cache avalanche is that a large number of caches loaded at the same time have the same expiration time, and a large number of cache expiration times occur in a short period of time when the expiration time is reached. This will cause many requests to fall on DB at the same time, which will cause the DB to increase pressure, even jammed.
Outbreak of online teaching scenario, for example, high school and junior high school and primary school is a few hours of classes at the same time, then there will be a large amount of data to load at the same time, and set up the same expiration time, when the expiration time arrive peer appear one by one the DB request wave, pressure wave is passed on to the next cycle, appear even stack.
Go-zero’s solution is:
- Use distributed caching to prevent cache avalanches caused by a single point of failure
- Add a standard deviation of 5% to the expiration time. 5% is the empirical value of the p-value in the hypothesis test (interested readers can check for themselves).
Let’s do an experiment. If 10,000 data are used, the expiration time is set to 1 hour, and the standard deviation is set to 5%, the expiration time will be evenly distributed between 3400 and 3800 seconds. If our default expiration time was 7 days, it would be evenly distributed over 16 hours centered on 7 days. This prevents cache avalanches.
To be continued
In this article, we discussed the common stability problems of the cache system. In the next article, I will analyze the data consistency problems of the cache.
The solutions to all of these problems are included in the Go-Zero Microservices framework. If you want to get a better understanding of the Go-Zero project, please go to the official website for detailed examples.
Video Playback Address
ArchSummit Architect Summit – Cache architecture design for massive concurrency
The project address
Github.com/tal-tech/go…
Welcome to Go-Zero and star support us!
Wechat communication group
Pay attention to the “micro service practice” public account and click into the group to obtain the qr code of the community group.