preface

I didn’t expect that this article would be serialized, until this week, after observing the performance problems of Redis online, I found a line of bug code causing data problems of a user (no.1 in the list). I thought the whole analysis process should be recorded.

Review the details of weekly list scheme

There was a leaderboard article,Design of ranking and liking scheme based on Zset and Bloom FilterIn which, Redis middleware is used extensively. The function of weekly list is completely realized by Redis Zset data structure, which is fully supported in terms of function from the test or online operation, and there is no problem. The general effect picture is as follows, with the basic information of the user on the left and list data on the right.The implementation logic of the list was described in the previous article and will not be repeated. In terms of user basic information, there is a special requirement that parallel users with the same number of stars are regarded as one. That is, if the first user and the second user in the list both have 100 stars, they will be displayed as the first, and the third user is the second. In this case, Redis Zset Api cannot meet the requirements, Zset ranking will not be repeated processing, so we need to make our own judgment, I mainly use two apis to achieve.

1. Obtain the reverse ranking of the user with the corresponding ID in set. The ranking is a cursor starting from 0 int reverseRank(zset,name) 2. Get all object data between corresponding cursor start and stop, ReverseRangeWithScores (zset, 0, 4) List<Object> reverseRangeWithScores(zset, start, stop)Copy the code

According to the above two apis, my logic is as follows, which can be achieved by calculating the ranking of current users to be de-duplicated according to the combination of the two apis:

Slow requests locate problems

Problem of insight

Using Redis Zset this time, all our active user ids will be saved. Due to the user dimension, the space complexity is high. The memory usage is estimated by using Redis memory usage, which is about 100-200m, which is a typical Redis big key. Redis big key problem is one of the serious problems often encountered online, so it is particularly concerned, almost every day will see Ali Cloud Redis monitoring.

Not surprisingly, some slow requests are seen in the monitor, as follows:

time Time (ms) The query
January 21, 2021 15:01:41 33.78 ZREVRANGE zset 0 -1 withscores
January 21, 2021 15:00:26 33.11 ZREVRANGE zset 0 -1 withscores
January 21, 2021, 14:34:00 32.35 ZREVRANGE zset 0 -1 withscores
January 21, 2021, 14:25:50 33.12 ZREVRANGE zset 0 -1 withscores
January 21, 2021, 14:18:45 33.59 ZREVRANGE zset 0 -1 withscores
You can see that these slow requests are generally around 30ms, which should be clear to those familiar with Redis that it is completely memory basedSingle-threaded run modelThe basic command processing time is in microseconds (US) level. If a command exceeds 5ms, it will have a great impact on other requests in the queue, so it can be considered as slow request. Since our Redis product in Alicloud is cluster cluster, the first thing that comes to mind is whether the crC16 hash sharding strategy corresponding to the key of the cluster leads to a large number of hot keys in some sharding, and the high delay of the same command caused by the high load of the corresponding sharding. Therefore, the corresponding fragment is found and the corresponding instance performance with high latency is shown in the following figure: The instance performance is not affected in the case of high latency.
## Problem location
Then neither shard instance calculation nor memory is the bottleneck, the problem must be in the command. If you understand the underlying principle of Redis Zset data structure, you will know that it is realized by using a hop table. For scope search, the corresponding node is first found through dichotomy, and then n data is obtained by traversing forward or backward. Therefore, its operation time complexity is in log(M)+N, where M is the total amount of data and n is the number of scope query elements. Based on our total number of elements and business calculations, the time complexity is N and there should not be such a high delay.

ZREVRANGE zset 0-1 withscores “ZREVRANGE zset 0-1 withscores” After checking the document, if the second parameter cursor is -1, it is considered to return all data in the entire Zset, so full scan is performed. The time complexity increases by N orders of magnitude instantly, resulting in slow request.

ReverseRangeWithScores (zset, 0, rank-1). If the rank in the request is 0 (zset, 0, rank-1), Then the second cursor becomes -1 to scan the entire Zset, and the ranking in the subsequent calculation is not accurate, unable to calculate the first logic. So I found bugs in the production environment, and used the data request of the first user to reproduce the problem, and finally went online in an emergency to solve the problem with the logic code that was handled separately by the first ranking calculation.

conclusion

This is a typical experience of monitoring production environment middleware, then analyzing and locating problems, and finally solving problems. Looking back, I am really in a cold sweat. Our cluster Redis is shared by multiple services, and this is a big key problem. The single thread mode of Redis external service causes that if there is a large key in the production environment, it will slow down all requests of other services, and it is very possible to cause an avalanche of the whole link if there is a slight mistake. The reason why I was able to get an early insight into the problem this time was that the risk was estimated at the early stage of the review of the business requirements solution. The list needs to pay more attention to the problems of Redis, so as to solve the problems in time before they expand. Therefore, we still need to be in awe of the use of this cache middleware, which can be summarized as follows:

  1. Long-term observation of Redis cache memory usage. If there is a sharp rise, check whether there are unexpired keys and clean them regularly. Slow memory will cause service suspension.
  2. Check whether the logs have hot keys. If there are hot keys, remove them according to the service instance or degrade them to the local cache of the service instance (see Guava Cache and Caffine). Hot keys cause great pressure on the corresponding sharding in Redis.
  3. Observe the statistics logs of slow requests. If there are a small number of slow requests of different commands in peak hours, the cluster performance is likely to bottleneck. In this case, you can expand the cluster capacity. If there are a large number of slow requests for the same command, you need to quickly locate the source of the slow requests and determine whether the request is caused by service requirements. There are many other reasons why Redis suddenly becomes slow, which need to be analyzed in a case-by-case manner.

reference

Redis zrangebyscore

Estimated Redis memory usage