Zhang Weifan (Yanxuan Technical Team of netease)
RedisTemplate is one of the most commonly used Redis data access classes packaged by Spring. It is powerful and simple to use. But under the seemingly simple API of RedisTemplate, there is something else going on. This article explores RedisTemplate step by step, starting with an online performance problem, and digs a hole in how RedisTemplate operates with psetex.
The background,
In the previous pressure test process, the RT of the business details page remained high, and it was found that the abnormal alarm of yx_malltech_user redis cluster was reported. At first, there were a large number of connections between slave nodes of a Redis cluster, occupying a large amount of CPU resources, when the services were not opened for split reading. And lead to redis master/slave disconnection and other abnormal behaviors.
After analysis, THE DBA found that from the real-time monitoring platform of the database, a large number of cluster commands were generated, and the request always existed, but the pressure measurement flow was too large, which led to the amplification of the problem. A single cluster command request was more than 30 milliseconds. All of them hit a random NODE, causing the CPU of the NODE to surge.
Then the DBA started to handle the problem, including but not limited to isolating the abnormal node to an independent machine, optimizing configuration parameters, etc., but the problem was not fundamentally solved. It can be determined that it is not the Redis server side problem, it should be the business code side problem, we need to start from the business usage to check the usage mode, see how to trigger a large number of cluster command.
6.9 The single pressure supplier found in detail that the problem still existed. Let’s start our search.
2. Service information
- Service involved: Yanxuan-app
- Scene involved: Details
- Spring version: 4.0.8
- Spring-data-redis version: 1.8.4.release
- Jedis version: 2.9.0
3. Analysis process
1. Monitor and observe
Since there are too many abnormal cluster commands, let’s observe the monitoring data. For detailed single pressure on the afternoon of September 6
You can see that the cluster command is indeed excessive and takes a long time. This affects other normal GET and set operations.
A closer look shows that the abnormal request is not just cluster, but PSEtex. Normally, we do not use psetex directly for caching. Look at both individually.
The psetex command and cluster command are shown in the following two graphs. They are found to be identical. Psetex and Cluster are strongly related.
Pset as followsCluster as follows
And cluster only calls to10.130.68.239:16379On this machine.
2. Code analysis
The psetex command is roughly the same as the cluster command, and the psetex command has many requests. And we are developing, actually does not use pset, we are all redisTemplate. OpsForValue (). The set ()
This is weird, so let’s start coding.
It encapsulates the caching logic, the bottom is redisTemplate. OpsForValue (). The set () method, then we go in redisTemplate source see, it secretly did what on earth.In this code, we can see where PSEtex comes from. If you use ms for timeout, you use psetex, and setex for others. It feels like we’ve taken a step forward. But why is the psetex and cluster commands proportional? Let’s keep going.The node where the corresponding slot is located will be calculated according to the key. Let’s go back to the topology method.When you look at this code, the truth is staring you in the face. There is a cache for retrieving cluster topology information with an expiration time of 100ms. The cache fails and then loops from the Cluster Nodes to send the Cluster Nodes command to update the topology information. This proves the correlation between the Cluste command and psetex command. The cache expiration time is 100ms, which must cause frequent cluster nodes commands to be issued. After confirming with DBA Dafengi, a large number of cluster commands exist in the pressure test, namely cluster Nodes commands. The conjecture was verified.
Iv. Solution and verification
1. Solutions
The solution is simple: change the code that sets the cache to expire in milliseconds to set the cache in seconds.
Setex will not trigger the cluster command as long as it is changed to the second level, which is setex. Avoid using the psetex command for frequent cluster Nodes commands.
2. Verify
After the online change, observe the monitoring. It was found that there was an obvious downward trend in the process of going online.During the flat peak, it has dropped to a very low levelFinal pressure test certificate:From the pressure test results, the performance is also very good. The problem was solved.
In the core link pressure measurement, the MRT of the overall detailed decreased from 100ms+ to 66ms.
V. Summary and thinking
1. Summary
This is an interesting question, because the culprit is actually a time unit. Since 1000ms is used and 1s is theoretically the same, most people don’t care. However, spring left such a usage hole in the bottom layer, which caused problems under heavy flow pressure. One small difference makes a big difference.
We don’t need to be precise about the cache expiration time in milliseconds, so we can just change it to seconds.
2. Think about
PSETEX and SETEX are not fundamentally different except for the accuracy of expiration time.
So Spring provides the ability to use SEtex and PSEtex differently based on units of time, without the interface showing up. I personally feel that this is a bad part of spring interface design. The underlying logic is not exactly the same (it’s not just sending different commands), although the details are shielded for the convenience of the developer. There are two sets of logic for commands such as obtaining cluster Nodes. It is also a wake-up call to the interfaces and methods exposed by our development design. An interface, method should be easy to understand, not ambiguous. Otherwise you have to split multiple interfaces, let the user choose, do not leave a pit to the user.
Follow-up optimization measures:
- Tease out the best practices for one version, and then standardize the use version of the framework. For example, spring-data-Redis1. x uses 1.8.4 and best practices; 2. Version X uses 2.2.4 and has corresponding best practices. This way, you can avoid stepping on repeated pits and see farther from the shoulders of giants.
- Establish a canonical CodeReview mechanism. Through multiple code reviews and different people’s experiences, it is possible to uncover some hidden potholes. Avoid problems after online.
Six. One day
1. Why does the cluster command always hit a node?
Another problem, as mentioned earlier, is that the cluster command keeps hitting a node, causing its CPU usage to spike.
The reason lies in the following spring-data-redis code:
The for loop iterates through all nodes, sending nodes from the first, and if it gets a return, it returns directly. However, our node information is generally unchanged. Therefore, the order of entrySet does not change, and Cluster Nodes commands are always sent to the same node.
This bug has been officially fixed. Jira. Spring. IO/browse/DATA…
2. Why is the Cluster Nodes command slow?
Well, this is straight from the Internet.
The answer is shown in the figure below. The CLUSTER_SLOTS constant is equal to 16384, so Redis loops multiple times to assemble slot information for each node. The CPU must loop 16384 times at least N times, where N is the number of redis cluster masters. Therefore, as the redis cluster size increases and the number of client NODES increases, the problem of filling the CPU with NODES commands becomes more and more serious.
The redis system command NODES performance issue has been reported to Redis officials in 2018:Github.com/antirez/red…
3. What is the PSETEX command used for?
Redis. IO/commands/ps…
PSETEX works exactly like SETEX with the sole difference that the expire time is specified in milliseconds instead of seconds.
PSETEX and SETEX are not fundamentally different except for the accuracy of expiration time. Psetex should be used in scenarios that require high accuracy in expiration time.
Thanks to the GREAT optimization done by the DBA on the server side and the very good analysis direction, we can quickly and clearly troubleshoot the problem from the business code. In the process of business code investigation, wen Yuan is indispensable. Thank you very much!!
Netease technology lover team continues to recruit teammates! Netease Yan Xuan, because of love so choose, look forward to like-minded you to join us, Java development resume can be sent to [email protected]