When I just went to work on Monday morning, a large number of users suddenly complained that it was very slow to enter the web page. When I logged in to the server, the Redis call time seriously timed out, so the high-speed cache turned into a weakness. As the data was not returned, the request response was slow.
Web monitoring
Through ali’s Grafana monitoring, the server’s CPU load, memory, and network input and output are all normal, so there must be a problem with Redis.
Our application uses the single-node 32M 16GB Aliyun Redis. We logged in the webpage monitoring to see the performance monitoring and found that the CPU usage soared to 100%!!
QPS increased from more than 1000 to 6000, but it was far below the limit value, and the number of connections increased from 0 to 3000, which was also far below the limit value (maybe users just went to work and had requests at the beginning, and then the response was delayed, resulting in too many command queues and many connections being opened).
Temporary solution: Rent a new Redis server, change the Redis configuration of the application server, and restart the application to avoid affecting more users.
Then we continue to track the specifics of Redis.
Server command Monitoring
Log in to redis-CLI and check the server status and command statistics through the info command. Sangge summarized two anomalies:
Query redis slowlog instructions. The top ten redis slowlog instructions are keys* and time-consuming. Running keys* in current traffic will block services, resulting in slow query and high CPU usage. It is worth noting that there is no open keys * interface on the application level, so it is not necessary to check whether there is a background person or background program to trigger the command.
Check the execution of redis instructions and exclude exec,flushall and other instructions. Among the business use instructions, the most time-consuming ones are setnx (75 million calls), Setex (84 million calls) and DEL (260 million calls), which take 6s on average, 7.33s on average, and 69s on average. Hmset has an average of 64s for 100 million calls, hmget has an average of 9s for 68 million calls, Hgetall has an average of 205s for 1.4 billion calls, and Keys has an average of 3740s for 20 million calls.
Generally speaking, the time of these instructions is proportional to the value size, so we can check whether the data related to these instructions has increased significantly recently. Or there is no recent business transformation, will frequently use the above instructions, will also cause high CPU.
(I forgot the screenshot at that time, the following picture just shows the command and parameter meanings)
The info CommandStats command displays Redis command statistics in the following format
Cmdstat_XXX: calls=XXX,usec=XXX,usec_per_call=XXX number of calls, CPU usage, average CPU usage per command (in microseconds)Copy the code
Slowlog: slowlog slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog: slowlog
(Forgot screenshots at the time, too, so here’s what slowlog thinks)
xxxxx> slowlog get 10
3) 1) (integer) 411
2) (integer) 1545386469
3) (integer) 232663
4) 1) "keys"
2) "mecury:*"
Copy the code
The fields in the figure represent:
- 1= Unique identifier of the log
- 2= Execution time of the command, represented by a UNIX timestamp
- 3= query the command execution time, in the unit of subtlety. The value in 🌰 is 230ms
- 4= Commands executed are arranged in an array. The complete command is keys mucury:*
Key * = keys = keys = keys = keys = keys = keys = keys = keys = keys = keys = keys
Finally, these statistical results and slow commands were sent to the R&D group, and it was found that the configuration of other applications was configured into our Redis. Then they had a business scenario of data crawling, and a large number of calls suddenly flooded in, and keys * constantly, which caused our Redis to be overwhelmed. Therefore, we modified the configuration correctly and stopped calling our Redis.
conclusion
- Redis jitter can first see the web monitoring (Ali cloud do good!
- View Redis command status and slow command status through commands
- Consider optimizing Redis usage in your code
- If traffic continues to rise, consider upgrading =-=
,