On Thursday night, a server experienced a very strange situation: the CPU load was up to 100%, but the memory usage was not large. The server automatically recovered after NGINX cut traffic and did not accept HTTP requests

So we started logging on to the server in question and checking the logs and monitoring to see what the problem was. Before and after this stream, ActiveMQ message service and DUBBO RPC calls are both running. The difference is that the CPU load before the stream suddenly increases to 100%, but the memory usage only increases to OOM level (so there is no dump file).

Start viewing the server:

1: On the top server, the CPU usage of the JVM process whose PID is 30284 reaches 100%

2: displays a list of threads, sorted by the threads with the highest CPU usage

[admin@xxxxx]# top -Hp 30284

3: Find the PID of the most time-consuming thread and convert it to hexadecimal format

[admin@xxxxx]# printf “%x\n” 31448 77e0

4: Run the jstack command to view the detailed stack information of the JVM and grep time-consuming threads

[admin@xxxx] # jstack 30284 | grep 77e0 -A 60


Further analysis

The MQ thread is in a TIME_WAIT state, waiting for a message to be queued. If we look at the source code, we can see that there is always a calculation, but MQClient is always waiting for a message to be processed, so this state seems to be ok =-=

By looking at the business logs, onMessage consumption is particularly time-consuming

Some of the message processing has exceeded 100000ms; intersect | intersect; intersect; it is not normal at all. Compared with several other servers with the same application, the message processing speed is normal and other indicators are normal. Therefore, we can only cut the flow of this server first.

Several consuming messages were also tracked, and one database query method was particularly time consuming, slowing down the entire message process

Check the message processing of other servers. This query method takes one to two seconds, but it is still normal. However, the slow query speed of this machine is probably due to the high CPU load of the server, after all, redis and JDBC connections are also affected.

The server will be temporarily cut off, waiting for the end of the message processing, the server will automatically restore, but what is causing the server CPU load increase “culprit” has not been found, how also can not explain, still need to continue to observe.


Summary:

2. ActiveMQ is still not familiar with this message-oriented middleware, so it needs to know more about it. 3


References:

TIME_WAIT and CLOSE_WAIT details and solutions

2. ActiveMQ source code parsing (2) What is session