A detailed fault locating record: Troubleshooting process of low CPU usage and high load

For historical reasons, there is currently a service dedicated to processing MQ messages using Alibaba Cloud RocketMQ, SDK version 1.2.6 (2016).

With the development of business, there are more and more consumers on this application, which is close to 200+, resulting in high load for a long time and frequent alarm of THE ECS where this application is located.

The phenomenon of analysis

The ecS server where the application resides has a high load for a long time (the ECS has only one service), but the utilization of CPU, I/O, and memory resources is low. The following figure shows the system load:

ECS configuration: 4 cores 8 GB Number of physical cpus =4 Number of core cores in a physical CPU =1 Single-core multi-processorCopy the code

In terms of system load, the effect of multi-core CPU is similar to that of multi-CPU. When considering the system load, divide the system load by the total number of cores. As long as the load of each core does not exceed 1.0, it indicates normal operation. Generally, if the load is less than N, the system load is normal.

Apply the preceding rules: Observe load_15m and load_5m. The load is between 3 and 5, indicating that the system load is of a high magnitude in the medium and long term. Looking at load_1m, you can see that it fluctuates a lot and is much larger than the number of CPU cores for many periods of time. Busy in the short term, tense in the medium to long term, is likely to be the beginning of a congestion.

Reasons for positioning

Check the cause of high load

Tips: High system load does not mean insufficient CPU resources. A high Load simply means that too many queues need to run. But the tasks in the queue can actually be CPU intensive, I /0 intensive, and so on

In the figure, load_15,load_5, and load_1 are all greater than the number of cores 4, indicating overload operation

User processes =8.6%
Kernel processes =9.7%
Free = 80%
Percentage of CPU time occupied by I/O waiting =0.3%

The CPU, memory, and I/O usage in the preceding figure are low and the CPU usage is low and the load is high. Therefore, the high load may be caused by insufficient CPU resources.

Then use vmstat to check the overall running status of processes, memory, AND I/O, as shown in the following figure:

According to the results, the block in and block out of IO are not frequent, but the interrupt number (in) and context switch (CS) of system are very frequent. It is easy to cause the CPU to spend a lot of time on saving and restoring resources such as registers, kernel stack, and virtual memory, thus shortening the time of the actual running process and causing high load.

A CPU register is a small but fast memory built into the CPU. A program counter is used to store the location of an instruction being executed by the CPU, or the location of the next instruction to be executed.

They are the environments on which the CPU must depend before running any task, and are therefore also called CPU contexts.

CPU context switch is to first save the CPU context of the previous task (that is, CPU registers and program counters), then load the context of the new task, to these registers and program counters, and finally jump to the new position indicated by the program counter, and run the new task.

General direction of investigation: Frequent interruption and thread switching (as there is only one Java service on this ECS, this process is mainly investigated)

You can only view the total CPU context switch by using vmstate. You can view the thread-level context switch information by using pidstat -wt 1.

Looking at the figure above, it’s easy to see two things:

The first is that Java threads are extremely numerous; The second is a thread with a regular occurrence of 100+ context switches per second. (Specific reasons will be analyzed later)

To verify the source of these Java threads, check the number of threads in the application process cat /proc/17207/status

Number of threads 9749 (off-peak)

Direction of investigation:

Too many threads
Some threads have too many context switches per second

First, check the main cause, that is, the number of context switches of some threads is too high, and then pull the stack information of the process on the line. Then find the thread ID with the number of switch times reaching 100+/ second, convert the thread ID into hexadecimal and retrieve it in the stack log

From the picture above you can see process status TIME_WAITING, code, com. Alibaba. Ons. Open. Trace. Core. Dispatch. Impl. AsyncArrayDispatcher, check a few more other thread context switch frequently, The stack information is basically the same.

Then check the problem of too many threads. Analyzing stack information, it can be found that there are a large number of ConsumeMessageThread threads (communication, listening, heartbeat and other lines are ignored first).

By thread name, a search in the RocketMQ source code will almost certainly locate the following code

With two pieces of code, you can basically locate the problem in the mq Consumer initialization and startup process, and then analyze it based on the code.

The code analysis

ConsumeMessageThread_ is managed by the thread pool. Let’s look at the key parameters of the thread pool. The core number of threads enclosing defaultMQPushConsumer. GetConsumeThreadMin (), the maximum number of threads enclosing defaultMQPushConsumer. GetConsumeThreadMax LinkedBl (), unbounded queue ockingQueue

Since the thread pool queue uses an unbounded LinkedBlockingQueue, the default size of LinkedBlockingQueue is integer.max. No new worker threads are created until the task fills this capacity, so the maximum number of threads has no effect.

Looking at message-Consumer’s configuration for the number of core threads and the maximum number of threads, there is no special configuration at the code level, so the system defaults are used, as shown in the figure below

At this point, we can roughly locate the cause of excessive threads:

Since the number of consuming threads (ConsumeThreadNums) is not specified, the system default core thread number is 20 and the maximum thread number is 64. When each consumer is initialized, a thread pool with a core thread count of 20 will be created, i.e. there is a high probability that each consumer will have 20 thread consumption messages, resulting in a spike in the number of threads (20* consumers). However, it is found that most of these consumption threads are in sleep/wait state. Context switching is not affected.

Check on code level: This code cannot be found in rocketMQ source code. This application uses The SDK of Ali Cloud. When searched in the SDK, check the context and call link, it will be found that this code belongs to the track return module.

The process of AsyncArrayDispatcher module is analyzed with the code, which is summarized as follows:

Locate the code in the thread stack log in the SDK source as follows:

From this code and the stack information you can see that the problem occurred at tracecontextQueue.poll (5, timeunit.milliseconds); TraceContextQueue is a bounded blocking queue. When polling, if the queue is empty, it will block for a certain period of time, resulting in frequent switching between RUNNING and TIME_WAIT.

Poll (5, timeunit.milliseconds) instead of take() is used to reduce network I/O.

The poll() method returns the first element of the queue and deletes it; If the queue is empty, null is returned and the current thread is not blocked. The queue logic calls the dequeue() method, and it also has an overloaded method, poll(long timeout, TimeUnit Unit), which waits for a period of time if the queue is empty.

TraceContextQueue uses ArrayBlockingQueue, a bounded blocking queue that uses an array to store elements, locks concurrent access, and FIFO elements.

As you can see from the above code, reentrantLock implements concurrency control. ReentrantLock provides both fair and unfair locks, but ArrayBlockingQueue, by default, uses an unfair lock that does not guarantee fair access to the queue by thread threads.

Fair means that the blocked thread accesses the queue in the order in which it blocks. Unfair means that when the queue is available, the blocked thread can compete for the access of the queue. It is possible that the thread that blocks first can access the queue last.

Since each consumer has only one track distribution thread open, there is no competition in this part.

Take a look at the blocking implementation of ArrayBlockingQueue

As you can see from the above section of code, blocking is ultimately implemented using the park method, which is native

The park method blocks the current thread and returns when one of the following four conditions occurs

When unpark corresponding to park is executed or has been executed
When a thread is interrupted
Wait for the number of milliseconds specified by the time parameter
Occurrence of abnormal phenomenon

So far, the reasons for frequent system thread switching and interruption are summarized as follows:

In the track sending back module of Ali Cloud SDK, a consumer has a distribution thread, a track queue and a track data sending back thread pool. If the distribution thread takes from the track queue, it will block for 5ms if it fails to get the track data sending back thread pool, and then the data will be reported.

Too many distribution threads frequently switch between running and TIME_WAIT states, resulting in high load. As the maximum and minimum number of threads for each consumer message consumption is not set at the code level, each consumer will start 20 core thread process message consumption, resulting in excessive thread consumption of system resources and empty runs.

Optimization scheme

Combined with the above reasons, targeted optimization was carried out

At the code level, the configuration item of the number of threads is set for each consumer. Consumer can set the number of core threads according to the actual situation such as the business carried by the consumer, so as to reduce the overall number of threads and avoid a large number of threads running empty.

The above analysis uses the version 1.2.6 of Ons of Aliyun, which has been iterated to version 1.8.5. By analyzing the source code of the track uploading module of version 1.8.5, it is found that a switch is added to track uploading, and a single thread (singleton) is used to configure track uploading. That is, all consumers are processed by one distribution thread, one track bounded queue, and one track reporting thread pool. The ascending version after verification can be considered.

Wenyuan network, only for the use of learning, if there is infringement please contact delete.

I’ve compiled the interview questions and answers in PDF files, as well as a set of learning materials covering, but not limited to, the Java Virtual Machine, the Spring framework, Java threads, data structures, design patterns and more.

Follow the public account “Java Circle” for information, as well as quality articles delivered daily.

A detailed fault locating record: Troubleshooting process of low CPU usage and high load

The phenomenon of analysis

Reasons for positioning

The code analysis

Optimization scheme

Related Posts

Cve-2021-35211: SolarWinds V-U SSH vulnerability analysis

Leetcode.206 Reverses the linked list

Some thoughts on im server architecture