The dry goods we haven’t seen in a long time reappear! Today’s content is based on the UCloud operation and maintenance students feedback that some hosts have abnormal process CPU peak usage phenomenon. This article introduces the complete analysis of the problem and the practice of successfully solving the problem with hot patches. If you think it is good, welcome to like and share! Start of the text ~
At the beginning, the operation and maintenance students of the company reported that some hosts had abnormal peak CPU usage. Only a few out of tens of thousands of machines, or a few in 10,000. It’s easy to overlook small monitoring errors that don’t cause serious consequences such as outages. But given that the anomaly is so fleeting and so undetectable, that there could be more of it, or that it could be normal now and abnormal in the future, the natural curiosity of kernel development made us think: Something must be wrong here! So they tracked it down.
Phenomenon of the problem
Symptom 1: If CPU monitoring is not 0, 100%
This problem occurs when the peak value of CPU monitoring of Redis process is sometimes 100%, sometimes 0, sometimes even 0 for dozens of minutes, and then 0 again after a burst of 100% for 1 second, as shown in the following figure.
From the statistical rule of a large number of machines, this phenomenon does not exist in kernel 2.6.32, and several cases exist in kernel 4.1. 2.6.32 is our earlier version, which provides strong support for the stable development of the platform. 4.1 can meet many new technology requirements, such as new CPU, new board, RDMA, NVMe and Binlog2.0, etc. Both versions are seamlessly maintained in the background, with a gradual transition to version 4.1 and higher for capability enhancement and optimization.
Symptom 2: If the value of TOP is not 0, 300%
Login to the machine to perform top – 1 – b – d p | grep, you can see the process of CPU utilization every few minutes to a few minutes in a 300%, which means that the process of three threads 3 cpus can run full, to present the same exception monitoring program.
Problem analysis
The above exception program uses the same data source: the user-mode time and kernel-mode time occupied by the process in /proc/pid/stat. When we capture the updates of Utime and Stime, we find that utime or Stime is updated every few minutes or dozens of minutes, and the updated step value reaches hundreds to 1000+, while the normal process sees updates every few seconds and the step value is dozens.
Once you locate the anomaly, you also need to figure out the cause. After the monitoring logic, I/O load, and call bottleneck are excluded, the CPU time statistics of 4.1 kernel are found to be buggy
Cputime Statistics logic
Checking the code execution paths where utime and stime were updated in /proc/pid/stat, we found something suspicious in cputime_adjust() :
Utime +stime>=rtime (utime+stime>=rtime) Rtime is runtime, which indicates the TOTAL CPU duration occupied by a process. Normally, it should be equal to or approximately equal to the user-mode time + kernel-mode time. However, the kernel is configured with the CONFIG_VIRT_CPU_ACCOUNTING_GEN option, which increases utime and stime monotonously, respectively. The Runtime is the total time that a process has actually been running, as measured by the scheduler.
The kernel compares the utime and stime of /proc/pid/stat to rtime each time it updates the utime and stime of /proc/pid/stat. If utime+stime is greater than rtime for a long time, the code will go out, and /proc/pid/stat will not be updated. Utime and stime are updated only when rtime keeps up with utime+stime.
Cold patches and hot patches
Round 1: Cold patch
Kernel /sched/cputime.c changelog: make sure stime+utime=rtime. Look at the description: with a tool like Top, there will be over 100% utilization and then zero for a period of time. Isn’t that the problem we have? It is really a broken shoe without looking for place, come all do not waste time! (patch links: lore.kernel.org/patchwork/p…
The patch was committed in kernel 4.3 and later, but was not committed to the 4.1 stable branch, so it was ported to the 4.1 kernel. The patch after the pressure test, no cpuTime sometimes 100% sometimes 0% phenomenon, but between 0-100% smooth fluctuation value.
At this point, you might think the problem is solved. But the problem is only half solved. It’s the “but” that’s the point.
Round 2: Hot patch
Adding this cold patch to the kernel code will only solve the problem of new servers, but the company still has tens of thousands of servers that can’t be restarted after a kernel upgrade.
If there is no other good option, stock updates will be forced to use the following compromise: monitor change statistics to circumvent the use of Utime and stime, instead of using runtime to count the execution time of the process.
Although this scheme is fast and feasible, it also has major disadvantages:
Many business departments have to modify the statistical procedures, high research and development costs; The utime and stime of /proc/pid/stat are standard statistics. Some third-party components are not easy to change. The problem of inaccurate Utime and Stime has not been fundamentally solved, and users, R&D, operation and maintenance will be confused when using ps and TOP commands, resulting in additional communication and coordination costs. Fortunately, we can also rely on a technology that UCloud has used successfully many times before: hot patching.
The so-called hot patch technology refers to the patch on the program binary that has been loaded into memory when the faulty server kernel or process is running, so that the program can execute the new correct logic in real-time online state. It can be understood as simply as guan Erye did not play anesthesia in a sober state of bone healing. Of course, the kernel bone healing kernel is not painful, but bad scraping kernel will die directly for you, without hesitation, very crisp and straight.
Hot patch Repair
There are two difficulties in this hot patch repair:
Difficulty 1: Making hot patches
This hot patch added the spinLock member variable to the structure, which involves the allocation and release of memory for new members. When the structure instance is copied and released, additional processing of new members is required. Any omission may cause memory leaks and lead to downtime, which increases the risk.
Another is that the structure instance is initialized when the process is started. For existing instances, new spinlock members are added. We thought we could intercept the code path where the native patch used the Spinlock member, allocate, initialize, lock, and release the member if the instance did not contain the member.
The solution involves both climbing difficult mountains and managing potential risks. The team wrote scripts for millions of loading and unloading hot patch tests, with no memory leaks, stable single operation, and then the next city.
Difficulty two: it is difficult to reproduce
Another difficulty is that this problem is difficult to reproduce, only in the live network production environment there are several cases to verify hot patches, and you can’t risk the user’s environment. We already have a standardized process to deal with this situation, that is, a well-designed gray strategy, which is also the core concept and capability that UCloud has been emphasizing internally. After analysis, this problem can be decomposed into verifying the stability and correctness of hot patches. So we adopted the gray scale strategy as follows:
Stability verification: first take a few machines to test normal, then take the company’s 500 secondary important machines to hot patch, gray run normal for a few days, so as to verify the stability, the risk is under control. Correctness verification: Find a faulty machine and print utime+stime and Rtime at the same time. According to the logic of the code, when rtime is less than utime+stime, the old logic will be executed; when rtime is greater than utime+stime, the new hot patch logic will be executed. As shown in the following figure, after entering the new logic of the hot patch, utime+stime prints normally and updates synchronously with Rtime, thus verifying the correctness of the hot patch.
Whole-network change: Finally, hot patches will be made on the live network environment machine in batches to implement whole-network change, and the problem will be fundamentally solved. Thanks for the full assistance of operation and maintenance students.
conclusion
In summary, we introduce the complete analysis and solution of cputime statistical anomaly in detail. This problem is not a serious outage problem, but it may confuse users with monitoring data, thinking that the machine may be overloaded and need more resources. The resolution of the problem will avoid unnecessary costs. In addition, this problem can cause confusion for students in R&D, operations and technical support when using the top and ps commands. Finally, we carefully analyzed and verified the essence of the problem, and properly solved the problem in the way of hot patches.