Record a CPU and memory high line accident

background

I am responsible for the development of a platform for the Internet of things and the operational work, December 18, 2020 a day, at Friday, thought can delimit stroke can have nice weekend, let me see yesterday’s equipment in the early morning report data, open the browser input url, can’t normal login, prompt server internal error! I think not right, I take care of the system has been stable for a long time, how can it be so? Everybody sit down. Don’t panic. We’ll take our time.

The process

First, I logged in to the server via SSH through my FinalShell tool, and when I saw the results that shocked me, I started to panic…

There was a Java process that took up so much memory and CPU. Then I simply checked the process number and found it to be the background service process of the Internet of Things platform (since the platform was deployed by Docker, it needed to check inside the Docker container).

docker exec -it cfe1916c608a /bin/bash
Copy the code

Enter the container through Docker exec, first use our more commonly used top to see the situation

Sure enough, consistent with the phenomenon we see, there is a Java process RES as high as 8.6G, %CPU as high as 277.0, then suddenly some excitement, I can lie in the pit again, will be shared as an experience article, which is what you see now 😂. Then we analyze and deal with it step by step…

Next, we mainly start from the direction of CPU increase analysis. Copy the process ID 8 and run the top-hp command to view the execution status of each thread of the current process

top -Hp 8
Copy the code

We randomly select a thread whose PID is 13 for analysis. It should be noted that in the jstack command display, the thread ID is converted to hexadecimal. Run the following command to print the hexadecimal result of the thread ID

The important thing is to execute jStack to see the stack information

jstack 8 |grep d
Copy the code

A, B, C, and D are all GC threads (PS: 10~13 in the figure above are GC threads). It is probably because of the frequent garbage collection by GC that other services cannot work properly. Ok, so let’s verify this with jstat

jstat -gcutil 8 2000 10 
Copy the code

Sure enough, the frequency and time of fullGC shown by FGC and FGCT is not making your scalp tingling 😫. Then, to give you a little trick, we first use the jmap-histo command to look at the number and size of objects in the heap

jmap -histo:live 8 | more
Copy the code

From the analysis we can see that “B and” C occupies is very high, what is this? Is this Byte and Char array, bold guess may be there is a lot of String String…

Class name is an object type, described as follows:

B byte
C char
D double
F float
I int
J long
Z boolean
[array, e.g. [I for int[]
[L+ class name other object

If the GC fails to free the heap, the JVM will not be able to free it. If the GC fails to free the heap, the JVM will not be able to free it

First, we export heap memory logging for the current JVM using jmap-dump

jmap -dump:format=b,file=dump.dat 8
Copy the code

Then, download the dump.dat file from the server to the local computer and analyze it using the Eclipse Memory Analyzer tool

Overall memory usage

Click directly on the Reports→Leak Suspects link below to generate a report to see who’s causing the memory Leak

You can see from the figure that a suspicious object consumed nearly 93.43% of memory. Read on and click Details for Details

Remark:

Shallow Heap is the size of memory occupied by the object itself, not the object it references.

Retained Heap is the size of the current object plus the total size of the objects that can be directly or indirectly referenced by the current object

Click dominator_tree to view the entries Heap object reference tree for analysis

Finally, click on the entries reference tree layer by layer to see the object references, see familiar things, and locate the location that seems to be associated with device_log_mTU_2020-12

In fact, we have roughly located the location of the problem, and then we should trace the business situation and analyze the system code based on the situation of the business system, and finally locate the cause of the problem:

The following screenshot is the communication between me and the technical developers of the system. Finally, I roughly located the cause of the problem: the logic bug of the code led to the endless loop

conclusion

Run the top command to check the CPU and memory of the process. If the CPU is high, run the top-hp command to check the running status of each thread in the current process. After finding the thread whose CPU is high, convert its thread ID to hexadecimal. Then check the working status and call stack status of the thread through the JStack log. There are two cases:

If it is a normal user thread, the stack information of the thread is used to see which code is running more CPU.
If the thread isVM Threadbyjstat -gcutil <pid> <period> <times>Command to monitor the GC status of the current system and passjmap dump:format=b,file=<filepath> <pid>Export the current memory data of the system. After the export, put the memory situation into the Mat tool of Eclipse for analysis, and then we can get the main object in the memory that consumes memory, and then we can deal with the relevant code

If the enterprise itself does not have related monitoring and operation tools, we can use the JDK itself to provide some monitoring and analysis tools (JPS, JStack, JMap, JHAT, Jstat, Hprof, etc.), through flexible use of these tools can locate almost 80% of the performance problems of Java projects. Arthas is an open source Java diagnostic tool for Alibaba that can help you solve this problem:

From which JAR is this class loaded? Why are all kinds of class-related exceptions reported?
Why didn’t the code I changed execute? Did I not commit? Got the branch wrong?
If you encounter a problem, you cannot debug it online. Can you only re-publish it by logging?
There is a problem with a user’s data processing online, but it cannot be debugged online, and it cannot be reproduced offline!
Is there a global view of the health of the system?
Is there any way to monitor the real-time health of the JVM?
How to quickly locate application hot spots, generate flame map?

The resources

The system runs slowly, the CPU is 100%, and the number of Full GC is too much

JPS, JStack, JMAP, Jhat, jstat, hprof_ vision -CSDN blog

Record a CPU and memory high line accident

background

The process

conclusion

The resources

Related Posts

Spring Reactor Uses (1)

You’re not still using these 8 wrong SQL scripts, are you?

Design Patterns – Visitor patterns and applications