The thing is, just yesterday, I went to work in the morning and went to K8S to check whether the version typed out last night had been sent.

When I opened the container list, I noticed something was wrong. My app had eaten 6GB of memory.

Although I have not been in the company for a long time, I sensitively remembered the heap memory size set online, which is 1G

That is to say, the space outside the heap up to 5G, naked eye observation, most likely a memory leak.

It’s just a guess, not much. After all, beta has been around for a while, and business code is unlikely to cause a memory leak.

I’m still too young… After a while, the test says on the group that the interface is slow to respond. I looked at it and saw that the response was above 2S each time. Considering the leakage of the out of heap space, it should be frequent Full GC that caused the pause.

Sure enough, where there is a problem, there is always a problem. Well, if that’s the case, we need to take a closer look.

Analysis of JVM parameters

Enter the Docker container to view the startup information of the Java process

ps -ef | grep java
Copy the code

The first few actions are parameters to the JVM

-Xmx1g -Xms1g -Xloggc:/opt/app/bin/.. /log/gc.log -XX:NCumberOfGLogFiles=5 -XX:GCLogFileSize=5M -XX:HeapDumpPath=/opt/app/bin/.. /log/HeapDumpOnOutOfMemoryError/Copy the code

Parallel GC is used by default. Concurrent Mode Fail does not exist.

When Concurrent Mode fails, a single-threaded, STW garbage collector is started for garbage collection, resulting in a long pause in the case of a large heap.

The -XMx1g configuration indicates that the heap space is not very large.

GC Log Analysis

GC logs are important! Online applications, not too bad operations, will almost always have this parameter -Xloggc.

So thanks to o&M colleagues!

Open GC logging directly in the container.

1.366: [GC (Allocation Failure) 262144K->13458K(1005056K), 0.0377596 secs] 2.061: [GC (Allocation Failure) 275602K->12506K(1005056K), 0.0303328 secs] [GC (Allocation Failure) 274650K->12482K(1005056K), 0.0231478 secs] [GC (Allocation Failure) 274626K->15146K(1005056K), 0.0332410 secs]Copy the code

The previous GC was relatively harmonious, with no Full GC and a relatively short memory cleanup time.

Frustrated, I pressed vi’s shift + G shortcut to the end of the document, and something was wrong.

349908.124: [Full GC (Ergonomics) 1041404K->1009967K(1044992K), 1.8348046 secs] 349909.987: [Full GC (Ergonomics) 1041399K->1010051K(1044992K), 1.8928675 secs] 349911.906: Full GC (Ergonomics) 1041395K->1009637K(1044992K), 1.9980643 secs]Copy the code

As you can see, in the late stages, each GC is Full GC and takes nearly 2s, but the key is that there are still a large number of objects alive after each cleanup.

At this point, it’s almost certainly a memory leak.

But which object is causing the memory leak? This requires analysis of the application’s memory snapshot.

The DUMP is generated

Note that the o&M team did set the -xx :HeapDumpPath parameter, which means that in OOM, the application memory snapshot is recorded.

However goose, I went to the corresponding directory to search, but did not find the file, that did not happen OOM! .

It is necessary to mention the difference between memory leaks and memory leaks, and refer to the previous article with interest.

In conclusion, when I checked the problem, the memory leak had not yet caused a memory overflow, so this parameter was not yet in effect.

This also from the side to support the exception is a memory leak!

Ps: 2020/12/30 added: – XX: HeapDumpPath parameter is to specify the memory snapshot output paths, but also need to cooperate – XX: + HeapDumpOnOutOfMemoryError open OOM generated when memory snapshots

Therefore, it is necessary to use the JMap tool to generate a memory snapshot of the application.

jmap -dump:file=filename,format=b pid
Copy the code

When jmap is used, the system searches for the process information in the/TMP /hsperfdata_$user directory and generates a memory snapshot for the process based on the process information.

There is also a pit. Such a mistake was reported.

Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
Copy the code

The reason for this error is that the default user is root when entering the container, and o&M created another user to start the application when it was deployed.

Naturally, jmap will not find the corresponding process information in the root directory.

After user switchover, an exception is resolved and a file of more than 1 GB is generated.

MAT tool analysis

MAT is, download address, in fact, I also not very general use, with their wave over level 4 rough can understand.

After a bit of fumbling, I came across the following information.

See the Top Consumers

As you can see, objects generated by a class called ThreadLocalScope are taking up 84% of the memory, and these objects are likely to cause the leak.

Keep clicking in.

As you can see, many objects are created for this class, but only some of them are leaking. Inside it, there is an element called Wrapped that holds an ArrayList of logs, the overflow caused by this object.

Continue to check, there are some business problems, here to briefly say, another colleague, when using Jaeger components, part of the logic was not properly handled, resulting in the accumulation of objects. Later, I fed back this problem to him and pointed out the bug to him for modification.

I hope it will help you to share the process.