This article is participating in the Java Theme Month – Java Debug Notes EventActive link

preface

Record an online JVM out of the heap memory leak troubleshooting process and ideas, including some JVM memory allocation mechanism and common JVM troubleshooting instructions and tools to share, I hope to help you.

In the whole process of troubleshooting, I also made many detachments, but I will still write out the complete thoughts and ideas in this article as a lesson for future generations to refer to, and finally summarize several principles for quick troubleshooting of memory leak problems.

Main contents of this paper:

  • Fault description and troubleshooting process
  • Fault cause and solution analysis
  • Principles for allocating in-heap and off-heap memory in JVM
  • Common process memory leak detection instructions and tools are introduced and used

Writing the article is not easy, please support me more original technology public number: back-end technology ramble

Fault description

On August 12, during the lunch break, our business service received an alarm that the service process occupied the physical memory of the container (16G) exceeding the 80% threshold and increasing.

The monitoring system calls up the chart to view:

For example, a Java process has a memory leak, and our heap memory limit is 4G.

Confirm the startup configuration of the service process at that time:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80
Copy the code

Although no new code was put online that day, we were using the repair script of message queue to push historical data that morning, and this task would call a large number of interfaces in our service, so it was initially suspected to be related to this interface.

The following figure shows the change of traffic volume of the call interface on that day:

You can see that the dosage was much higher (5000+ beats per minute) than normal (200+ beats per minute).

We temporarily stopped the script from sending messages, the interface call usage dropped to 200+ calls per minute, the container memory stopped rising at a steep rate, and everything seemed to be back to normal.

Next check to see if the interface has a memory leak.

The screening process

First let’s review the Java process memory allocation, so that we can explore the following ideas.

Take JDK1.8, the version we use online. There are many summaries on the JVM memory allocation web, so I won’t write again.

The JVM memory region is divided into two parts: heap and non-heap.

  • Heap area: this is what we know as the Cenozoic old age.
  • Non-heap area: The non-heap area is shown in the figure, with metadata area and direct memory.

An additional note here is that the persistent generation (native to JDK8) holds classes used by the JVM runtime, and objects of the persistent generation are garbage collected during full GC.

Having reviewed memory allocation for the JVM, let’s go back to failures.

Heap memory analysis

Although it was pretty clear from the start that it wasn’t heap memory because the leaked memory usage exceeded the heap memory limit by 4 gigabytes, let’s take a look at the heap memory to be on the safe side.

We looked at the Cenozoic and old-generation memory footprint curves and collection times, and as usual, we dumped a log of JVM heap memory on the container at the accident site.

Heap memory Dump

Dump heap snapshot dump command:

jmap -dump:live,format=b,file=xxxx.hprof pid
Copy the code

Voiceover: You can also view heap memory alive objects directly using jmap-histo :live PID.

Once exported, the Dump file is downloaded locally and can then be opened using Eclipse’s MAT (Memory Analyzer) or the JVisualVM that comes with the JDK.

Open the file using MAT as shown in the figure below:

As you can see, there are some large niO-related objects in the heap, such as nioChannel, which is receiving messages from the message queue, and Nio.heapbyteBuffer, but there are not enough of them to make a judgment.

Next, I started to browse the interface code. The main internal logic of the interface was to call the WCS client of the group and write the data in the database table to WCS after looking up the table, without other additional logic

After finding no special logic, I began to doubt whether there was a memory leak in the WCS client package. The reason for my suspicion was that the WCS client base was encapsulated by SCF client. As RPC framework, its underlying communication transport protocol might apply for direct memory.

Is my code starting a Bug in the WCS client that causes constant calls to direct memory and eventually runs out of memory?

I contacted the person on duty of WCS and described the problem we encountered to them. They replied to us that they would perform pressure test of write operation locally to see if we could reproduce our problem.

Since we’re still waiting for their feedback, we’re going to figure it out ourselves.

I suspect that the direct memory is due to the excessive amount of interface calls and the client’s improper use of NIO, resulting in excessive direct memory requests using ByteBuffer.

Voiceover: As it turned out, this preconceived idea led to a detour in the screening process. It’s ok to narrow down the list with a reasonable guess, but it’s best to list every possibility first, and if you find yourself going into one possibility without success, go back and look at the others.

Sandbox environment reappears

In order to reproduce the failure scene, I applied for a pressure measuring machine in the sandbox environment to make sure it was consistent with the online environment.

First, we simulate a memory overflow situation (a lot of calls to the interface) :

We let the script keep pushing data, calling our interface, and we keep watching the footprint.

Once the call is made, memory continues to grow and does not appear to be restricted (no Full GC is triggered due to the restriction).

Then we will simulate the normal amount of normal volume (normal volume call interface) :

We cut the normal call consumption of the interface (relatively small and called in batches every 10 minutes) to the pressure measuring machine, and get the old generation memory and physical memory trend as follows:

Here’s the question: Why does memory keep going up and up to fill it up?

It was assumed that the FullGC operation would not be triggered because the JVM process does not have a direct memory size limit (-xx :MaxDirectMemorySize).

Two conclusions can be drawn from the graph:

  • When there is a large amount of interface call for memory leakage, if the FullGC condition is not satisfied in other situations like old generation in the heap, the FullGC will not be fulfilled all the time, and the memory will rise all the way directly.
  • However, under the condition of low-key consumption, memory leakage is relatively slow, and FullGC will always come and recover the leaked part, which is also the reason why there is no problem in normal operation for a long time.

As mentioned above, there is no limit to direct memory in our process’s startup parameters, so we added the -xx :MaxDirectMemorySize configuration and tested it again in a sandbox environment.

As it turns out, the amount of physical memory used by the process continues to grow beyond our limits, and it “looks” like the configuration doesn’t work.

This surprised me. Is there a problem with the JVM’s memory limitations?

At this point, you can see that I was obsessed with direct memory leaks during the investigation, gone forever.

Voiceover: We should trust the JVM to handle memory. If a parameter fails, look to yourself to see if you are using the parameter incorrectly.

Direct memory analysis

To further investigate what’s in direct memory, I started with direct memory. Since direct memory is not as easy to see as heap memory, we need some commands to troubleshoot direct memory. I have used several methods to see what is wrong with direct memory.

View the process memory information pmap

Pmap-report Memory map of a process

The pmap command is used to report the memory mapping relationship of the process. It is a good tool for Linux debugging and o&M.

Pmap -x pid if you need to sort | sort - n - k3 * *Copy the code

After executing, I get the following output, truncated as follows:

. 00007fa2d4000000 8660 8660 8660 rw--- [ anon ] 00007fa65f12a000 8664 8664 8664 rw--- [ anon ] 00007fa610000000 9840 9832  9832 rw--- [ anon ] 00007fa5f75ff000 10244 10244 10244 rw--- [ anon ] 00007fa6005fe000 59400 10276 10276 rw--- [ anon ]  00007fa3f8000000 10468 10468 10468 rw--- [ anon ] 00007fa60c000000 10480 10480 10480 rw--- [ anon ] 00007fa614000000 10724 10696 10696 rw--- [ anon ] 00007fa6e1c59000 13048 11228 0 r-x-- libjvm.so 00007fa604000000 12140 12016 12016 rw---  [ anon ] 00007fa654000000 13316 13096 13096 rw--- [ anon ] 00007fa618000000 16888 16748 16748 rw--- [ anon ] 00007fa624000000 37504 18756 18756 rw--- [ anon ] 00007fa62c000000 53220 22368 22368 rw--- [ anon ] 00007fa630000000 25128 23648 23648 rw--- [ anon ] 00007fa63c000000 28044 24300 24300 rw--- [ anon ] 00007fa61c000000 42376 27348 27348 rw--- [ anon ] 00007fa628000000 29692 27388 27388 rw--- [ anon ] 00007fa640000000 28016 28016 28016 rw--- [ anon ] 00007fa620000000 28228 28216 28216 rw--- [ anon ] 00007fa634000000 36096 30024 30024 rw--- [ anon ] 00007fa638000000 65516 40128 40128 rw--- [ anon ] 00007fa478000000 46280 46240 46240 rw--- [ anon ] 0000000000f7e000 47980 47856 47856 rw--- [ anon ] 00007fa67ccf0000 52288 51264 51264 rw--- [ anon ] 00007fa6dc000000 65512 63264 63264 rw--- [ anon ] 00007fa6cd000000 71296 68916 68916 rwx-- [ anon ] 00000006c0000000 4359360 2735484 2735484 rw--- [ anon ]Copy the code

As you can see, the bottom line is a map of the heap, occupying 4G, and the rest of it has a very small footprint, but we still can’t see the problem from this information.

Out-of-heap memory tracking nativemoryTracking

Native Memory Tracking (NMT) is a feature that Hotspot VM uses to analyze VM internal Memory usage. We can use the JCMD (JDK built-in) tool to access NMT data.

NMT must first be turned on through the VM startup parameters, but note that there is a 5%-10% performance penalty for turning NMT on.

- XX: NativeMemoryTracking = [off | summary | detail] # off: the default close # summary: only statistical classification of the various memory usage. # the detail: Collect memory usage by individual call sites.Copy the code

Then run the process and view the direct memory using the following command:

jcmd <pid> VM.native_memory [summary | detail | baseline | summary.diff | detail.diff | shutdown] [scale= KB | MB | GB] In addition to summary information, virtual memory usage is also included. Diff: compares the summary of the last baseline with the detail. Diff: compares the detail of the last baseline with the detail of the last baseline. Close the NMTCopy the code

We use:

jcmd pid VM.native_memory detail scale=MB > temp.txt
Copy the code

The result is shown as follows:

The information given to us in the figure above could not clearly tell the problem, at least I still could not see the problem through these information at that time.

The investigation seems to be at an impasse.

Mountains and rivers doubt there is no way

When the investigation was stalled, we got a response from WCS and SCF. Both sides confirmed that their package had no memory leak, WCS did not use direct memory, and SCF, although the underlying RPC protocol, would not leave such obvious memory bugs, otherwise there would have been a lot of feedback online.

View JVM memory information jMAP

At this point, unable to find the problem, I open a new sandbox again, run the server process, and then run the jmap command to see the actual configuration of JVM memory:

jmap -heap pid
Copy the code

Results obtained:

Attaching to process ID 1474, please wait... Debugger attached successfully. Server Compiler detected. JVM version is 25.66-B17 using Parallel threads in the new generation. using thread-local object allocation. Concurrent Mark-Sweep GC Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 4294967296 (4096.0MB) NewSize = 2147483648 (2048.0MB) MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 4294967296 (4096.0MB) NewSize = 2147483648 (2048.0MB) MaxNewSize = 2147483648 (2048.0MB) OldSize = 2147483648 (2048.0MB) NewRatio = 2 SurvivorRatio = 8 MetaspaceSize = 21807104 (20.796875MB) CompressedClassSpaceSize = 1073741824 (1024.0MB) MaxMetaspaceSize = 17592186044415 MB G1HeapRegionSize = 0 (0.0MB) Heap Usage: New Generation (Eden + 1 Survivor Space): Capacity = 1932787712 (1843.25MB) Used = 1698208480 (1619.5378112792969MB) Free = 234579232 (223.71218872070312MB) 87.86316621615607% informs the Eden Space: Capacity = 1718091776 (1638.5MB) Used = 1690833680 (1612.504653930664MB) Free = 27258096 (25.995346069335938MB) 98.41346682518548%, informs the From Space: Capacity = 214695936 (204.75MB) Used = 7374800 (7.0331573486328125MB) Free = 207321136 (197.7168426513672MB) 3.4349974840697497%, informs the To Space: Capacity = 214695936 (204.75MB) Used = 0 (0.0MB) Free = 214695936 (204.75MB) 0.0% Used concurrent mark-sweep generation: Capacity = 2147483648 (2048.0MB) Used = 322602776 (307.6579818725586MB) Free = 1824880872 (1740.3420181274414MB) 15.022362396121025% Used 29425 Interned Strings evicted 3202824 bytesCopy the code

In the output information, it can be seen that the old age and the new generation are quite normal, meta space is only occupied by 20M, direct memory looks like 2g…

Huh? Why MaxMetaspaceSize = 17592186044415 MB? It looks like there’s no limit.

Take a closer look at our startup parameters:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80
Copy the code

-xx :PermSize=256m -xx :MaxPermSize= 512M, that is, the memory space of the permanent generation. After 1.8, the Hotspot VIRTUAL machine has removed the permanent generation and used the meta space instead. -xx :PermSize=256m -xx :MaxPermSize=512m these two parameters are expired parameters for 1.8.

The following diagram depicts the change from 1.7 to 1.8, the permanent generation:

Could it be a meta-space memory leak?

I chose to test locally, making it easy to change parameters and visually see memory changes using the JVisualVM tool.

Use JVisualVM to observe the process in action

First restrict the MetaspaceSize space, using the argument -xx :MetaspaceSize=64m -xx :MaxMetaspaceSize=128m, and then loop locally to call the troubled interface.

As shown below:

As you can see, when the meta-space runs out, the system starts Full GC, the meta-space memory is reclaimed, and many classes are unloaded.

Then we remove the metaclspace constraint, using the argument that caused the problem earlier:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80 -XX:MaxDirectMemorySize=2g -XX:+UnlockDiagnosticVMOptions
Copy the code

As shown below:

It can be seen that the meta-space is increasing, and the number of classes loaded is also increasing with the increase of call usage, showing a positive correlation trend.

Another village

The problem became clear. With each interface call, it was likely that a class was being created over and over again, taking up memory in the meta-space.

Observing JVM class loads -verbose

When debugging a program, you sometimes need to see what classes the program is loading, what memory is reclaimed, what local interfaces are called, and so on. The -verbose command is required. In MyEclipse you can set it by right-clicking (as shown below) or you can view it by typing java-verbose on the command line.

-verbose:class Displays the class loading information. -verbose:gc displays the memory reclamation information on the VM. -verbose:jni displays the local method invocation informationCopy the code

In the local environment, we add the startup parameter -verbose:class loop call interface.

Can see created countless com. Alibaba. Fastjson. Serializer. ASMSerializer_1_WlkCustomerDto:

[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from File: / C: / Users/yangzhendong01 / m2 / repository/com/alibaba/fastjson / 1.2.71 / fastjson - 1.2.71. Jar] [the Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from File: / C: / Users/yangzhendong01 / m2 / repository/com/alibaba/fastjson / 1.2.71 / fastjson - 1.2.71. Jar] [the Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from File: / C: / Users/yangzhendong01 / m2 / repository/com/alibaba/fastjson / 1.2.71 / fastjson - 1.2.71. Jar] [the Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from File: / C: / Users/yangzhendong01 / m2 / repository/com/alibaba/fastjson / 1.2.71 / fastjson - 1.2.71. Jar] [the Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from File: / C: / Users/yangzhendong01 / m2 / repository/com/alibaba/fastjson / 1.2.71 / fastjson - 1.2.71. Jar] [the Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from File: / C: / Users/yangzhendong01 / m2 / repository/com/alibaba/fastjson / 1.2.71 / fastjson - 1.2.71. Jar]Copy the code

After many calls and a certain number of classes have been accumulated, we manually perform Full GC to collect the class loader, and we find that a large number of FastJSON-related classes have been collected.

If you use Jmap to look at class loads before recycling, you can also find a large number of FastJSON-related classes:

jmap -clstats 7984
Copy the code

With that in mind, I went through the code to see where fastJSON was used in the code logic and found the following code:

/** * Returns a Json character string. Public static String buildData(Object bean) {try {SerializeConfig CONFIG = new SerializeConfig(); CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase; return jsonString = JSON.toJSONString(bean, CONFIG); } catch (Exception e) { return null; }}Copy the code

Problems returning

We serialized the entity class of the hump field into the underscore field before calling WCS, ** this required using FastJSON’s SerializeConfig, which we instantiated in the static method. SerializeConfig creates an ASM proxy class by default to serialize the target object. Namely the classes are frequently created above com. Alibaba. Fastjson. Serializer. ASMSerializer_1_WlkCustomerDto, if we reuse SerializeConfig fastjson will look for the proxy class is created, So reuse. But if new SerializeConfig() does not find the previously generated proxy class, the new WlkCustomerDto proxy class will always be generated.

The following two graphs when the problem location source:

We solved the problem by making SerializeConfig a static variable of our class.

private static final SerializeConfig CONFIG = new SerializeConfig();

static {
    CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase;
}
Copy the code

What does fastJSON SerializeConfig do

SerializeConfig is introduced:

The main function of SerializeConfig is to configure and record the corresponding serialization class for each Java type (the implementation class of the ObjectSerializer interface). For example, BooleanCodec is used as the serialization implementation class, while Float []. Class uses FloatArraySerializer as the serialization implementation class. These serialization implementation classes are either default in FastJSON (such as Java base classes), generated through the ASM framework (such as user-defined classes), or even user-defined serialization classes (such as Date type framework default implementation of milliseconds, applications need to convert to seconds). Of course, this involves the problem of whether to use ASM to generate serialized classes or to use JavaBean serialized classes. Here, the determination depends on whether the Android environment (environment variable “java.vm.name” is “dalvik” or “lemur” is the Android environment). But that’s not the only judgment, there are more specific judgments that follow.

Theoretically, if each SerializeConfig instance serializes the same class, it will find a proxy class of the previously generated class to serialize. The service instantiates a ParseConfig object every time the interface is called to configure the Fastjson antisequence. Without the ASM proxy disabled, the created proxy class is never checked because ParseConfig is a new instance every time. So Fastjson keeps creating new proxy classes and loading them into Metaspace, eventually causing Metaspace to grow and run out of memory.

Upgrade to JDK1.8

The cause of the problem is worth paying attention to. Why doesn’t this happen before you upgrade? This brings us to the differences between the hotspot virtual machine that comes with jdk1.8 and 1.7.

From JDK1.8, the hostSpot virtual machine has cancelled the previous permanent section and added metaspace section. Metaspace is similar to the permanent section in function, and its main function is to store class metadata, but the actual mechanism is quite different.

First, the default maximum value of Metaspace is the physical memory size of the entire machine, so the continuous expansion of Metaspace will cause Java programs to encroach on the available memory of the system, and eventually the system will have no available memory. The permanent area, on the other hand, has a fixed default size and does not expand to the entire machine’s available memory. When the allocated memory runs out, the full gc is used to collect the Class metadata (Class objects) in the permanent gc. Metaspace determines whether the Class metadata can be collected. Is determined by whether the Classloader that loaded the class metadata can be recycled. As long as the Classloader does not recycle the class metadata, it will not be recycled. This explains why we had problems with both services after we upgraded to 1.8, because in previous JDK versions, although many proxy classes were created each time fastjson was called, many proxy Class instances were loaded in the permanent section, but these Class instances were created when method calls were made. The call is not reachable after completion, so the permanent area is reclaimed whenever full GC is triggered.

With 1.8, because the proxy classes are loaded through the main thread’s Classloader, the Classloader is never recycled during the program’s run, and so the proxy classes loaded through it are never recycled, causing Metaspace to expand and eventually run out of memory.

This problem is not limited to FastJSON, but can occur wherever classes need to be created through program loading. Especially in frameworks, bytecode enhancement is often heavily implemented with tools such as ASM and javassist. Prior to jdk1.8, classes that are dynamically loaded in most cases can be collected in full gc, so they are not prone to problems. As a result, many frameworks and toolkits do not deal with this problem, which can be exposed once you upgrade to 1.8.

conclusion

Problem solved, then analyse the whole screen problem process, the entire process to expose me many questions, the most important is for a different version of the JVM memory allocation is not enough, lead to the judgment for the old generation and yuan space, walk a lot of detours, screening in the direct memory for a long time and waste a lot of time.

Secondly, the investigation needs one is careful, the other is comprehensive, and it is best to sort out all possibilities first, otherwise it is easy to fall into the scope of their own set good investigation, into a dead end.

Finally, to summarize the results of this question:

  • Metaspace is similar to the permanent area. Metaspace is mainly used to store class metadata. However, the actual mechanism is quite different.
  • The amount of memory in the JVM needs to be limited at startup, including the familiar heap memory, as well as direct memory and primitive area, which is the last ditch to keep online services running properly.
  • With class libraries, pay more attention to the way the code is written and try not to have obvious memory leaks.
  • Be careful when using libraries that use bytecode enhancement tools like ASM (especially after JDK1.8).

Writing the article is not easy, please support me more original technology public number: back-end technology ramble

reference

Observe the class loading process as the program runs

blog.csdn.net/tenderhearted/article/details/39642275

Metaspace (permanent generation replacement reason, Metaspace features, Metaspace memory view analysis method)

www.cnblogs.com/duanxz/p/35…

Troubleshooting Flowchart for Java Memory Usage Exceptions (including out-of-heap Memory exceptions)

My.oschina.net/haitaohu/bl…

JVM source code analysis out of the heap memory full interpretation

Lovestblog. Cn/blog / 2015/0…

Unload JVM classes

www.cnblogs.com/caoxb/p/127…

Fastjson enables ASM on JDk1.8

Github.com/alibaba/fas…

Fastjson: PropertyNamingStrategy_cn

Github.com/alibaba/fas…

Beware of Metaspace memory leaks caused by dynamic proxies

Blog.csdn.net/xyghehehehe…

Pay attention to my

I’m a back-end development engineer. Focus on back-end development, data security, crawler, Internet of Things, edge computing and other directions, welcome to exchange.

I can be found on every platform

  • Wechat official account: A ramble on back-end technology
  • Making:@qqxx6661
  • CSDN: @Pretty three knives
  • Zhihu: @ Ramble on back-end technology
  • Jane: @pretty three knives a knife
  • Nuggets: @ pretty three knife knife
  • Tencent Cloud + community: @ Back-end technology ramble

Original article main content

  • Back-end development practices
  • Java Interview Knowledge
  • Design pattern/data structure/algorithm problem solving
  • Reading notes/anecdotes/Procedural life

Personal public account: Back-end technology ramble

If this article is helpful to you, please like it and bookmark it