This article covers the following

  • Enable NMT to view JVM memory usage
  • Run the pmap command to check the physical memory usage of processes
  • Smaps Displays the memory address of a process
  • The GDB command dumps memory blocks

background

Recently I received feedback from operation and maintenance that the RSS of one node of a project is more than twice that of Xmx. Because it is an ECS machine, the project can run all the time. Fortunately, the machine has enough memory, otherwise it may affect other applications.

Troubleshoot problems

By detecting the login to the target machine, implement the top command, then c, see the RES of the corresponding process occupies more than 8 G (here at that time forgot screenshot), but in fact we configure -xmx only 3 G, and the program is running, so will not be a heap of take up so much, so he direction pointing to the problems of heap memory.

The first thing that came to mind was Arthas to check the memory distribution, and then the dashboard command to check the memory distribution

The non-heap contains code_cache and metaspace, but the non-heap contains only code_cache and metaspace.

NMT

NMT is an acronym for Native Memory Tracking. It is a new feature of HotSpot introduced in Java7U40. After enabled, it can track the Memory usage of JVM by JCMD command. Note that according to the Official Java documentation, there is a 5%-10% performance penalty for enabling NMT.

- XX: NativeMemoryTracking = [off | summary | detail] # off: the default close # summary: only statistical classification of the various memory usage. # the detail: Collect memory usage by individual call sites.Copy the code

Add the -xx :NativeMemoryTracking=detail command to the startup parameters, and then restart the project.

After running for some time, Top looked at the RES of the process and found that it was already more than 5 G

Run JCMD < PID > vm. native_memory summary scale=MB

We can see that the total committed memory of heap and non-heap is just over 2 gigabytes, so where are the 3 GIGABytes left?

pmap

The pmap command is used on Linux to open the process address space

Perform the pmap – x < pid > | sort – n – k3 > pmap – sorted. TXT command can be ordered according to the actual memory

Looking at the map-sorted. TXT file, you find a large number of 64M memory blocks

Is it the classic 64MB memory problem in Linux GliBC? Zhang wrote an article (a Java process OOM investigation analysis (GliBC)) about this problem, so I prepared to refer to the investigation ideas

Try setting the environment variable MALLOC_ARENA_MAX=1 and restart the project. After running for some time, execute the pmap command again to check the memory and find no change.

smaps + gdb

Since you can see that there are so many abnormal blocks of memory, is there any way to know what’s inside? The answer is yes. After some data searching, it is found that the GDB command can dump the specified address range of memory blocks.

To perform GDB dump, you need to know an address range. Smaps provides detailed information about the memory blocks used by the process, including the address range and source

cat /proc/<pid>/smaps > smaps.txt
Copy the code

Look at smaps.txt to find the address of the memory block in question, such as 7fb9b0000000-7fb9b3ffe000 in the figure below

Start GDB

gdb attach <pid>
Copy the code

Dump specifies a memory address to a specified directory. Add 0x to the address before smaps obtains the address

dump memory /tmp/0x7fb9b0000000-0x7fb9b3ffe000.dump 0x7fb9b0000000 0x7fb9b3ffe000
Copy the code

Displays a string of more than 10 characters

strings -10 /tmp/0x7fb9b0000000-0x7fb9b3ffe000.dump
Copy the code

It is found that there are a large number of contents in the red box in the figure. This content is the content pushed by the back-end to the front-end Websocket. How can it reside in the out-of-heap memory? After checking the project code, we found that the websocket at the back end is implemented using Netty-Socketio, and Maven relies on

<dependency> <groupId>com.corundumstudio.socketio</groupId> <artifactId>netty-socketio</artifactId> The < version > 1.7.12 < / version > < / dependency >Copy the code

This is an open source socket. IO Java implementation framework, see github.com/mrniko/nett…

A review of the recent release logs shows that the latest version of the framework is 1.7.18, and that several subsequent releases have fixed memory leaks

So you upgrade the dependent version to the latest version, republish it, and then look at it the next day, and the RES is still high

In the middle, I carefully looked at the code related to the use of the framework, and found that the method of leaveRoom was not called when the connection was disconnected, but joinRoom was called. As a result, data would be sent to the disconnected connection when the group message was sent (theoretically there would be an error, but I haven’t seen the relevant log). After fixing the code, I released a new version. I ran all day to see, but still no effect.

conclusion

At this stage, it can be confirmed that the memory leakage problem must be related to the websocket and related functions (because the webSocket was not connected to 1.7.18 in the middle of the upgrade, and then decreased to 1.7.17, there was a long time during which the WebSocket service was not exposed. RES and just this period of time is very stable, no rising trend, because the function points is small, users do not have what sense), and finally, after communication with products, decided to get rid of websocet here, instead to be request the front-end interface to get the data, because the function is to live post an unread message number, And this information is actually very few users care about, so, change to refresh the page when the query on the line, in this way, the problem is a disguised solution 😁. As for the specific cause of this problem, I haven’t found it yet. It may be a framework BUG or a code use problem. In the future, when we need to rely heavily on Websocket, we may write a set based on Netty, which is more controllable.

conclusion

Although after so many efforts, finally just proves that “no demand, no BUG” this sentence, but still have a lot of harvest, among many commands are also used for the first time, there are some twists and turns, there is no one to write out, picking some more valuable wrote this article summary, hope I can share with people in need, Encounter similar problem later, can do experience reference.

Thank you

We have learned a lot of information from the Internet for this investigation. Thank you very much for the experience shared by the authors of the following articles

  • A Java process in OOM (Glibc)
  • JAVA out-of-heap memory troubleshooting summary
  • Linux uses GDB dump memory