background

In order to better manage the project, we migrated one of the projects in the group to the MDP framework (based on Spring Boot) and found that the system frequently reported the exception that the Swap area was too high.

The author was called to help check the reason and found that 4G heap memory was configured, but the actual physical memory used was as high as 7G, which was really abnormal. JVM parameter configuration is -xx :MetaspaceSize=256M -xx :MaxMetaspaceSize=256M -xx :+AlwaysPreTouch -xx :ReservedCodeCacheSize= 128M -xx :InitialCodeCacheSize= 128M, -xSS512K -XMx4g-xMS4G, -xx :+ USEG1GC-xx :G1HeapRegionSize=4M, the actual physical memory used is shown as follows:

The memory information displayed by the top command

The screening process

  1. Use Java level tools to locate memory areas (in-heap memory, Code area, or out-of-heap memory using unsafe.allocateMemory and DirectByteBuffer)

The author added the -xx :NativeMemoryTracking=detailJVM parameter to restart the project and ran the JCMD pid VM. Native_memory detail command to check the memory distribution as follows:

JCMD displays memory

The committed memory displayed in the JCMD command is smaller than the physical memory, because the memory displayed in the JCMD command contains the memory in the heap, the Code area, and the memory applied through the unsafe.allocateMemory and DirectByteBuffer. But it does not include out-of-heap memory requested by other Native Code. So I guess the problem is caused by memory application using Native Code.

In order to prevent misjudgment, the author used Pmap to check the memory distribution and found a large number of 64M addresses. The address space is not in the address space given by the JCMD command, which is basically the result of 64MB of memory.

Pmap shows memory

  1. Use system-level tools to locate out-of-heap memory

The author has basically confirmed that the problem is caused by Native Code, and tools at the Java level are not convenient to troubleshoot such problems, so we can only use tools at the system level to locate the problem.

First, gperfTools was used to locate the problem

You can refer to gperfTools for the usage method, and gperfTools is monitored as follows:

Gperftools monitoring

As can be seen from the figure above, the maximum amount of memory applied by malloc is released after 3G, and then remains at 700m-800M. My first reaction is: Native Code does not use malloc application, directly use MMAP/BRK application? (The GperfTools principle replaces the operating system’s default memory allocator (GLIbc) with dynamic linking.)

Then, use Strace to trace the system call

Since gperftools failed to track these memories, the command “strace -f -e” BRK,mmap,munmap “-p pid” was directly used to track the memory request to the OS, but no suspicious memory request was found. Strace monitoring is shown below:

The strace monitoring

Next, GDB is used to dump suspicious memory

No suspicious memory requests were traced using Strace; So I wanted to look in memory. Dump memory mem.bin startAddress endAddressdump StartAddress and endAddress can be found in /proc/pid/smaps. Then use strings mem.bin to check the contents of dump as follows:

Gperftools monitoring

From the point of view of the content, it is like the information of the JAR package after decompression. The JAR information should be read when the project is started, so using Strace after the project is started is not very useful. So strace should be used when the project is started, not after it is started.

Again, strace is used to track system calls at project startup

Strace was used to track system calls at the start of the project, and it was found that a lot of 64M memory space was applied, as shown in the screenshot below:

The strace monitoring

The address space applied for using mmap is as follows:

Strace applies for the pMAP address space corresponding to the content

Finally, use JStack to see the corresponding thread

The strace command already shows the ID of the thread requesting memory. Use the jstack pid command directly to check the thread stack and find the corresponding thread stack (note the decimal and hexadecimal conversion) as follows:

Strace applies space to the thread stack

This is basically where the problem starts: MCC uses Reflections to sweep packages, and underneath uses Spring Boot to load jars. Because unzip jars use Inflater classes, we need to use out-of-heap memory, and then use Btrace to trace this class, stack as follows:

Btrace tracing stack

When the MCC is used, the packet scanning path is not configured. By default, all packets are scanned. So modify the code, configure the packet sweep path, released online after the memory problem solved.

  1. Why isn’t out-of-heap memory freed?

While the problem has been solved, there are a few questions:

Why is it ok to use the old framework? Why is out-of-heap memory not freed? Why is it that the memory size is 64MB and the JAR size can’t be that big and all be the same size? Why does GperfTools end up showing 700M or so of memory used? Is it true that the decompression package does not use malloc to apply for memory?

With doubt, the author directly looked at the Spring Boot Loader that piece of source code. Spring Boot wraps Java JDK InflaterInputStream and uses Inflater. The Inflater itself uses out-of-heap memory to decompress JAR packages. The wrapped ZipInflaterInputStream class does not free the out-of-heap memory held by Inflaters. Thinking I had found the cause, I immediately reported the bug to the Spring Boot community. However, after feedback, THE author finds that Inflater itself implements finalize method, which calls logic to release out-of-heap memory. That is, Spring Boot relies on GC to free out-of-heap memory.

When I looked at objects in the heap using JMap, I found that there were almost no Inflaters left. Therefore, WE suspect that Finalize is not called during GC. With such doubt, the author replaces Inflater wrapped in Spring Boot Loader with its own Inflater, and finalize method is called after finalize method is checked. Then I looked at the C code corresponding to Inflater, and found that malloc was used to initialize the memory, and free was called to free the memory when end was used.

At this point, I can only suspect that free is not actually releasing memory, so I have replaced Spring Boot’s wrapped InflaterInputStream with Java JDK’s built-in InflaterInputStream.

At this point, I looked back at the memory distribution of GperfTools and found that when Spring Boot was used, the memory usage was increasing all the time, and suddenly at a certain point, the memory usage dropped a lot (the usage directly dropped from 3G to 700M). This point should be caused by GC, memory should be freed, but there are no memory changes seen at the operating system level, so is it not freed to the operating system and held by the memory allocator?

Further exploration, it is found that the default memory allocator (Glibc 2.12 version) and the memory address distribution using GperfTools are obviously different. The 2.5g address is found to belong to the Native Stack by using SMaps. Memory address distribution is as follows:

Memory address distribution as shown by GperfTools

At this point, it’s almost certain that the memory allocator is playing tricks; Glibc introduces a memory pool for each thread starting at 2.11 (64-bit machine size is 64M).

Glib Indicates the memory pool

Modify the MALLOC_ARENA_MAX environment variable as described in the article and find no effect. Look at TCMALloc (the memory allocator used by GperfTools), which also uses the memory pool approach.

To verify that the memory pool is responsible, I will simply write a memory allocator with no memory pool. Run the GCC zjbmalloc.c -fpic-shared-o zjbmalloc.so command to generate the dynamic library, and then run the export LD_PRELOAD=zjbmalloc.so command to replace the glibc memory allocator. The code Demo is as follows: #include

#include

#include

#include

Long * PTR = mmap(0, size + sizeof(long), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0 ); if (ptr == MAP_FAILED) { return NULL; } ptr = size; // First 8 bytes contain length. return (void)(&ptr[1]); // Memory that is after length variable }



void calloc(size_t n, size_t size) { void ptr = malloc(n * size); if (ptr == NULL) { return NULL; } memset(ptr, 0, n * size); return ptr; } void *realloc(void *ptr, size_t size) { if (size == 0) { free(ptr); return NULL; } if (ptr == NULL) { return malloc(size); } long plen = (long)ptr; plen–; // Reach top of memory long len = plen; if (size <= len) { return ptr; } void rptr = malloc(size); if (rptr == NULL) { free(ptr); return NULL; } rptr = memcpy(rptr, ptr, len); free(ptr); return rptr; }

void free (void* ptr ) { if (ptr == NULL) { return; } long plen = (long)ptr; plen–; // Reach top of memory long len = plen; // Read length munmap((void)plen, len + sizeof(long)); }

Through embedding in the custom allocator, it can be found that the out-of-heap memory actually applied by the application is always between 700m-800m after the program is started. Gperftools monitoring shows that the memory usage is also around 700m-800m. However, from an operating system perspective, the amount of memory consumed by a process varies considerably (only out-of-heap memory is monitored here).

The author has done a test, using different allocator to sweep packets of different degrees, the memory occupied is as follows:

Memory test comparison

Why does a custom malloc request 800M and end up taking up 1.7GB of physical memory?

Because the custom memory allocator uses MMAP allocated memory, which is rounded up to an integer number of pages as needed, there is a huge waste of space. According to the monitoring, the number of pages requested is about 536K, which is equal to 512K * 4K (pagesize) = 2G. Why is this number greater than 1.7 gigabytes?

Because the operating system adopts a lazy allocation mode, when mMAP requests memory from the system, the system only returns the memory address and does not allocate the actual physical memory. Only in actual use does the system generate a page-missing interrupt and then reassign the actual physical Page.