This paper documents the troubleshooting process of an out-of-heap memory leak on a stable, long-running system without any code changes. This was originally a problem I encountered last year, but the lazy cancer outbreak has been left unsummed up here. New Year new atmosphere, this time take the opportunity to pay off the debt.

The cause of

One night at 0 o ‘clock, a large number of online services suddenly began to report errors. According to the error link, it was pointed out that the system I was responsible for hung _(:з “∠ _). I quickly boarded the machine to see what the problem was, but the server did not respond at all, so I had to ask for help from operation and maintenance to restart the machine forcibly. After the restart, it was observed for a period of time that the service ran smoothly and the crisis was temporarily relieved.

Problem analysis

Problem orientation

View monitoring & Logs

The first reaction to a service failure is to check the monitors. From the memory monitoring (Figure 1), it is clear that the heap usage is stable at zero while the process usage spikes, so it is likely that there is an out-of-heap memory leak (the previous blank is because I remember that the previous data was cleared at the time of the screenshot). Checking the system logs with the dmesg command confirmed that the system was running out of physical memory and the Tomcat process was killed, causing the service to become unavailable. Oom Dump is only generated when OutOfMemoryError occurs in the heap memory. JVM cannot do anything about out-of-heap memory. The trail falls off here for the time being.

Figure 1. Heap and process memory monitoring

Add alarm & export dump file

Since there is no way to reproduce the failure scenario at that time, the alarm that o&M increased the memory usage is prepared to observe whether it will happen again. Fortunately (not) a few days after the alarm appeared, quickly find the operation and maintenance export dump file to start the investigation.

Analyze dump files using MAT

I used MAT to import dump files and asked me if I wanted to generate Leak Suspect. The report shows that a large number of objects are held in a HashTable called Records of this object, Sun.java2d. Disposer (Figure 2). This Disposer class is to release resources, add the object and its corresponding DisposerRecord to the Records through the Disposer#addObjectRecord method, and the object will be added to the ReferenceQueue when it is recycled. A Daemon thread called ‘Java2D Disposer’ will scan the reclaimed object in the ReferenceQueue and take the object’s corresponding DisposerRecord from records using the DisposerRecord# Dispose method to release the resource. As you can probably guess from here, an object that has applied for out-of-heap memory has not been collected by GC for some reason. As a result, its Dispose method cannot be executed and it cannot release the out-of-heap memory occupied by it.

Figure 2 MAT locates the object that may leak

Figure 3 Classes referenced in the Records property of Disposer

Track memory allocation through gperfTools

Gperftools works by replacing malloc calls with its own TCMALloc to count the behavior of all memory allocations. Specific usagehereI won’t repeat it. You can see Disposer$Records all the objects it holds are Font related. We have only one Font related code in the code where the picture is generated, so take it out and track the memory allocation process through GperfTools. As a result, you can see that out-of-heap memory is allocated in the T2KFontScaler#initNativeScaler method (Figure 4).

Figure 4 Memory allocation links tracked by GperfTools

Fault locating code

After finding the entry to the out-of-heap memory allocation, it’s much easier. Slowly debug the code to see that the FileFont object created in the Font2D#getStrike method ends up holding a reference to the T2KFontScaler object via the Scaler property of FileFont. Font2D$strikeCache contains a soft reference to the FileFont object. As a result, the T2KFontScaler#dispose method cannot be disposed if the FileFont is not disposed.

public abstract class Font2D { private FontStrike getStrike(FontStrikeDesc var1, boolean var2) { FontStrike var3 = (FontStrike)this.lastFontStrike.get(); if (var3 ! = null && var1.equals(var3.desc)) { return var3; } else { Reference var4 = (Reference)this.strikeCache.get(var1); if (var4 ! = null) { var3 = (FontStrike)var4.get(); if (var3 ! = null) { this.lastFontStrike = new SoftReference(var3); StrikeCache.refStrike(var3); return var3; } } if (var2) { var1 = new FontStrikeDesc(var1); } var3 = this.createStrike(var1); int var5 = var1.glyphTx.getType(); if (var5 ! = 32 && ((var5 & 16) == 0 || this.strikeCache.size() <= 10)) { var4 = StrikeCache.getStrikeRef(var3); } else { var4 = StrikeCache.getStrikeRef(var3, true); } this.strikeCache.put(var1, var4); this.lastFontStrike = new SoftReference(var3); StrikeCache.refStrike(var3); return var3; }}}Copy the code

The cause of the problem

Going back to the glitch itself, the system had been running steadily for a long time without any changes to the code, so why did this suddenly occur? It turned out that a few days ago, in preparation for the big push, we did a pressure test. At that time, we found that GC took a long time and FullGC occurred about every 2 hours. Therefore, we appropriately increased the heap memory size to avoid FullGC. The problem is that larger heap memory means more FontStrike soft references are held in strikeCache, and more out-of-heap memory is occupied by T2KFontScaler objects held by FontStrike. Moreover, as FullGC frequency decreases, FontStrike’s soft reference will not be recycled in time, resulting in T2KFontScaler’s Disposer being unable to release the allocated out-of-heap memory. As the heap memory increases, the out-of-heap memory also becomes smaller. As the heap memory increases and decreases, the out-of-heap memory becomes exhausted before FullGC is triggered to free the space, which eventually triggers Linux’s Oom-killer mechanism to kill the entire Tomcat process to free the memory.

The solution

The solution is easier once the cause of the problem is identified. Reftype =weak’ enables FontStrikeCache to hold weak references and free memory in time. Our system uses only a few fonts, so using fonts as static variables at project startup rather than creating new fonts on every request solves this problem.

Afterword.

This time, we used MAT’s Leak Suspect function to locate the object that might Leak and the corresponding code block, and then tracked down the actual method of allocating out-of-heap memory through Gperftools, and finally located the cause of out-of-heap memory Leak. But what if the code block is hard to locate, or the out-of-heap memory is not allocated through malloc methods, is there a way to locate the cause using MAT alone? In fact, we will take a closer look at the class referenced to Disposer$Entry (Figure 3). WeakReference is the key of Entry, and other classes are the implementation of DisposerRecord. This means that they implement the logic of releasing resources in the DisposerRecord#dispose method. As long as the dispose method has the logic to free the memory, it means that the memory is applied for by this class. You can easily find the DISPOSE method of the T2KFontScaler which disposeNativeScaler, a local method, to dispose of memory.

class T2KFontScaler extends FontScaler { private synchronized void disposeScaler() { this.disposeNativeScaler(this.nativeScaler, this.layoutTablePtr); this.nativeScaler = 0L; this.layoutTablePtr = 0L; } public synchronized void dispose() { if (this.nativeScaler ! = 0L || this.layoutTablePtr ! = 0L) { Runnable var2 = new Runnable() { public void run() { T2KFontScaler.this.disposeScaler(); }}; (new InnocuousThread(var2)).start(); }}}Copy the code

In addition, the naming of sun.font.StrikeCache$SoftDisposerRef actually indicates that this object is a soft reference, and clicking on it shows that it does hold a reference to T2KFontScaler. Therefore, only using MAT can actually locate the specific leak location.

conclusion

This article documents a troubleshooting process for an out-of-heap memory leak that was exposed due to heap memory tuning. Firstly, it determines the possible out-of-heap memory leakage through monitoring and system logs. Then, it obtains the dump file when the leakage occurs again and locates the possible leaked code through MAT tool. After that, it traces the class that actually allocates the memory and locates the leaked code from the source code through GperfTools. Finally, combined with the recent JVM optimization of the project, the cause of the failure was found and repaired. In addition, after reviewing the whole troubleshooting process, a more direct method to locate the cause of the problem using MAT was summarized. Dove for more than half a year finally to fill their own pit, usually a variety of fault tuning books and articles read a lot, the actual problem when the investigation is not clear thinking took a lot of detour, light a MAT in the use of various tools is enough to do STH over and over again, so the paper come zhongjue shallow, must know this to practice. I learned many new things from this failure, such as Linux system logging and Oom-killer, the out-of-heap memory allocation recycling mechanism, and the use of various memory analysis tools. Make it a good habit to write down everything you learn at the beginning of the New Year.

Reference documentation

Font.create() causes OOME problems. Gperftools is used to locate Java memory leaks on macOS