The phenomenon of
One morning, a mobile phone suddenly alerted, and the memory usage of a certain online machine exceeded 90%. At that time, I thought there was a scheduled task running, and it was already late at night, so I did not go to investigate the specific cause. The next morning there was a release, the memory was down, and the problem was not investigated during the day. Wait until the evening peak time, and receive memory call more than 90% of the alarm, and after a period of time, a machine on the line hung up.
The save
When a machine hangs, the machine restarts immediately and dumps the memory information of the other machine.
Jmap-dump :format=b,file= filename [pid]
A small incident occurred during the dump process and cannot be dumped. Procedure
This is caused by a user who is not currently on the thread, so add the sudo -u user in front of the command
The thread information is also saved
sudo -u jetty jstack pid > /tmp/jstack2018-04-18.txt
Problem orientation
View system logs of the server
cat /var/log/messages
Killed Process 7364, UID 502, (Java) total-VM :10511252kB, ANon-RSS :7489308kB Ps: The previous line of The Java process 2627813 and 1872416 are the number of pages, 4K per page.
You can see that container memory usage is killed for exceeding system memory.
Use MAT to analyze specific memory leaks.
Since I am not familiar with the use of MAT tool, after seeing the distribution map, the total is only 103.6MB, so I thought there was no problem with the in-heap memory, so I went to the direction of out-of-heap memory overflow. We checked the use of ThreadLocal in the code, cleaned the object in time, and checked the number of threads on line. No obvious problems were found.
The number of threads causes the out-of-heap memory to overflow
Continuing with the MAT analysis, using the Leak Suspects function, we found a large number of TrueTypeFont objects in memory
Just think of in this application, there is the use of font drawing, sending pictures function. As a result, I checked the OOM request at that time and found that there were indeed a large number of drawing requests, and I found that more than 600 pictures were drawn simultaneously in one request.
Problem reproduction
Then, I simulated the request in the company’s test environment, and sure enough, the memory usage jumped from over 60 to over 80.
Looking back at the drawing code, I found that there was a problem with the fonts.
The logic of this method is to load a font file on the server, draw a QR code with a brush, and save it to a temporary directory on the server.
There is no proper processing in the loading of fonts, resulting in the loading of font files every time a painting task comes, and a font object is created in memory. The font file size on our server is about 16M, which means that each painting needs to load 16M of memory.
The analysis reason
Since I was not familiar with MAT at that time, I still suspected that Font operated the out-of-heap space. So I’m ready to test the out-of-heap space.
Rule out the possibility of out-of-heap space overflow
Our server is configured with 8GB of memory, and the JVM is configured with -XMx4428m-XMS4428m-xMN2767m. Out-of-heap space, if not limited, will be similar to the JVM usage. At this point, we limit the size of the off-heap, for example we specify the off-heap size as only 100M -XX:MaxDirectMemorySize= 100M.
If The Font uses out-of-heap space, then The out-of-heap space quickly reaches 100M and Full GC is performed, blocking all requests and stopping The World. At this point, as long as the Full GC is cleared in time, the following blocking requests continue to come in, continue up to 100M, continue the Full GC. In this way, the memory usage never reaches 8GB and the system will not kill the memory.
During the test, however, memory usage continued to soar and was eventually killed. Again, the suspicion was that the heap was out of memory. Using the same policy, I set the heap to 1 gb (100 mbit/s will not work), found that the heap did not grow any further, and looked at the GC log and kept doing Fulll GC, which roughly showed that I was using heap memory.
As you can see, during the peak period, a large number of drawing requests come in, causing a large number of large objects to be loaded into memory, eventually causing the JVM memory to exceed the system memory and the container to be killed by the system.
To solve the problem
When you change Font to singleton mode, only one copy exists in memory.
I tried it again in the company’s test environment, and there was no change in memory usage. After the release of the online, the online problem did not reappear, the problem temporarily ended.
conclusion
- Protect the site first for on-line faults.
- Be especially careful with large object operations.
- Be familiar with common tools, or you’ll waste a lot of unnecessary time. When using MAT to analyze a problem, you should care more about object proportion than size. If you are a little more familiar with the tools, you can save the search for the space outside the heap.
- There are alarms to timely see, fortunately, only a machine down this time, not too much impact on the line.
Follow-up questions
Follow-up investigation found that when the number of requests was small, there was no OOM after the memory went up, but the memory did not go down, and it was suspected that there was a memory leak. Search for TrueTypeFont using MAT’s Histogram
Then look at his root recycle node.
Filter out weak references that have no impact on the collection.
There are several main classes, the one we watch most Disposer. Check to Disposer online.
In MAT, I’ve filtered out the weak references, and the rest of the disposer finds a FontStrikeDisposer that has a strong reference to TrueTypeFont. But these are a small number, only eight, so it shouldn’t work. At this point, the thread is broken, back to TrueTypeFont source to see if there is any discovery.