1, problem,
Recently, I received feedback from the operation department that the website often appears 502 in small batches and the response is very slow before the website appears 502. After refreshing for several times, the website can be accessed normally again.
2. Investigation process
The project deployed two machines, one of which was restarted. Some users were restarted when accessing the current server, resulting in access exceptions.
2.1. View log files
It can be found from the warpper log that one of the machines has restarted for several times. It can be found from the O&M that warpper restarts automatically for several times when the ping fails for several times. No OutOfMemoryException is thrown and no StackOverflowException is found.
2.2 Production environment, resources and GC
Run the top command to check that the CPU usage in the production environment is high, the load is high, and the GC occupies CPU resources for a long time.
A further look at the GC situation found that the number of FGC times reached 7W+ in a very short time after the restart. The Old Generation occupies 100% continuously, and the time of each FGC is very short.
jstat -gcUtil [pid] 1000
No exception is found when you run jmap-histo [PID] to check the memory instances.
FGC
2.3. View stack information
Because OOM is not triggered to manually add the following startup parameters, the heap memory before and after FGC is displayed
-XX:+HeapDumpBeforeFullGC
-XX:+HeapDumpAfterFullGC
-XX:HeapDumpPath=/tmp
Copy the code
Sure enough, dump started to output shortly after restart
Next, use JVisualVM (the memory analysis tool that comes with the JDK) to view the heap dump file. The heap dump class views before and after FGC are listed below
- Before FGC
- After the FGC
Before and after THE comparison of FGC, it was found that there were a large number of arrays of int[] and char[] before FGC, and the length of the arrays was very large, and these large arrays were effectively reclaimed after FGC. For those of you who are familiar with JVM garbage collection, the JVM creates large objects directly on the Old Generation to avoid frequent movement of large objects. At this point, I began to suspect that the Old Generation space is not enough, check the JVM startup parameters of the online environment one by one.
-Xms 2048
-Xmx 2048
...
-Xmn 2000
-
Copy the code
.
And it turns out, because this machine has 2 gigabytes of memory and the new generation has 2000 megabytes of memory and only 48 megabytes of memory left. When the Old Generation does not have enough space to store the large array objects, FGC modification may be triggered. Xmn 683 defaults the ratio of the new Generation to the Old Generation to 1:2. Reissue problem solved, top, jstat-GC view server status data are very cute.
3, analyse
Since the project has been running for several years, the idea at the beginning was to find the problem at the project code level, without any concern about the Jvm startup parameters. In fact, for some experienced students, some information collected can locate the problem. Such as:
GC
You can see the problem clearly in that picture.FGC
Abnormal frequency (7W+), butFGCT
But not high (1.9W), which means every timeFGC
Is very efficient. And every timeFGC
whenNew Generation
The occupation is very little, can be ruled out because the program problems in a short time to generate a large number of new objects triggeredFGC
, these provide a direct direction of investigation should be from: causeFGC
Start with the trigger condition, not the program.- Dump has a large number of unreferenced objects, which is also reflected in the side
FGC
There was no valid trigger beforeYGC
.
It is also worth noting that the default values for JVM startup parameters will work perfectly in most cases. Try to use the default values until the last step of system optimization. Through the analysis of the code in the project, and the existence of a large number of pattern(regular expression) related object analysis before THE FGC, the initial suspicion is caused by a large number of irregular use of pattern in the project.
4. Summarize the investigation process
- Check whether logs exist
OutOfMemoryException
,StackOverflowException
Sell them. - View online server status,
GC
Situation. - Analysis of the
GC
Before and afterdump
File. - Locate the problem.