1, problem,

Recently, I received feedback from the operation department that the website often appears 502 in small batches and the response is very slow before the website appears 502. After refreshing for several times, the website can be accessed normally again.

2. Investigation process

The project deployed two machines, one of which was restarted. Some users were restarted when accessing the current server, resulting in access exceptions.

2.1. View log files

It can be found from the warpper log that one of the machines has restarted for several times. It can be found from the O&M that warpper restarts automatically for several times when the ping fails for several times. No OutOfMemoryException is thrown and no StackOverflowException is found.

2.2 Production environment, resources and GC

Run the top command to check that the CPU usage in the production environment is high, the load is high, and the GC occupies CPU resources for a long time.

A further look at the GC situation found that the number of FGC times reached 7W+ in a very short time after the restart. The Old Generation occupies 100% continuously, and the time of each FGC is very short.

jstat -gcUtil [pid] 1000

No exception is found when you run jmap-histo [PID] to check the memory instances.

FGC

2.3. View stack information

Because OOM is not triggered to manually add the following startup parameters, the heap memory before and after FGC is displayed

-XX:+HeapDumpBeforeFullGC
-XX:+HeapDumpAfterFullGC
-XX:HeapDumpPath=/tmp
Copy the code

Sure enough, dump started to output shortly after restart

Next, use JVisualVM (the memory analysis tool that comes with the JDK) to view the heap dump file. The heap dump class views before and after FGC are listed below

  • Before FGC
  • After the FGC

Before and after THE comparison of FGC, it was found that there were a large number of arrays of int[] and char[] before FGC, and the length of the arrays was very large, and these large arrays were effectively reclaimed after FGC. For those of you who are familiar with JVM garbage collection, the JVM creates large objects directly on the Old Generation to avoid frequent movement of large objects. At this point, I began to suspect that the Old Generation space is not enough, check the JVM startup parameters of the online environment one by one.

-Xms 2048
-Xmx 2048
...

-Xmn 2000
-
Copy the code

.

And it turns out, because this machine has 2 gigabytes of memory and the new generation has 2000 megabytes of memory and only 48 megabytes of memory left. When the Old Generation does not have enough space to store the large array objects, FGC modification may be triggered. Xmn 683 defaults the ratio of the new Generation to the Old Generation to 1:2. Reissue problem solved, top, jstat-GC view server status data are very cute.

3, analyse

Since the project has been running for several years, the idea at the beginning was to find the problem at the project code level, without any concern about the Jvm startup parameters. In fact, for some experienced students, some information collected can locate the problem. Such as:

  1. GCYou can see the problem clearly in that picture.FGCAbnormal frequency (7W+), butFGCTBut not high (1.9W), which means every timeFGCIs very efficient. And every timeFGCwhenNew GenerationThe occupation is very little, can be ruled out because the program problems in a short time to generate a large number of new objects triggeredFGC, these provide a direct direction of investigation should be from: causeFGCStart with the trigger condition, not the program.
  2. Dump has a large number of unreferenced objects, which is also reflected in the sideFGCThere was no valid trigger beforeYGC.

It is also worth noting that the default values for JVM startup parameters will work perfectly in most cases. Try to use the default values until the last step of system optimization. Through the analysis of the code in the project, and the existence of a large number of pattern(regular expression) related object analysis before THE FGC, the initial suspicion is caused by a large number of irregular use of pattern in the project.

4. Summarize the investigation process

  1. Check whether logs existOutOfMemoryException,StackOverflowExceptionSell them.
  2. View online server status,GCSituation.
  3. Analysis of theGCBefore and afterdumpFile.
  4. Locate the problem.