After more than a month’s effort, FullGC was optimized from 40 times/day to trigger once in nearly 10 days, and the YoungGC time was also reduced by more than half. It is necessary to record the tuning process in the middle of such a large optimization.

As for JVM garbage collection, it has always been in the theoretical stage, knowing the promotion relationship between the new generation and the old generation, this knowledge is only enough for the interview. Recently, the FullGC of the online server was very frequent, with an average of more than 40 times a day, and the server would automatically restart every few days, which indicated that the status of the server was very abnormal. Given such a good opportunity, of course, we should actively request for tuning. The server GC data before tuning, FullGC is very frequent.

First of all, the server configuration is very general (2-core 4G), with a total of 4 server clusters. The number and time of FullGC for each server is about the same. The startup parameters of several JVM cores are:

-Xms1000M -Xmx1800M -Xmn350M -Xss300K -XX:+DisableExplicitGC -XX:SurvivorRatio=4 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:LargePageSizeInBytes=128M -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC
Copy the code

-XMX1800M: Sets the maximum available memory of the JVM to 1800M. -Xms1000m: Sets the initial MEMORY of the JVM to 1000M. This value can be set to the same as -xmx to avoid the JVM reallocating memory after each garbage collection. -Xmn350M: Set the size of the young generation to 350 50m. Total JVM memory size = young generation size + Old generation size + persistent generation size. The permanent generation has a fixed size of 64m, so increasing the young generation will reduce the size of the old generation. This value has a significant impact on system performance. Sun officially recommends setting it to 3/8 of the entire heap. -xSS300K: Sets the stack size for each thread. After JDK5.0, the stack size of each thread is 1M. Before JDK5.0, the stack size of each thread is 256K. The size of memory required by more applied threads is adjusted. Reducing this value generates more threads for the same physical memory. However, the operating system has a limit on the number of threads in a process, which cannot be generated indefinitely. The experience value is about 3000~5000.

First optimization

A look at the parameters, immediately feel why the new generation is so small, so small how to improve throughput, and will lead to frequent trigger of YoungGC, such as the new generation collection takes 830s. The initial heap memory is not the same as the maximum heap memory, and various references recommend that these two values be set the same to prevent memory reallocation after each GC. Based on the previous knowledge, the first line tuning is done: the new generation size is increased and the initial heap memory is set to maximum memory

-Xmn350M -> -Xmn800M
-XX:SurvivorRatio=4 -> -XX:SurvivorRatio=8
-Xms1000m ->-Xms1800m
Copy the code

Changing the SurvivorRatio to 8 is meant to allow as much garbage as possible to be recycled in the new generation. In this way, the configuration was deployed to two online servers (PROD, proD2 and the other two servers remained unchanged for easy comparison). After 5 days of running, GC results were observed. The number of YoungGC was reduced by more than half and the time was reduced by 400s, but the average number of FullGC increased by 41 times. YoungGC was basically what you expected, but FullGC was not.

So the first optimization failed.

Second optimization

In the process of optimization, our supervisor found that there was an object T with more than 10,000 instances in memory, and these instances occupied nearly 20 MB of memory. Based on the use of the bean object, the cause is found in the project: anonymous inner class reference, pseudocode as follows:

Public void doSmthing(T T){redis.addListener(new Listener(){public void onTimeout(){if(t.scuccess ()){// Do something}}}); }Copy the code

The listener will not be released after the callback, and the callback is a timeout operation. When an event exceeds the set time (1 minute), the callback will be performed. As a result, the object T cannot be reclaimed, so there are so many object instances in the memory. After a memory leak is found in the above example, the error log file in the program should be checked first, and all the error events should be solved first. After the next release, the GC operation was basically the same, fixing a bit of a memory leak, but showing that the server continued to restart inexplicably without addressing the root cause.

Memory leak investigation

After the first time of tuning, the problem of memory leakage was found, so we all began to conduct the investigation of memory leakage. The code was examined first, but this kind of efficiency was quite low, and basically no problem was found. So I continue to dump memory when it is not very busy online, and finally catch a large object.

This object has more than 4W, and all of them are identical ByteArrowRow objects, which can confirm that the data is generated during database query or insert. Therefore, another round of code analysis was carried out. In the process of code analysis, my colleagues through operation and maintenance found that the incoming traffic increased several times in one day, reaching as high as 83MB/s. After some confirmation, there was no such large business volume at present, and there was no file uploading function. Consulting ali Cloud customer service also shows that it is completely normal traffic, can rule out the possibility of attack.

Just when I was still in investigation of inlet flow problem, another colleague to find the fundamental reasons, turned out to be under a certain condition, can all unprocessed data specified in the query table, but as a result of the query time less added a module in the where clause for conditions, lead to query the number of more than 40, ten thousand, and by the log to see the time of the request and data, Dump = 4W; dump = 4W; dump = 4W; And this is a good reason why the server restarts automatically.

After this problem was solved, the online server ran completely normally, using the pre-tuning parameters, running FullGC only 5 times for about 3 days.

Second tuning

The memory leak problem has been solved and the rest can continue tuning. After checking the GC log, we found that the first three GullGC occurred when the old generation occupied less than 30% of the memory. So for a variety of data investigation, blog.csdn.net/zjwstz/arti… The blog clearly explains the situation that metaspace causes FullGC. The default metaspace of the server is 21M, and the maximum metaspace occupies about 200M in GC log, so the optimization is lowered. The following are the modified parameters of PROD1 and PROD2 respectively. Prod3, PROD4 remain the same

-Xmn350M -> -Xmn800M
-Xms1000M ->1800M
-XX:MetaspaceSize=200M
-XX:CMSInitiatingOccupancyFraction=75
Copy the code

and

-Xmn350M -> -Xmn600M
-Xms1000M ->1800M
-XX:MetaspaceSize=200M
-XX:CMSInitiatingOccupancyFraction=75
Copy the code

Prod1 and 2 are just different Cenozoic sizes, everything else is the same. To the online operation for about 10 days, for comparison: proD1:

Prod2:

Prod3:

Prod4:

In contrast, the FullGC of 1,2 servers is much lower than that of 3, 4 servers, and the YounGC of 1,2 servers is also reduced by about half compared with 3, 4 servers. Moreover, the efficiency of the first server is more obvious, except that the number of YoungGC is reduced, and the throughput is higher than that of 3, which has been running for one more day. 4 More for both (by number of threads started), indicating the throughput improvement for PROd1 is particularly significant.

Through the number of GC and GC time, the optimization is declared successful, and the configuration of PROD1 is better, which greatly improves the server throughput and reduces the GC time by more than half.

The only FullGC in prod1:

The reason can not be seen from GC log. The old age only occupies about 660M when CMS remark is made, which should not be the condition for triggering FullGC. In addition, through previous YoungGC investigations, the possibility of promoting large memory objects is also ruled out. The conditions for GC are not met. This still needs to continue to investigate, have know welcome to point out, here first thanked.

conclusion

After more than a month of tuning, the following points are summarized:

  • FullGC would be abnormal if it was more than once a day.
  • Memory leaks are investigated first when FullGC is found to be frequent.
  • Once the memory leak is resolved, the JVM has less room to tune, which is fine as learning, otherwise don’t invest too much time.
  • If it is found that the CPU continues to be high, you can consult ali Cloud customer service after troubleshooting the code problem. During this investigation, it was found that 100% of the CPU is caused by the server problem, and it will be normal after the server migration.
  • Data query is also counted as the entrance traffic of the server, if the access business is not so large, and there is no attack problem can be investigated to the database.
  • It is necessary to monitor your server’s GC from time to time so that problems can be detected early.

Three things to watch ❤️

If you find this article helpful, I’d like to invite you to do three small favors for me:

  1. Like, forward, have your “like and comment”, is the motivation of my creation.

  2. Follow the public account “Java rotten pigskin” and share original knowledge from time to time.

  3. Also look forward to the follow-up article ing🚀

Reference: club.perfma.com/article/185… Author: coffeeboy