GC problem of online service is a very typical problem of Java program, which tests the ability of engineers to troubleshoot problems. At the same time, it is almost an interview question, but not many people can really answer this question, or do not understand the principle, or lack of practical experience.

In the past six months, our advertising system has had several online problems related to GC, including too frequent GC for Full and too long GC for Young. The impact of these problems is that the program in GC is stuck, which further leads to service timeout and affects advertising revenue.

In this article, I will take a frequent online case of FGC as an introduction to introduce the GC troubleshooting process in detail. In addition, I will give a practical guide based on the operation principle of GC, hoping it will be helpful to you. The content is divided into the following three parts:

Start with a frequent online case of FGC

Introduces the operation principle of GC

Practical guidelines for troubleshooting FGC problems

01 Start with a frequent online case of FGC

Last October, our AD recall system received frequent system alerts from FGC after the program went live, as you can see from the monitoring chart below: an AVERAGE of one FGC every 35 minutes. Before the program went online, our FGC frequency was about once every two days. The following describes the troubleshooting process of the fault.

! [](https://upload-images.jianshu.io/upload_images/23721221-119e05c43dad3f4f? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

1. Check the JVM configuration

Check the JVM startup parameters with the following command:

ps aux | grep “applicationName=adsearch”

-Xms4g -Xmx4g -Xmn2g -Xss1024K

-XX:ParallelGCThreads=5

-XX:+UseConcMarkSweepGC

-XX:+UseParNewGC

-XX:+UseCMSCompactAtFullCollection

-XX:CMSInitiatingOccupancyFraction=80

It can be seen that the heap memory is 4G, the new generation is 2G, and the old generation is also 2G. The new generation uses the ParNew collector, and the old generation uses the CONCURRENT mark clearing CMS collector. When the memory usage of the old generation reaches 80%, FGC will be performed.

Further through the jmap – heap 7276 | head – n20 can learn the new generation of Eden area is 1.6 G, S0 and S1 area are 0.2 G.

2. Observe memory changes in older generations

By looking at the old usage, you can see that the memory is back to around 500M after each FGC, so we rule out memory leaks.

! [](https://upload-images.jianshu.io/upload_images/23721221-01ec545b283ad1eb? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

3. Run the jmap command to view objects in the heap memory

Through the command jmap histo – 7276 | head – n20

! [](https://upload-images.jianshu.io/upload_images/23721221-d77c0a359d10ea35? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

In the figure above, the number of surviving instances, memory occupied, and class name are shown in order of memory occupied by objects. Int [] takes up more memory than any other living object. At this point, we have targeted our suspicions at int[].

4. Dump heap memory files for further analysis

After locking int[], we intend to dump the heap memory file to further trace the object’s origin through visualization tools. Considering that the program would be paused during heap dump, we first removed the node from the service management platform and then dumped heap memory with the following command:

jmap -dump:format=b,file=heap 7276

By using JVisualVM tool to import dump heap memory files, you can also see the space occupied by each object, among which int[] accounts for more than 50% of the memory. Further down, you can find the business object int[] belongs to, and find that it comes from the coDIS basic components provided by the architecture team.

! [](https://upload-images.jianshu.io/upload_images/23721221-d630aa8ec7523ccd? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

5. Analyze suspicious objects through code

Through code analysis, the coDIS base component generates an int array of about 40M size every minute to count TP99 and TP90, with a lifetime of one minute. According to the observation of the memory change of the old age in step 2, it is found that the memory of the old age is basically increased by more than 40M per minute, so it is inferred that the 40M int array should be promoted from the new generation to the old age.

We further checked the frequency monitoring of YGC, and it can be seen from the figure below that there are about 8 YGC times per minute, which basically verified our inference: Because the default generation age of the CMS collector is 6, that is, objects that are still alive after 6 YGC cycles are advanced to the old age, the large array lifetime in the CODIS component is 1 minute, which fits this requirement.

! [](https://upload-images.jianshu.io/upload_images/23721221-625639ca0e77cba8? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

At this point, the whole investigation process is basically over, then why didn’t this problem occur before the program went online? As can be seen from the figure above, the YGC frequency before the program went online was about 5 times, but after the program went online this time, the YGC frequency changed to about 8 times, which caused this problem.

6. Solutions

In order to solve the problem quickly, we changed the generation age of the CMS collector to 15 times. After the change, the FGC frequency was restored to once every 2 days, and the problem would be triggered again if the YGC frequency exceeded 15 times per minute. Of course, our most fundamental solution is to optimize the program to reduce the frequency of YGC and shorten the lifetime of the int array in the CODIS component, which we won’t expand here.

02 GC Operating Principles

In the analysis of the whole case above, a lot of GC principles are involved. If you do not know these principles, you will start to deal with them. In fact, the whole investigation process is very blind.

Here, I select a few core knowledge points, expand to introduce the operation principle of GC, and finally give a practical guide.

1. Heap memory structure

As you all know, GC is divided into YGC and FGC, both of which occur on the JVM heap memory. Let’s take a look at the heap structure of JDK8:

! [](https://upload-images.jianshu.io/upload_images/23721221-3907e1ed10d87c74? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

As you can see, the heap memory has a generational structure, including the new generation and the old generation. The new generation is divided into Eden zone, From Survivor zone (S0), and To Survivor zone (S1), and the default ratio of the three is 8:1:1. In addition, the default ratio of new generation to old generation is 1:2.

The generation structure of heap memory takes into account that most objects have short life cycles, so objects with different life cycles can be placed in different regions, and different garbage collection algorithms are adopted for the new generation and the old generation to maximize the efficiency of GC.

2. When is YGC triggered?

In most cases, objects are allocated directly in the Eden region of the young generation. If there is not enough space in the Eden region, Minor GC (YGC) will be triggered, and only the new generation will be processed by YGC. Because most objects are retrievable in a short period of time, only a very small number of objects survive after YGC and are moved to S0 (using the copy algorithm).

When the next YGC is triggered, the living objects in Eden and S0 will be moved to S1, and Eden and S0 will be cleared. When YGC is triggered again, the processing area becomes Eden and S1 (S0 and S1 swap roles). Each time YGC passes, the age of the surviving object increases by 1.

3. When was the FGC triggered?

There are four ways in which an object will enter the old age:

In YGC, the To Survivor zone is not enough To hold surviving objects, and the objects go straight To the old age.

After multiple YGC, if the age of the surviving object reaches a set threshold, it is promoted to the old age.

Dynamic age determination rule: if the sum of the size of objects of the same age in To Survivor zone occupies more than half of the space in To Survivor zone, then objects older than this age will directly enter the old age without reaching the default generational age.

Large objects: – XX: PretenureSizeThreshold launch parameters control, if the object size is greater than this value, can bypass the Cenozoic, direct distribution in old age.

FGC (Major GC) is triggered when the number of objects promoted to the old age exceeds the remaining space of the old age, and FGC deals with both the New generation and the old generation. In addition, there are four other conditions that trigger FGC:

FGC is triggered when the memory usage reaches a certain threshold (you can adjust the parameters).

Space allocation guarantee: Before YGC, it checks whether the maximum contiguous space available in the old age is greater than the total space of all objects in the new generation. If less, YGC is unsafe, and the HandlePromotionFailure parameter is checked to see if it is set to allow guaranteed failure, or Full GC is triggered if it is not. If allowed, the maximum available contiguous space of the old age is further checked to see if it is greater than the average size of objects promoted to the old age, and Full GC is also triggered if it is smaller.

Metaspace (MetaspaceSize) is expanded when space is insufficient. FGC is also triggered when capacity is expanded to the value specified by the -xx :MetaspaceSize parameter.

FGC is triggered when system.gc () or Runtime.gc() is explicitly called.

4. Under what circumstances does GC affect a program?

Both YGC and FGC cause a certain amount of program lag (The Stop The World problem where The GC thread starts working and other worker threads are suspended), and even more advanced garbage collection algorithms like ParNew, CMS, or G1 reduce The lag time, not eliminate it completely.

So when exactly does GC affect a program? According to the severity from high to the end, I think it includes the following four situations:

FGC is too frequent: FGC is usually slow, ranging from a few hundred milliseconds to a few seconds, and normally only executed every few hours or even a few days with an acceptable impact on the system. However, when the FGC occurs frequently (such as every few minutes), it can be problematic, causing worker threads to stop frequently, making the system appear to be stuttering, and making the overall performance of the program worse.

YGC takes too long: Generally speaking, the total time of YGC is tens or hundreds of milliseconds, which is quite normal. Although it may cause system delay of several milliseconds or tens of milliseconds, this situation is almost unconscious to users and the impact on the program can be ignored. However, if YGC takes 1 second or even a few seconds (almost as long as FGC), the delay time will increase, and YGC itself is more frequent, leading to more service timeout problems.

Too long FGC time: As the FGC time increases, so does the latency time. Especially for high-concurrency services, this may lead to more timeouts and reduced availability during FGC, which also needs to be concerned.

Too frequent YGC: Even if YGC does not cause service timeouts, too frequent YGC can degrade the overall performance of the service, which is also a concern for high-concurrency services.

Among them, “FGC is too frequent” and “YGC takes too long” are typical GC problems, which have a high probability of affecting the quality of service of the program. The remaining two cases are less serious, but also of concern for highly concurrent or highly available programs.

Practice guide for Troubleshooting FGC problems

Through the above case analysis and theoretical introduction, the troubleshooting ideas of FGC problems are summarized as a practical guide for everyone’s reference.

1. Know what causes FGC from a program point of view?

Large objects: The system loads too much data into memory at one time (for example, SQL queries are not paginated), causing large objects to age.

Memory leak: Frequently creating a large number of objects that cannot be reclaimed (such as not calling the close method to release the resource after the IO object is used), causing FGC and OOM.

The program frequently generates objects with a long life cycle. When these objects live beyond the generational age, they enter the old age, and finally raise FGC (the case in this article).

A BUG caused a lot of new classes to be generated dynamically, causing Metaspace to be occupied, first FGC, then OOM.

Gc methods are explicitly invoked in code, including your own code and even code in the framework.

JVM parameter setting issues: including total memory size, new generation and old generation size, Eden and S-region size, meta-space size, garbage collection algorithm, etc.

2. Know the tools available for troubleshooting

Corporate monitoring system: Most companies have a comprehensive monitoring system for the JVM.

JDK tools include jmap, jstat, and other common commands:

View the usage and GC of each area of heap memory

jstat -gcutil -h20 pid 1000

View the live objects in heap memory and sort them by space

jmap -histo pid | head -n20

Dump heap memory files

jmap -dump:format=b,file=heap pid

Visual heap memory analysis tools: JVisualVM, MAT, etc

3. Troubleshooting Guide

View the monitoring to see when the problem occurred and the current FGC frequency (compare to normal to see if the frequency is normal)

Find out if any applications have come online and basic components have been upgraded before this point in time.

Understand the JVM parameters, including the size Settings for various regions of the heap space, which garbage collectors are used by the new generation and the old generation, and then analyze whether the JVM parameters are appropriate.

Then eliminate the possible causes listed in Step 1, where it is easier to troubleshoot if the meta-space is full, memory leaks, and code is explicitly invoked using GC.

For FGC caused by large objects or long-life objects, you can run the jmap-histo command and dump heap memory files to further analyze the FGC. You need to locate suspicious objects first.

By locating the suspicious object to the specific code and analyzing it again, at this time, it is necessary to combine GC principle and JVM parameter Settings to figure out whether the suspicious object meets the conditions for entering the old age.

The last word

This article introduces the FGC process in detail through an online case combined with the PRINCIPLE of GC, and provides a practical guide.

In the future, I will share another case that YGC takes too long in a similar way, hoping to help you understand the troubleshooting of GC problems. If you think this article is helpful, please help to forward or click on it again!