Problem description

A regular engine system of the company will be manually preheated every time the version is started. After the preheating, the young generation GC will occasionally appear for 1-2 seconds after the flow is cut in (the flow is not large, and this situation will occur in every service under LB).

After this long pause, each young GENERATION GC pause time was restored to within 20-100ms

2s May not seem like a long time, but compared to the rule engine’s response time of around 10ms, it is still unacceptable; In addition, due to the rule engine response timeout, the order timeout will fail

Problem analysis

After analyzing the GC logs of the system, it is found that 2s pauses occur in the Young GC phase, and each Young GC with long pauses occurs is accompanied by the Promotion of new generation objects.

Core JVM parameters (Oracle JDK7)

-Xms10G -Xmx10G -XX:NewSize=4G -XX:PermSize=1g -XX:MaxPermSize=4g -XX:+UseConcMarkSweepGCCopy the code

The first young GC log after startup

The T16 2020-04-23: thou. 108 + 0800: [GC2020-04-23 T16: thou. 108 + 0800: [ParNew2020-04-23 T16: thou. 229 + 0800: [SoftReference, 0 refs, 0.0000950SECs]2020-04-23T16:28:31.229+0800: [WeakReference, 1156 Refs, 0.0001040 secs]2020-04-23T16:28:31.229+0800: [FinalReference, 10410 Refs, 0.0103720 secs]2020-04-23T16:28:31.240+0800: [PhantomReference, 286 Refs, 2 refs, 0.0129420 secs]2020-04-23T16:28:31.253+0800: [JNI Weak Reference, 0.0000000 secs]Desired survivor size 214728704 bytes, new threshold 1 (Max 15) -age 1: 315529928 bytes, 315529928 total- age 2: 40956656 bytes, 356486584 total- age 3: 8408040 bytes, 364894624 total: 3544342K-> 354555k (3574912k), 0.1444710secs] [Times: User sys = = 1.46 0.09, real = 0.15 secs]Copy the code

Long pause young generation GC logs

T17:2020-04-23 and. 514 + 0800: [GC2020-04-23 T17: same. 514 + 0800: [ParNew2020-04-23 T17: disguise. 975 + 0800: [SoftReference, 0 refs, 0.0000660 secS]2020-04-23T17:18:29.975+0800: [WeakReference, 1224 refs, 0.0001400SECs]2020-04-23T17:18:29.975+0800: [FinalReference, 8898 Refs, 0.0149670 secs]2020-04-23T17:18:29.990+0800: [PhantomReference, 600 refs, 1 refs, 0.0344300 secs] 2020-04-23T17:18:30.0025 +0800: [JNI Weak Reference, 0.0000210 secs]Desired survivor size 214728704 bytes, new threshold 15 (Max 15) -age 1: 79203576 bytes, 79203576 total: 3730075K-> 374371k (3774912K), 1.5114000secs] [Times: User sys = = 6.32 0.58, real = 1.51 secs]Copy the code

From the long pause GC log, promotion occurred. After the Young GC, 363M+ objects were promoted to the old age. This promotion operation should be time-consuming (PS: after checking the safepoint reason, there is no exception).

Since the -xx :+PrintHeapAtGC parameter is not configured in the log parameter, here is the promotion size calculated manually:

Young Generation Young change - Full Heap Capacity change = Advancement size (304371K-3730075K) - (676858K-3730075K) = 372487K(363M)Copy the code

Next young generation GC log

T17:2020-04-23 for. 749 + 0800: [GC2020-04-23 T17: for 749 + 0800: [ParNew2020-04-23 T17: for 774 + 0800: [SoftReference, 0 refs, 0.0000500 secs]2020-04-23T17:23:39.774+0800: [WeakReference, 3165 refs, 0.0002720 secs]2020-04-23T17:23:39.774+0800: [FinalReference, 3520 refs, 0.0021520 secs]2020-04-23T17:23:39.776+0800: [PhantomReference, 150 refs, 1 refs, 0.0051910 secs]2020-04-23T17:23:39.782+0800: [JNI Weak Reference, 0.0000100 secs]Desired survivor size 214728704 bytes, new threshold 15 (Max 15) -age 1: 17076040 bytes, 17076040 total- age 2: 40832336 bytes, 57908376 total: 3659891K->90428K(3774912K), secs] 4032378K->462914K(10066368K), secs] [Times: User sys = = 0.30 0.00, real = 0.03 secs]Copy the code

At first glance, there is nothing wrong, but when I think about it carefully, I find something abnormal. Why did the promotion happen in the second GC just after the program started

This is probably due to dynamic age determination, where the threshold for advancement in GC is not fixed at 15, but calculated dynamically by the JVM after each GC

Promotion mechanism for younger generation

In order to better adapt to the memory conditions of different programs, the virtual machine does not always require that the object age must reach MaxTenuringThreshold to advance to the old age. If the sum of all object sizes of the same age in the Survivor space is greater than half of the Survivor space, Objects older than or equal to this age can go directly to the old age without waiting until the age specified in MaxTenuringThreshold

As described in Understanding the Java Virtual Machine, the age threshold for an object’s advancement is determined dynamically.

However, after checking other materials and verifying, I found that there is some discrepancy between this and the explanation in Understanding the Java Virtual Machine (or the explanation in the book is not clear enough).

If the total of this group is greater than half of survivor, the promotion age threshold is updated to the age of this group

Note: Instead of being promoted if you pass survivor half, passing survivor half will only reset the promotion threshold and the next GC will use this new threshold

3544342K->374555K, 0.1444710 secs] Young generation 3544342K->374555K, 0.1446290 secs] full heapCopy the code

As evidenced by the first GC log above, the full heap memory change was equal to the young generation memory change in this GC, so there was no object promotion

As in the log above, the first GC simply sets threshold to 1, because at this point the survivor half is 214728704 bytes, and the sum of objects with age 1 is 315529928 bytes, which exceeds the Desired survivor size. So after this GC, set threshold to the object age of 1

The object promotion age threshold was changed to 1Desired survivor size 214728704 bytes, and new threshold 1 (Max 15) -age 1: 315529928 bytes, 315529928 total- age 2: 40956656 bytes, 356486584 total- age 3: 8408040 bytes, 364894624 totalCopy the code

To explain the output of this age distribution:

- age   1:  315529928 bytes,  315529928 total Copy the code

– age 1 Indicates the group of objects whose age is 1. 315529928 bytes indicates the memory size occupied by objects whose age is 1

315529928 Total Indicates the total size of objects smaller than or equal to the current group age. The total size of age 1 is the total size of age 1. The total size of age 2 is the total size of age 1 + age 2. The total size of age N is age 1 + age 2 +… +age n total size, as shown in the following figure

The promotion threshold is updated when the total value of the maximum group exceeds survivor/2

On the second young generation GC, the long pause Young Generation GC log, with a new promotion age threshold of 1, promotion occurs for objects that have survived one GC and are still reachable

The long pause was caused by 363M object promotions in this GC

thinking

Is this “dynamic object age” really justified in the JVM? In my opinion, the mechanism is good and can better adapt to the memory condition of different programs, but it is not suitable for all scenarios, such as the scenario in this article where the GC is not started

Since most objects are aged 0 or 1 at the beginning of the program, it is easy to have a large number of viable objects aged 1. This “dynamic object age determination” mechanism causes a new promotion threshold to be set to 1, resulting in the promotion of objects that should not be promoted

For example, when the program is initializing and loading various resources, the Young GC occurs. The loading logic is still in progress. Many newly created objects are reachable during this GC.

After the GC, the age of these objects is updated to 1, but due to the “dynamic object age determination” mechanism, the promotion age threshold is updated to the age of the “largest object age group”, which is the group of objects that have just gone through a GC

Shortly after this GC, the resource initialization is complete, and the related objects involved are probably not reachable, but since the promotion age threshold was just updated to 1, the next normal Young GC will promote the objects of age 1 directly, prematurely or incorrectly

The solution

A review of the documentation shows that dynamic age can’t be disabled, so the only way to solve this problem is to “bypass” the calculation rules

Dynamic age is determined based on the fact that the total size of all objects of the same age in the Survivor space is greater than half of the size in the Survivor space, so it is also easy to solve according to this mechanism

Since we know our system well enough to know the approximate memory required to load resources, we can easily set survivor capacity to a value greater than the sum of these temporarily reachable objects

For example, in the log above, the object of age 1 after the first GC has 315529928 Bytes(300M), which has a Desired survivor size of (survivor size /2) 214728704 Bytes(204M), Then survivor can be set to 600M or more.

For safety, however, survivor was set to 800M, which desired survivor size to be around 400M, and after the first Young GC, There is no updating of the promotion age threshold because the sum of objects of age 1 exceeds the desired survivor size, and there is no GC long pause due to early/false promotion

Survivor size cannot be specified directly, but survivor size can be adjusted with -xx :SurvivorRatio

-XX:SurvivorRatio=8Copy the code

Denotes the ratio of two Survivor and Edgen zones, 8 denotes two survivors :Eden=2:8, i.e., one Survivor accounts for 1/10 of the new generation.

The calculation method is:

Survivor Size(1) = Young Generation Size / (2+SurvivorRatio)Eden Size = Young Generation Size / (2+SurvivorRatio) * SurvivorRatioCopy the code

Further reading

Why does it take so many times longer to get promoted to 300M than a younger generation to get back to 3G

According to the nature of the replication algorithm, the time consumption of the replication algorithm mainly depends on the size of the living object, not the size of the total space

For example, for the young generation of 4G above (actually only Eden+S0 is available), GC only requires traversing the object graph from GC ROOTS and copying the reachable objects to S1, rather than traversing the whole young generation

For a detailed introduction to the replication algorithm, please refer to my other article “Garbage Collection Algorithm Implementation – Replication Algorithm (complete runnable C language code)”.

In the above long pause GC log, a promotion of 363M took place and a recovery of 300M or so. Compared with the first GC, it can be concluded that the 1.5S spent is basically a promotion operation

So why does promotion take so long?

We don’t go into the details of young generation promotion for Oracle JVM implementations, but promotion involves cross-generation replication (in fact, both young generation and old generation are heap, there is no difference in replication, it is just memcpy, but there is more logic to process), and the logic to process is more complex. Operations like pointer updates are understandably more time-consuming,

Native code emulation

Also attached is a piece of code that can simulate the problem locally and run the tests directly under Oracle JDK7

/ / jdk7. import java.io.IOException; import java.util.ArrayList; import java.util.HashMap; import java.util.List; Public class PromotionTest {public static void main(String[] args) throws IOException {List<Object> dataList = new ArrayList<>();for(int i = 0; i < 5; i++) { dataList.add(new InnerObject()); } // Simulate traffic entering the scenefor (int i = 0; i < 73; i++) {            if(i == 72){                System.out.println("Execute young gc... Adjust promotion threshold to 1");            }            new InnerObject();        }        System.out.println("Execute full gc... dataList has been promoted to cms old space"); System.gc(); system.gc (); system.gc (); } public static byte[]createData(){ int dataSize = 1024*1024*4; //4m byte[] data = new byte[dataSize];for (int j = 0; j < dataSize; j++) {            data[j] = 1;        }        return data;    }    static class InnerObject{        private Object data;        public InnerObject() { this.data = createData(); }}}Copy the code

jvm options

-server -Xmn400M -XX:SurvivorRatio=9 -Xms1000M -Xmx1000M -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+PrintReferenceGC -XX:+PrintGCApplicationStoppedTime -XX:+UseConcMarkSweepGCCopy the code

Java Basics – Compiled source code

Java memory

The garbage collection

Performance testing tool

Performance tuning


Public account: Java Architects Association, updated daily technical good articles

Some of the information has been uploaded to my Git repository: download it if you need it

Gitee.com/biwangsheng…