Problem description
A regular engine system of the company will be manually preheated every time the version is started. After the preheating, the young generation GC will occasionally appear for 1-2 seconds after the flow is cut in (the flow is not large, and this situation will occur in every service under LB).
After this long pause, each young GENERATION GC pause time was restored to within 20-100ms
2s May not seem like a long time, but compared to the rule engine’s response time of around 10ms, it is still unacceptable; In addition, due to the rule engine response timeout, the order timeout will fail
Problem analysis
After analyzing the GC logs of the system, it is found that 2s pauses occur in the Young GC phase, and each Young GC with long pauses occurs is accompanied by the Promotion of new generation objects.
Core JVM parameters (Oracle JDK7)
-Xms10G
-Xmx10G
-XX:NewSize=4G
-XX:PermSize=1g
-XX:MaxPermSize=4g
-XX:+UseConcMarkSweepGC
Copy the code
The first young GC log after startup
The T16 2020-04-23: thou. 108 + 0800: [GC2020-04-23 T16: thou. 108 + 0800: [ParNew2020-04-23 T16: thou. 229 + 0800: [SoftReference, 0 refs, 0.0000950SECs]2020-04-23T16:28:31.229+0800: [WeakReference, 1156 Refs, 0.0001040 secs]2020-04-23T16:28:31.229+0800: [FinalReference, 10410 Refs, 0.0103720 secs]2020-04-23T16:28:31.240+0800: [PhantomReference, 286 Refs, 2 refs, 0.0129420 secs]2020-04-23T16:28:31.253+0800: [JNI Weak Reference, 0.0000000 secs] Desired survivor size 214728704 bytes, new threshold 1 (Max 15) -age 1: 315529928 bytes, 315529928 total - age 2: 40956656 bytes, 356486584 total - age 3: 8408040 bytes, 364894624 total : 3544342K-> 354555k (3574912k), 0.1444710secs] [Times: User sys = = 1.46 0.09, real = 0.15 secs]Copy the code
Long pause young generation GC logs
T17:2020-04-23 and. 514 + 0800: [GC2020-04-23 T17: same. 514 + 0800: [ParNew2020-04-23 T17: disguise. 975 + 0800: [SoftReference, 0 refs, 0.0000660 secS]2020-04-23T17:18:29.975+0800: [WeakReference, 1224 refs, 0.0001400SECs]2020-04-23T17:18:29.975+0800: [FinalReference, 8898 Refs, 0.0149670 secs]2020-04-23T17:18:29.990+0800: [PhantomReference, 600 refs, 1 refs, 0.0344300 secs] 2020-04-23T17:18:30.0025 +0800: [JNI Weak Reference, 0.0000210 secs] Desired survivor size 214728704 bytes, new threshold 15 (Max 15) -age 1: 79203576 bytes, 79203576 total : 3730075K-> 374371k (3774912K), 1.5114000secs] [Times: User sys = = 6.32 0.58, real = 1.51 secs]Copy the code
From the long pause GC log, promotion occurred. After the Young GC, 363M + objects were promoted to the old age. This promotion operation should be time-consuming (PS: after checking the safepoint reason, there is no exception).
Since the -xx :+PrintHeapAtGC parameter is not configured in the log parameter, here is the promotion size calculated manually:
Young Generation Young change - Full Heap Capacity change = Advancement size (304371K-3730075K) - (676858K-3730075K) = 372487K(363M)Copy the code
Next young generation GC log
T17:2020-04-23 for. 749 + 0800: [GC2020-04-23 T17: for 749 + 0800: [ParNew2020-04-23 T17: for 774 + 0800: [SoftReference, 0 refs, 0.0000500 secs]2020-04-23T17:23:39.774+0800: [WeakReference, 3165 refs, 0.0002720 secs]2020-04-23T17:23:39.774+0800: [FinalReference, 3520 refs, 0.0021520 secs]2020-04-23T17:23:39.776+0800: [PhantomReference, 150 refs, 1 refs, 0.0051910 secs]2020-04-23T17:23:39.782+0800: [JNI Weak Reference, 0.0000100 secs] Desired survivor size 214728704 bytes, new threshold 15 (Max 15) -age 1: 17076040 bytes, 17076040 total - age 2: 40832336 bytes, 57908376 total : 3659891K->90428K(3774912K), secs] 4032378K->462914K(10066368K), secs] [Times: User sys = = 0.30 0.00, real = 0.03 secs]Copy the code
At first glance, there is nothing wrong, but when I think about it carefully, I find something abnormal. Why did the promotion happen in the second GC just after the program started
This is probably due to dynamic age determination, where the threshold for advancement in GC is not fixed at 15, but calculated dynamically by the JVM after each GC
Promotion mechanism for younger generation
In order to better adapt to the memory conditions of different programs, the virtual machine does not always require that the object age must reach MaxTenuringThreshold to advance to the old age. If the sum of all object sizes of the same age in the Survivor space is greater than half of the Survivor space, Objects older than or equal to this age can go directly to the old age without waiting until the age specified in MaxTenuringThreshold
As described in Understanding the Java Virtual Machine, the age threshold for an object’s advancement is determined dynamically.
However, after checking other materials and verifying, I found that there is some discrepancy between this and the explanation in Understanding the Java Virtual Machine (or the explanation in the book is not clear enough).
If the total of this group is greater than half of survivor, the promotion age threshold is updated to the age of this group
Note: Instead of being promoted if you pass survivor half, passing survivor half will only reset the promotion threshold and the next GC will use this new threshold
3544342K->374555K, 0.1444710 secs] Young generation 3544342K->374555K, 0.1446290 secs] full heapCopy the code
As evidenced by the first GC log above, the full heap memory change was equal to the young generation memory change in this GC, so there was no object promotion
As in the log above, the first GC simply sets threshold to 1 because 214728704 bytes are survivor half and 315529928 bytes are total for objects of age 1. The Desired survivor size was exceeded, so after this GC the threshold was set to the object age 1 of age 1
The object promotion age threshold was set to 1. Desired survivor size 214728704 bytes, new threshold 1 (Max 15) -age 1: 315529928 bytes, 315529928 total - age 2: 40956656 bytes, 356486584 total - age 3: 8408040 bytes, 364894624 totalCopy the code
To explain the output of this age distribution:
- age 1: 315529928 bytes, 315529928 total
Copy the code
– age 1 Indicates the group of objects whose age is 1. 315529928 bytes indicates the memory size occupied by objects whose age is 1
315529928 Total Indicates the total size of objects smaller than or equal to the current group age. Group objects by age. The total of age 1 is the total size of age 1. The total of age 2 is the total size of age 1 + age 2. Age n Group Total: Age 1 + Age 2 +… +age n total size, as shown in the following figure
The promotion threshold is updated when the total value of the maximum group exceeds survivor/2
On the second young generation GC, the long pause Young Generation GC log, with a new promotion age threshold of 1, promotion occurs for objects that have survived one GC and are still reachable
The long pause was caused by 363M object promotions in this GC
thinking
Is this “dynamic object age” really justified in the JVM? In my opinion, the mechanism is good and can better adapt to the memory condition of different programs, but it is not suitable for all scenarios, such as the scenario in this article where the GC is not started
Since most objects are aged 0 or 1 at the beginning of the program, it is easy to have a large number of viable objects aged 1. This “dynamic object age determination” mechanism causes a new promotion threshold to be set to 1, resulting in the promotion of objects that should not be promoted
For example, when the program is initializing and loading various resources, the Young GC occurs. The loading logic is still in progress. Many newly created objects are reachable during this GC.
After the GC, the age of these objects is updated to 1, but due to the “dynamic object age determination” mechanism, the promotion age threshold is updated to the age of the “largest object age group”, which is the group of objects that have just gone through a GC
Shortly after this GC, the resource initialization is complete, and the related objects involved are probably not reachable, but since the promotion age threshold was just updated to 1, the next normal Young GC will promote the objects of age 1 directly, prematurely or incorrectly
The solution
A review of the documentation shows that dynamic age can’t be disabled, so the only way to solve this problem is to “bypass” the calculation rules
Dynamic age is determined based on the fact that the total size of all objects of the same age in the Survivor space is greater than half of the size in the Survivor space, so it is also easy to solve according to this mechanism
Since we know our system well enough to know the approximate memory required to load resources, we can easily set survivor capacity to a value greater than the sum of these temporarily reachable objects
For example, in the log above, the object of age 1 after the first GC has 315529928 Bytes(300M), which has a Desired survivor size of (survivor size /2) 214728704 Bytes(204M), Then survivor can be set to 600M or more.
For safety, however, survivor was set to 800M, which desired survivor size to be around 400M, and after the first Young GC, There is no updating of the promotion age threshold because the sum of objects of age 1 exceeds the desired survivor size, and there is no GC long pause due to early/false promotion
Survivor size cannot be specified directly, but survivor size can be adjusted with -xx :SurvivorRatio
-XX:SurvivorRatio=8
Copy the code
Denotes the ratio of two Survivor and Edgen zones, 8 denotes two survivors :Eden=2:8, i.e., one Survivor accounts for 1/10 of the new generation.
The calculation method is:
Survivor Size(1) = Young Generation Size / (2+SurvivorRatio)
Eden Size = Young Generation Size / (2+SurvivorRatio) * SurvivorRatio
Copy the code
Further reading
Why does it take so many times longer to get promoted to 300M than a younger generation to get back to 3G
According to the nature of the replication algorithm, the time consumption of the replication algorithm mainly depends on the size of the living object, not the size of the total space
For example, for the young generation of 4G above (actually only Eden+S0 is available), GC only requires traversing the object graph from GC ROOTS and copying the reachable objects to S1, rather than traversing the whole young generation
Replication algorithm in detail you can refer to my another article “starting from scratch, handwritten GC algorithm (young generation recycling) | | replication algorithm can run with complete source code,
In the above long pause GC log, a promotion of 363M took place and a recovery of 300M or so. Compared with the first GC, it can be concluded that the 1.5S spent is basically a promotion operation
So why does promotion take so long?
We don’t go into the details of young generation promotion for Oracle JVM implementations, but promotion involves cross-generation replication (in fact, both young generation and old generation are heap, there is no difference in replication, it is just memcpy, but there is more logic to process), and the logic to process is more complex. Such as pointer updates and other operations, RSet updates, etc., more time-consuming is also understandable.
Some people might think that I have turned up the young generation, which will cause the young generation’s GC to be slower and pause longer. In fact, no, if you have read my article on handwritten GC, you should know that the time of young generation GC is proportional to the number of live objects, not the number of recovered objects. In my scenario, there are very few live objects, so there is no side effect of increasing the young band
Native code emulation
Also attached is a piece of code that can simulate the problem locally and run the tests directly under Oracle JDK7
/ / jdk7.
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
public class PromotionTest {
public static void main(String[] args) throws IOException {
// Simulate the resource initialization scenario
List<Object> dataList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
dataList.add(new InnerObject());
}
// Simulate traffic entering the scene
for (int i = 0; i < 73; i++) {
if(i == 72){
System.out.println("Execute young gc... Adjust promotion threshold to 1");
}
new InnerObject();
}
System.out.println("Execute full gc... dataList has been promoted to cms old space");
// The dataList object will enter the old age after the Full GC
System.gc();
}
public static byte[] createData(){
int dataSize = 1024*1024*4;//4m
byte[] data = new byte[dataSize];
for (int j = 0; j < dataSize; j++) {
data[j] = 1;
}
return data;
}
static class InnerObject{
private Object data;
public InnerObject(a) {
this.data = createData(); }}}Copy the code
jvm options
-server -Xmn400M -XX:SurvivorRatio=9 -Xms1000M -Xmx1000M -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+PrintReferenceGC -XX:+PrintGCApplicationStoppedTime -XX:+UseConcMarkSweepGC
Copy the code
reference
- Understanding the JAVA Virtual Machine in Depth. By Zhou Zhiming
- Useful JVM Flags — Part 5 (Young Generation Garbage Collection)
Original is not easy, reprint please contact the author. If my article helped you, please like/bookmark/follow it ❤❤❤❤❤❤