background

I recently encountered an unusual GC problem that I felt I needed to document and share. The service itself is a normal Spring Boot application that runs in a K8S environment and does a lot of network calls with no caching and not much CPU computation.

JVM version: Oracle JDK 1.8.161

CPU6 core, JVM heap memory 6G, new generation 2G, CMS + ParNew.

The problem found

The GC time of this service was too high during the pressure test. Therefore, based on the principle that the probability of problems is too high, brainless expansion of a wave. After a few days of running, frequent timeouts began to occur. The original response time of tens of milliseconds service TP9999 has reached 300ms!

Look at the GC monitoring, even more exaggerated: the new generation of GC time can reach 300ms!

This 300ms is really a cognitive breakthrough.

Looking back at the crazy expansion instances, CPU utilization was pitiful. Obviously, expansion can’t be a cure-all.

To solve the process

It takes too long to solve Cenozoic GC

First of all, a search for new generation GC long problem, found the treasure: blog. Ragozin. Info/p/garbage – c…

To discuss the new generation of GC: blog. Ragozin. Info / 2011/06 / und…

Discuss the CMS: blog. Ragozin. Info / 2011/06 / und…

The GC pause of Cenozoic mainly consists of four periods:

The stack scan can obviously be ignored; Survior copy times are often short because of the low survival rate (unless the Cenozoic survival rate is abnormally high), with few objects needing to be copied. Therefore, the main consumption time is the card table scan time and the old age scan, which depends on the dirty table information of the card table scan. To sum up, the older the age, the longer the pause time of Cenozoic GC.

To determine the object creation and promotion rates of the new generation, you can turn on some GC log analysis. Here I looked at the size of the Survivor zone and found it was really low:

And the old days were really high:

It turns out that our conventional wisdom is that the new generation has less memory than the old. The JVM’s default allocation ratio gives us this hint. For stateless IO intensive applications, the old age is just a heavy burden.

So the next idea is to reduce the old.

1. Adjust the old age to 500 MB

It can be observed that the Cenozoic GC time gradually climbs from 10ms to 60ms. Because the old generation is limited, so the new generation gc time has a limit.

TP9999, which was applied, climbed from 50ms to 150ms

This application grew very slowly in the old days and was slow to trigger a Major GC. When the memory of the old generation is high, the gc time of the new generation will be longer.

If the old generation memory is not reduced, the new generation GC will always be slow, so we want the old generation to GC more frequently.

2. Reduced promotion algebra to 2

Add JVM parameters:-XX:MaxTenuringThreshold=2In the old years, the growth rate was a little faster:

3. Set the memory upper limit for triggering old-generation GC to 50%

Old age finally happened GC:

Cenozoic GC time no longer increases indefinitely.

Solve the problems of the old days

And then we did more with the size of the new generation. Later found out that only by adjusting the new generation of 1800 m, CMSInitiatingOccupancyFraction = 50, can limit the GC time consuming in the acceptable range.

The GC log shows that there is a problem:

2021-04-07T19:23:47.633+ 0800:1486.331: [GC (Allocation Failure)) 2021-04-07T19:23:47.633+ 0800:1486.331: [ParNew: 1314975K-> 1448k (1474560K), 0.00982secs] 1544334K-> 1433312k (1933312K), 0.0101444secs] [Times: User =0.21 sys=0.01, real=0.01 secs] 2021-04-07T19:23:47.643+ 0800:1486.341: Total time for which Application Threads were stopped: 0.0114859 seconds, Stopping Threads took: 0.0001502 seconds 2021-04-07T19:23:47.644+ 0800:1486.341: Application time: 0.0000415 seconds 2021-04-07T19:23:47.644+ 0800:1486.342: [GC (CMS Initial Mark) [1 CMS-initial-mark: [Times: 229603K(458752K)] 229279k (1933312K), 0.0017176secs] [Times: Secs] 2021-04-07T19:23:47.646+ 0800:1486.344: Total time for which Application Threads were stopped: 0.0028056 seconds, Stopping Threads took: 0.0001077 seconds 2021-04-07T19:23:47.646+ 0800:1486.344: [CMS-concurrent-mark-start] 2021-04-07T19:23:47.668+0800: 1486.365: [cmS-concurrent-mark: 0.021/0.021secs] [Times: Secs] 2021-04-07T19:23:47.668+ 0800:1486.365: [CMS - concurrent - preclean - start] 2021-04-07 T19: company. 670 + 0800:1486.368: [CMS - concurrent - preclean: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 2021-04-07T19:23:47.670+ 0800:1486.368: [CMS - concurrent - abortable - preclean - start] 2021-04-07 T19:23:49. 870 + 0800:1488.567: [CMS - concurrent - abortable - preclean: [Times: user= user, real= user] 2021-04-07T19:23:49.870+ 0800:1488.567: Application time: 2021-04-07T19:23:49.870+ 0800:1488.567: Application time: 2.2235089 seconds 2021-04-07T19:23:49.871+0800: 1488.568: [GC (CMS Final Remark) [YG occupancy: 021-04-07T19:23:49.871+ 0800:1488.568: [Rescan (PARALLEL), 0.0901161 SECS]2021-04-07T19:23:49.961+ 0800:1488.659: [Weak Refs processing, 0.0003090 secs]2021-04-07T19:23:49.962+ 0800:1488.659: 2, Class class brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown Scrub Symbol Table, 0.0118417 secS]2021-04-07T19:23:50.004+ 0800:1488.702: scrub Symbol table, 0.0118417 secS [Scrub string table, 0.1605043 secs][1 CMS-remark: 229603K(458752K)] 914720K(1933312K), 0.2950712 secs][Times: scrub scrub User =0.88 sys=0.01, real=0.29 secs] 2021-04-07T19:23:50.166+ 0800:1488.864: Total time for which Application Threads were stopped: 0.2964125 seconds, Stopping Threads took: 0.0001767 seconds 2021-04-07T19:23:50.166+ 0800:1488.864: Application time: 0.0000984 seconds 2021-04-07T19:23:50.166+ 0800:1488.864: [CMS-concurrent-sweep-start]... 2021-04-07T19:23:50.298+ 0800:1488.995: [CMS-concurrent-sweep: 0.108/0.131 secs] [Times: User =0.51 sys=0.04, real= 0.13secs] 2021-04-07T19:23:50.298+ 0800:1488.995: [cms-concurrent-reset-start] 2021-04-07T19:23:50.299+ 0800:1488.996: [CMs-concurrent-reset: 0.001/0.001secs] [Times: User sys = = 0.00 0.00, real = 0.00 secs]Copy the code

In the old days we were monitoring, the time was not accurate, and the application pause times seen from the logs were actually much longer. Specifically, in the two STW phases of CMS, it did not take much time for CMS Initial Mark and 290ms for CMS Final Remark, which is a bit deadly…

The TP9999 does have burrs due to the Major GC

1.CMS must perform a Cenozoic GC before GC

How do you solve it? Generally, CMS will wait for a New generation GC before Initial Mark to reduce scanning time. According to the logs, a new generation GC did occur shortly before the CMS.

The problem this time is in the Final Remark phase. Adding -xx :+CMSScavengeBeforeRemark command can force another Cenozoic GC before Remark, so as to reduce the number of Cenozoic objects and reduce the scanning time. After that, check back the next day.

The time of Final Remark in this monitoring is also available, but it is not correct (this problem will be followed later)! Because service TP9999 will still go up significantly during GC. Now look at the GC log:

2021-04-08T13:39:17.892+ 0800:61397.551: [GC (CMS Final Remark) [YG occupancy: 304654K (1474560K)]2021-04-08T13:39:17.892+ 0800:61397.551: [GC (CMS Final Remark) 2021-04-08T13:39:17.892+0800: 61397.552: [ParNew: 304654K->3064K(1474560K), 0.0084983 secs] [Times: User =0.18 sys=0.01, real=0.01 secs] 2021-04-08T13:39:17.901+ 0800:61397.560: [Rescan (PARALLEL), 0.0020656 SECs]2021-04-08T13:39:17.903+ 0800:61397.562: [Weak Refs processing, 0.0005887 secs]2021-04-08T13:39:17.904+ 0800:61397.563: 2, Class class brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown Scrub Symbol Table, 0.0105486 secs]2021-04-08T13:39:17.944+ 0800:61397.604: scrub symbol table, 0.0105486 secs [Scrub String table, 0.1656738 secs][1 CMS-remark: 229674K(458752K)] 232738K(1933312K), 0.2302244 secs][Times: scrub scrub User sys = = 0.44 0.01, real = 0.23 secs]Copy the code

Boy, did a ParNew Cenozoic GC first, which reduced the Cenozoic from 600m to 3M. The scanning time of the new generation has dropped from 90ms to 3ms. The total time did drop to 230ms, but that was still too high.

2. Resolve the String table problem

Taking a closer look at the detailed logs of the CMS, the following snippet shows the internal processes of the CMS:

2021-04-08T13:39:17.901+ 0800:61397.560: [Rescan (Parallel), 0.002020656 secs] 2021-04-08T13:39:17.903+ 0800:61397.562: [Weak Refs processing, 0.0005887 secs] 2021-04-08T13:39:17.904+ 0800:61397.563: 2, Class class brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown brown [Scrub Symbol table, 0.0105486 secs] 2021-04-08T13:39:17.944+ 0800:61397.604: [Scrub String table, 0.1656738 secs]Copy the code

[Scrub String table, 0.1656738 secs] [scrub String table, 0.1656738 secs] [scrub string table, 0.1656738 secs] [scrub string table, 0.1656738 secs]

I did a search on the term and found this article, which was almost identical to our question.

The article pointed to Jackson’s use of string.intern () as the culprit.

Three more articles were found. One, two, three

Recall that the first object we got with jmap-histo :live did have a String. But not that big.

 num     #instances         #bytes  class name
----------------------------------------------
   1:         44126       42924632  [Ljava.lang.Object;
   2:         22542       14787552  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
   3:         12673       11341144  [B
   4:        120882        9288568  [C
   5:        118651        2847624  java.lang.String
Copy the code

When MAT was used to analyze heap dump, there were no obvious anomalies.

In order to confirm the problem, add a – XX: + PrintStringTableStatistics parameters. It can only print the String Table when the JVM exits, and the program is running in K8S.

Finally, you can only use the jmap-heap command, and the last line of output is the String Table information.

  • Modify previous result of Jackson: 307321 Interned Strings eviction 23220408 bytes.

  • Result after Modifying Jackson: 26352 Interned Strings eviction 2853024 bytes.

That’s a 10-fold difference. The new CMS GC log is as follows. In the old days, GC time is reduced to 110ms, which is ideal.

2021-04-20T17:25:03.781+ 0800:452505.107: [GC (CMS Final Remark) 2021-04-20T17:25:03.781+ 0800:452505.107: [ParNew Desired survivor size 38109184 bytes, new threshold 15 (max 15) - age 1: 4111480 bytes, 4111480 total - age 2: 45496 bytes, 4156976 total - age 3: 12184 bytes, 4169160 total - age 4: 42608 bytes, 4211768 total - age 5: 5104 bytes, 4216872 total - age 6: 7472 bytes, 4224344 total - age 7: 2256 bytes, 4226600 total - age 8: 36400 bytes, 4263000 total - age 9: 2152 bytes, 4265152 total - age 10: 5056 bytes, 4270208 total - age 11: 664 bytes, 4270872 total - age 12: 1784 bytes, 4272656 total - age 13: 480 bytes, 4273136 total - age 14: 34808 bytes, 4307944 total - age 15: 1232 bytes, 4309176 total : 753754K->5633K(1563968K), 0.010023secs] 753733K ->5633K(2022720K), 0.0101724 secs] [Times: User =0.03 sys=0.01, real=0.01 secs] Heap after GC invocations=73456 (full 3): par new generation total 1563968K, used 5633K [0x0000000080000000, 0x00000000e4000000, 0x00000000e4000000) eden space 1489536K, 0% used [0x0000000080000000, 0x0000000080000000, 0x00000000daea0000) from space 74432K, 7% used [0x00000000daea0000, 0x00000000db420600, 0x00000000df750000) to space 74432K, 0% used [0x00000000df750000, 0x00000000df750000, 0x00000000e4000000) concurrent mark-sweep generation total 458752K, used 422079K [0x00000000e4000000, 0x0000000100000000, 0x0000000100000000) Metaspace used 81187K, capacity 84281K, committed 84384K, reserved 1124352K class space used 9988K, capacity 10542K, committed 10548K, Reserved 1048576K} 2021-04-20T17:25:03.792+ 0800:452505.118: [Rescan (PARALLEL), 0.0047549 SECs]2021-04-20T17:25:03.797+ 0800:452505.122: [Weak Refs processing, 0.0146939 secs]2021-04-20T17:25:03.811+ 0800:452505.137: 2, Class 21-04-20T17:25:03.852+0800: 2021-04-20T17:25:03.867+ 0800:452505.193: Scrub Symbol Table, 0.0147520 secS [Scrub string table, 0.0013593 secs][1 CMS-remark: 422079K(458752K)] 427712K(2022720K), 0.1081788 secs][Times: scrub scrub User sys = = 0.14 0.01, real = 0.11 secs]Copy the code

Final JVM parameters

java -server -Xmx2g -Xms2g -Xmn1600m -XX:+UseParNewGC -XX:MaxTenuringThreshold=2 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSScavengeBeforeRemark -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=7 -XX:GCLogFileSize=128m
Copy the code

conclusion

This paper discusses the resolution process of an atypical GC problem, which is valuable for reference because the process is recorded in detail. Finally, let’s talk about some experiences:

The first step in debugging Java GC problems is to become familiar with the garbage collector. This familiarity does not mean to be familiar with the source code, but to understand its principles and features. For example, we need to know where a particular garbage collector is spending most of its time, so that we have a line of inquiry when the status quo doesn’t fit the theory.

In addition, generally speaking, the problems we encounter must have been encountered before, so as long as the location of a specific detail, search engines can probably find a solution.

Third, detailed GC logs are critical to understanding how the garbage collector works. Once you have that initial judgment, all that is left is to experiment and measure until you have tuned the system to a better state.

The resources

Tech.meituan.com/2020/11/12/… (The system summarizes all kinds of GC problems, not to be overlooked)

Blog. Ragozin. Info/p/garbage – c…

Cloud.tencent.com/developer/a…

Dzone.com/articles/ho…

www.jianshu.com/p/051d566e1…

www.jianshu.com/p/4e96beb37…

Mp.weixin.qq.com/s/aYwIH0TN3…