The starting point

Why should we analyze GC logs for dialogue and Trade services (not just those two)? Because dialogue on the user conversation link (requests from the user) acts as the service invocation center, if the service has a delay/call timeout, it could reply to the user’s conversation, but it is discarded because of timeout, which can result in the following situations

  • For businesses, resulting in a certain loss, response rate is reduced
  • The most intuitive feeling to users is, why did this kind of question have no reply

Characteristics of instance objects on the dialogue service

  • Large objects (proto objects with rich fields, context information, collections, etc.)
  • A certain amount (say ten) of user context information is cached, so the cached old context information is meaningless and can be recycled as garbage

Trade-build is driven by consuming order messages from third-party e-commerce

  • Many objects, few fields, short life cycle

The garbage collection

Serial, parallel, CMS (Serial, parallel, concurrent tag cleaning)

Heap memory structure

The young generation

Regions that allocate and age new objects

  • Eden area. Newly created objects are allocated to this area. Unless the object is very large, it is directly allocated to the old age
  • A survivor zone, as the name implies, is an object that has survived gc

The old s

Gc through the Young region is counted once, and a certain number (15 by default) is copied to the region

The permanent generation

Stores JVM metadata, basic library class information, and methods

CMS concurrent tag cleanup five phases

Stick the official document here. The CMS collector performs so he performs with The following phases on The old generation of The heap:

Phase Description
(1) Initial Mark

(Stop the World Event) | Objects in old generation are “marked” as reachable o those Objects which may be reachable from young generation. Pause times are typically short in duration relative to minor collection pause times. | | (2) Concurrent Marking | Traverse the tenured generation object graph for reachable objects concurrently while Java application threads are executing. Starts scanning from marked objects and transitively marks all objects reachable from the roots. The mutators are executing during the concurrent phases 2, 3, and 5 and any objects allocated in the CMS generation during these phases (including promoted objects) are immediately marked as live. | | (3) Remark

(Stop the World Event) | Finds objects that were missed by the concurrent mark phase due to updates by Java application threads to objects after the concurrent collector had finished tracing that object. | | (4) Concurrent Sweep | Collects the objects identified as unreachable during marking phases. The collection of a dead object adds the space for the object to a free list for later allocation. Coalescing of dead objects may occur at this point. Note that live objects are not moved. | | (5) Resetting | Prepare for next concurrent collection by clearing data structures. |

Garbage First

Heap allocation



In addition to these three areas, there is also a large object store. Each of the small squares above has a memory allocation of 1m-32M. If the size of a giant object exceeds 50% of the small squares, it is directly allocated to the giant object area



The multiple stages of garbage collection are posted here directly to the official documentation

Phase Description
(1) Initial Mark

(Stop the World Event) | This is a stop the world event. With G1, it is piggybacked on a normal young GC. Mark survivor regions (root regions) which may have references to objects in old generation. | | (2) Root Region Scanning | Scan survivor regions for references into the old generation. This happens while the application continues to run. The phase must be completed before a young GC can occur. | | (3) Concurrent Marking | Find live objects over the entire heap. This happens while the application is running. This phase can be interrupted by young generation garbage collections. | | (4) Remark

(Stop the World Event) | Completes the marking of live object in the heap. Uses an algorithm called snapshot-at-the-beginning (SATB) which is much faster than what was used in the CMS collector. | | (5) Cleanup

(Stop the World Event and Concurrent) |

– Performs accounting on live objects and completely free regions. (Stop the world)

– Scrubs the Remembered Sets. (Stop the world)

– Reset the empty regions and return them to the free list. (Concurrent)

| | (*) Copying

(Stop the World Event) | These are the stop the world pauses to evacuate or copy live objects to new unused regions. This can be done with young generation regions which are logged as [GC pause (young)] . Or both young and old generation regions which are logged as [GC Pause (mixed)] . |

Online situation

It can be seen from the dialogue GC log analysis chart that the total memory allocated is nearly 6G, 2.24G for young generation and 3.53g for old generation. The default NewRatio=2 is used. However, from the perspective of usage, the young generation is not enough, so when allocating new objects, the young generation’s GC is always tried. In the old days, 800M of free memory was not used, which was obviously not a reasonable memory allocation

Trade-build is also not enough for the young generation. When creating new objects, memory allocation often fails, triggering the gc of the young generation. However, the service is not on the call link of the user conversation. Even if there is gc delay, it only affects the time of data falling into the database. Therefore, most objects can be reclaimed in Eden of the young generation because of the large number and short life cycle. Try to adjust the configuration of **NewRatio ** to NewRatio=1, that is, new objects cannot be allocated if the heap memory occupied by the young generation and the old generation occupies half of the heap memory. Trigger the young GENERATION GC log as follows:

2021-03-18T10:13:44.256+0800: [GC (Allocation Failure) 2021-03-18T10:13:44.257+0800: [ParNew: Secs] [Times: Times: 3282K -> 3283k [Times: 3282k -> 3283k] Sys =0.00, real=0.02 secs] 2021-03-18T10:14:33.687+0800: [GC (Allocation Failure) 2021-03-18T10:14:33.687+0800: [ParNew: 405455K->36853K(407424K), secs] 405470K ->459797K(1296256K), secs] [Times: User =0.06 sys=0.00, real=0.03 secs] 2021-03-18T10:15:41.909+0800: [GC (Allocation Failure) 2021-03-18T10:15:41.910+0800: [ParNew: 407285K->32213K(407424K), secs] 40729K ->459058K(1296256K), secs [Times: User sys = = 0.05 0.01, real = 0.01 secs]Copy the code

After the adjustment

After adding -xxnewratio =1 to the startup script, the thread’s GC log is shown in the following figure, which clearly shows a decrease in the peak GC time. If you are interested, you can compare the two screenshots of GC log parsing.

GC monitoring dot chart

The red box marked example in the figure below is one published by Canary, which shows a significant decrease in the peak GC time relative to the other two

After the remaining two releases, all three GC peaks dropped

GC log parsing diagram

Reference documentation

Jvm Options Java Garbage Collection Basics GC tuning