Java Automatic Memory Management (GC)

Automatic memory management

This section is the most important one, and anyone who has done C/C++ development should know the importance and difficulty of memory management. Although Java implements its own memory management and does not require developers to worry about it, there are still some shortcomings in memory management, and memory leaks and overflow problems are common. When we troubleshoot these problems, we are blind and directionless with no knowledge of JVM memory management, which can be used to solve problems.

This part of memory mainly involves two parts, one is the memory model, memory model is the basis, affecting the memory management, which also involves the problem of multi-threading competition, concurrency is also a big problem, here will not detail. The second is the basic knowledge of garbage collection and various GC algorithms. It is also very important to understand the GC algorithm, which is helpful for problem analysis and performance tuning. However, it is not necessary to go into the details of the GC algorithm code, but to understand its main algorithm ideas.

Use of GC tools and log analysis is more hands-on.

The memory model

The above image shows a Java virtual machine database in action, with five parts to master: method area, heap, virtual machine stack, local method stack, and program counter

Program counter: line number indicator of bytecode; Thread private
Java Virtual Machine stack: the threaded memory model of Java execution that holds basic data types and object references; Thread private
Local method stack: serves Java methods, similar to the Java virtual machine stack; Thread private
Heap: holds object instances; Threads share
Method area: loaded type information, constants, static variables, real-time compilation of its compiled code cache and other data; Threads share
- Runtime constant pool: Holds the version, field, method, interface description of the class, as well as the constant pool table (holds various literal and symbolic references generated at compile time)

There is also a non-runtime data area: direct memory. This section will also be used frequently

Java virtual machine stack (continuous storage of variables), local method stack (continuous recursion), heap (continuous storage of objects), method area (continuous storage of class information), direct memory

See the OOM example section for more information about OutOfMemoryError in Understanding the Java Virtual Machine: Advanced JVM Features and Best Practices 2.4 Practice

Garbage collection basics

There are some pre-knowledge of garbage collection that need to be understood: the judgment criteria of garbage collection (reachability analysis algorithm), generation collection theory and three basic algorithms of garbage collection

Accessibility analysis algorithm

Reachability analysis algorithm: Mainly starts by enumerating GC Roots as the root node, and the objects that can be traversed are still viable

Fixed objects that can be used as GC Roots in Java technology architecture include the following:

Objects referenced in the virtual machine stack (the local variable table in the stack frame), such as parameters, local variables, temporary variables, etc. used in the method stack called by each thread.
An object referenced by a class static attribute in a method area, such as a Java class reference type static variable.
Objects that are constant references in the method area, such as references in the String Constant pool (String Table).
Objects referenced by JNI (commonly referred to as Native methods) in the Native method stack.
Internal references to the Java virtual machine, such as Class objects corresponding to basic data types, resident exception objects (NullPointExcepiton, OutOfMemoryError), and system Class loaders.
All objects held by the synchronized keyword.
Jmxbeans that reflect Java virtual machine internals, callbacks registered in JVMTI, local code caches, and so on.

Generation collection theory

Among the important generational assumptions, GC algorithms are designed around this assumption before ZGC

1) Weak Generational Hypothesis: the overwhelming majority of subjects were Generational and Generational.
2) The Strong Generational Hypothesis: the more times an object goes through garbage collection, the less likely it is to die.
3) Intergenerational Reference Hypothesis: cross-generational references are only a minority relative to same-generation references.

Garbage collection algorithm

This is quite important, so far in the DESIGN of GC algorithm at least one of the ideas of the algorithm

Mark-clear algorithm

As its name suggests, it is divided into two stages: mark the object to be cleared, and then clear it

Advantages: Simple foundation
Disadvantages:
- The first is that the execution efficiency is not stable. If the Java heap contains a large number of objects, and most of them need to be recycled, then a large number of marking and cleaning actions must be performed, resulting in the efficiency of both marking and cleaning processes decreasing as the number of objects increases
- The second problem is the fragmentation of memory space. After marking and clearing, a large number of discontinuous memory fragments will be generated. Too much space fragment may cause that when large objects need to be allocated in the running process of the program, enough continuous memory cannot be found and another garbage collection action has to be triggered in advance

Mark-copy algorithm

It divides the available memory by capacity into two equally sized pieces, one of which is used at a time. When this area of memory is used up, the surviving objects are copied to the other area, and the used memory space is cleaned up again. If most objects are live in the memory, the algorithm will produce a lot of memory copy overhead, but in the case of most objects are recycled, algorithm need to copy is in the minority living objects, but every time is for the entire area in memory, when allocating memory, also need not consider to have the complex situation of space debris, as long as the mobile heap pointer, It can be distributed in order.

G1 and the younger generations of previous algorithms use this idea for garbage collection. Because there are fewer living objects, replication takes less time

Advantages: simple implementation, efficient operation, no space debris
Cons: Reduce the available memory to half of the original, space waste is a bit too much

Mark-collation algorithm

The efficiency of mark-copy algorithm will decrease when the survival rate of the object is high. More importantly, if you do not want to waste 50% of the space, you need to have extra space for allocation guarantee, in the extreme case that all objects in the memory used are 100% alive, so in the old days, this algorithm generally cannot be used directly.

The marking process is still the same as the mark-clean algorithm, but instead of cleaning up the recyclable objects directly, the next step is to move all surviving objects to one end of the memory space, and then clean up the memory directly beyond the boundary. The mark-clean algorithm is shown in the diagram above.

There is also room for flexible adjustment of the collation, can be timely collation, can be collated when fragmentation to a certain extent

Advantages: No space debris, high space utilization
Disadvantages: Moving live objects and updating all references to them can be a very heavy operation, and the object movement must be paused throughout the user application

Serial collector

As shown in the figure above, Serial collects its garbage in a single thread. When it does garbage collection, it must stop the business code and let it concentrate on recycling. “When your mom cleans your room, she makes you stay out of the room or on your chair. If you throw confetti while she cleans, how can the room get done?”

The new generation uses mark-copy algorithm and the old generation uses mark-collation algorithm

ParNew collector

As you can see above, ParNew is a parallel version of Serial, essentially unchanged, with the idea of using multiple threads to speed up garbage collection

The new generation uses mark-copy algorithm and the old generation uses mark-collation algorithm

Parallel avenge

The Parallel Collector is characterized by its focus on minimizing the pause time of user threads during garbage collection. The goal of the Parallel Insane is to achieve a controlled Throughput. Throughput is the ratio of the time the processor spends running user code to the total elapsed time of the processor

The collector is designed to achieve a set throughput and can be adjusted intelligently

Serial Old collector

Serial Old is an older version of the Serial collector, which is also a single-threaded collector using a mark-collation algorithm. The main purpose of this collector is also to be used by HotSpot virtual machines in client mode. If in server mode, it may also serve two purposes: The Application is used in JDK 5 and earlier with the Parallel Scavenge collector, and the application is used as a fallback in the event of a Concurrent Mode Failure in a Concurrent collection.

Parallel Old collector

It wasn’t until the Parallel Old collector was invented that the Parallel Scavenge collector became a worthy combination of the insane and the Parallel Avenge collector.

CMS collector

CMS is a breakthrough node, whose breakthrough point is to subdivide the nodes of garbage collection to shorten the STW time, and part of the collection process can run together with the business code.

The CMS (Concurrent Mark Sweep) collector is a collector whose goal is to obtain the shortest collection pause time. At present, a large part of Java applications are concentrated on the server side of Internet websites or B/S system based on browsers. Such applications usually pay more attention to the response speed of services and hope that the system pause time is as short as possible to bring good interactive experience to users. The CMS collector is a good fit for such applications.

CMS is based on the mark-clear algorithm, as shown in the figure above, which consists of four steps:

1) Initial mark: STW is required; Tagging objects that GC Roots can be directly associated with is fast
2) Concurrent markup: no STW required; Objects directly associated with marking GC Roots start traversing marks (concurrent)
3) Re-marking: STW is required; Fix the part of phase 2 concurrent markup that changes due to business running
4) Concurrent cleanup: no STW needed; Clean removed dead object, memory fragment

CMS has the following advantages and disadvantages:

Advantages: Concurrent collection, low pauses, suitable for the fast response needs of today’s Internet servers
Disadvantages:
- Sensitive to resources, core 4 or above is ok, the impact is very big, occupy resources
- Due to floating point garbage, some space should be reserved for concurrent collection of programs, too low or too high
- There’s space debris

G1 GC

The G1 GC is a milestone achievement, with the breakthrough being a redesign of memory layout (Region), controllable maximum pause times. It can be said that these designs and improvements, starting with CMS, are aimed at highly responsive servers.

G1 is roughly divided into the following four phases:

1) Initial Marking: just mark the objects that GC Roots can be directly associated with, and modify the value of the TAMS pointer so that the next phase of user threads running concurrently can correctly allocate new objects in the available regions. This phase requires the thread to pause, but it is very short, and it is done synchronously during the Minor GC, so the G1 collector actually has no additional pauses during this phase.
2) Concurrent Marking: Analyze the reachability of objects in the heap starting from GC Root and recursively scan the object graph in the whole heap to find the objects to be reclaimed. This stage is time-consuming but can be executed concurrently with user programs. When the object graph scan is complete, it is also necessary to reprocess the objects recorded by SATB that have reference changes at the time of concurrency.
3) Final Marking: Another short pause is made on the user thread to process the last few SATB records that remain after the concurrent phase is over.
4) Live Data Counting and Evacuation: You can update the statistics of a Region, sort the reclamation value and cost of each Region, and make a reclamation plan based on the pause time expected by users. You can select multiple regions to form a collection and copy the surviving objects of the Region to an empty Region. Then clean up the entire old Region. The operation here involves the movement of the living object, which must suspend the user thread, and is done in parallel by multiple collector threads.

As to whether G1 algorithm has STW pause, here is an example of a gang of thieves:

1) Mark the map together: first mark the rich houses on the map and prepare to tread and mark them on the map. If the houses are still on the map, STW is required
2) Division of work to check the location: it is necessary to go to a specific place to check, people are coming and going, this time is not static, do not need STW
3) meet to confirm on the map: we meet to confirm which family is rich and good theft, mark on the map, corresponding need STW
4) Start work separately: it must be dark and windy at this time, either no one at home, or sleep soundly, corresponding to the need for STW

The G1’s pros and cons:

Advantages: controllable, low delay, basically no debris, 6G-8G is more suitable
Disadvantages: Cross-generation recycling generated card table, occupy more memory, processing is more complex, below 6G May be good CMS

Shenandoah

Shennadoah feels like the most significant improvement over the G1. The steps are as follows, but I won’t go into detail here:

Initial Marking: As in G1, objects directly associated with GC Roots are marked first. This phase is still “Stop The World”, but The pause time is not related to heap size, only The number of GC Roots.
Concurrent Marking: As in G1, the object graph is traversed to mark all reachable objects. This phase is Concurrent with the user thread, depending on the number of live objects in the heap and the structural complexity of the object graph.
Final Marking: As in G1, the remaining SATB scans are processed and the regions with the highest reclaim value are counted at this stage and grouped into a Collection Set. There is also a short pause in the final marking phase.
Concurrent Cleanup: this phase is used to Cleanup regions that have not found a single viable object in the entire Region (called Immediate Garbage regions).
Concurrent Evacuation: The Concurrent recovery phase is the core difference between Shenandoah and other collectors in Previous hotspots. At this stage, Shenandoah makes a copy of the live objects in the collection into another unused Region. Copying objects is fairly simple if the user thread is frozen, but becomes more complicated if both must be done concurrently. The difficulty is that while moving the object, the user thread may continue to read and write the object. Moving the object is a one-time behavior, but after moving the object, all references to the object in the whole memory are still the address of the old object, which is difficult to change all at once. Shenandoah addresses these challenges in the concurrent collection phase by using reading barriers and forwarding Pointers called Brooks Pointers (which I’ll come back to when I’m done explaining Shenandoah’s work). How long the concurrent collection phase runs depends on the size of the back collection.
Initial Update Reference: After the replication of objects in concurrent reclamation, all references to the old objects in the heap need to be corrected to the new address after replication. This operation is called Reference Update. The initialization phase of reference updates does nothing specific, but is set up to establish a thread collection point to ensure that all collector threads working in the concurrent collection phase have completed the object movement task assigned to them. The initial reference update time is very short, resulting in a very short pause.
Concurrent Update Reference: The actual start of the Reference Update operation, which is performed concurrently with the user thread, depending on the number of references involved in memory. Unlike concurrent markup, concurrent reference updates no longer need to search along the object graph. Instead, they need to search for reference types linearly in the order of the physical address of memory, changing the old value to the new value.
Final Update Reference: After resolving Reference updates in the heap, references that exist in GC Roots are also fixed. This stage is Shenandoah’s last pause, and the pause time is only related to the number of GC Roots.
Concurrent Cleanup: After the concurrent collection and reference update, all Regions in the collection have no living objects. These Regions become Immediate Garbage Regions. Finally, the concurrent clearing process is invoked to reclaim the memory space of these Regions for future allocation of new objects.

ZGC

The ZGC collector is a garbage collector based on Region memory layout, with (for now) no generation, and uses techniques such as read barriers, dye Pointers, and memory multiple mapping to implement concurrent mark-collation algorithms with low latency as the primary goal.

Let’s start with the ZGC memory layout. Like Shenandoah and G1, ZGC uses a region-based heap memory layout, but unlike them, ZGC’s regions (referred to as Page or ZPage in some official sources and continue to be called regions in this chapter) are dynamic — dynamically created and destroyed. And dynamic region size. On the X64 hardware platform, ZGC regions can have large, medium, and small capacities, as shown in Figure 3-19.

Small Region: the capacity of a Region is fixed at 2MB. It is used to store Small objects smaller than 256KB.
Medium Region: with a fixed capacity of 32MB, it is used to store objects larger than or equal to 256KB but smaller than 4MB.
Large Region: the capacity of a Large Region is not fixed and can be dynamically changed. The value must be an integer multiple of 2MB and is used to store Large objects of 4MB or larger. Each large Region contains only one large object. This indicates that the actual capacity of a large Region may be smaller than that of a medium-sized Region, with the capacity as low as 4MB. Large regions are not reallocated in A ZGC implementation (reallocation is a ZGC processing action used to copy objects during the collector phase, described later) because copying a large object is very expensive.

Next comes the core problem of ZGC – the implementation of concurrent collation algorithm. Dyeing pointer technique

The dye pointer allows a Region to be freed and reused as soon as its live objects are removed, rather than waiting for all references to the Region in the heap to be corrected. This is a big advantage over Shenandoah, which allows ZGC to theoretically complete the collection as long as there is a free Region left, but Shenandoah waits until the end of the reference update phase before releasing the Region in the collection collection, which means the extreme case where almost all objects in the heap are alive. To copy objects to a new Region at 1:1, half of the regions must be free for collection. As to why the dye pointer can lead to such a result, the author will explain its “self-healing” properties in the future.
Coloring Pointers can significantly reduce the number of memory barriers used during garbage collection. Memory barriers, especially write barriers, are often set up to record changes in object references, and if this information is maintained directly in Pointers, it is obvious that some specialized logging operations can be eliminated. In fact, so far the ZGC has not used any write barriers, only read barriers (partly thanks to the dye pointer, and partly because the ZGC doesn’t support generational collection yet, so cross-generation references are naturally not an issue). The cost of memory barriers to runtime performance has been discussed in the previous section, and the ability to eliminate some of them is obviously beneficial to program performance, so the impact of ZGC on throughput is relatively low.
The dye pointer can be used as an extensible storage structure to record more data related to the object marking and relocation process to further improve performance later. The first 18 bits of 64-bit Pointers are still unused on Linux, and while they cannot be used for addressing, they can be used for recording information in other ways. If developed this 18, not only can make use of four marks, the maximum heap memory of ZGC can support from 4 TB to expand to 64 TB, also can use the rest of the location to store more signs, such as store some tracking information to let the garbage collector to low frequency time when moving objects using the object moves to don’t often visit the area of memory.

The general steps are as follows:

Concurrent Mark: Like G1 and Shenandoah, concurrent markers are the reachability analysis stages of traversing the object graph, followed by short pauses similar to G1 and Shenandoah’s initial and final markers (though they are not named in the ZGC), and they do similar things in goal. Unlike G1 and Shenandoah, ZGC is Marked on Pointers instead of objects, and Marked 0 and Marked 1 flag bits in dyeing Pointers are updated in the marking phase.
Concurrent Prepare for Relocate: This stage will determine which regions are to be relocated based on specific query conditions. The redistribution Set is different from the G1 collector’s Collection Set in that ZGC does not divide regions to do revenue-first incremental Collection as G1 does. Instead, ZGC scans all regions with each collection, trading the cost of a wider scan for the savings in G1 memory set maintenance. Therefore, the ZGC’s reallocation set only determines that the living objects in the set will be copied to other regions, and the regions in the set will be freed. It is not true that the collection is only for the regions in the set, because the marking process is for the whole heap. In addition, the uninstallation of classes and handling of weak references, which began to be supported in THE ZGC in JDK 12, is done in this phase.
Concurrent Relocate: Redistribution is the core stage of the ZGC. This process copies the surviving objects in the redistribution set to the new Region and maintains a Forward Table for each Region in the redistribution set to record the direction relationship from the old objects to the new objects. Thanks to support dyeing pointer, ZGC collector can only from the reference is a clear whether an object in the redistribution of set, if the user thread as concurrent access to the concentration of the redistribution of object, the visit will be intercepted by the preset memory barrier, then immediately according to the published record on the Region to forward the access to the new copy objects, The ZGC refers to this behavior as the pointer’s self-healing ability. The advantage of this is that only the first access to the old object will be caught in forwarding, that is, only slow once, compared to Shenandoah’s Brooks forwarding pointer, which is a fixed overhead that must be paid for every access to the object. In short, it is slow every time, so the ZGC’s runtime load on the user program is lower than Shenandoah’s. There is another direct benefit is that because of the existence of dyeing pointer once redistribution on a certain Region after the live objects are copied, this Region can immediately release for the distribution of the new object (but turn and published has not released), even if the heap and there are many pointer to the object is not updated it does not matter, These old Pointers are self-healing once used.
Concurrent Remap: Heavy mapping does is fixed point to redistribution in the whole heap concentrated all the old object references, from the standpoint of the target, it is like the Shenandoah concurrent update phase reference, but concurrent heavy ZGC mapping is not a must “urgent” to complete the task, as said earlier, even old references, it is also can self-healing, At most, there is only one more forward and correction operation for the first use. The main purpose of remapping to clean up these old references is not to slow them down (and to release the spin-off benefits of forwarding when the cleaning is done), so it’s not “urgent.” Therefore, the ZGC cleverly merges the work of the concurrent remapping phase into the concurrent marking phase of the next garbage collection cycle, which will traverse all objects anyway, thus saving the overhead of traversing the object graph once. Once all Pointers have been corrected, the forwarding table that records the relationship between the old and new objects can be released.

Evolution of GC algorithm

In the development of GC algorithms, my personal feeling has always revolved around one point: reducing single STW times.

For example, multithreading is used to accelerate the collection to reduce the STW time of the entire collection of serial GC; The recycling process is divided into stages. A longer STW is divided into small parts and scattered in each recycling stage, thus reducing the time of a single STW.

And the evolution of GC algorithm is to meet the rapid response of the server.

The algorithm evolution diagram is roughly as follows: