A list,

Java 11 includes a brand new garbage collector, ZGC, developed by Oracle that promises very low pause times on terabytes of heaps. In this article, we introduce the motivation for developing a new GC, a technical overview, and some of the possibilities opened up by ZGC.

So why do you need a new GC? After all, Java 10 has had four garbage collectors released over the years, all of which are almost infinitely tunable. To put it another way, the G1 was introduced to Hotspot VM in 2006. At the time, the largest AWS instance had 1 vCPU and 1.7GB of ram, whereas today AWS is happy to rent you an x1E.32 Xlarge instance, which has 128 Vcpus and 3,904GB of ram. ZGC is designed to support terabytes of memory capacity, low pause times (<10ms), and less than 15% impact on overall program throughput. The implementation mechanism could be extended in the future to support quite a few exciting features, such as multi-tier heaps (where hot objects are in DRAM and cold objects are in NVMe flash), or compressed heaps.

2. GC terminology

To understand how the ZGC matches the existing collector and how to implement the new GC, we need to understand some terminology. At its most basic, garbage collection involves identifying memory that is no longer being used and making it reusable. Modern collectors do this in several phases, which we tend to describe as follows:

  • Parallelism: While the JVM is running, there are both application threads and garbage collector threads. The parallel phase is performed by multiple GC threads, that is, the GC work is divided between them. There is no reference to whether the GC thread needs to suspend the application thread.

  • Serial: The serial phase is performed only on a single GC thread. As before, it does not say whether the GC thread needs to suspend the application thread.

  • STW: STW phase, in which the application thread is paused so that the GC can do its work. When an application is paused because of GC, this is usually due to The Stop The World phase.

  • Concurrency: If a phase is concurrent, the GC thread can run at the same time as the application thread. Concurrency phases are complex because they require processing before the phase completes and can invalidate the work.

  • Incremental: If a phase is incremental, it can run for a while and then terminate prematurely due to some condition, such as the need to perform a higher priority GC phase while still completing productive work. The incremental phase is in sharp contrast to the phase that needs to be fully completed.

Three, the working principle

Now that we know the properties of the different GC phases, let’s move on to how ZGC works. To achieve its goal, ZGC has added two new technologies to Hotspot Garbage Collectors: coloring Pointers and read barriers.

Coloring pointer

Coloring Pointers is a technique for storing information in Pointers (or referencing in Java terminology). Because Pointers can handle more memory on 64-bit platforms (the ZGC only supports 64-bit platforms), some bits can be used to store state. ZGC will limit support to a maximum of 4Tb heap (42-bits), leaving 22 bits available. It currently uses 4 bits: Finalizable, Remap, mark0 and mark1. We’ll explain their purpose later.

One problem with coloring Pointers is that it requires extra work when you need to uncolor them (because of the need to mask information bits). Platforms like SPARC have built-in hardware that supports pointer masking so it’s not a problem, while for x86, the ZGC team uses a neat multiple mapping technique.

Multiple mapping

To understand how multiple mapping works, we need to briefly explain the difference between virtual and physical memory. Physical memory is the actual memory available to the system, usually the capacity of the installed DRAM chips. Virtual memory is abstract, which means that applications have their own view of (usually isolated) physical memory. The operating system is responsible for maintaining the mapping between virtual memory and physical memory ranges, which it does by using page tables and the processor’s memory management unit (MMU) and the transformation lookup buffer (TLB), which translates the address requested by the application.

Multiple mapping involves mapping different ranges of virtual memory to the same physical memory. Since there is only one REmap in the design, mark0 and mark1 can be 1 at any point in time, so three mappings can be used to accomplish this. There is a nice diagram in the ZGC source code to illustrate this.

Read barrier

Read barriers are snippets of code that run whenever an application thread loads a reference from the heap (i.e. access a non-primitive field on an object) :

void printName( Person person ) {
    String name = person.name;  // This triggers the read barrier
                                // Because the reference needs to be read from the heap
                                // 
    System.out.println(name);   // There is no direct read barrier
}
Copy the code

In the code above, String name = Person.name accesses the Person reference on the heap and then loads the reference into the local name variable. The read barrier is triggered. The systemt.out line does not trigger the read barrier directly, because no reference is loaded from the heap (name is a local variable, so no reference is loaded from the heap). But other read barriers may be triggered inside System and out, or println.

This is in contrast to the write barriers used by other GCS, such as G1. The job of the read barrier is to check the state of the reference and do some work before returning the reference (or even a different reference) to the application. In the ZGC, it performs this task by testing loaded references to see if certain bits are set. If the test passes, nothing else is done, and if it fails, some stage-specific task is performed before the reference is returned to the application.

tag

Now that we know what these two new technologies are, let’s take a look at the GC cycle for the ZG.

The first part of the GC cycle is the tag. Tagging involves finding and tagging all heap objects accessible to the running application, in other words, finding objects that are not garbage.

The ZGC markup is divided into three phases. The first stage is STW, where GC roots are marked as live objects. GC Roots are similar to local variables that allow access to other objects on the heap. An object is considered garbage if it cannot be accessed by traversing the object graph starting with roots, then the application cannot access it. A collection of objects accessed from roots is called a Live set. The GC roots labeling step is very short because the total number of roots is usually small.

When this phase is complete, the application resumes execution and the ZGC begins the next phase, which simultaneously traverses the object graph and marks all accessible objects. During this phase, the read barrier pin tests all loaded references with a mask that determines whether they are marked or unmarked, and if not, adds them to the queue for marking.

After The traversal is complete, there is a final, short Stop The World phase, which handles some edge cases (which we will ignore for now), and then The marking phase is complete.

relocation

The next major part of the GC cycle is relocation. Relocation involves moving live objects to free up some of the heap memory. Why move objects instead of filling gaps? Some GCS actually do this, but it has the unfortunate consequence that allocating memory becomes more expensive because the memory allocator needs to find free space to put objects in when it does. By contrast, if large chunks of memory can be freed, allocating memory is simple by incrementing the pointer to the size required by the new object.

The ZGC divides the heap into many pages, and at the beginning of this phase, it simultaneously selects a set of pages that need to relocate live objects. After The relocation set is selected, a Stop The World pause appears in which The ZGC relocates The root objects in The collection and maps their references to The new location. As with The previous Stop The World step, The pause time involved here depends only on The number of roots and The ratio of The size of The relocation set to The total active set of The object, which is usually quite small. So unlike many collectors, the pause time increases as the heap grows.

After root is moved, the next stage is concurrent relocation. In this phase, the GC thread iterates through the relocation set and relocates all objects in the pages it contains. If the application thread tries to load objects before GC relocates them, the application thread can also relocate the object, which can be achieved through a read barrier (triggered when a reference is loaded from the heap), as shown in the flowchart below:

This ensures that all references seen by the application are updated, and that it is impossible for the application to operate on the relocated objects at the same time.

The GC thread will eventually relocate all objects in the relocation set, however there may still be references to the old locations of those objects. The GC can traverse the object graph and remap these references to new locations, but this is an expensive step. So this step is merged with the next marking phase. When traversing the object object graph during the marking phase of the next GC cycle, if references are found that are not remapped, they are remapped and then marked as active.

summary

It is difficult to try to understand the performance characteristics of complex garbage collectors such as ZGC in isolation, but it is clear from the previous section that almost all pauses we encountered depended solely on the GC Roots collection size, not the real-time heap size. The last pause in the marking phase that processes the termination of the mark is the only exception, but it is incremental, and if the GC time budget is exceeded, the GC reverts to the concurrent mark until it is tried again.

Three, performance,

So how did the ZGC perform?

Stefan Karlsson and Per Liden gave some numbers in their Jfokus talk earlier this year. The ZGC’s SPECjbb 2015 throughput is roughly comparable to the Parallel GC (optimized throughput), but with an average pause time of 1ms and a maximum of 4ms. In comparison, G1 and Parallel have many GC pauses of more than 200ms.

However, garbage collectors are complex software, and it may not be possible to infer real-world performance from benchmark results. We look forward to testing the ZGC ourselves to see how its performance varies from workload to workload.

This paper reference: mp.weixin.qq.com/s/nAjPKSj6r…

Please scan the code or search the wechat public number “Programmer Guoguo” to follow me, pay attention to surprise ~