Learn more about the core principles behind the G1 garbage collector.

Three color tag

The CMS collector uses the Incremental Update algorithm, while the G1 collector uses the SATB algorithm. The idea behind both uses a three-color tagging algorithm, which is as follows:

  • Black: an object that itself has been marked and all fields have been marked;
  • Gray: an object that is marked itself, but has fields that are not marked (the object the collector is accessing);
  • White: The object is not marked.

Note: at the end of the marking, the marked object is alive, and the unmarked object is reclaimed.

In the parallel GC phase, the application thread and GC are in parallel, so two things can happen at the same time:

  • 1. Mutator stores a reference to a white object in a field of a black object;
  • 2. A white object loses all reference paths (directly or indirectly) that can reach it from a grey object.

The black object holds a reference to the white object. By definition, the Collector will no longer traverse the fields of the black object, so it will not find a live reference to the white object. It doesn’t matter if there is another grey object that holds a direct or indirect reference to the white object; If all references from the grey object to the white object are unfortunately cut off, the white object will miss scanning.

Therefore, if the above two conditions occur at the same time, the object will be missed and recycled.

Incremental Update is the algorithm adopted by the CMS collector to address the above problems. The algorithm does this: whenever a reference to a white object is assigned to a black object field in a write barrier, turn the white object gray (for example, marking it on the marking stack, or recording it in a mod-union table). This strongly eliminates the occurrence of the first situation mentioned above.

SATB treats all live objects in the logical snapshot at the start of the tag as live. This is done by turning all old references to non-white objects in a write barrier.

SATB

SATB (Snapshot at the Beginning) is the basis of G1 concurrency theory and is used to maintain and ensure the correctness of the recycle process. GC starts with a snapshot of a live object.

Next, the newly allocated objects are alive during the GC.

It is easy to know which objects are newly allocated after a GC has started: each region records two Pointers to top-at-Mark-start (TAMS), prevTAMS and nextTAMS. The Top and NextTAMS Pointers overlap at the beginning of the concurrent marker (as shown in the figure below), and the object pairs allocated in the concurrent phase are in the [Next TAMS, Top] interval. Objects within this range are alive by default, which is also called implicit markup.

However, if A white object A is referenced as A field by A gray object B before the Next TAMS and is assigned null before the concurrent token scans the field, then the reference relationship between B–>A is cut off and white object A may be missed.

G1 solves this problem by inserting a layer of pre-write barrier before the reference relationship is modified. The pre-write barrier records the old reference value every time the reference relationship changes. These reference values are placed in the SATB Mark Queue, and in the next concurrent marking phase, objects in the SATB Mark Queue are processed in turn to ensure that these objects are alive in the current GC.

void G1SATBCardTableModRefBS::enqueue(oop pre_val) {
  // Nulls should have been already filtered.
  assert(pre_val->is_oop(true), "Error");

  if(! JavaThread::satb_mark_queue_set().is_active()) return;
  Thread* thr = Thread::current(a);if (thr->is_Java_thread()) {
    JavaThread* jt = (JavaThread*)thr;
    jt->satb_mark_queue().enqueue(pre_val);
  } else {
    MutexLockerEx x(Shared_SATB_Q_lock, Mutex::_no_safepoint_check_flag);
    JavaThread::satb_mark_queue_set().shared_satb_queue() - >enqueue(pre_val); }}Copy the code

However, it is very likely that an object that is alive in the Snapshot may have died as the concurrent GC progresses, but SATB will let it live through the GC. This leads to so-called floating garbage.

RSet

The chapter is excerpted fromSome key technologies for the Java Hotspot G1 GC

Remembered Set is a structure that helps GC processes. It is a typical space swap tool, similar to Card tables. Another data structure that assists GC is a Collection Set (CSet), which records the Collection of regions to be collected by the GC. The regions in the Collection can be of any age. For cross-generation object references of old->young and old->old, only rsets in the corresponding CSet can be scanned during GC. Logically, each Region has an RSet. Rsets record the relationship between objects in other regions and objects in this Region. Rsets belong to the points-into structure (who references my object). Card tables, on the other hand, are a points-out structure where each Card covers a certain range of Heap (typically 512Bytes). The RSet of G1 is implemented on the basis of Card Table: each Region records the Pointers of other regions and marks which Card ranges these Pointers belong to. The RSet is actually a Hash Table. The Key is the start address of another Region, and the Value is a set. The element in the RSet is the Index of the Card Table.

The following diagram shows the relationship between RSet, Card, and Region.

Cards in different regions reference each other. Objects in Cards in Region1 reference objects in Cards in Region2. The solid blue line shows the points-out relationship. In Region2’s RSet, the Card of Region1, represented by the red dotted line, is recorded, which is points-into. The reference relationships in rSETS are maintained by post-write barriers and Concurrent refinement threads.

void oop_field_store(oop* field, oop new_value) {
  pre_write_barrier(field);             // pre-write barrier: for maintaining SATB invariant
  *field = new_value;                   // the actual store
  post_write_barrier(field, new_value); // post-write barrier: for tracking cross-region reference
}
Copy the code

The post-write barrier records cross-region reference updates, and the update log buffer records Cards that contain reference updates. Once the buffers are full, the post-write barrier is no longer in service and the buffer logs are processed by Concurrent Refinement Threads. How exactly does RSet assist GC? When doing YGC, you only need to select rsets of the Young Generation region as the root set. These Rsets record cross-generation references of old->young, avoiding scanning the whole old generation. However, in mixed GC, RSet of old->old is recorded in old generation, and reference of young->old is obtained by scanning all young generation regions, so it is not necessary to scan all old generation regions. So the introduction of Rsets greatly reduces the amount of GC work.

Learning materials

  • A thorough dissection of G1GC
  • Some key technologies for the Java Hotspot G1 GC
  • RednaxelaFX forum reply post
  • Study notes of Uncle F
  • Garbage First Garbage Collector Tuning
  • The Garbage Collection Handbook
  • SATB for G1 garbage collector