For those of you who have read too many articles about Java garbage collection, if not, the wall is broken: After reading this article about garbage collection, you will be able to argue with your interviewer

Rather than rehash GC algorithms and garbage collectors, this article talks about several issues that can be overlooked when GC occurs. Understanding these issues will help you understand MORE about GC.

Main Contents of this paper

- Q1: How was the GC work initiated? -q2: Stop The World How do YOU make Java threads Stop? - Q3: How do I find GC Roots? -q4: How to handle four special references in GC? -q5: How do references correct after objects are moved?Copy the code

Q1: How was GC work initiated?

Garbage collection is divided into MinorGC and FullGC for different partitions, and the triggering conditions for different partitions are different. In general, GC triggers can be divided into active and passive types:

  • Active: the program shows a call to system.gc () to initiate a GC (not necessarily immediately or even without gc)
  • Passive: Memory allocation fails and space needs to be cleared

In either case, the GC is initiated in the same way:

  • Step1: the thread that needs GC to initiate oneVM_OperationOperation (this is a base class, and different garbage collectors initiate their own subclass operations, for example, the CMS collector initiates VM_GenCollectFullConcurrent)
  • Step2: This operation is posted to a queue, one in the JVMVMThreadThe thread that handles these operation requests in the queue calls the VM_OperationevaluateFunction to handle each operation.
  • Step3: VM_Operation evaluate function call itselfdoitVirtual functions
  • Step4: the VM_Operation subclass derived from each garbage collector overwrites the doit method to realize their own garbage collection processing work, a typical use of C++ polymorphism.

Q2: How do you make Java threads Stop?

You may have heard of STW, which requires all working Java threads to be stopped while performing garbage collection. The reason for doing this is to borrow a sentence from the previous article:

Why are other worker threads suspended during garbage collection? Imagine that you are collecting rubbish while another group of people are throwing it away. Can the rubbish be picked up?

So how exactly do these Java threads stop?

It is certainly not the garbage collection thread that performs suspend to suspend them in the first place.

Stopping does not allow the thread to stop anywhere, because the following GC will cause objects in the heap to “migrate”, and if stopped improperly, the thread will wake up and do something unexpectedly wrong with those objects.

So where do we stop? This leads to another important concept: the safe point, where a thread entering a safe point means that the reference relationship does not change.

The security point synchronization is initiated by the VMThread described in the previous section. The security point synchronization is performed before VM_Operation is processed. After the security point synchronization is completed, the security point synchronization is canceled.

void VMThread::loop() {
  while (true) {... _cur_vm_operation = _vm_queue->remove_next(); .// Secure point synchronization starts
    SafepointSynchronize::begin();
    // Process the current VM_Operationevaluate_operation(_cur_vm_operation); .// Synchronization endsSafepointSynchronize::begin(); . }... }Copy the code

Note that not all the VMS that process arbitration perform the synchronization based on arbitration. For clarity and simplicity, this logic is omitted in the preceding code.

A Java thread can be in different states, and in HotSpot, depending on the state the thread is in, there are different ways to get it to a safe point. There is a long note in the HotSpot source that specifically explains it:

1. Explain the status of bytecode execution

The JVM execution process is simply understood as a huge switch case, constantly fetching the bytecode and executing the code corresponding to that bytecode (this is just a simplified model). There must be a DispatchTable for bytecodes and their blocks in the JVM. This table is called DispatchTable and looks like this:

In fact, there are two such tables inside the JVM, one in normal state and one that needs to enter a safe point.

In the code that enters the safe point, one of the jobs is to replace the bytecode dispatch table in effect above:

Recovery:

The code in the replaced bytecode DispatchTable will add the security point check code, which is not expanded here.

2. Execute native code state

For ongoing JNI calling thread, SafepointSynchronize: : don’t need special operation in the begin. Java threads executing native code will proactively check to see if they need to suspend themselves when they return from the JNI interface.

3. Execute the compiled code state

Most modern JVMS use a just-in-time compilation technique called JIT, which compiles hot-spot execution code into local machine instructions, usually at the granularity of method functions, during execution to speed up the process.

In simple terms, it is found that a function is being executed repeatedly, or that a block of code within a function is being looped too many times, and it is decided to compile it directly into native code instead of being interpreted through intermediate bytecode.

In this case, instead of executing through intermediate bytecode, there is of course no bytecode dispatch table, so replacing bytecode dispatch table in the first case does not work for executing this code on the thread. So what to do?

HotSpot uses a method called active interrupts to get threads to a safe point. Specifically, there is an in-memory page in the JVM that the thread looks at (or reads) from time to time as it works, but normally everything is fine. However, before GC, the JVM’s housekeeping manager VMthread will set the access attribute of the memory page as unreadable in advance. In this case, when another worker thread tries to read the page, the memory access exception will be triggered. The JVM’s exception catcher installed in advance can take over the execution process of each thread. Next block, suspends the thread.

// Roll all threads forward to a safepoint 
// and suspend them all
voidSafepointSynchronize::begin() { ... os::make_polling_page_unreadable(); . }Copy the code

Calling OS ::make_polling_page_unreadable() causes polling Page to become unreadable. This function has different implementations depending on the operating system platform, using common Linux and Windows as examples:

Linux:

void os::make_polling_page_unreadable(void) {
  if(! guard_memory((char*)_polling_page, 
    Linux::page_size())) {
    fatal("Could not disable polling page"); }}bool os::guard_memory(char* addr, size_t size) {
  return linux_mprotect(addr, size, PROT_NONE);
}

static bool linux_mprotect(char* addr, size_t size, int prot) {
  char* bottom = (char*)align_down((intptr_t)addr, os::Linux::page_size());
  assert(addr == bottom, "sanity check");
  size = align_up(pointer_delta(addr, bottom, 1) + size, os::Linux::page_size());
  return ::mprotect(bottom, size, prot) == 0;
}
Copy the code

Finally call system-level API: MProtect to complete the property setting of the memory page, familiar with Linux C/C++ programming friends should not be unfamiliar.

Windows:

void os::make_polling_page_unreadable(void) {
  DWORD old_status;
  if(! VirtualProtect((char *)_polling_page, 
    os::vm_page_size(),
    PAGE_NOACCESS, 
    &old_status)) {
    fatal("Could not disable polling page"); }}Copy the code

Finally call system level API: VirtualProtect to complete the memory page properties Settings, familiar with Windows C/C++ programming friends should be familiar with.

Where is this particular page? A static member variable in a Runtime/OS class.

4. Blocked state

Threads that are blocked due to IO, lock synchronization, etc., will block and not wake up until the GC completes.

5. VM or state switchover

A Java thread spends most of its time interpreting and executing Java bytecode, and in some cases the JVM itself takes over execution. When a thread is at these special moments, the JVM also proactively checks the state of the safe point when switching the thread’s state.

Q3: How do I find GC Roots?

Who are GC Roots?

During GC, reachability analysis algorithm is used to find the valuable objects, and they are copied and retained. The remaining objects that are not in the traceability chain are cleaned and eliminated. The starting point of a reachability analysis algorithm is a set of things called GC Roots. What are GC Roots? Where are they?

  • The object referenced in the virtual machine stack (the local variable table in the stack frame)
  • The object referenced by the class static property in the method area
  • The object referenced by the constant in the method area
  • Objects referenced by JNI (commonly referred to as Native methods) in the Native method stack

Now we know who they are and where they are. But how do you find them in GC? Take the object referenced in the first stack for example, JVM often dozens of threads running, each thread nested function stack frame at least a dozen layers, more than dozens of hundreds of layers, how to find out the references in all these threads, we can imagine that this will be a time-consuming and exhausting project. Time is precious, so GC needs to be completed as soon as possible to reduce The interruption of process response caused by garbage collection. Later, we need to trace The object reference chain, copy and copy The object, so there is not much time left for GC Roots to traverse.

Modern Java virtual machines, including HotSpot, adopt a space-for-time strategy. The core idea is simple: record the location of GC Roots in advance, and track them quickly when GC is performed.

So the question is, where does this location information exist? What kind of data structure? How is this information updated as threads are executing and reference relationships are changing?

Derivation of the OopMap

Before we answer those questions, let’s forget about GC Roots for a moment and consider another question:

If a JVM thread scans the Java stack and finds a 64-bit number 0x0007FF3080345600, how does the JVM know if this is an address to an object in the Java heap (that is, a reference) or if it is just a long variable?

As we all know, one of the biggest changes in the Java language compared to C/C++ is the elimination of annoying Pointers, freeing programmers from the need to use Pointers to manage memory. The JVM is written in C++, after all. Instead of saying that Java has no Pointers, Java is, in a way, full of Pointers. Except in Java, we have a different expression: reference.

It should be added that in some early JVM implementations, the reference itself was just a handle value, an index value in an object’s address table. Modern JVMS no longer reference this way, but use direct Pointers instead. Regarding this question, in Q6 of this article, how can references be corrected after objects are moved? Further elaboration will be made.

Going back to the question, why does the JVM need to know if a 64bit’s data is a reference or a long variable? The answer is if it doesn’t know, how can it reclaim memory?

This leads to another set of terms: conservative GC and exact GC.

  • Conservative type GC: virtual machine can not clearly distinguish the problem stated above, cannot know what is quoted in the stack, the conservative attitude, if a data looks like a pointer to an object (such as the number to heap area, the position just have an object in the head), in which case it as a reference. This may not be a quote but a quote, which is actually lazy politics, in which case it is possible to get caught and not recycled (think about why?).
  • Exact GC: Compared to conservative GC, this is to know exactly whether a 64bit number is a long or a reference to an object. Modern commercial JVMS use this more advanced approach, where JVMS know exactly what is inside each address unit in the stack and object structure, without killing it by mistake or missing it.

So how does the exact GC know how to do this? The answer is that the JVM records the data in memory, which in HotSpot is called an OopMap.

To answer the last question in the previous section, the location information of GC Roots is also in OopMap.

What does OopMap look like?

How is OopMap data generated?

The HotSpot source code for creating OopMap data is scattered all over the place. You can find them by searching the new OopMap keyword in the source directory. After a preliminary reading, you can see that they are found in function returns, exception jumps, loop jumps and so on. The JVM records OopMap information for subsequent GC.

Q4: How to handle four special references in GC?

Any article on GC will tell us that the reachability algorithm is used to find unreferenced objects from GC Roots. But the quote here is not that simple.

Generally we refer to Java references as strong references, but there are other references:

  • Strong reference: The default is to point directly to the object coming out of new
  • Soft references: SoftReference
  • A weak reference: WeakReference
  • Phantom reference: PhantomReference, also called PhantomReference

The following is a brief introduction to the above types of references, excluding the default strong reference:

Soft references

Soft references are used to describe objects that are useful but not necessary. Objects associated with soft references are listed for a second collection before the system is about to run out of memory. An out-of-memory exception is thrown if there is not enough memory for this collection. ———— from Understanding the Java Virtual Machine in Depth

To sum up: If only one SoftReference object is still referencing an object, A will normally not be cleaned up when there is enough memory. But if memory is tight, then I’m sorry, I’m going to have to clean up A with you. This is also the reason why soft references are “soft”.

A weak reference

Weak references are also used to describe non-essential objects, which are weaker than soft references. An object associated with a weak reference is garbage collected if it is only associated with a weak reference (there is no strong reference associated with it). ———— from Understanding the Java Virtual Machine in Depth

WeakReference is weaker than soft reference, so weak that even in the case of sufficient memory, if object A is only referenced by A WeakReference object, then I’m sorry, but also want to take you. This is why weak references are “weak”.

Phantom reference

The existence of a virtual reference does not affect the lifetime of an object, nor can we obtain an instance of an object through a virtual reference. The sole purpose of setting a virtual reference association for an object is to receive a system notification when the object is reclaimed by the collector. ———— from Understanding the Java Virtual Machine in Depth

This is weaker than the weak reference above, and in some ways it is not even a reference at all, because unlike the above two bits, we can get the original reference through the get method, and return NULL after overwriting the get method:

public class PhantomReference<T> extends Reference<T> {
  public T get(a) {
    return null; }}Copy the code

The Final reference

In addition to the above four types, there is a special reference called FinalReference that allows the Finalalizer method to be executed before the class object overriding the Finalalizer method is cleaned.

The definitions for the above references are as follows in the HotSpot source:

Clean up the strategy

How does the JVM treat these special types of references differently when performing GC?

In HotSpot, regardless of the garbage collector, after traversing all references through GC Roots, before performing object cleanup, Will be called ReferenceProcessor: : process_discovered_references function to find the need to clean up the reference for processing, it can see come out by the name of this function.

And before calling this function, there is another step: call ReferenceProcessor: : setup_policy set handling strategy.

The logic of the function is simple. The bool parameter always_clear determines whether _alwayS_clear_SOFT_ref_policy or _default_soft_ref_policy is currently used.

The name of the policy is always clean soft reference, and the default policy is always clean soft reference.

The first is the always clean policy, which is the AlwaysClearPolicy

Then there is the default policy, select LRUMaxHeapPolicy if currently running in Server mode, otherwise select LRUCurrentHeapPolicy in Client mode.

ReferencePolicy is a base class, and the core virtual function should_clear_reference is used to determine whether to clean the corresponding reference. HotSpot provides four subclasses to reference processing policies:

  • NeverClearPolicy: Never clean up
  • AlwaysClearPolicy: Always clean up
  • LRUCurrentHeapPolicy: Recently unused cleanup (evaluate the latest time based on the current heap space remaining)
  • LRUMaxHeapPolicy: Recently unused cleanup (evaluate the most recent time based on the maximum available heap space remaining)

Setup_policy specifies whether always_clear is true or false. Because it directly decided to choose later for the handling of soft references are LRUCurrentHeapPolicy/LRUMaxHeapPolicy or AlwaysClearPolicy.

In the HotSpot source code, different garbage collectors handle this slightly differently, but in general the always_clear parameter is false in most scenarios, and only when multiple attempts to allocate memory fail will attempts be made to set it to true to clean up soft references to free up more space.

Keep these policies in mind. The choice of policy will affect how soft references are handled.

Processing logic analysis of special references

Going back to the process_DISCOVERd_References function, let’s look at the contents of this function:

As can be seen from the variable name and comments, process_discovered_reflist is called internally to handle the Soft, Weak, Final, and Phantom special references.

This function is declared as follows:

Focus on the second parameter, policy, and the third parameter, clear_referent. Look back at the arguments passed in the call to this function above:

Reference types policy clear_referent
SoftReference Is not empty true
WeakReference NULL true
FinalReference NULL false
PhantomReference NULL true

Different parameters will determine the fate of four different references.

In process_discovered_reflist, the reference processing is divided into three phases. Let’s take a look at the first phase:

Stage 1: Handling soft references

According to the comments, the policy parameter is not empty only for soft references.

In the process_phase1 function that actually performs the processing, all soft references are iterated over, and for objects that are no longer alive, the process_DISCOVERed_References function in the previously mentioned policy determines whether the reference needs to be kept or removed from the list to be cleaned up.

Stage two: Eliminate the surviving objects

The main task at this stage is to remove references that point to objects that are still alive (and have other strong references pointing to them) from the list:

Stage 3: Disconnects the objects to which the remaining references point

clear_referent

Looking back at the table above, clear_referent is true for Weak, Soft, and Phantom references, meaning that by the end of this stage, everything that should be kept is kept and everything else is destroyed. In this function, the referent field in the remaining references is set to null, and the last link between the object and the particular reference is cut off, which will doom the subsequent GC.

For Final references, this parameter is false, and phase 3 does not disconnect it from the object. Disconnect after the Finalizer method is executed. So in this GC, a class object that overrides the Finalizer method is kept alive for the time being.

summary

I think you’re a little confused here, with so many types of references, and so many processing stages, that the head has shifted. Don’t be afraid, Xuanyuan Jun first read the time is also like this, even now to write this article, but also repeatedly taste the source code, research certification before sorting out.

Let’s take a look at the stages of each type of reference:

  • Soft references
    • Phase 1: For objects that are no longer alive, decide whether to remove them from the list to be cleaned according to the policy
    • Stage 2: Remove surviving references to objects from the list to be cleaned
    • Phase 3: If the clearing policy in phase 1 decides to clear soft references, empty the remaining soft references in phase 3 to sever the final connection with objects. If the clearing policy in phase 1 decides not to clear soft references, the soft references are kept in phase 3 when the list to be cleared is empty.
    • conclusion:When an object that is referred to only by a soft reference is cleaned up depends on the cleanup strategy and, ultimately, on the current heap space usage
  • A weak reference
    • Phase 1: No processing. In phase 1, only soft references are processed
    • Stage 2: Remove surviving references to objects from the list to be cleaned
    • The third stage: the remaining weak references to the object are no longer alive, the weak references are empty, and the final connection with the object is cut off
    • conclusion:An object that is referred to only by weak references is cleaned up on the first GC
  • Phantom reference
    • Phase 1: No processing. In phase 1, only soft references are processed
    • Stage 2: Remove surviving references to objects from the list to be cleaned
    • The third stage: the remaining virtual references to the object are no longer alive, the weak reference is null, cut off the last connection with the object
    • conclusion:An object that is referred to only by a virtual reference is cleaned up on the first GC

Q5: How can references be corrected after objects are moved?

By now we know that the garbage collection process is accompanied by object migration, and once the object is “moved”, all references to it (including stack references, member variable references to objects in the heap, and so on) are invalidated. The reason our program is still running after GC is that the JVM has done a lot of work behind the scenes to make it look like we just STW for a while and wake up as if nothing had happened.

The natural question to ask is: How do references fix when objects move?

Before answering that question, let’s take a look at how references actually “point” to objects in Java. Over the history of the JVM, two scenarios have emerged:

Scheme 1: handle

The reference itself does not refer directly to the object. The address of the object exists in a table, and the reference itself is only the index value of the entry in the table. Here’s an illustration from understanding the Java Virtual Machine:

No matter Windows Windows, or kernel objects (Mutex, Event, etc.) are described and managed in the kernel. For security, the address of the kernel object will not be directly exposed. The application layer can only get a handle value. This handle is used to interact.

File descriptors on the Linux platform also reflect this idea. This is even true of the virtual memory addresses used by modern operating systems, which are not physical memory addresses but are translated by address decoding tables.

The advantage of this approach is obvious. After the object is moved, all references themselves do not need to be modified, just the corresponding object address in the table can be modified.

The downside is also obvious, as accessing an object requires a “translation transform”, which degrades performance.

Plan two: direct pointer

The second scheme is the direct pointer, no middleman to earn the difference, the reference itself is a pointer. To quote again from inside understanding the Java Virtual Machine:

Compared with the first method, the advantages and disadvantages of both are exchanged.

Advantages: More direct access to objects, faster performance. Cons: Cumbersome repair of references after objects are moved.

Modern commercial JVMS represented by HotSpot have chosen the direct pointer approach for object access location.

In this way, all existing reference values need to be modified, and the work is not trivial.

Fortunately, in section 3 Q3: How to find GC Roots? OopMap is once again playing the role of savior.

The information stored in OopMap tells the JVM where references are, a key piece of information not only for finding GC Roots for garbage collection, but also as an important guide for fixing references.

Reference links:

RednaxelaFX:Find the pointer/reference on the stack

Write in the last

After reading this article, I hope you will not only know what GC itself is, but also know more about the work behind the scenes of GC, so that when talking about GC with the interviewer, you can have more rounds of conversation

Of course, due to the limited technical level of the author, it is difficult to write a long article of ten thousand words. If there are any mistakes in the article and technical discussion, please be sure to point out, so as to correct the mistakes in time. Thank you.

If you find this article useful, please click on it for me. Thanks again.

A review of past favorites

A Memoir of A Java object: Garbage collection

Kernel Address Space Adventure 3: Permission management

Who’s moving your HTTPS traffic?

Advertising secrets in routers

Kernel Address Space Adventure 2: Interrupts and exceptions

DDoS attacks: Infinite Warfare

An SQL injection leads to a spectacular case

Kernel address space Adventures: system calls

A magical journey through HTTP packets

A THRILLING tour of DNS packets

I am a rogue software thread

Scan code attention, more exciting