JVM garbage collection process

This article mainly learns from “In-depth Understanding of Java Virtual Machine”, the content roughly follows the order of the book, plus their own understanding of the algorithm learning. Today Koizumi mainly with you to learn the whole process of recycling.

preface

Garbage Collection is an automatic memory management mechanism. As usual, the Wikipedia definition of garbage collection is posted below.

In computer science, Garbage Collection (abbreviated GC) refers to an automatic storage management mechanism. When a program occupies a portion of memory that is no longer accessible by the program, the program returns the portion of memory to the operating system through a garbage collection algorithm. The garbage collector can lighten the programmer’s burden and reduce errors in the program.

GC is much older than Java, having its roots in Lisp, and in 1960 the big guys at MIT were working on automatic memory management and garbage collection techniques. Over the years, the GC mechanism in the JVM has become very robust and mature.

Determine the objects for garbage collection

Before you can recycle your garbage, you must first find the objects that need to be recycled as garbage.

The JVM is divided into five regions — program counters, virtual machine stack, local method stack, heap, and method area. We know that program counters and stacks are thread-private and their lifetime depends on the lifetime of the thread. So you don’t have to worry about recycling.

The heap and method areas are shared by threads, where object instances and variables are stored, and can be dynamically allocated. Therefore, based on the previous definition and the analysis here, it is the memory areas in the heap and method areas that cannot be accessed again that need to be garbage collected.

Heap of recycling

Now that you know the definition of an object for garbage collection, how do you determine if the current object cannot be accessed again, that is, the object is “dead”?

Reference Counting algorithm

The reference counting algorithm is relatively simple. After the object is instantiated, the initial value of the reference count of the object is 1, and the subsequent operations are carried out according to the reference and release of the object. When the counter of this instance object returns to zero, the object is judged to be “dead”.

Reference counting algorithm is relatively simple, relatively high judgment efficiency, many places have adopted reference counting algorithm, such as Microsoft’s Component Object Model (COM) technology, Python language, object-C language have adopted reference counting algorithm memory management mode.

Of course, reference counting algorithm also has its own disadvantages, such as references between objects. Here’s a little chestnut from the book:

public clss ReferenceCountingGC {
  public Object instance = null;
  private int size = 10;// It has no actual meaning
  
  public statice void testGC(a) {
    ReferenceCountingGC objA = new ReferenceCountingGC();
    ReferenceCountingGC objB = new ReferenceCountingGC();
    // Reference each other
    objA.instance = objB;
    objB.instance = objA;
    
    objA = null;
    objB = null; }}Copy the code

The instantiation counter of the object starts at 1, is referenced +1, and finally references are released -1. The last two objects have a counter value of 1. The objects are not dead, but they are not used anywhere else.

Reachability Analysis

Reachability analysis algorithm is the main decision algorithm used in C# and Java. The algorithm flow is as follows:

  1. Select GC Roots as the start node set
  2. Starting from the start node, search nodes down according to the object reference relationship to form a reference chain

Reconstruct the process from the perspective of data structure and algorithm:

An undirected graph is formed by taking instance objects as nodes and reference relations of objects as edges. Specific nodes are selected from the undirected graph as specific nodes, and nodes that have no association with these specific nodes (directly or indirectly accessible) are selected.

In fact, it is the traversal of undirected graph in graph theory. If we only consider it from the perspective of algorithm, we can consider constructing adjacency matrix or adjacency linked list and use DFS or BFS to deal with it, which will be introduced in the series of data structures and algorithms in the future.

As shown in the figure, the nodes searched from GC Roots are all alive objects, while the nodes not searched are judged as “dead” objects.

Now that we know the flow of the algorithm, we also need to know the premise, how do we pick a particular node which is the root object?

In Java, the book gives several fixed sources of GC Roots:

  • Objects referenced in the virtual machine stack — temporary variables, local variables, and so on
  • Objects referenced by class static properties in the method area — static variables, etc
  • Objects referenced by constants in the method area — pool of string constants, etc
  • Objects referenced by the Native method stack JNI (Native method)
  • Internal JVM references — base data types, system classloaders, common exception objects (OOM, etc.), etc
  • Synchronization lock (sychronized keyword) holds the object
  • Jmxbeans, native code caches, and so on that reflect what’s going on inside the Java virtual machine

In addition to the fixed GC Roots collection, some “temporary” objects are added to the collection to make up the GC Roots collection. I’ll talk more about that in the future.

Will unreachable objects be reclaimed as long as they are not associated with the GC Roots collection?

Is unreachable going to be recycled?

Just because it’s unreachable doesn’t mean it’s bound to be recycled. After the JVM determines that an object has nothing to do with GC Roots through the accessibility analysis algorithm, it will mark this object and then make a secondary determination, which is as follows: Is it necessary to implement Finalize () of objects? If objects do not override Finalize () or finalize() method has been called by virtual machine, then it can be regarded as “not necessary to execute”.

If the second judgment is “necessary to execute”, the object will be placed in the F-queue, and then a Finalizer thread automatically set up by the JVM with low scheduling priority will execute finalize() of the object in the Queue, and execute the machine-triggered method without waiting for the completion of the method execution.

Therefore, unreachable objects have a way to save themselves at this time, that is to rewrite finalize() function, but notice that Finalize () function can only enter once, that is, unreachable objects only have a way to save themselves once.

Method area recovery

Method area reclamation was mentioned in the previous article about the JVM memory area.

For reclaimed objects, method area reclamation can be divided into two categories: discarded constants in the constant pool and types that are no longer used.

Abandoned constants

The determination of discarded constants is very similar to the determination of objects in the heap, and will not be described again. Refer to reference counting in recycling.

A type that is no longer used

Similarly, “no longer in use” needs to be scoped.

  1. All instances of this type (including derived subclasses) have been reclaimed
  2. Class loaders of this type have been recycled (difficult to achieve)
  3. Its java.lang.Class object is not referenced

Types that meet the above three requirements are allowed to be garbage collected, but it is important to note that this is only “allowed”, not required.

Garbage collection

Generation collection theory

After determining the object of garbage collection, there needs to be a garbage collection manager, namely garbage collector. At present, most garbage collectors follow the generational collection theory, which is proposed based on two hypotheses:

  1. The Weak Generational Hypothesis: the overwhelming majority of subjects were born and died.
  2. The Strong generation Hypothesis: the more garbage collections an object makes it through, the less likely it is to die.

Based on the above two hypotheses, our garbage collector will divide the Java heap into different regions and divide the collection objects into different regions based on the number of times the objects have passed through the garbage collection, while the garbage collection will collect one or more regions when it is collected.

Although different garbage collectors divide regions differently, regions are roughly divided into Young Generaion and Old Generation. This division also leads to a problem — objects in the two regions have a reference relationship (called cross-generation references), which can have an impact on the collection of a region (live objects in the region must be queried again).

Therefore, the third hypothesis is introduced:

  1. Intergenerational Reference Hypothesis: Cross-generational references are rare compared to generational references.

With the third hypothesis, we can assume that when intergenerational references occur between Cenozoic and old age, the indeterminability of the old age will enable the referenced new age objects to survive and be promoted to old age as the number of garbage collections increases.

At the same time, the third hypothesis tells us that there will be very few intergenerational references, so we only need to establish a data structure to mark the memory region of intergenerational references in the old age, and add it to GC Roots for the determination of Cenozoic surviving objects.

Types of garbage collection

According to different areas of garbage collection, garbage collection can be divided into the following ways:

  1. Partial GC: we strained with Partial GC, which we passed in a way that didn’t target the entire Java heap
    • Minor GC: New Generation collection, targeting only new generations
    • Major GC: Old age collection, targeting old age only
    • Mixed GC: A Mixed collection, the goal is to collect the new generation and part of the old generation
  2. Full GC: Whole heap collection, targeting the entire Java heap and method area

Garbage collector

Garbage collector is not introduced here, and I will share it with you after further study.

Garbage collection algorithm

Garbage collection algorithms can be divided into two categories according to the method of determining garbage objects to be reclaimed, one is called ReferenceCounting GARBAGE collection and the other is called Tracing garbage collection. Since the JVM uses a reachability analysis algorithm for garbage determination, this section focuses on the latter collection algorithm.

Mark-sweep algorithm

The tag removal algorithm is the foundation of the garbage collection algorithm. The idea is relatively simple, literally “mark”, then “clear” :

  1. tag

    First, according to the accessibility analysis algorithm, reachability was determined and then marked as recyclable

  2. remove

    Clear areas marked as recyclable for recycling

The mark-clean process is simple and therefore has a number of disadvantages, which can be summarized in two main points:

  1. Low efficiency

    The process of marking and clearing mainly involves scanning objects: all objects are first scanned for marking, then scanned for marking and then cleared, so the efficiency decreases as the number of recyclable objects increases.

  2. Fragmented memory

    Memory fragmentation is conceivable. Recyclable objects are not necessarily stored in a contiguous segment of memory. Therefore, cleaning without any processing will cause a lot of fragmented space, similar to the memory management operation of an operating system.

Mark-copy algorithm

The mark-copy algorithm is optimized on the basis of the mark-clear algorithm.

In 1969, Fenichel proposed a garbage collection algorithm of “half region copy” based on mark-clean algorithm, whose algorithm flow (the marking process has been omitted, and the mark-tidy algorithm also omitted this process) is as follows:

  1. The memory area is split into two, with one half being used at a time and the other half reserved.
  2. When half of the region is used, reachable (live) objects in the used region are first allocated to the reserved region.
  3. Reclaim the reclaimable memory region in the fully used region, and then use the original reserved region.

Such mark-copy algorithms are friendly for situations where a large amount of memory needs to be reclaimed, and there is no need to worry about fragmentation after a memory copy operation. But corresponding also has its disadvantages.

  1. Large memory copy overhead is incurred in the process of memory copy move, and the efficiency decreases as the number of living objects increases.
  2. The actual available area of memory becomes half of the original area, wasting memory space

This is a valid idea for the new generation of garbage collection, but the ratio no longer needs to be 1:1. IBM’s first hypothesis for generational collection suggests that 98% of the new generation will not survive the first round of collection. On this basis, Andrew Appel proposed a more optimized algorithm in 1989.

Appel type recycling

The new generation is divided into three parts: 80% Eden space, 10% Survivor space (called from), and 10% Survivor space (called to).

The allocated memory is 80% Eden space and 10% Survivor space named from. When the allocated memory is used up, the JVM performs garbage collection as follows:

  1. Copy the Eden and FROM reachable objects to the TO space
  2. Clear all recoverable objects in the reclaim Eden and FROM Spaces
  3. The original from space becomes the next to space, and the original to space becomes the next from space.

What if the surviving object exceeds this 10% Survivor space? Not so fast, there’s the old age in the back pocket, and we’re gonna use some of it to store any extra survivors.

Mark-compact algorithm

Mark-copy algorithm is relatively effective for the new generation of garbage collection, but for the old generation, this algorithm is not suitable, because the old generation needs to consider the extreme case that all memory is occupied, so in 1974, Edwaed LueDers proposed another targeted algorithm, namely mark-collation algorithm. The algorithm flow is as follows:

  1. Move all marked reachable objects to one end of the memory region to form a reachable object boundary region
  2. Clears all objects except the boundary

In contrast to mark-clean, mark-clean is just one more operation to move reachable objects. This has the advantage of eliminating fragmentation from memory allocation, and has the disadvantage of making memory reclamation more complex.

conclusion

After learning, we know that garbage collection in JVM is mainly for Java heap, and this article also focuses on the analysis of Java heap garbage collection: garbage collection object determination (reference counting algorithm && reachability analysis and secondary determination), garbage collection algorithm for different regions and proposed for different regions. This process is still more complex, or we need to learn more ah ~

For garbage collection algorithms, I summarized a small table for you, hoping to help you quickly understand the advantages and disadvantages of the three algorithms.

Garbage collection algorithm advantages disadvantages
Mark-clear algorithm Simple algorithm design Low efficiency, easy memory fragmentation after reclamation
Mark-copy algorithm Relatively efficient for the new generation, without considering the memory fragmentation problem It wastes some memory space and has memory replication overhead
Mark-collation algorithm Relatively efficient for the old age, do not consider the memory fragmentation problem The design is complex and has the overhead of memory movement

The garbage collection process is the process of the JVM to reclaim memory, this for later troubleshooting memory problems (such as memory leakage, memory overflow, etc.) has a significant help, I hope you read after, can harvest.