preface
Most programmers have an idea of “garbage collection” from the Java language, but the concept was invented around 1959 by John McCarthy, an American computer scientist, to simplify memory management in Lisp. Whether it is reference counting, which was first thought of in Lisp before 1960, or mark-clearance, which was successfully applied in Lisp in 1960, or copy algorithm, which was proposed by M. L. Minsky in 1963. Starting in the 1970s, people gradually realized that an ideal garbage collector should not be as inefficient as a “mark sweep” algorithm; It should not sacrifice a lot of memory like the replication algorithm, nor should it interrupt the execution of the application for long periods of time. So between 1970 and 1985, scientists dug deep into the analysis, and good garbage collection algorithms, such as tag collation, incremental collection, and generational algorithms, poured out
The body of the
“Low latency and high availability” is the performance goal we have been pursuing for perfection in the high-frequency transaction e-commerce business like Daily Youxian. For example, some interfaces on the homepage of Daily Youxin require the response within 100ms, and the availability must ensure 4 nines. In fact, this requirement is not too much, because you never know how many users will lose or even reduce the payment conversion rate due to an error or a delay. Because of the existence of STW(Stop the World), the possibility of such problems will be magnified (as shown below).
Why G1?
One problem with the example above is that we sometimes expect the system to be “soft real time” during garbage collection. Garbage collectors prior to G1 (parallel GC, concurrent GC, incremental GC, and so on) were constantly trying to reduce the maximum pause times, which could lead to throughput degradation, and garbage collectors prior to G1 were not as predictive of pause times as G1 was. This is likely to happen in a scenario like the one above, where the application system’s garbage collection stops for long periods of time, resulting in increased interface response times and reduced user experience. (tips: there may be some of my classmates said that I chose the GC tuning, let him to meet the requirements of me bai, but relatively high complexity of the landscape of the electricity market and business flow is protean, changeability, in what you think is tuning good situation, is likely to be a business in the next iteration or any online sudden traffic anomaly will fail!)
Soft real-time
And what does “soft real time” mean? I will first give an example of “hard real-time”. Even if there is a 1ms delay in the core system used in rescuing patients in hospitals, it will bring very serious consequences. Therefore, our requirements for such a system are “hard real-time”, which means that you must strictly ensure the real-time. In this way, we have a good understanding of “soft real time”. You do not need to strictly guarantee that you can exceed the deadLine several times, but the number of times must be within the user’s tolerance, which is soft real time. In fact, G1 ensures soft real-time performance through the following two means:
1. Users can customize the pause time of the application system and try to ensure that the processing time does not exceed the user-defined pause time
2. By predicting the next GC can let how long pause the function of the application system, if the predicted time is too long, it will pass some split GC target object and delay the GC means to ensure that go as far as possible to achieve the user custom pause time (tips: of course, many students know the G1 may be heard of his design is very suitable for processing, And can efficiently take advantage of the multi-core performance of the server without compromising GC performance, which is a core feature of the server.
Analysis of G1 principle
origin
The G1 garbage collection algorithm was introduced in Java6, supported in Java7, and Oracle wants to replace CMS with G1. Of course, G1 has been set as the default garbage collection algorithm in Java9.
Everything arises from the beginning, to begin real complexity, g throughout the whole cover at the back of the content will be OpenJDK7 code for reference (of course some very basic concepts described here I won’t, I can directly reference public number > < goodbye logging machine of another article > < goodbye JVM with beep beep miles video > < goodbye JVM). Today, we’ll focus on the following aspects of the G1
1. Structure of G1GC heap
The first basic concept we need to know is that the heap structure in G1GC looks like this (below)
G1GC divides a contiguous heap into regions. The default size of each region is 1MB, and each region is filled with various objects. The remaining regions are called free regions.
From the above code, we can see that it has several parameters
MIN_REGION_SIZE = 1MB(minimum regionSize)
MAX_REGION_SIZE= 32MB(maximum regionSize)
TARGET_REGION_NUMBER = 2048(number of regions in the heap)
Moving down to line 293, we divide the smallest heap memory (the size of -xms we set) by TARGET_REGION_NUMBER(2048) and the larger one between MIN_REGION_SIZE(1MB). We can actually see that he divided the heap into 2048 equal regions and took the larger one from the size of 1MB and the split region.
Of course why take the heap size of -xms instead of taking the heap size of -xmx and removing 2048? -xms128m -XMx32g -xms128m -xms32g -xms32g -xms32g And then we look at line 301, and we take the size of this area up to the power of 2. The last line 304 is used by the author to ensure that the area size is not less than the minimum 1MB and the maximum 32MB.
2. Concurrent markup
2.1. Initial markup – Bitmap creation
I think most of you know what the initial tag does, right? Even without understanding the meaning of what it does, you should be forced to memorize concepts by various interviews. That’s right, the initial tagging process is all about tagging objects that GC Roots can be directly associated with. The G1 does not, of course, do something before it does! Create two bitmaps (NextBitmap and PrevBitmap) for each region. What would they look like? (As shown below – Paper diagram)
Of course, this is the original drawing in the paper, which is not easy to understand, so I drew a version of the easy to understand drawing by myself (as shown below – hand-drawn drawing).
NextBitmap(bitmap marked this time) and PrevBitmap(result of last marked bitmap). We can see that there are six objects in the G1 heap, and we now assume that the size of these six objects is (16 bytes, 24 bytes, 8 bytes, 8 bytes, 24 bytes, 16 bytes). We assume that the 8-byte object is one Bit in the bitmap, so you can see in the figure above that the first object corresponds to two bits, the second to three bits, and the third to one Bit. Here,The top pointer refers to the beginning of the object in these fields, bottom refers to the end of the object in these fields, nextTAMS represents the top position at the start of the current tag, and prevTAMS represents the top position at the start of the last tag. If the object is alive, its corresponding bitmap is marked with 1(red); If the object is dead, its corresponding bitmap is marked with 0(blue). Of course, when an object corresponds to more than one bitmap, just mark the starting position (bit). (The 24-byte object in the image corresponds to three bits, and when it is alive, only the bitmap of the first position needs to be marked).
Ok, let’s review the process again (see the paper diagram above)
Step A: For the first initial tag, A NextBitmap is created. Of course, since GC did not exist before, the PrevBitmap is empty
Step B: You can see that the number of objects in the heap seems to increase because new objects may enter during the concurrent marking process, but it doesn’t matter, we said that NextTAMS is used to record the initial position of the GC tag, so you will find that NextTAMS is still in the original position. The process is then to mark the bitmap, making it black when it’s alive and white when it’s dead
Step C: At the end of the GC, some cleanup operations are performed, swapping the current bitmap NextBitmap and PrevBitmap so that the NextBitmap becomes empty in preparation for the next GC. The PrevBitmap becomes the result of our surviving GC. PrevTAMS points to the position of NextTAMS at the beginning of our GC, and NextTAMS points to the position of booTom.
Step D: A new GC cycle begins, first creating the NextBitmap and then pointing NextTAMS to the location of the start tag.
Step E and step F: Repeat, not emphasized here. It should be noted that the GC thread creates bitmaps for each region concurrently with the application thread.
The generation age of GC is recorded in the object header (4 bits), while the birth and death of the object are recorded in the bitmap.
2.2. Initial tag-root scan
Simply suspend the mutator thread (application thread) and allow the GC thread to scan the object referenced directly at the root, the red line in the image, starting with GCRoot and marking the red object that can be scanned directly.
Can we just let the Mutator thread finish the root scan without suspending it?
No, there is a skill called Write barriers in G1 that can sense changes to objects, but most Gcroots are not objects, so their changes are not perceived, and Gcroots sometimes require frequent changes. The authors also consider that suspending the mutator thread would give better performance.
2.3. Concurrent markup
Summary sentence is to scan the original tag is tagged object, and at the same time will be done in most live objects tag (students may have any questions, I first phase marked the root, the second stage is not all root nodes below mark it is not good, why do you say it’s all live objects rather than most here? Don’t worry, details can wait until later to see what the final mark does?) Let me use a diagram to briefly illustrate what concurrent tokens look like (see below).
To explain the diagram, we can see that in the first phase the root scan marking process only flags 3 objects in red and flags 1 (red) on the corresponding NextBitmap. Then we look at what the concurrent marking process does. First, because the process is concurrent GC thread and the mutator thread, so there will be a new object in the process of concurrent marker, is in the figure 7 and 8 two objects, because both of the reference relationship at the beginning of the tag does not exist, so we can see in figure 7 and 8 no corresponding bitmap, and the generated in the process of marking a new object, We all think of it as a “scanned and marked” object, so its color is red. We then see that top has been moved to the latest object, and that the child objects 1 and 5 referenced by 3 are marked with the corresponding bitmap in red. However, have we taken into account the fact that GCThread and mutatorThread are concurrent markers, and if 1 and 5 are no longer referenced by 3 (either by other objects or by no one else), are we mislabeled? So what does G1 do? The answer is SATB(snapshot at the beginning), which saves the reference relationships between objects in the form of snapshots at the beginning of concurrent marking.
Let’s take a look at how it is described in the paper (as shown below)
The general idea is shown below
In other words, in the process of concurrent marking, the reference relationship of the object should be recorded after the change of the reference relationship, and the SATB local queue is the local thread queue held by the mutator thread, so there is no competition problem in the execution of enqueue(enqueue operation), as shown in the figure. When application thread 1’s dedicated SATB local queue is full, it is placed in the global queue set of the SATB. In the process of concurrent marking, GCThread will periodically check the global SATB queue set. If there is a queue, objects in the queue will be scanned and marked.
2.4. Final marking
I don’t know if you remember what I said above about the concurrent marking phase of the process just doing the marking for most of the living objects. The reason is that it is possible that the SATB local queue of some mutator thread is not filled, so it will not be placed in the global SET of SATB queues. So what we need to do in the final tagging phase is scan these remaining SATB local queues.
We can see from the figure above the difference between before and after the final tag: First, mutator thread 2’s SATB local queue is not filled, so it will not be placed in the global SATB queue set. Then it has a reference to object 6 in its queue, so after scanning the Mutator thread 2SATB local queue, 6 and its child object 4 will be marked. Of course,The final marking process suspends the Mutator threadBecause the local queue of the mutator thread is operated by the mutator.
2.5. Counting viable objects
As we know from the paper, this step is to count the number of bytes of the NextBitmap survivable object for each region and put it into next_marked_bytes within the region
After the end of the picture is as follows
Since objects 1, 3, 4, 5, and 6 are alive, the size of the live object is (16+8+8+24+16=72 bytes). Of course, the newly generated objects 7 and 8 are not counted. Prev_marked_bytes is 0 bytes, since this is the first GC.
2.6. Pick up the pieces
It’s just putting some finishing touches on it. Let’s look at the picture below
GCThread scans each region and moves nextTAMS to bottom, prevTAMS to top, next_marked_bytes to 0, prev_marked_bytes to 72, Move NextBitmap to PrevBitmap. Of course, in this step of the scan, we’re going to calculate each regionTransfer efficiencyAccording to theDescending orderSorting. How do you calculate the transfer efficiency here? We’ll talk about that later.
3. The transfer
When all the work of marking is finished, all we have to do is to collect the garbage. What is the process of collecting the garbage like? Moving a live object to a free area, and then resetting the previously occupied area, looks something like this
We can see that before the transfer, GCroot refers to 1, 1 to 2, and 2 to 4. All three objects are alive, and 5 refers to 6, but both 5 and 6 are dead. After the migration, the surviving objects in the first and second zones are moved to zone 3, and then zones 1 and 2 are emptied and reset to free zones. Before explaining the steps of transfer, I would like to raise a question, that is, there is always a problem in generational garbage collection algorithm, that is, cross-generation or cross-region reference scenarios. When scanning, the whole region or the old era needs to be placed in the scan range of GCRoots. What means does G1 take to avoid this problem? The answer is Remembered Set. What is it Remembered Set?
It basically means passcard table(card table) to implement this transfer memory set in the heap each512BCorresponds to a1BCard tables, and card tables are made up of one by one1BSize of the card, as shown below
This way we can quickly identify the index address of the card corresponding to an object in the heap. Represented by hashtables. There is a transfer memory set associated with each regionHash tableAs shown in the figure below
Object A in region A is referenced by objects B and D in region B and also by objects C in region C. The hash table structure in Remembered Set for region A is shown below.Key is the referenced area address, and value is the card index of the referenced object. We can see that references to objects in the Remembered Set region are recorded in card tables. In the marking phase, we explained the concept of a SATB write barrier. Its main function is to record the changes of reference relationships between objects, whileRemembered Set records changes in reference relationships between regions. When the object’s domain is modified, how do we perceive and implement it? In contrast, there is a sentence in the paper,Each thread has an associated remembered set log, A current buffer or sequence of modified cards. In addition, there is a global set of filled RS buffers. This means that each mutator thread holds a Remembered set log.There’s actually a concept in G1 that says,”Transfer dedicated write barriersThe transfer write barrier detects when the object’s domain changes and adds the card index of the object to the Remembered Set log. This is similar to the design of SATB. When the Remembered set log of the mutator thread is full, Will be added to the global transfer memory collection log (figure below)
Let me briefly describe the process. When object A in region A suddenly adds A reference to object B in region B, it is calledTransfer dedicated write barrier awareness, and then add index 512 in the card table corresponding to region A to the transfer memory set thread log. When the transfer memory set log of the Mutator thread is full, it will be added to the global transfer memory set log. After adding, A new transfer memory set log will be assigned to the Mutator thread. Tips: To add a concept here,Dirty CARDSCards added to the transfer memory set log of the Mutator thread after being sensed by the transfer specific write barrier;Clean the cardCards that are not in the transfer memory set log of the mutator thread. We saw earlier that the global set of SATB queues is periodically checked, scanned and tagged by GCThreads during concurrent tagging.So what does the global transfer memory set log do here?In fact, there will beThe special transfer memory maintenance thread maintains the Remembered Set for each region based on the global transfer memory log. How did he do it?
The general idea in the paper is as follows
First fetch the transfer memory set log of mutator thread from the global transfer memory set log, and then mark the card corresponding to the index of the card in the corresponding log as a clean card. The third step is to scan the domain of all objects in the region corresponding to the card. Finally, objects referenced by objects in the range (objects A in region A) (such as objects B in region B and objects C in region C) are added to the cards in their region’s transfer memory collection.If object A in region A is referenced again by the mutator in this thread, does it have to be added to the Remembered Set log again?Yeah, so there’s one hereHot cardTo the concept that frequently modified objects can actually causeTransfer memory collection maintenance thread is too stressful. It’s easy. There’s one in G1Hot card queue, when the card becomes dirty more than a certain number of times (the default is 4 times), it will be put to the end of the hot card queue. If the hot card queue is full, the card will be taken out from the beginning and handed overTransfer memory collection maintenance threadTreat as a normal card. If the hot card queue is not full, it will not be processed by the transfer memory collection maintenance thread, because it is destined to become dirty cards frequently, so it will be placedProcess of transferFor processing.
3.1. Select the collection
In the process of cleaning up the mess, we sorted in descending order of transfer efficiency, and we started to select which regions to recycle. Of course, when we chose the predicted pause time boundary that the user could tolerate, the subsequent regions would have to stop, otherwise the user’s predicted pause time would be exceeded. Generally speaking, the area with more garbage is more efficient in its transfer. (As shown below)
3.2. Root object + object transfer
Which objects will the root transfer transfer? The object referenced directly by the root + the object processed by the concurrent tag + the object referenced to the object in the collection (not in the region of the collection) the object transfer process is as follows: First, the object A in the collection is transferred to A free region (as shown below).
After the transfer, the new address of A* should be written back to the position before the transfer (as shown below).
Will AReferences to all referenced objects (in this case, only referenced objects in the collection) (A in the figure below)Field1, A*.field2, A*.field3) join the yellow transfer queue in the figure below.You may be wondering, why not just put objects B, C, and D on the transfer queue? A*. Field1, A*. Field2, A*. Field3
When the object is ARefer to theWhen object E outside the collection is collected this time, the location of the corresponding card in remembered Set of region E needs to be updated because A has moved.
When the object is AbeThe same is true for object references outside the collection this time, because A has been moved, the card of object F in region F needs to be added to the remembered Set of object A*.
Finally, the transfer is complete when the queue is empty after the transfer of an object referenced in the transfer queue is complete. At this point, all living objects within the collection have been migrated.
4. How does G1GC predict pause times
He uses the history to predict the next pause time, which we can see in the code is a way of calculating the attenuated variance, as we can see in the way we add the history
The advantage of this algorithm is that it slowly reduces the influence of past values on the calculation. (I will not go into details here, but interested students can know it by themselves) The paper also gives a formula for predicting object transfer as shown in the figure below
Basically what that means isElapsed time = fixed elapsed time of transfer + total elapsed time of all regions in collection + scanning elapsed time of remaining dirty cards. Of course, the transfer process does not have to be after the end of the concurrent token.
Afterword.
I am uneducated, so the article inevitably has mistakes and loopholes, hope forgive me. Of course, it is unreliable and irresponsible to put performance and optimization all on the bottom line of JVM tuning. There is a well-known saying circulating in Daily Youfresh: “Write code with conscience”. If you procrastinate with your code because of the tight schedule and heavy task, you are irresponsible to yourself, your descendants and the company. So, did you code your conscience?
reference
1.Garbage-First Garbage Collection
2.The garbage collection handbook
3.Oracle G1
4. Some key technologies of Java Hotspot G1 GC
5.G1 wiki
6. Algorithm and practice of JVMG1
7.Java performance optimization practice