“What will be hot and what to learn in 2022? This article is participating in the” Talk about 2022 Technology Trends “essay campaign.
Advanced GC optimization
Discussed in an article in the new technology to have what characteristic, and good performance is a new technology a measuring dimension of “good”, this article focuses on the performance of the memory when it comes to memory we might think of GC, more or less also can say some basic knowledge of GC, but many people had the optimization measures in this respect is not very understanding, So this article discusses some historical optimization of GC, which is also a technical step we are working on this week.
Most programmers only know how to determine garbage, which methods to collect garbage, and how to collect garbage, and that’s what interviewers are looking for. But we also want to know how much, so the author from the past to explain, the optimized plan for the GC technology involves some details and changes in the recycling process optimization points, learn from these optimized point not only for GC, also the optimization of the scheme in the project or to understand the deeper knowledge will also help
1.GC Roots traversal improves efficiency
Previous practice
When the garbage collector thread does GC, the first step is to find GC Roots; In the second step, GC Roots are used to traverse objects that reference GC Roots in the heap to form a reference chain. The third step is to mark the objects that are not in the reference chain (objects that need to be recycled), or mark the objects in the reference chain (objects that need to be copied and collated), depending on the generational memory in the heap and the garbage collection algorithm used.
The place that can be optimized and the principle of optimization
In the second step of the above process, objects that reference GC Roots in the heap are traversed, and the time required for this part will increase gradually as the heap memory becomes larger and larger. It would be more efficient to know in advance which part of the heap is a reference to determine whether to refer to GC Roots.
Yes, the JVM has used Exact memory management since the Exact VM, that is, knowing which parts of memory are references; I also know what part of the stack or register is referenced during just-in-time compilation. At this point, I use a data structure to store the information. In step 2, I don’t need to traverse the entire heap, only the places where there is no memory reference (that is, the information that was not stored in the data structure).
Using the OopMap data structure to store this information in HotSpot can greatly improve the efficiency of GC Roots traversal, but where to put this information?
2. Improved the traversal efficiency of GC Roots but don’t know how to install them?
I mentioned earlier that traversal efficiency can be improved by using an OopMap data structure, but the contents of the OopMap data are different in different places (for example, the contents of my local variable list may be different in each method), so I put an OopMap near each instruction.
Wait, that’s a waste of memory.
Yeah, which is why we have to get it in the right place first! Well, let’s see: this data structure is there to optimize the efficiency of the second GC step, that is, only put the data during GC. The idea is there, but when does GC happen? I am not sure when GC will occur, but I am sure that it must STW as it traverses memory in the heap [otherwise, if the heap reference changes during marking, the marking result will be wrong] (explained in 2.1). I’m specifying that STW should only be executed somewhere in the code so that I can do this indirectly.
That is, when GC occurs, STW is only performed at a certain point, and THEN I put an OopMap data structure near that point to speed up the second step.
The name of this place is actually “safePoint”, as the name implies, and STW garbage collection can only be done if code is executed near the safePoint, as long as OopMap is placed near the safePoint.
2.1 Why is STW needed?
Mentioned above:
Otherwise, if the reference in the heap changes during the marking process, the marking result will be wrong.
One, three color marking method
Let’s use the tricolor notation to explain what would happen if there was no STW: first, the tricolor notation:
Second, there is no STW
Third, solutions
The above exception must satisfy both conditions:
1. Grey objects do not reference white objects
2. Black read/write references white objects
Therefore, it only takes one of these conditions not to be met, so two solutions emerge:
1. Incremental update: In this scheme, the second condition is not met, that is, when the black object references the white object, the black object is saved, and when the scanning is finished, the black object is taken out again for scanning. It can be simply understood as if the black object references the hundred objects, it will be marked as gray.
2. Original snapshot: When a gray object deletes a reference to a white object, the gray object is recorded. After the scan is complete, the gray object is scanned as the root.
CMS garbage collector uses incremental updates for concurrent marking, G1,Shenandoah uses raw snapshots
3. Where should I put safePoint?
SafePoint is explained above, but where do I put safePoint? Too much can cause GC collections to increase run-time memory pressure too frequently, and too little can cause the garbage collector to wait too long because the heap is not being recollected in a timely manner due to the increasing use of memory in the heap.
In this way, I define a rule that I will only install safePoint if there is an instruction characteristic that will keep the program running for a long time, but this characteristic has no specific definition of “long”, but has the meaning of “instruction sequence reuse”. For example, method calls, loop adjustments, exception jumps, etc., safePoint is only placed near these instructions.
SafePoint location has been chosen, but that was the last question
How do I quickly run to safePoint for STW when GC occurs? And how can I achieve this STW?
4. How to implement STW?
First of all, explain why it is called STW, which stands for “Stop the Word”, because the reference relationship in the memory cannot be changed during GC Roots traversing the memory in the heap. Therefore, all user thread operations need to be suspended to ensure that the reference chain formed by GC Roots is correct, that is, the marking process will not go wrong later.
There are two ways to suspend all threads: first, preemptive interrupt:
The system interrupts all user threads while the garbage collector is collecting. When a thread is found that is not near safePoint, let it resume running until it is near safePoint. This approach few virtual machines now respond to GC in this way.
Second, active interrupt:
I don’t directly to my user thread operation, when the GC occurs, I will give the user a thread to set up a sign, when user thread execution constantly polling the sign bit, if polling to run so I will I break my own, because this way is polling to immediately to hang so polling places and safePoint overlap.
To optimize the
“Constantly polling flag bits” sounds very time-consuming, ha ha, so how to optimize the virtual machine? And then I’m going to do the polling and I’m going to suspend myself and how does that work?
Hold on, I’m not going to go into the next question, I’m going to go through it:
Polling sign this operation is actually a assembly instruction, the meaning of the assembly instruction is when I need interrupt threads polling to sign a: I can set one of the pages to unreadable, this will lead to an exception to the signal, after receiving in the exception handler active interrupt operation.
5. A “Bug”: What if threads don’t execute?
As mentioned above, virtual machines almost use active interrupt to interrupt threads at present, and its implementation is to interrupt threads in the exception processing table by constantly polling flag bits to generate self-trapping exception signals during thread execution.
Did you notice a small bug: what if my polling operation never gets executed? How do I put the virtual machine into garbage collection at this time?
It is not always necessary to interrupt the thread, but remember why STW is needed: if the user thread is still executing at this time, the reference relationship in memory may change. If a thread didn’t get the CPU time slice to perform (in Java thread corresponding to the operating system thread, corresponding relationship can also find the author before to consult about SignCatcher understanding of thread), but I can ensure that part of the code area will not change memory references, so also can run through these threads.
Importing the Safe Region solution
“Safe zone: This part of the code does not change the reference relationship in memory,” so the virtual machine does not care about the thread once it enters the safe zone. When a thread leaves the safe zone, it cannot leave if the chain of references has not formed at this point (i.e. traversing the heap with GC Roots) and waits until it receives a signal that the chain of references is formed (or completes the phase where the garbage collector needs to suspend the user thread).
6. Will GC Roots increase with longer running time?
Introduction to Basic Knowledge
Different GC names are identified based on different regions in the heap (generational design) and reclaimed memory space: Partial reclaimed: Minor GC, MajorGC,….. Full memory collection: Full GC
If there are “cross-generation references” (typically old age objects referring to young age objects), such as traversing only ordinary GC Roots during a Minor GC, the result is not accurate (some objects are not GC Roots themselves but become old age objects as they experience more GC), If the object of the younger generation of this reference is marked as garbage removal, the object in the old age will have a problem, so the process of forming the reference chain needs to traverse the whole old age to ensure the accuracy of the result.
CPU cache line technology and pseudo-sharing solution
The memory set
Cross-domain can be understood as cross-memory access or access to memory within another generation
Going through the ages sounds time-consuming, and it is. So we can introduce the concept that if you reference an object in another memory then I will store you in a data structure in another memory, and then when the other memory is reclaimed, you just need to add the object that was added to the data structure to GC Roots.
We optimize: each different generations with an array, the array to a map of heap memory, I each small corresponding elements in the array is fixed in a generational size of memory (such as I said I refer to the first array subscripts is 0 to 100, the second array subscript refers to the 100-200 and so on). When the memory in my first array subscript refers to memory in other generations across domains, I mark the element value of the memory in my first array subscript as 1 for Dirty, and 0 for none. When garbage collection is done, I know which parts of memory are cross-generation references and add them to GC Roots for scanning (add the memory object corresponding to element 1 in the array to GC Roots).
According to the memory size precision of my mapping, I can subdivide it:
1. Word length accuracy: Records only one machine word length (the addressing bits of the processor) that contains cross-generation Pointers
2. Object precision: Record an object (the object field contains cross-generation Pointers)
3. Card precision: Record a memory region (this region has objects containing cross-generation Pointers)
The most commonly used precision
The memory set with “card precision” is realized by the data structure “card table”.
Using precision for the card, the implementation of memory set, also known as the card table, is actually an array of bytes in the table structure, each part of the elements in the array is corresponding to the size specified memory block, this section of memory called the card page, page when the card in the memory block in reference to other one or more objects in the block of memory, will be the element value into a card page. The dirty data is changed into one, and this part of memory is added to GC roots during collection. It goes something like this:
Possible problems
First, when to update the card table? First look at my picture ha ha, the word is not pretty, but the general meaning is similar.
I update the card table in the post-write barrier to ensure that my card table records are correct.
Two, above the “false sharing problem caused by” just about CPU cache line technology, is simply if two threads in the two independent variables in the same cache line, so whether which thread modification, another thread needs to be read from the main memory, and set the cache line is in order to speed up the reading efficiency, so it will reduce the efficiency.
If a cross-generation reference occurs in the memory corresponding to the card page, the card table is updated. The “pseudo-sharing” mentioned above can also occur here and affect performance, for example: a cache line of 64 bytes; An element in a card table is a byte, and a card page corresponding to each element stores 512 bytes. That is, 64 elements in a card table are stored in a cache line, and the total memory for these 64 elements is 32KB (64 X 512 bytes). If variables in two threads are allocated to this memory, A variable that later updates the card table element with a cross-generation reference will cause another thread’s cache row to be invalidated and fetched from main memory. Therefore, the operation of updating the card table should be reduced. If the dirty data is already updated, there is no need to update the card table.
This is the end of GC, but there is more to optimization than that. The more you look at this knowledge, the more addictive it becomes and the more you feel like you are learning nothing. Interested partners can continue to expand, can also be the above mentioned optimization measures for in-depth understanding, the next article will continue to explore personal technical pain points ~