Today, another dry day. This is the first in a series of the most hardcore JVMS on the web, starting with TLAB. Since the article is very long and everyone has different reading habits, it is hereby divided into single and multiple editions

  • TLAB Analysis of the most hardcore JVM in the web (single version does not include additional ingredients)
  • TLAB analysis of the most core JVM in the whole network 1. Introduction of memory allocation idea
  • TLAB analysis of the most core JVM in the entire network 2. TLAB life cycle and problem thinking
  • 3. JVM EMA expectation algorithm and TLAB related JVM startup parameters
  • 4. Complete analysis of TLAB basic process
  • Network most core JVM TLAB analysis 5 TLAB source code full analysis
  • 6. TLAB related hot Q&A summary
  • 7. Tlab-related JVM log parsing
  • 8. Monitor TLAB through JFR

1. Pre-viewing reminders

This content is more hardcore, very comprehensive, involving the design idea to the implementation principle and source code, and also gave the corresponding log and monitoring methods, if there are unclear or have questions, welcome to leave a message.

Which involves the design idea is mainly for personal understanding, the implementation principle and source code analysis is also personal sorting, if there are inaccurate places, very welcome correction! Thanks in advance ~~

2. Allocate memory

We often want to new an object, which takes up space, and the first time we want to new an object the space is shown in Figure 00,

We’re going to focus here on the storage inside the heap, the storage in the meta space, and we’ll discuss that in more detail in another series. The storage inside the heap includes object headers, object bodies, and memory alignment populations. how is this space allocated?

First, the memory required by an object, after the object’s classes are parsed and loaded into the meta space, can be calculated before the allocated memory is created. Assuming we are now designing our own heap allocation, one of the easiest ways to do this is through linear allocation, also known as bump-the-pointer.

Each time memory needs to be allocated, the required memory size is calculated, and the CAS updates the memory allocation pointer shown in Figure 01, marking the allocated memory. But memory is usually not so neat, maybe some memory is allocated and some memory is freed and recollected. So you don’t usually just distribute by firing pin. One idea is to add a FreeList to the firing pin distribution.

The simple implementation is to add the freed object memory to the FreeList, and the next time the object is allocated, the appropriate memory size is first found from the FreeList for allocation, and then the firing needle allocation in the main memory.

Although this solves the problem to some extent, most applications are multi-threaded at present, so memory allocation is multi-threaded, which is allocated from the main memory. Too frequent REtries of CAS update lead to low efficiency. In current applications, different thread pools are generally differentiated according to different services. In this case, the memory allocation per thread is relatively stable. By stable, I mean the size of the objects allocated each time, the number of objects allocated in each GC allocation interval, and the total size. Therefore, we can consider that after each thread allocates memory, it will be reserved for the next allocation, so that it does not have to be allocated from main memory each time. If you can estimate the amount of memory used by each thread in each GC round, you can allocate memory to the threads in advance to improve the allocation efficiency. This memory allocation method, in the JVM is TLAB (Thread Local Allocate Buffer).

3. Brief description of JVM object heap memory allocation process

We are not looking at stack allocation here, which will be examined in detail in the JIT section, but we are looking at objects that cannot be allocated on the stack and need to be shared.

For the HotSpot JVM implementation, all GC algorithm implementations are heap memory management, that is, they all implement a heap abstraction, and they all implement the interface CollectedHeap. When allocating an object heap, the CollectedHeap first checks whether TLAB is enabled. If TLAB is enabled, TLAB allocation is attempted. If the current thread’s TLAB size is sufficient, then the allocation is from the thread’s current TLAB; If not, but the current remaining TLAB space is less than the maximum wasted space limit (a dynamic value, which we will examine in detail later), then a new TLAB will be reassigned from the heap (usually Eden). Otherwise, it is assigned directly outside the TLAB. Different GC algorithms have different allocation strategies outside TLAB. For example the G1:

  • If the object is a Humongous object (when the object is more than half the size of the Region), allocate the object directly in the Humongous Region (the old contiguous Region).
  • Within the Region with the currently assigned subscript, based on the Mutator status

4. Life cycle of TLAB

TLAB is thread private. When a thread initializes, the TLAB is created and initialized. TLAB is also created and initialized the first time a thread attempts to allocate an object after GC scanning has occurred. The TLAB lifecycle stops (the TLAB lifecycle stops does not mean that memory is reclaimed, only that the TLAB is no longer privately managed by the thread) at:

  • The current TLAB is not allocated enough and the remaining space is less than the maximum wasted space limit, so the TLAB will be returned to Eden and a new one will be applied for
  • When GC occurs, TLAB is reclaimed.

5. Thinking about the problems to be solved by TLAB and the problems and solutions

The obvious problem with TLAB is to avoid allocating memory directly from the heap to avoid frequent lock contention.

After the introduction of TLAB, there are many problems worth considering in the design of TLAB.

5.1. The introduction of TLAB will cause memory pore problems and may affect GC scanning performance

1. The condition of pore appearance:

  • If the current TLAB is not allocated enough and the remaining space is less than the maximum wasted space limit, the TLAB will be returned to Eden and a new TLAB will be applied for. This remaining space will become a void.
  • When GC occurs, TLAB is not used up, and unallocated memory also becomes a void.

If these pores are ignored, since TLAB only knows which ones are allocated within the thread, it will return to the Eden zone when GC scanning occurs. If not filled, the outside does not know which part is used and which part is not, so additional checks need to be done, which will affect GC scanning efficiency. So when TLAB returns to Eden, it will fill the remaining available space with a dummy object. If an object that has been confirmed to be reclaimed is filled, i.e., dummy object, GC will directly flag and skip the memory, increasing scanning efficiency. But at the same time, since the dummy object needs to be filled, the object header space for this object needs to be reserved.

5.2. Memory allocated by a thread is not stable during a GC round

If we know in advance how much memory each thread will allocate in this round, then we can just allocate it in advance. But this is a fantasy. Each thread may be allocated differently in each GC round:

  • Different thread business scenarios lead to different allocation object sizes. Generally, thread pools are separated by business. For user requests, the objects allocated each time may be small. For background analysis requests, the objects allocated each time are relatively large.
  • Thread pressure is uneven over time. There are peaks and valleys of business. More objects are allocated during peak hours.
  • The service pressure of threads in the same thread pool at the same time may not be very even. It is likely that only a few threads will be busy and others will be idle.

Therefore, considering the above situation, we should implement TLAB as follows:

  • Instead of applying for a large TLAB for a thread all at once, consider applying for a new one after the TLAB allocation of this thread is full, which is more flexible.
  • The size of TLAB varies with each application and is not fixed.
  • The size of each TLAB request should take into account the expected number of threads that will allocate objects in the current GC round
  • The size of each TLAB request should take into account the number of times that all threads expect to reapply for a new TLAB when TLAB allocation is full

6. The JVM expects to compute EMA

When designing the TLAB size mentioned above, we often talk about expectations. This expectation is calculated based on historical data, that is, each time the sample value is input, the latest expected value is obtained based on the historical sample value. This kind of expectation computation is used not only in TLAB, but also in JVM mechanisms such as GC and JIT. Here we look at an algorithm, EMA (Exponential Moving Average), which is often used in TLAB:

The core of EMA algorithm is to set an appropriate minimum weight. Let’s assume a scenario: Firstly, 100 samples of 100 are sampled (the first 100 samples in the algorithm are to eliminate unstable interference, so we directly ignore the first 100 samples here), then 50 samples of 2 are sampled, and the last 50 samples of 200 are sampled. For different minimum weights, look at the change curve.

It can be seen that the larger the minimum weight, the faster the change, the less affected by historical data. Setting the right minimum weight for your application can make your expectations even better.

The corresponding source code for this is the AdaptiveWeightedAverage class of Gcutil.hpp.

7. JVM parameters related to TLAB

It’s just listed here with a brief introduction. If you don’t understand it, there will be a detailed analysis to help you understand each parameter. The following parameters and default values are based on OpenJDK 17

7.1. TLABStats (Expired)

It has expired since Java 12, and there is no logic associated with it. Previously it was used for TLAB statistics to better scale TLAB but the performance cost was relatively high, but now it is mainly calculated through EMA.

7.2. UseTLAB

Note: Whether to enable TLAB, the default is enabled.

Default: true,

For example, if you want to close: -xx: -usetlab

7.3. ZeroTLAB

Note: whether to zero all bytes in the newly created TLAB. When we create a class, the field of the class has default values, such as Boolean is false, int is 0, etc., and the way to do that is to assign 0 to the allocated memory space. Setting ZeroTLAB to true means 0 is assigned when TLAB is requested, otherwise 0 is assigned when the object is allocated and initialized. Since TLAB Allocation involves allocating Prefetch to optimize the CPU cache, updating the Allocation of 0 immediately after TLAB is allocated should be more cache-friendly. Moreover, if TLAB is not full, Dummy Object is still a 0 array. In this case, open should be better. But ZeroTLAB is still disabled by default.

Default: false

For example: – XX: + ZeroTLAB

7.4. ResizeTLAB

Note: whether TLAB is variable, default is yes, that is, the expected TLAB size is calculated based on the EMA related to the thread history allocation data and the TLAB application is based on this size.

Default: true,

For example, if you want to close: -xx: -ResizetLab

7.5. TLABSize

Description: initial TLAB size. It’s in bytes

Default: 0, 0 is not active TLAB initial size, but through the JVM to calculate the initial size of each thread

For example: – XX: TLABSize = 65536

7.6. MinTLABSize

Description: minimum TLAB size. It’s in bytes

Default: 2048

For example: – XX: TLABSize = 4096

7.7. TLABAllocationWeight

TLAB initial size calculation is related to the number of threads, but threads are dynamically created and destroyed. Therefore, the TLAB size needs to be calculated by predicting the number of subsequent threads based on the number of historical threads. In general, a prediction function like this in a JVM uses EMA. This parameter is the minimum weight in Figure 06. The higher the weight, the greater the influence of the proportion of recent data. TLAB recalculates the size according to the allocation ratio, which also adopts EMA algorithm, and the minimum weight is TLABAllocationWeight

Default: 35

For example: – XX: TLABAllocationWeight = 70

7.8. TLABWasteTargetPercent

Description: TLAB size calculation involves the Eden zone size and the ratio of waste. TLAB waste refers to the space not allocated by the old TLAB when reapplying for a new TLAB mentioned above. This parameter is actually the percentage of Eden occupied by TLAB waste. The function of this parameter will be explained in detail in the following principle description

Default: 1.

Example: – XX: TLABWasteTargetPercent = 10

7.9. TLABRefillWasteFraction

Description: Initial Maximum Wasted Space limit calculated parameter, initial Maximum wasted space limit = current expected TLAB size/TLABRefillWasteFraction

Default: 64

Example: – XX: TLABRefillWasteFraction = 32

7.10. TLABWasteIncrement

Note: The maximum wasted space limit is not constant and increases when slow TLAB allocation occurs (i.e., when current TLAB space is insufficient to allocate). This parameter is the incremental TLAB waste allowed when TLAB is allocated slowly. The unit is not bytes, but the number of Markwords, which is the minimum unit of memory in the Java heap. In the case of a 64-bit virtual machine, the MarkWord size is 3 bytes.

Default: 4

Example: – XX: TLABWasteIncrement = 4

8.TLAB basic process

8.0. How to design the TLAB size per thread

Before, we mentioned the problems and solutions to introduce TLAB. According to these, we can design TLAB in this way.

First, the initial size of TLAB should be related to the number of threads per GC that need to be allocated to objects. However, the number of threads to be allocated is not necessarily stable. It is possible that the number of threads in this period is high, and the number of threads in the next stage is not so high. Therefore, the algorithm of EMA needs to collect the number of threads that need to be allocated by objects in each GC to calculate the expected number of threads.

Then, our best case is that within each GC, all the memory used to allocate objects is in the TLAB of the corresponding thread. The amount of memory used to allocate objects in each GC is essentially the Eden zone size in terms of JVM design. In the best case, it is best to only GC when Eden zone is full, and there is no other cause for GC, which is the most efficient case. The Eden zone is used up. If all the TLAB allocations are within TLAB, the Eden zone is occupied by all the Tlabs of threads. This is the fastest allocation.

Then, the number of threads and size of memory allocated by GC in each round are not certain. If a large chunk is allocated at once, waste will be caused; if it is too small, TLAB will be frequently applied from Eden, reducing efficiency. This size is difficult to control, but we can limit the maximum number of TLAB requests per thread from Eden in a round of GC, so that users can have better control.

Finally, the amount of memory allocated per thread is not always stable in each GC round, and using only the initial size to guide subsequent TLAB sizes is clearly not enough. In another way, the memory allocated by each thread has a certain relationship with the history, so we can infer from the history allocation, so each thread also needs to use EMA’s algorithm to collect the memory allocated by this thread each TIME to guide the next expected TLAB size.

To sum up, we can get such an approximate TLAB calculation formula:

Initial TLAB size of each thread = Eden zone size/(Maximum number of TLAB requests from Eden in a single GC round * Number of threads currently allocated by GC EMA)

After GC, re-calculate TLAB size = Eden zone size/(maximum number of TLAB requests from Eden in a single GC round * Number of threads currently allocated by GC EMA)

Next, let’s analyze each process of the TLAB lifecycle in detail.

8.1. TLAB initialization

When a thread is initialized, if TLAB is enabled by the JVM (it is enabled by default and can be turned off by -xx: -usetlab), then TLAB is initialized and TLAB memory is requested for the desired size when object allocation occurs. At the same time, TLAB memory will be reapplied when the thread first tries to allocate an object after GC scanning has occurred. Let’s focus on initialization first. The initialization process is shown in Figure 08:

The TLAB initial expected size is calculated during initialization. This involves the limitation of TLAB size:

  • Minimum size of TLABThrough:MinTLABSizeThe specified
  • The maximum size of TLAB varies with GC. In THE G1 GC, it is the size of humongous Object, that is, half of the size of G1 region. As mentioned at the beginning, in the G1 GC, large objects cannot be allocated in TLAB, but are old. The ZGC is 1/8 of the page size, similarly the Shenandoah GC is 1/8 of each Region size in most cases. They all expect at least seven out of eight areas to be unbacked to reduce the scanning complexity of selecting csets. For other GCS, it is the maximum size of an int array. This is related to padding dummy Object mentioned earlier, and the details are covered later.

In the subsequent process, whenever the TLAB size is within the range from the TLAB minimum size to the TLAB maximum size, we will not emphasize this restriction to avoid being verbose ~~~! In the subsequent process, whenever the TLAB size is within the range from the TLAB minimum size to the TLAB maximum size, we will not emphasize this restriction to avoid being verbose ~~~! In the subsequent process, whenever the TLAB size is within the range from the TLAB minimum size to the TLAB maximum size, we will not emphasize this restriction to avoid being verbose ~~~! Say the important things three times

Desired SIZE of TLAB is calculated during initialization, and the desired size of TLAB must be re-calculated after TLAB is reclaimed by GC and other operations. According to this expected size, TLAB will use this expected size as the benchmark space for each application as TLAB allocation space.

8.1.1. TLAB Initial expected size calculation

As shown in Figure 08, if TLABSize is specified, this size is used as the initial expected size. If not specified, the following formula is used:

Total size of heap space for TLAB /(Currently valid number of allocated threads expected * Refill times configured)

  1. Total size of heap space for TLAB: How much space on the heap can be allocated to TLAB, different GC algorithms vary, butMost GC algorithms are implemented with Eden zone size, such as:
    1. The traditional deprecated Parallel Scanvage is the Eden zone size. Reference: parallelScavengeHeap. CPP
    2. The default G1 GC is (number of YoungList regions minus number of Survivor regions) * region size, which is actually the Eden region size. Reference: g1CollectedHeap. CPP
    3. In ZGC is the size of the remaining Page space. Page is similar to the Eden zone, where most objects are allocated. Reference: zHeap. CPP
    4. Shenandoah GC is the size of FreeSet, also similar to Eden’s concept. Reference: shenandoahHeap. CPP
  2. Number of currently effectively allocated threads Expectation: This is a global EMA, which is a way to calculate the expectation as described earlier. The minimum weight for the EMA to effectively allocate the number of threads is TLABAllocationWeight. Number of valid allocated threads EMA is collected when a thread makes the first valid object allocation. This value is read during TLAB initialization to calculate the expected TLAB size.
  3. TLAB Refills Time: The number of times calculated according to TLABWasteTargetPercent. The meaning of TLABWasteTargetPercent is to limit the maximum wasted space limit. Why the number of refills is related to this will be discussed later.

8.1.2. TLAB initial allocation ratio calculation

As shown in Figure 08, the TLAB initial allocation ratio is then calculated.

Thread-private allocation ratio (EMA) : corresponds to the number of effectively allocated threads (EMA), which is a description of how much TLAB each thread should occupy globally, while the allocation ratio (EMA) is a dynamic control of the total TLAB space that the current thread should occupy.

When initialized, the allocation ratio is equal to 1/ the number of currently valid allocated threads. The formula in Figure 08, substituted into the previous formula for calculating the expected size of TLAB, is 1/ the number of currently effectively allocated threads. This value is used as an initial value to collect allocation ratios such as the thread-private EMA.

8.1.3. Clear thread private statistics

These collection data will be used for subsequent calculation and collection of the allocation ratio of the current thread, thereby affecting the TLAB expected size of the current thread.

8.2. TLAB allocation

The TLAB allocation process is shown in Figure 09.

8.2.1. Current TLAB allocation from threads

If TLAB is enabled (by default, it can be disabled by -xx: -usetlab), then memory is allocated from the current TLAB of the thread first. If the allocation is successful, the memory will be returned. Otherwise, different allocation strategies will be implemented based on the current remaining TLAB space and the current maximum wasted space limit size. In the next process, it will be mentioned exactly what this limitation is.

8.2.2. Reapply for TLAB assignment

If the current remaining TLAB space is greater than the current maximum wasted space limit (according to the process in Figure 08, we know that the initial value is the expected size /TLABRefillWasteFraction), allocate it directly on the heap. Otherwise, reapply for a TLAB assignment. Why the maximum waste of space?

When a TLAB is reallocated, there may be room left in the original TLAB. Dummy Object needs to be filled before the original TLAB is thrown back into the heap. Since TLAB only knows what is allocated within the thread and returns to the Eden zone when GC scanning occurs, if it is not filled, the outside does not know which part is used and which part is not, so additional checks need to be made. If the object that is confirmed to be reclaimed is filled, that is, dummy Object. GC will directly mark and skip this memory to increase scanning efficiency. Anyway, this memory already belongs to TLAB, other threads will not be able to use it until the end of the next scan. Our dummy Object is an int array. In order to ensure that there is space to fill dummy Object, the TLAB size is usually reserved for a dummy Object header, which is also an int[] header, so the TLAB size must not exceed the maximum size of the int array. Otherwise, you cannot fill the unused space with dummy Objects.

However, filling a dummy also creates a waste of space, which cannot be too much, so the maximum waste space limit is used to limit this waste.

The new TLAB size takes the smaller of the following two values:

  • The remaining heap space allocated to TLAB, and most GC implementations are actually the corresponding Eden space:
    • The traditional deprecated Parallel Scanvage is the Eden zone remaining size. Reference: parallelScavengeHeap. CPP
    • By default, the G1 GC is the remaining Region size of the current Region, which is actually the Eden partition. Reference: g1CollectedHeap. CPP
    • In ZGC is the size of the remaining Page space. Page is similar to the Eden zone, where most objects are allocated. Reference: zHeap. CPP
    • Shenandoah GC is the residual size of FreeSet, also similar to Eden’s concept. Reference: shenandoahHeap. CPP
  • TLAB Expected size + Current space size to be allocated

After the TLAB is allocated, the ZeroTLAB configuration determines whether to assign each byte to 0. When you create an object, you would assign an initial value to each field. Most fields are initialized to 0, and when TLAB returns to the heap, the remaining space is also filled with an int[] array of 0’s. So you can actually fill it ahead of time. In addition, when TLAB is first allocated, assigning 0 can also use the Allocation prefetch mechanism to accommodate CPU cache rows (the Allocation Prefetch mechanism will be described in another series). So you can assign 0 immediately after TLAB space is allocated by opening ZeroTLAB.

8.2.3. Allocate directly from the heap

Allocating directly from the heap is the slowest way to allocate. In one case, if the current TLAB free space is greater than the current maximum wasted space limit, allocate it directly on the heap. In addition, the current maximum wasted space limit will be increased, each time such allocation will increase the size of TLABWasteIncrement, so that after a certain number of direct heap allocations, the current maximum wasted space limit will keep increasing, resulting in the current TLAB remaining space is less than the current maximum wasted space limit. Then apply for a new TLAB for allocation.

8.3. Expected size of TLAB collection and recalculation during GC

The process is shown in Figure 10, with some operations performed on TLAB before and after GC.

8.3.1. Operations before GC

Before GC, if TLAB is enabled (it is enabled by default and can be turned off by -xx: -Usetlab), then the TLAB fill dummy Object for all threads needs to be returned to the heap and something is calculated and sampled for later TLAB size calculations.

First of all, in order to ensure the reference significance of this calculation, it is necessary to judge whether more than half of the TLAB space on the heap has been used. If the assumption is insufficient, then the data of this round of GC is considered to have no reference significance. If more than half is used, then calculate the new allocation ratio. The new allocation ratio is equal to the size of the thread’s GC allocation in this round/the TLAB space used by all threads on the heap. This calculation is mainly because the allocation ratio describes the proportion of the current thread’s TLAB space on the heap, which is different for each thread. This ratio is used to dynamically control the TLAB size of different business threads.

The size of thread GC allocation space in this round includes that allocated in TLAB and that allocated outside TLAB. It can be seen from the records of thread allocation space in thread records in the flow chart of Figure 8, Figure 9 and Figure 10. The size of the read-out thread allocation minus the size of the thread allocation at the end of the last GC round is the size of the thread’s current GC allocation.

Finally, the current TLAB is filled with dummy Object and returned to the heap.

8.3.2. Operations after GC

If TLAB is enabled (which is enabled by default and can be turned off by -xx: -usetlab) and TLAB is variable (which is enabled by default and can be turned off by -xx: -Resizetlab), then the expected TLAB size for each thread is recalcalculated after GC. New expected size = total heap space for TLAB * current allocation ratio EMA/Refill times configuration. It then resets the maximum wasted space limit to the current desired size/TLABRefillWasteFraction.

9. Source code analysis of OpenJDK HotSpot TLAB

If you’re having trouble reading this, you can go straight to Chapter 10, Popular Q&A, which has a lot of popular questions

9.1. TLAB class composition

During thread initialization, if TLAB is enabled on the JVM (it is enabled by default and can be disabled by -xx: -usetlab), then TLAB is initialized.

TLAB includes the following several field (HeapWord * can be understood as in the heap memory address) : SRC/hotspot/share/gc/Shared/threadLocalAllocBuffer CPP

Static size_t _max_size; Static int _reserve_for_allocation_prefetch; Static unsigned _target_refills = static unsigned _target_refills; // The number of refills expected for each GC cycle // The main components of TLAB are field HeapWord* _start; HeapWord* HeapWord* _top; // Last allocated memory address HeapWord* _end; // TLAB end address size_t _desired_size; // The size of TLAB includes the reserved space, which means that the size of memory needs to be size_t. That is, the actual number of bytes divided by the value of HeapWordSize. // TLAB maximum wasted space, the remaining space is insufficient allocation of wasted space limits. When the remaining TLAB space is insufficient, the allocation strategy is determined according to this value. If the wasted space is greater than this value, it is directly allocated to Eden; if less than this value, the current TLAB is put back to Eden for management and a new TLAB is applied from Eden for allocation. AdaptiveWeightedAverage _allocation_fraction; Field HeapWord* _allocation_end; field HeapWord* _allocation_end; HeapWord* _pf_top; // TLAB can actually allocate the end address of memory. This is the _end address which excludes the reserved space (for dummy object header space). // Allocate Prefetch to the CPU cache optimization mechanism. Size_t _allocated_before_LAST_GC is not required here. // This is used to calculate the size of the thread allocated in the current GC in Figure 10. The size of the thread allocated in the last GC was unsigned _number_of_refills; Unsigned _fast_refill_waste; // Unsigned _fast_refill_waste; Unsigned _slow_refill_waste; unsigned _slow_refill_waste; unsigned _slow_refill_waste; // Thread allocation memory data collection related, TLAB slow allocation waste, slow allocation is to fill a TLAB allocation unsigned _gc_waste; // Thread allocations are associated with memory data collection, GC wastes unsigned _slow_allocations; // Thread allocation memory data collection related, TLAB slow allocation count size_t _allocated_size; // Allocate memory size size_t _bytes_since_last_sample_point; // JVM TI collection metrics related field, not concerned hereCopy the code

9.2. TLAB initialization

Global TLAB is JVM startup time, first need to initialize: SRC/hotspot/share/gc/Shared/threadLocalAllocBuffer CPP

Void ThreadLocalAllocBuffer: : startup_initialization () {/ / initialization, that is, zero statistics ThreadLocalAllocStats: : initialize (); // If, on average, half of each thread's current TLAB is wasted during GC scanning, the percentage of memory wasted per thread (TLABWasteTargetPercent) is equal to (note that only the latest TLAB is wasted). The previous assumption is that there is no waste at all. 1/2 * (the expected number of spales per thread in each epoch) * 100 // The number of spales per thread in each epoch is equal to 50 / TLABWasteTargetPercent, which is 50 times by default. _target_refills = 100 / (2 * TLABWasteTargetPercent); // However, the initial _target_refills need to be set no more than twice to reduce GC possibility during VM initialization. _target_refills = MAX2(_target_refills, 2U); // If C2 JIT compilation exists and is enabled, then the Allocation Prefetch space is reserved for the CPU cache optimization. #ifdef COMPILER2 if (is_server_compilation_mode_vm()) {int lines = MAX2(AllocatePrefetchLines, AllocateInstancePrefetchLines) + 2; _reserve_for_allocation_prefetch = (AllocatePrefetchDistance + AllocatePrefetchStepSize * lines) / (int)HeapWordSize; } #endif // initialize TLAB guarantee(Thread::current()->is_Java_thread(), "tlab initialization thread not Java thread"); Thread::current()->tlab().initialize(); log_develop_trace(gc, tlab)("TLAB min: " SIZE_FORMAT " initial: " SIZE_FORMAT " max: " SIZE_FORMAT, min_size(), Thread::current()->tlab().initial_desired_size(), max_size()); }Copy the code

Each thread maintains its own TLAB, and the TLAB size varies for each thread. The size of TLAB is mainly determined by Eden’s size, the number of threads, and the object allocation rate of threads. In Java thread starts running, will first allotment TLAB: SRC/hotspot/share/runtime/thread. The CPP

void JavaThread::run() { // initialize thread-local alloc buffer related fields this->initialize_tlab(); // Ignore the rest of the code}Copy the code

Assigning TLAB is a call to the Initialize method of ThreadLocalAllocBuffer. src/hotspot/share/runtime/thread.hpp

Void initialize_tlab() {// Initialize TLAB if TLAB is not disabled by -xx: -usetlab if (UseTLAB) {TLAB ().initialize(); } } // Thread-Local Allocation Buffer (TLAB) support ThreadLocalAllocBuffer& tlab() { return _tlab; } ThreadLocalAllocBuffer _tlab;Copy the code

The Initialize method of ThreadLocalAllocBuffer initializes the various fields mentioned above in TLAB that we care about: src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp

Void ThreadLocalAllocBuffer: : initialize () {/ / set the initial pointer, because have not allocate memory from the Eden, Initialize (NULL, // start NULL, // top NULL); // end // Calculate the initial expected size and set set_desired_size(initial_desired_size()); // The total size of all tlabs, different GC implementations have different TLAB capacities, usually the Eden region size // for example G1 GC, (_policy->young_list_target_length() -_survivor. Length ()) * HeapRegion::GrainBytes Size_t capacity = Universe::heap()->tlab_capacity(thread())/HeapWordSize; Float alloc_FRAc = desired_size() * float alloc_frac = desired_size() * float alloc_frac = desired_size() * target_refills() / (float) capacity; EMA _allocation_fraction. Sample (alloc_frac); // Calculate the maximum space wasted at the initial refill and set the value. / TLABRefillWasteFraction set_refill_waste_LIMIT (initial_refill_waste_LIMIT ()); // Reset_statistics (); }Copy the code

9.2.1. How is the initial expected size calculated?

src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp

/ / calculate the initial size size_t ThreadLocalAllocBuffer: : initial_desired_size () {size_t init_sz = 0; // If the TLAB size is set by -xx :TLABSize, this is the initial expected size. // If the TLAB size is set by -xx :TLABSize, this is the initial expected size. TLABSize/HeapWordSize if (TLABSize > 0) {init_sz = TLABSize/HeapWordSize; } else {// Get the expected number of threads in the current epoch, This as previously described by EMA predict unsigned int nof_threads = ThreadLocalAllocStats: : allocating_threads_avg (); // Different GC implementations have different TLAB capacity, Universe::heap()->tlab_capacity(thread()) (_policy->young_list_target_length() -_survivor. Length ()) * HeapRegion::GrainBytes = = = = = = = = = = = = = = = = = = = = = = = = = = Init_sz = (Universe::heap()->tlab_capacity(thread())/HeapWordSize)/(noF_Threads * target_refills()); Init_sz = align_object_size(init_sz); Init_sz = MIN2(MAX2(init_sz, min_size())); max_size()); return init_sz; } // The minimum size is determined by MinTLABSize, which needs to be represented as HeapWordSize, and should consider object alignment, The final "alignment_Reserve" is the size of the dummy Object's filled header. (The JVM's CPU cache prematch is not considered here, but will be examined in more detail in another section.) static size_t min_size() { return align_object_size(MinTLABSize / HeapWordSize) + alignment_reserve(); }Copy the code

9.2.2. How is the maximum TLAB size determined?

Different GC methods, there are different ways:

G1 GC for large objects (humongous object) size, G1 is half the size of the region: SRC/hotspot/share/GC/G1 / g1CollectedHeap CPP

// For G1 TLABs should not contain humongous objects, so the maximum TLAB size // must be equal to the humongous object limit. size_t G1CollectedHeap::max_tlab_size() const {  return align_down(_humongous_object_threshold_in_words, MinObjAlignment); }Copy the code

The ZGC is 1/8 of the page size, similarly the Shenandoah GC is 1/8 of each Region size in most cases. They are expected to return at least 7/8 of the area is not reduced when choosing Cset scans of complexity: SRC/hotspot/share/gc/shenandoah/shenandoahHeap CPP

MaxTLABSizeWords = MIN2(ShenandoahElasticTLAB ? RegionSizeWords : (RegionSizeWords / 8), HumongousThresholdWords);
Copy the code

src/hotspot/share/gc/z/zHeap.cpp

const size_t      ZObjectSizeLimitSmall         = ZPageSizeSmall / 8;
Copy the code

For other GCS, this is the maximum size of the int array, which is related to filling the empty space of dummy Object representing TLAB. This reason has been explained before.

9.3. TLAB allocates memory

When new an object, you need to call instanceOop InstanceKlass: : allocate_instance (TRAPS) SRC/hotspot/share/oops/InstanceKlass CPP

instanceOop InstanceKlass::allocate_instance(TRAPS) { bool has_finalizer_flag = has_finalizer(); // Query before possible GC int size = size_helper(); // Query before forming handle. instanceOop i; i = (instanceOop)Universe::heap()->obj_allocate(this, size, CHECK_NULL); if (has_finalizer_flag && ! RegisterFinalizersAtInit) { i = register_finalizer(i, CHECK_NULL); } return i; }Copy the code

Its core is the heap () – > obj_allocate (this, the size, CHECK_NULL) from the pile top allocate memory: SRC/hotspot/share/gc/Shared/collectedHeap inline. HPP

inline oop CollectedHeap::obj_allocate(Klass* klass, int size, TRAPS) {
  ObjAllocator allocator(klass, size, THREAD);
  return allocator.allocate();
}
Copy the code

Use global ObjAllocator implement object memory allocation: SRC/hotspot/share/gc/Shared/memAllocator CPP

oop MemAllocator::allocate() const { oop obj = NULL; { Allocation allocation(*this, &obj); HeapWord* mem = mem_allocate(allocation); if (mem ! = NULL) { obj = initialize(mem); } else { // The unhandled oop detector will poison local variable obj, // so reset it to NULL if mem is NULL. obj = NULL; } } return obj; } HeapWord* MemAllocator::mem_allocate(allocate & allocate) const {// allocate from TLAB, allocate from TLAB, allocate from TLAB, allocate from TLAB If (UseTLAB) {HeapWord* result = allocate_inside_tlab(allocation); if (result ! = NULL) { return result; }} return allocate_outside_tlab(allocation); } HeapWord* MemAllocator::allocate_inside_tlab(Allocation& allocation) const { assert(UseTLAB, "should use UseTLAB"); HeapWord* mem = _thread-> TLAB ().allocate(_word_size); // Return if (mem! = NULL) { return mem; } return allocate_inside_tlab_slow(allocation);} return allocate_inside_tlab_slow(allocation);} return allocate_inside_tlab_slow(allocation); }Copy the code

9.3.1. TLAB Fast allocation

src/hotspot/share/gc/shared/threadLocalAllocBuffer.inline.hpp

The inline HeapWord * ThreadLocalAllocBuffer: : the allocate (size_t size) {/ / verify each memory pointer is valid, That is, _top invariants() within the limits of _start and _end; HeapWord* obj = top(); If (pointer_delta(end(), obj) >= size) {set_top(obj + size); invariants(); return obj; } return NULL; }Copy the code

9.3.2. TLAB slow allocation

src/hotspot/share/gc/shared/memAllocator.cpp

HeapWord* MemAllocator::allocate_inside_tlab_slow(Allocation& allocation) const { HeapWord* mem = NULL; ThreadLocalAllocBuffer& tlab = _thread->tlab(); // If the remaining TLAB space is greater than the maximum wasted space, If (tlab.free() > tlab.refill_waste_limit()) {tlab.record_slow_allocation(_word_size); return NULL; } // Recalculate the TLAB size; size_t new_tlab_size = tlab.compute_size(_word_size); Tlab. retire_before_allocation(); //TLAB puts the Eden zone back. if (new_tlab_size == 0) { return NULL; } / / computing minimum size size_t min_tlab_size = ThreadLocalAllocBuffer: : compute_min_size (_word_size); // Allocate new TLAB space, Mem = Universe::heap()->allocate_new_tlab(min_tlab_size, new_tlab_size, & allocATED_allocated_tlab_size); if (mem == NULL) { assert(allocation._allocated_tlab_size == 0, "Allocation failed, but actual size was updated. min: " SIZE_FORMAT ", desired: " SIZE_FORMAT ", actual: " SIZE_FORMAT, min_tlab_size, new_tlab_size, allocation._allocated_tlab_size); return NULL; } assert(allocation._allocated_tlab_size ! = 0, "Allocation succeeded but actual size not updated. mem at: " PTR_FORMAT " min: " SIZE_FORMAT ", desired: " SIZE_FORMAT, p2i(mem), min_tlab_size, new_tlab_size); // If the JVM argument ZeroTLAB is enabled, set all fields of the object to zero if (ZeroTLAB) {//.. and clear it. Copy::zero_to_words(mem, allocation._allocated_tlab_size); } else { // ... } and allocated zap just allocated object.} // Allocated TLAB. Fill (mem, mem + _word_size, allocation._allocated_tlab_size); // Return the allocated object memory address return mem; }Copy the code

9.3.2.1 Maximum wasted space of TLAB

The initial value is TLAB size divided by TLABRefillWasteFraction: src/hotspot/share/gc/shared/threadLocalAllocBuffer.hpp

size_t initial_refill_waste_limit()            { return desired_size() / TLABRefillWasteFraction; }
Copy the code

For each slow allocation, call record_slow_allocation(size_t obj_size) to record the slow allocation and increase the size of TLAB’s maximum wasted space:

src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp

Void ThreadLocalAllocBuffer: : record_slow_allocation (size_t obj_size) {/ / slow distribution at a time, Refill_waste_limit add refill_waste_limit_increment, TLABWasteIncrement set_refill_waste_LIMIT (refill_waste_LIMIT () + refill_limit_increment ()); _slow_allocations++; log_develop_trace(gc, tlab)("TLAB: %s thread: " INTPTR_FORMAT " [id: %2d]" " obj: " SIZE_FORMAT " free: " SIZE_FORMAT " waste: " SIZE_FORMAT, "slow", p2i(thread()), thread()->osthread()->thread_id(), obj_size, free(), refill_waste_limit()); } //refill_waste_limit_increment is the JVM argument TLABWasteIncrement static size_t refill_waste_limit_increment() {return TLABWasteIncrement; }Copy the code

9.3.2.2. Recalculate TLAB size

Recalculation will take the smaller of the current heap space available to TLAB and the expected size of TLAB + the current space to allocate:

src/hotspot/share/gc/shared/threadLocalAllocBuffer.inline.hpp

The inline size_t ThreadLocalAllocBuffer: : compute_size (size_t obj_size) {/ / get the current heap to TLAB can allocate space remaining const size_t available_size  = Universe::heap()->unsafe_max_tlab_alloc(thread()) / HeapWordSize; Size_t new_tlab_size = MIN3(available_size, desired_size() + align_object_size(obj_size), max_size()); If (new_tlab_size < compute_min_size(obj_size)) {log_trace(gc, tlab)("ThreadLocalAllocBuffer::compute_size(" SIZE_FORMAT ") returns failure", obj_size); return 0; } log_trace(gc, tlab)("ThreadLocalAllocBuffer::compute_size(" SIZE_FORMAT ") returns " SIZE_FORMAT, obj_size, new_tlab_size); return new_tlab_size; }Copy the code

9.3.2.3. Put the current TLAB back into the heap

src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp

// Slow allocation is called in TLAB, Current TLAB back on the heap of void ThreadLocalAllocBuffer: : retire_before_allocation () {/ / add the current TLAB the remaining space size size _slow_refill_waste + = slow distribution waste of space  (unsigned int)remaining(); // Perform TLAB return to the heap, which is called later in GC to return all threads to heap retire(); } // For TLAB slow allocation, stats is null // for GC calls, Stats to record each thread data void ThreadLocalAllocBuffer: : retire (ThreadLocalAllocStats * stats) {if (stats! = NULL) { accumulate_and_reset_statistics(stats); } // If TLAB is currently valid if (end()! = NULL) { invariants(); Thread ()-> incr_allocATED_bytes (used_bytes()); // Fill dummy Object insert_filler(); // Clear the current TLAB pointer initialize(NULL, NULL, NULL); }}Copy the code

9.4. Gc-related TLAB operations

9.4.1. Prior to GC

Different GCS may be implemented differently, but the timing of TLAB operation is basically the same. Take G1 GC as an example here, before the real GC:

src/hotspot/share/gc/g1/g1CollectedHeap.cpp

Void G1CollectedHeap::gc_prologue(bool full) {// Fill TLAB's and such {Ticks start = Ticks::now(); Ensure_parsability (true); ensure_parsability(true); Tickspan dt = Ticks::now() - start; phase_times()->record_prepare_tlab_time_ms(dt.seconds() * MILLIUNITS); } // Omit other code}Copy the code

Why make sure the heap memory is parsable? This allows for faster scanning of objects on the heap. Make sure that memory parses what’s going on in there? The main thing is to return TLAB for each thread and fill dummy Object.

src/hotspot/share/gc/g1/g1CollectedHeap.cpp

Void CollectedHeap::ensure_parsability(bool retire_tlabs) {// True GC must occur at a safe point, The security section will detail behind the assert (SafepointSynchronize: : is_at_safepoint () | |! is_init_completed(), "Should only be called at a safepoint or at start-up"); ThreadLocalAllocStats stats; for (JavaThreadIteratorWithHandle jtiwh; JavaThread *thread = jtiwh.next();) { BarrierSet::barrier_set()->make_parsable(thread); // If TLAB is enabled globally, then TLAB if (retire_tlabs) {// If TLAB is enabled globally, then TLAB if (retire_tlabs) {// Reclaim TLAB, call 9.3.2.3. TLAB ().retire(&stats); Thread -> TLAB ().make_parsable(); thread-> TLAB ().make_parsable(); } } } stats.publish(); }Copy the code

9.4.2. After GC

Different GC implementations may be different, but the timing of TLAB operation is basically the same. Here, take G1 GC as an example, after GC:

SRC/hotspot/share/gc/g1 / g1CollectedHeap CPP _desired_size become what time? How?

Void G1CollectedHeap::gc_epilogue(bool full) {// Omit the rest of the code resize_all_tlabs(); }Copy the code

src/hotspot/share/gc/shared/collectedHeap.cpp

Void CollectedHeap: : resize_all_tlabs () {/ / need to be in safety, GC will be safer assert (SafepointSynchronize: : is_at_safepoint () | |! is_init_completed(), "Should only resize tlabs at safepoint"); / / if UseTLAB and ResizeTLAB are open (the default is open) if (UseTLAB && ResizeTLAB) {for (JavaThreadIteratorWithHandle jtiwh; JavaThread *thread = jtiwh.next(); ) {// re-compute the expected TLAB size for each thread thread-> TLAB ().resize(); }}}Copy the code

To calculate each thread TLAB expected size: SRC/hotspot/share/gc/Shared/threadLocalAllocBuffer CPP

void ThreadLocalAllocBuffer::resize() { assert(ResizeTLAB, "Should not call this otherwise"); // Multiply the average value of the EMA by the Eden fraction, Size_t alloc = (size_t)(_allocation_fraction () * (Universe::heap()->tlab_capacity(thread()) /  HeapWordSize)); Size_t new_size = alloc / _target_refills; New_size = clamp(new_size, min_size(), max_size()); size_t aligned_new_size = align_object_size(new_size); Log_trace (GC, tlab)(" tlab new size: thread: "refills %d alloc: %8.6f desired_size: " SIZE_FORMAT " -> " SIZE_FORMAT, p2i(thread()), thread()->osthread()->thread_id(), _target_refills, _allocation_fraction.average(), desired_size(), aligned_new_size); // Set the new TLAB size set_desired_size(aligned_new_size); Set_refill_waste_limit (initial_refill_waste_limit()); }Copy the code

10. Q&A of TLAB process frequently Asked questions

I’ll keep you updated here to address any questions you may have

10.1. Why does TLAB need to fill dummy Object when returned to the heap

Mainly to ensure the GC scanning efficiency. Since TLAB only knows what is allocated within the thread and returns to the Eden zone when GC scanning occurs, if it is not filled, the outside does not know which part is used and which part is not, so additional checks need to be made. If the object that is confirmed to be reclaimed is filled, that is, dummy Object. GC will directly mark and skip this memory to increase scanning efficiency. Anyway, this memory already belongs to TLAB, other threads will not be able to use it until the end of the next scan. Our dummy Object is an int array. In order to ensure that there is space to fill dummy Object, the TLAB size is usually reserved for a dummy Object header, which is also an int[] header, so the TLAB size must not exceed the maximum size of the int array. Otherwise, you cannot fill the unused space with dummy Objects.

10.2. Why does TLAB need a maximum waste space limit

When a TLAB is reallocated, there may be room left in the original TLAB. Dummy Object needs to be filled before the original TLAB is thrown back into the heap. This causes the memory to be unable to allocate the object, as shown, which is called “waste”. If there is no restriction, when the remaining TLAB space is insufficient, it will reapply for it, resulting in reduced allocation efficiency. Most of the space is occupied by dummy Object, resulting in more frequent GC.

10.3. Why the TLAB refill count configuration is equal to 100 / (2 * TLABWasteTargetPercent)

TLABWasteTargetPercent describes the ratio of the maximum initial wasted space configuration to TLAB

First of all, the ideal situation is to try to allocate all objects within TLAB, that is, TLAB may fill up Eden. Before the next GC scan, the memory returned to Eden will not be available to other threads because the remaining space has been filled with dummy Object. Therefore, the memory size used by all threads is the number of expected threads allocated to the object in the next EPCOH * the number of times of each thread refill in each epoch. Generally, objects are allocated by a thread in Eden, so the memory size used by all threads is best to be the whole Eden. But this situation is so ideal that there will always be memory filled with dummy Object that it will be wasted because a GC scan can happen at any time. Assuming, on average, that half of the current TLAB for each thread is wasted during GC scanning, the percentage of wasted memory per thread (i.e. TLABWasteTargetPercent) is equal to (note that only the latest TLAB has wasted memory, The previous assumption is that there is no waste at all.

1/2 * (the expected number of spales per thread in each epoch) * 100

So the number of times per thread refill in each epoch is equal to 50 / TLABWasteTargetPercent, which is 50 times by default.

10.4. Why ZeroTLAB

After the TLAB is allocated, the ZeroTLAB configuration determines whether to assign each byte to 0. When applying for TLAB, since the request for TLAB occurs at the time of object allocation, the memory will be used immediately and the assignment will be modified. The operation of memory involves THE CPU cache row. If it is a multi-core environment, false sharing will also be involved in the CPU cache row. In order to optimize, the JVM has done Allocation Prefetch here. We will try to load this memory into the CPU cache, which means that it is most efficient to modify memory when allocating TLAB memory.

When you create an object, you would assign an initial value to each field. Most fields are initialized to 0, and when TLAB returns to the heap, the remaining space is also filled with an int[] array of 0’s.

Therefore, when TLAB is first assigned, assigning 0 avoids assigning 0 later. It is also possible to use the Allocation Prefetch mechanism to accommodate CPU cache rows (more on the Allocation Prefetch mechanism in another series).

10.5. Why the JVM needs to warm up, and why Java code executes faster (just for TLAB, JIT, MetaSpace, GC, etc.)

According to the previous analysis, the TLAB size of each thread will constantly change and tend to be stable according to the characteristics of thread allocation. The size is mainly determined by the allocation ratio EMA, but this collection needs a certain number of runs. In addition, the default of EMA for the first 100 collections is not stable enough, so the TLAB size also changes frequently at the beginning of the program. As the program threads stabilize and run for some time, the per-thread TLAB size also stabilizes and adjusts to the size that best fits the allocation characteristics of the thread object. In this way, it is closer to the ideal that only the Eden region is GC when it is full, and all objects in the Eden region are efficiently allocated through TLAB. This is why Java code executes faster and faster in terms of TLAB.