Hi everyone, MY name is Bin. It’s time for us to meet again every week. I published my first article “Evolution of IO Models from a Kernel Perspective” on January 10th. And two kinds of IO thread model, finally leads to the NETWORK IO thread model of Netty. Readers feel it is very hardcore. With your support, this article has received 2038 views, 80 likes and 32 views. This is a great encouragement for the trumpet, which has just been born for more than a month. Thank you again for your recognition, encouragement and support

Bin will come again today to bring us a hardcore technical article, in this paper, we will in detail from the Angle of computer composition principle of the object in the JVM memory is how to layout, and what is the memory alignment, if we compare iron head, is not to carry on the memory alignment will cause what kind of consequences, and finally leads to the principle and application of compressed pointer. We also introduced the reasons for False Sharing and its performance impact in high concurrency scenarios.

I believe that after reading this article, we will harvest a lot of words, not to say more, we officially start the content of this article ~~

In our daily work, sometimes in order to prevent OOM occurrence of online applications, we need to calculate the size of memory occupied by some core objects in the development process, in order to better understand the general situation of the memory occupied by our applications.

Based on the memory resource limit of our server and the order of magnitude of object creation, we can calculate the high and low memory usage threshold of the application. If the memory usage exceeds the high threshold, there may be OOM risk.

We can do some processing logic to prevent OOM or send an alarm according to the estimated high and low water line in the program.

The e core problem is how to calculate the memory footprint of a Java object?

Before I answer that question, I’ll take a look at the in-memory layout of Java objects, which is the subject of this article.

1. Memory layout of Java objects

As shown in the figure, Java objects are represented in the JVM using the instanceOopDesc structure and the memory layout of Java objects in the JVM heap can be divided into three parts:

1.1 Object Headers

Each Java object contains an object header, which contains two types of information:

  • MarkWord: The markOopDesc structure is used in the JVM to represent the data used to store the runtime of the object itself. For example: HashCode, GC generation age, lock status flags, thread held locks, bias thread Id, bias timestamp, etc. MarkWord occupies 4B and 8B of memory on 32-bit and 64-bit operating systems, respectively.

  • Java classes are wrapped in an InstanceKclass object in the JVM, which contains meta information about the Java class, such as: Inheritance structures, methods, static variables, constructors, etc.

    • Without pointer compression enabled (-xx: -usecompressedoops). Type Pointers occupy 4B and 8B of memory on 32-bit and 64-bit operating systems, respectively.
    • With pointer compression enabled (-xx :+UseCompressedOops). Type Pointers occupy 4B and 4B memory on 32-bit and 64-bit operating systems, respectively.
  • If the Java object is an array type, the object header of the array object also contains a 4B property that records the length of the array.

Since the property in the object header that records the size of the array takes up only 4B of memory, the maximum size that a Java array can claim is 2^32.

1.2 Instance Data

The in-memory instance data area of a Java object is used to store instance fields defined in a Java class, including instance fields in any parent class. That is, the instance of a subclass allocates memory for the parent class instance fields, even though the subclass does not have access to the parent class’s private instance fields, or its instance fields hide the parent class’s nameless instance fields.

Field types in Java objects fall into two broad categories:

  • Base types: Base types defined by instance fields in Java classes have the following memory footprint in the instance data area:

    • Occupies the first 8 bytes long | double.
    • Int | float occupy 4 bytes.
    • Short | char are 2 bytes.
    • Byte | Boolean takes up one byte.
  • Reference type: The reference type of an instance field in a Java class has two types of memory usage in the instance data area:

    • Disable pointer compression (-xx: -usecompressedoops) : The memory footprint of reference types on 32-bit operating systems is 4 bytes. The memory footprint of reference types on 64-bit operating systems is 8 bytes.
    • If pointer compression is enabled (-xx :+UseCompressedOops), the memory usage of the reference type changes to 4 bytes in a 64-bit operating system, and continues to 4 bytes in a 32-bit operating system.

Why are 32-bit OS reference types 4 bytes and 64-bit OS reference types 8 bytes?

In Java, reference types hold the memory address of the referenced object. On a 32-bit operating system, memory addresses are represented by 32 bits, so it takes 4 bytes to record memory addresses. The virtual address space that can be recorded is 2^32, which is only 4 gigabytes of memory.

In the 64-bit operating system, the memory address is represented by 64 bits, so it takes 8 bytes to record the memory address, but in the 64-bit system, only the lower 48 bits are used, so its virtual address space is 2^48 size, which can represent 256TB size of memory, of which the lower 128T space is divided into user space. High 128T divided into the kernel space, can be said to be very large.

Now that we have covered the memory layout of Java objects in the JVM as a whole, let’s look at how instance fields defined in Java objects are laid out in the instance data area:

2. Rearrange fields

In fact, when we write the Java source file when those instance fields are rearranged by the JVM, the purpose of doing this is actually for memory alignment, so what is memory alignment, why, I will with the article in depth interpretation for you to reveal the answer layer by layer ~~

In this section, I introduce the JVM field rearrangement rules:

The order in which the JVM reallocates fields is affected by the -xx :FieldsAllocationStyle parameter, which defaults to 1. The reallocation policy for instance fields follows the following rules:

  1. If a field is occupiedXBytes, then the offset of this fieldOFFSETNeed to be aligned toNX

The offset is the difference between the memory address of the field and the starting memory address of the Java object. For example, if a field of type long occupies 8 bytes of memory, its OFFSET should be a multiple of 8, 8N. Anything less than 8N requires padding bytes.

  1. In 64-bit JVMS with compression Pointers turned on, the OFFSET of the first field in a Java class needs to be aligned to 4N, and the OFFSET of the first field in a class with compression Pointers turned off needs to be aligned to 8N.

  2. By default, the JVM allocates fields in this order: Long/double, int/float, short/char, byte/Boolean, oops(Ordianry Object Point reference type pointer), and instance variables defined in the parent class appear before instance variables defined in the subclass. When the JVM parameter -xx +CompactFields is set (the default), fields that occupy less memory than long/double are allowed to be inserted into the gap before the first long/double field in the object to avoid unnecessary memory filling.

The CompactFields option parameter is marked as out of date in JDK14 and is likely to be removed in future releases. Details are available at issue:bugs.openjdk.java.net/browse/JDK-…

The above three field rearrangement rules are very important, but it is quite convoluted, abstract and not easy to understand. The purpose of the author lists them first is to let everyone have a hazy perceptual understanding. The following author gives a specific example to explain it in detail. In the process of reading this example, it is also convenient for you to deeply understand these three important field rearrangement rules.

Suppose we now have a class definition like this


public class Parent {
    long l;
    int i;
}

public class Child extends Parent {
    long l;
    int i;
}
Copy the code
  • According to the above introductionRule 3We know that variables in the parent class come before variables in the child class, and that fields should be assigned in the order long l comes before int I.

If the JVM has -xx +CompactFields turned on, fields of type int are inserted in the gap between the first long field (that is, parent-l) in the object. If the JVM sets -xx-compactFields, this insertion of an int field is not allowed.

  • We know from rule 1 that the OFFSET of long fields in the instance data area needs to be aligned to 8N, and the OFFSET of int fields needs to be aligned to 4N.

  • According to rule 2, if the compression pointer -xx :+UseCompressedOops is enabled, the OFFSET of the first field of the Child object needs to be aligned to 4N. When the compression pointer is disabled, -xx: -usecompressedoops, The OFFSET of the first field of the Child object needs to be aligned to 8N.

Due to the existence of JVM parameters UseCompressedOops and CompactFields, the field arrangement order of Child object in the instance data area can be divided into four cases. The following are the performance of the field arrangement order in these four cases based on the three rules extracted above.

2.1 -xx :+UseCompressedOops -xx -CompactFields Enable compression pointer and disable field compression

  • The OFFSET at OFFSET = 8 holds the type pointer, which takes up 4 bytes because the compression pointer is enabled. The object header takes up 12 bytes in total: MarkWord(8 bytes) + type pointer (4 bytes).

  • According to rule 3: Parent fields precede Child fields and long fields precede int fields.

  • According to rule 2: With compression pointer enabled, the first field in the Child object needs to be aligned to 4N. The Parent. L field’s OFFSET can be 12 or 16.

  • According to rule 1: The OFFSET of the long field in the instance data area needs to be aligned to 8N, so the Parent. L field’s OFFSET must be 16, so OFFSET = 12 needs to be filled. The child. l field can only be stored at OFFSET = 32. OFFSET = 28 cannot be used because 28 is not a multiple of 8 and cannot be aligned with 8N, so OFFSET = 28 is filled with 4 bytes.

Rule 1 also states that int OFFSET must be aligned to 4N, so parent-I and child-I are stored at OFFSET = 24 and OFFSET = 40, respectively.

Because memory alignment in the JVM exists not only between fields but also between objects, memory addresses between Java objects need to be aligned to 8N.

So the end of the Child object is padded with 4 bytes, and the object size is padded from 44 bytes to 48 bytes.

2.2 -xx :+UseCompressedOops -xx +CompactFields Enable compression pointer to enable field compression

  • Based on the analysis of the first case, we turned on the -xx +CompactFields compression field, so that the parent-I field of type int could be inserted at OFFSET = 12 to avoid unnecessary byte padding.

  • As we can see from rule 2: The first field of the Child object should be aligned to 4N.

  • According to rule 1: All long fields of Child are aligned to 8N and all int fields are aligned to 4N.

Finally, the size of the Child object is 36 bytes. Since the memory address between the Java object and the object needs to be aligned to 8N, the end of the Child object is filled with 4 bytes and finally becomes 40 bytes.

Here we can see that with field compression enabled -xx +CompactFields, the size of the Child object has changed from 48 bytes to 40 bytes.

2.3 -xx: -usecompressedoops -xx -CompactFields Disables compression pointer and field compression

First, when the compression pointer -usecompressedoops is turned off, the type pointer in the object header becomes 8 bytes. This causes the size of the object header to be 16 bytes in this case.

  • According to rule 1: OFFSET of type long needs to be aligned to 8N. According to rule 2: When the compression pointer is turned off, the first field of the Child object, parent-l, needs to be aligned to 8N. So the Parent. L field has OFFSET = 16.

  • Because the long variable OFFSET needs to be aligned to 8N, the child.l field’s OFFSET

It needs to be 32, so the OFFSET = 28 position is filled with 4 bytes.

This results in a Child object size of 44 bytes, but with the memory address of the Java object and object aligned to 8N, four more bytes are padded at the end of the object, resulting in 48 bytes of Child memory.

2.4-xx: -usecompressedoops -xx +CompactFields Disables the compression pointer and enables field compression

Based on the analysis of the third case, let’s look at the field arrangement of the fourth case:

With pointer compression turned off, the size of the type pointer is 8 bytes, so there is no space in front of the first field in the Child object, parent-l, which is aligned with 8N. There is no need for int insertion. So even if field compression is enabled -xx +CompactFields, the overall order of fields remains the same.

Pointer compression -xx :+UseCompressedOops and field compression -xx +CompactFields are both enabled by default

Step 3 Align Padding

In the previous section on field rearrangement in the instance data area, byte padding for memory alignment occurs not only between fields, but also between objects.

We introduced three important rules for field rearrangement. Rules 1 and 2 define memory alignment rules between fields. Rule 3 defines the arrangement rules between object fields.

For memory alignment purposes, unnecessary bytes need to be filled in between object headers and fields, as well as between fields.

For example, in the first case of field rearrangement mentioned earlier -xx :+UseCompressedOops -xx -CompactFields.

In each of the four cases, a 4-byte space is filled after the object instance data area, because you need to satisfy memory alignment between objects as well as between fields.

Memory addresses between objects in the Java virtual machine heap need to be aligned to 8N (multiples of 8). If an object occupies less than 8N bytes of memory, it must be aligned to 8N bytes by padding unnecessary bytes behind the object.

Virtual machine memory alignment options for – XX: ObjectAlignmentInBytes, the default is 8. That is, how many times the memory addresses between objects need to be aligned is controlled by this JVM parameter.

Let’s use the first case above as an example: the object in the figure actually occupies 44 bytes, but is not a multiple of 8, so we need to fill 4 more bytes and align the memory to 48 bytes.

These unnecessary bytes that are padded between fields and objects for memory alignment purposes are called Padding.

4. Align the fill application

Now that we know the concept of an aligned fill, you might wonder why we do an aligned fill, and what kind of problem are we trying to solve?

So let us take this problem, to listen to the author to talk ~~

4.1 Solve the problem of alignment padding caused by false sharing

In addition to the two aligned population scenarios described above (fields to fields and objects to objects), there is another aligned population scenario in JAVA that solves the problem of False Sharing by aligning the population.

Before introducing False Sharing, I will introduce the way the CPU reads data in memory.

4.4.1 CPU cache

According to Moore’s Law, the number of transistors on a chip doubles every 18 months. As a result, CPU performance and processing speed become faster and faster, and improving CPU speed is much easier and cheaper than improving memory speed, so the speed gap between CPU and memory becomes larger and larger.

In order to make up for the huge speed difference between CPU and memory, improve the processing efficiency and throughput of CPU, so people introduced L1,L2,L3 cache integration into CPU. Of course, there is also the L0 register, which is the closest to the CPU, the fastest access, and basically no delay.

There are multiple cores in a CPU, and we often see this configuration when we buy a computer, such as four cores and eight threads. This means that the CPU contains four physical cores and eight logical cores. Four physical core said at the same time allow four threads parallel execution, eight core logic said processor with hyper-threading technology will be a physical simulating the two logical core, a physical core can only execute one thread at the same time, the super chip can quickly switch between threads, thread when a thread in the space access memory, A hyperthreaded chip can switch to another thread immediately. Because the switch is so fast, you see eight threads executing simultaneously.

The CPU core in the figure refers to the physical core.

L1Cache is the nearest cache to the CPU core, followed by L2Cache, L3Cache, and memory.

Caches that are closer to the CPU core are faster, more expensive, and, of course, smaller.

L1Cache and L2Cache are private to the physical core of the CPU.

L3Cache is shared by all physical cores of the entire CPU.

The logical CPU core shares the L1Cache and L2Cache of its physical core

L1Cache

L1Cache is the closest cache to the CPU. It has the fastest access speed and the smallest capacity.

L1Cache is divided into two parts: Data Cache and Instruction Cache. One of them stores data and the other stores code instructions.

You can view CPU information on a Linux machine by CD /sys/devices/system/ CPU /.

In the /sys/devices/system/ CPU/directory, you can see the number of CPU cores (logical cores).

The processor on my machine doesn’t use hyperthreading so there are actually four physical cores.

Let’s dive into one of the CPU cores (CPU0) to see what L1Cache looks like:

The situation of the CPU cache in/sys/devices/system/CPU/cpu0 / cache directory to view:

Index0 describes DataCache in L1Cache:

  • level: indicates the level of the cache information. 1 indicates L1Cache.
  • type: Indicates the DataCache of L1Cache.
  • size: Indicates that the DataCache size is 32 KB.
  • shared_cpu_listWe mentioned earlier that L1Cache and L2Cache are private to the physical core of the CPU, while the logical core simulated by the physical core isShared L1Cache and L2Cache./sys/devices/system/cpu/The information described under the directory is the logical core. Shared_cpu_list describes exactly which logical cores share the physical core.

Index1 describes the Instruction Cache in L1Cache:

The Instruction Cache in L1Cache is also 32K in size.

L2Cache

L2Cache information is stored in index2:

The L2Cache is 256K in size, which is larger than the L1Cache.

L3Cache

L3Cache information is stored in index3:

Here we can see that the DataCache and InstructionCache in L1Cache are the same size as 32K whereas L2Cache is 256K and L3Cache is 6M.

Of course, these values vary depending on the CPU configuration, but overall the L1Cache is tens of kilobytes, L2Cache is hundreds of kilobytes, and L3Cache is several megabytes.

4.1.2 CPU Cache rows

Before we introduced the CPU cache structure, the purpose of introducing cache is to eliminate the speed gap between CPU and memory, according to the local principle of the program, we know that the CPU cache must be used to store hot data.

The principle of program locality manifests itself as time locality and space locality. Time locality means that if an instruction in a program is executed once, it may be executed again soon after. If a piece of data is accessed, it may be accessed again shortly thereafter. Spatial locality means that once a program accesses a storage unit, it will not be long before nearby storage units are accessed as well.

What is the basic unit for accessing data in the cache?

The fact that hot data is accessed in the CPU cache is not as we imagine in terms of individual variables or individual Pointers.

The basic unit for accessing data in the CPU cache is called the cache line. The cache line access bytes are multiples of 2, ranging from 32 to 128 bytes on different machines. The size of the cached row is 64 bytes (note: this is in bytes) on all current major processors.

L1Cache,L2Cache, and L3Cache are all 64 bytes in size.

This means that every time the CPU reads or writes 64 bytes of data from memory, even if you read only one bit, the CPU will load 64 bytes of data from memory. In the same way, the CPU synchronizes data from the cache to memory in 64 bytes.

For example, if you access a long array, the CPU loads the first element of the array into the cache at the same time as it loads the next seven elements. This speeds up the efficiency of traversing the array.

Long takes up eight bytes in Java, and a cache line can hold eight long variables.

In fact, you can iterate very quickly over any data structure allocated in contiguous chunks of memory, and if the items in your data structure are not adjacent to each other in memory (e.g., linked lists), you will not be able to take advantage of the CPU cache. Because the data is not stored consecutively in memory, each item in these data structures can have a cache row miss (program locality principle).

Remember when we introduced the creation of a Selector in the Reactor’s Implementation in Netty, Netty’s custom SelectedSelectionKeySet using arrays replaces the JDK’s Implementation of the HashSet type sun.nio.ch.selectorImpl# selectedKeys. The goal is to take advantage of CPU caching to improve the traversal performance of IO active SelectionKeys sets.

4.2 False Sharing

For an example, we define a FalseSharding class with two volatile fields a and b of type long.

public class FalseSharding {

    volatile long a;

    volatile long b;

}
Copy the code

Fields A and B are logically independent of each other and have no relationship to each other. They are used to store different data and are not associated with each other.

The memory layout between fields in the FalseSharding class is as follows:

Fields A and B in FalseSharding class are stored adjacent to each other in memory, occupying 8 bytes respectively.

If field A and field B are read into the same cache line by the CPU, there are two threads, thread A to modify field A and thread B to read field B.

What happens to the read operations of thread B in this scenario?

We know that variables declared with the volatile keyword ensure memory visibility in a multithreaded environment. The computer hardware layer guarantees memory visibility after writes to shared variables decorated with the volatile keyword. This memory visibility is guaranteed by the Lock prefix instruction and the Cache Consistency protocol (MESI control protocol).

  • The Lock prefix instruction can make the corresponding cache row in the processor where the thread of modification is located flush back to the memory immediately after the modification, and simultaneously Lock all the cache rows in the processor core that cache the modified variable, preventing multiple processor cores from concurrently modifying the same cache row.

  • The cache consistency protocol is mainly used to maintain CPU cache consistency between multiple processor cores and memory data consistency. Each processor will sniff on the bus for other processors to write to the memory address, and if this memory address is cached in its own processor, the corresponding cache line in its own processor will be invalidated, and the next time it needs to read the data in this cache line, it will need to access memory to obtain the data.

Based on the volatile keyword principle above, let’s look at the first effect:

  • When thread A makes a change to field A in processor Core0, the Lock prefix instruction locks the corresponding cached row of field A in all processors, which causes thread B to be unable to read and modify field B in processor Core1.

  • Processor Core0 flushes the cache line where the modified field A is located back into memory.

We can see from the figure that the value of field A has changed in the cache line of processor Core0 and in memory at this point. However, the value of field A in processor CORE1 has not changed, and the cache row of field A in core1 is locked and cannot be read or written to field B.

From the above procedure, we can see that even though fields A and B are logically independent, they are not related to each other at all, but thread A’s modification of field A causes thread B to be unable to read field B.

The second effect:

When processor core0 flushes the cache row containing field A back to memory, processor Core1 sniffs on the bus that the memory address of field A is being changed by another processor and sets its cache row as invalid. When thread B reads the value of field B in processor Core1 and finds that the cache row has been invalidated, Core1 needs to re-read the value of field B from memory even if nothing has happened to field B.

From the above two influences, we can see that field A and field B actually do not share and there is no correlation between them. Theoretically, any operation of thread A on field A should not affect the reading or writing of field B by thread B.

But in fact thread A’s modification of field A causes the cache rows of field B in CORE1 to be locked (the Lock prefix instruction), making it impossible for thread B to read field B.

After core0, the processor where thread A resides, synchronously flushes the cache row of field A back to memory, the cache row of field B in Core1 is set as invalid (cache consistency protocol), and thread B needs to go back to memory to read the value of field B and cannot take advantage of CPU cache.

Because field A and field B are in the same cache row, field A and field B are actually shared (which they should not be). This phenomenon is called False Sharing.

In high concurrency scenarios, this pseudo-sharing problem can have a significant impact on program performance.

If thread A makes changes to field A at the same time thread B makes changes to field B, the performance impact is even greater because it invalidates the corresponding cache rows in CORE0 and CORE1.

4.3 False Sharing solution

Since False Sharing occurs because field A and field B are in the same cache row, we need to find a way to make field A and field B not in the same cache row.

So what can we do to ensure that field A and field B are not assigned to the same cache row?

This is where the topic byte padding in this section comes in handy

Prior to Java8, we used to populate 7 long variables (64 bytes cache row size) before and after field a and field b, so that field a and field b each had a cache row to themselves to avoid False Sharing.

For example, if we change our original example code to look like this, we can ensure that field A and field B each have a cache row.

public class FalseSharding {

    long p1,p2,p3,p4,p5,p6,p7;
    volatile long a;
    long p8,p9,p10,p11,p12,p13,p14;
    volatile long b;
    long p15,p16,p17,p18,p19,p20,p21;

}
Copy the code

The memory layout of the modified object is as follows:

We saw that to solve the False Sharing problem, we padded the FalseSharding sample object from 32 bytes to 200 bytes. This is a significant memory drain. Often, for extreme performance, we see the solution of False Sharing in the source code of some highly concurrent frameworks or JDK. Because in high concurrency scenarios, any small performance penalty, such as False Sharing, is magnified.

However, solving False Sharing can also be a significant memory drain, so even in high-concurrency frameworks such as disrupters or JDK, it is only for shared variables that are frequently written in multithreaded scenarios.

What I want to emphasize here is that in our daily work, we should not be full of nails because we hold a hammer in our hand, and want to hammer any nail twice.

We should clearly distinguish what impacts and losses a problem will bring, and whether these impacts and losses are acceptable in our current business stage. Is it a bottleneck? We also need to be clear about the costs of solving these problems. We must make comprehensive evaluation and pay attention to an input-output ratio. Some problems are problems, but there are stages and scenarios where we don’t need to invest in solving them. Some problems are bottlenecks in our current stage of business development and we have to solve them. In our architectural design or program design, the scheme must be simple and appropriate. And it is estimated that there is a certain evolution space for some advance quantities.

4.3.1 @ Contended annotation

A new annotation @Contended was introduced in Java8 to solve the False Sharing problem, and it also affects the arrangement of fields in Java objects.

In the previous section, we solved the problem of False Sharing by means of filling fields, but there is a problem because we also need to consider the size of the CPU cache when manually filling fields, because all the major processors cache 64 bytes of rows. However, there are still processors that cache lines of 32 bytes, some even 128 bytes. There are a lot of hardware constraints to consider.

Java8 solved this problem by introducing the @contended annotation, so we no longer need to fill fields manually. Let’s take a look at how the @contended annotation can help solve this problem

The manual byte padding described in the previous section fills 64 bytes before and after the shared variable. This only ensures that the program exclusively fills the cache row on a CPU with a cache row size of 32 or 64 bytes. However, if the CPU’s cache line size is 128 bytes, there is still the problem of False Sharing.

The introduction of @Contended annotations allows us to ignore the differences in underlying hardware devices and achieve the original purpose of the Java language: platform independence.

The @Contended annotation works only internally in the JDK by default. If you want to use the @Contended annotation in your program code, you need to enable the JVM parameter -xx :-RestrictContended to work.

@Retention(RetentionPolicy.RUNTIME)
@Target({ElementType.FIELD, ElementType.TYPE})
public @interface Contended {
    //contention group tag
    String value(a) default "";
}
Copy the code

An @Contended annotation can be annotated on a class or on a field in a class. An object annotated by @Contended will have an exclusive cache line and will not share the cache line with any variables or objects.

  • The @Contended annotation indicates on a class that the entire instance data in that class object needs to be exclusive to the cache line. The cache row cannot be shared with other instance data.

  • The @Contended annotation indicates on a field in a class that the field needs to have an exclusive cache line.

  • In addition, @Contended provides the notion of groups, with the value attribute in the annotation indicating contention groups. Variables that belong to a unified group are stored consecutively in memory, allowing shared cache rows. Cache rows are not allowed to be shared between different groups.

Let’s take a look at how the @Contended annotation affects the arrangement of fields in each of the three usage scenarios.

@contended annotations on classes
@Contended
public class FalseSharding {
    volatile long a;
    volatile long b;

    volatile int c;
    volatile int d;
}
Copy the code

When the @Contended annotation is on the FalseSharding example class, it means that the entire instance data area in the FalseSharding example object needs to have an exclusive cache line, and cannot share the cache line with other objects or variables.

Memory layout in this case:

As shown in the figure, after the FalseSharding example class is annotated @Contended, the JVM fills 128 bytes before and after the FalseSharding example object’s instance data area, ensuring that the memory between the fields in the instance data area is continuous, and ensuring that the entire instance data area is exclusive to the cache line. Cache rows are not shared with data outside the instance data area.

Careful friends may have noticed the problem. Didn’t we mention that the cache line size is 64 bytes earlier? Why is it filled with 128 bytes?

And since the manual padding described earlier was 64 bytes, why is the @Contended annotation filled with twice the size of the cache line?

There are actually two reasons for this:

  1. First of all, as we’ve already mentioned, most of the mainstream CPU cache lines are 64 bytes, but there are some CPU cache lines that are 32 bytes or 128 bytes. If you fill 64 bytes, It is possible to avoid FalseSharding on cpus with 32 and 64 bytes of cache row size, but on cpus with 32 and 64 bytes of cache row size128 bytesThe FalseSharding problem still occurs in the CPU of Java, where Java takes a pessimistic approach and defaults to fill128 bytes, while wasteful in most cases, hides underlying hardware differences.

However, the size of the @Contended annotation fill byte can be determined by JVM parameters

-xx :ContendedPaddingWidth Specifies that the value ranges from 0 to 8192. The default value is 128.

  • The second reason is the core one, mainly to prevent FalseSharding caused by CPU Adjacent Sector Prefetch.

CPU Adjacent Sector Prefetch: www.techarp.com/bios-guide/…

CPU Adjacent Sector Prefetch is a BIOS feature unique to Intel processors. The default value is Enabled. The main function is to use the program locality principle, when the CPU requests data from the memory, and reads the cache line where the current request data is, it will further prefetch the next cache line adjacent to the current cache line, so that when our program processes data in sequence, it will improve the CPU processing efficiency. This point also embodies the spatial locality characteristic of the principle of program locality.

If the CPU Adjacent Sector Prefetch feature is disabled, the CPU obtains only the cache row where the requested data is located, but does not Prefetch the next cache row.

Therefore, when the CPU Adjacent Sector Prefetch is enabled (Enabled), the CPU actually processes two cache rows at the same time. In this case, It is necessary to fill twice the cache line size (128 bytes) to avoid the FalseSharding problem caused by CPU Adjacent Sector Prefetch.

The @contended annotation is on the field
public class FalseSharding {

    @Contended
    volatile long a;
    @Contended
    volatile long b;

    volatile int c;
    volatile long d;
}
Copy the code

This time we annotated the @Contended annotation on field A and field B in the FalseSharding sample class. The effect is that field A and field B have their own cache rows. From a memory layout, fields A and B are filled with 128 bytes before and after each to ensure that fields A and B do not share the cache rows with any data.

Fields C and D that are not annotated by the @Contended annotation are stored consecutively in memory and can share cached rows.

@ Contended grouping
public class FalseSharding {

    @Contended("group1")
    volatile int a;
    @Contended("group1")
    volatile long b;

    @Contended("group2")
    volatile long  c;
    @Contended("group2")
    volatile long d;
}
Copy the code

This time we put field A and field B under the same Content Group, and field C and field D under another Content Group.

In this way, field a and field b under the same group1 are stored consecutively in memory and can share cache rows.

Similarly, field C and field D under the same group2 are also stored consecutively in memory, allowing shared cache rows.

However, cache rows cannot be shared between groups. Therefore, 128 bytes are filled before and after field groups to ensure that variables between groups cannot share cache rows.

5. Align memory resources

From the above we learned that instance data area fields in Java objects need to be memory-aligned so that they can be rearranged in the JVM and that byte-aligned padding is avoided for the purpose of false sharding by padding cache rows.

We also learned that memory alignment occurs not only between objects, but also between fields within objects.

In this section I will introduce you to what memory alignment is. Before we begin, I will ask you two questions:

  • Why memory alignment? If the header is relatively iron, but not memory alignment, what will happen?

  • Why should the starting address of objects in the Java virtual machine heap be aligned to multiples of 8? Why not align it to multiples of 4 or 16 or 32?

With these two questions in mind, let’s begin this section

5.1 Memory Structure

What we normally call memory is also called random-access memory or RAM. There are two types of RAM:

  • One is static RAM (SRAM), which is used for the CPU caches L1Cache, L2Cache, and L3Cache described above. Its characteristics are fast access speed, access speed of 1-30 clock cycles, but small capacity, high cost.

  • The other type is dynamic RAM(DRAM), which is used in what is often called main memory and is characterized by slow access (relative cache), with access speeds of 50-200 clock cycles, but large capacity and lower cost (relative cache).

Memory consists of memory modules that are inserted into an expansion slot on the motherboard. Common memory modules typically transfer data to and from a storage controller in units of **64 bits (8 bytes) **.

The black component on the memory module is the memory module. Multiple memory modules connected to a storage controller are aggregated into main memory.

The DRAM chips described above are packaged in memory modules, and each memory module contains eight DRAM chips, numbered from 0 to 7.

The storage structure of each DRAM chip is a two-dimensional matrix. The elements stored in the two-dimensional matrix are called super cells (Supercell), and each supercell is one byte (8 bits) in size. Each Supercell is given a coordinate address (I, J).

I represents the row address in a two-dimensional matrix, which in computers is called RAS(Row Access strobe).

J represents the column address in the two-dimensional matrix, which is called CAS(Column Access strobe) in computer.

Supercell in the figure below has RAS = 2 and CAS = 2.

Information in the DRAM chip flows into and out of the DRAM chip through pins. Each pin carries 1 bit of signal.

The DRAM chip in the figure contains two address pins (ADDR), because we need RAS, CAS to locate the supercell to be acquired. There are also 8 data pins. Since THE IO unit of the DRAM chip is one byte (8 bits), 8 data pins are needed to pass data in and out of the DRAM chip.

Note that this is just to explain the concept of address pins and data pins, the number of pins in real hardware may not be the same.

5.2 ACCESS to DRAM chips

Let’s take the example of reading supercell at (2,2) to illustrate the process of accessing DRAM chips.

  1. First the memory controller sends the line address RAS = 2 to the DRAM chip via the address pin.

  2. The DRAM chip copies the entire contents of the second row of the two-dimensional matrix to the internal row buffer according to RAS = 2.

  3. The memory controller then sends CAS = 2 to the DRAM chip via the address pin.

  4. The DRAM chip copies the supercell in the second column from the internal row buffer according to CAS = 2 and sends it to the storage controller via data pins.

The UNIT of IO for DRAM chips is one supercell, which is one byte (8 bits).

5.3 How does the CPU read and write Main Memory

Earlier we looked at the physical structure of memory, and how to access the DRAM chips in memory to get a single byte of data stored in Supercell.

In this section, we take a look at how the CPU accesses memory.

The internal structure of CPU chip has been introduced in detail in false Sharding. Here we mainly focus on the bus architecture between CPU and memory.

5.3.1 Bus Structure

The exchange of data between CPU and memory is done through a bus, and the transfer of data across the bus is done through a series of steps called bus transactions.

The transfer of data from memory to CPU is called a Read transaction, and the transfer of data from CPU to memory is called a write transaction.

The signal transmitted on the bus includes address signal, data signal and control signal. The control signal transmitted on the control bus can synchronize the transaction and identify the transaction information being executed:

  • Is the current transaction to memory? Or to disk? Or to some other IO device?
  • Is the transaction read or write?
  • Address signals transmitted on the bus (Memory address), or data signals (data)? .

Remember the MESI cache consistency protocol we talked about earlier? When core0 changes the value of field A, other CPU cores will sniff the memory address of field A on the bus. If the memory address of field A appears on the bus, it indicates that someone is modifying field A. In this way, other CPU cores will invalidate the cache line of field A.

As shown in the figure above, the system bus is connected to the CPU and IO bridge, and the storage bus is connected to the IO bridge and main memory.

The IO Bridge is responsible for converting electrical signals on the system bus to electrical signals on the storage bus. The I/O Bridge also connects the system bus and storage bus to the I/O bus (I/O devices such as disks). Here we see that the IO Bridge is actually used to convert electronic signals on different buses.

5.3.2 Process for the CPU to read data from the memory

Suppose the CPU now loads the contents of memory address A into A register for calculation.

First, the bus interface in the CPU chip initiates a read transaction on the bus. The read transaction is divided into the following steps:

  1. The CPU places memory address A on the system bus. The IO Bridge then passes the signal to the storage bus.

  2. The main memory senses the address signal on the storage bus and reads the memory address A on the storage bus through the storage controller.

  3. The storage controller locates the specific memory module based on memory address A and retrives data X corresponding to memory address A from the DRAM chip.

  4. The storage controller puts the read data X onto the storage bus, and the IO Bridge converts the data signal on the storage bus to the data signal on the system bus, and then passes it along the system bus.

  5. The CPU chip senses the data signal on the system bus, reads the data from the system bus and copies it to the register.

This is how the CPU reads data from memory into a register.

But there is also an important process involved, which we need to lay out here, that is, how does the storage controller read the corresponding data X from main memory through memory address A?

Next, we combine the memory structure and the process of reading data from DRAM chip to introduce how to read data from main memory.

5.3.3 How do I Read Data from main memory based on the Memory Address

As described above, when the memory controller in main memory senses the address signal on the memory bus, it will read the memory address from the memory bus.

The memory address is then used to locate the specific memory module. Remember the memory module in the memory structure?

Each memory module contains eight DRAM chips, numbered from 0 to 7.

The memory controller translates the memory address into supercell’s coordinate address (RAS, CAS) in a two-dimensional matrix in DRAM chips. And send this coordinate address to the corresponding memory module. The memory module then broadcasts RAS and CAS to all DRAM chips in the memory module. Read supercell from DRAM0 to DRAM7 in sequence (RAS, CAS).

We know that a Supercell stores 8 bits of data. Here we read 8 Supercell, or 8 bytes, from DRAM0 to DRAM7, and then return these 8 bytes to the storage controller, which puts the data onto the storage bus.

The CPU always reads data from memory in word size, which in 64-bit processors is 8 bytes. 64-bit memory can only handle 8 bytes at a time.

The CPU reads and writes one cache line (64 bytes) to memory at a time, but memory can only read 8 bytes at a time.

So in the memory module corresponding to the memory address, DRAM0 chip stores the first low byte (Supercell) and DRAM1 chip stores the second byte (……) Similarly, DRAM7 chips store the last high byte.

Memory is read and written in 8 bytes at a time. And what the programmer sees as contiguous memory addresses are actually physically discontiguous. Because these consecutive 8 bytes are actually stored on different DRAM chips. Each DRAM chip stores one byte (Supercell).

5.3.4 Process for the CPU to write data to the Memory

We now assume that the CPU writes the data X in the register to the memory address A. Similarly, a bus interface in a CPU chip initiates a Write transaction to the bus. Write transaction steps are as follows:

  1. The memory address A to be written by the CPU is placed on the system bus.

  2. Through signal conversion of IO Bridge, memory address A is transferred to the storage bus.

  3. The storage controller senses the address signal on the storage bus, reads the memory address A from the storage bus, and waits for the data to arrive.

  4. The CPU copies the data in the registers to the system bus and transfers the data to the storage bus through the signal conversion of the IO Bridge.

  5. The storage controller senses the data signal on the storage bus and reads the data from the storage bus.

  6. The memory controller locates the specific memory module through the memory address A, and finally writes the data to the eight DRAM chips in the memory module.

6. Why memory alignment

Now that we know about the memory structure and how the CPU reads and writes memory, let’s return to the question at the beginning of this section: Why is memory aligned?

Here are three reasons for memory alignment:

speed

The unit of data read by the CPU is based on word size. In a 64-bit processor, Word size = 8 bytes, so the unit of data read and written by the CPU to the memory is 8 bytes.

In 64-bit memory, the unit of memory IO is 8 bytes. As we mentioned earlier, the memory module in the memory structure usually transfers data to and from the storage controller in 64-bit units (8 bytes). Because each memory IO reads data from the same 8 DRAM chips (RAM,CAS) contained in the specific memory module where the data is located, one byte is successively read, and then aggregated into 8 bytes in the memory controller and returned to the CPU.

Due to the limitation of the physical storage structure composed of 8 DRAM chips in the memory module, the data read in memory can only be read —-8 bytes 8 bytes in order of address.

  • Suppose we now read eight bytes of the contiguous memory address 0x0000-0x0007. Since the memory reads are in 8-byte order and the memory address we are reading starts at 0 (a multiple of 8), the coordinates of each address in 0x0000-0x0007 are the same (RAS,CAS). So he can read it all at once through the same RAS (CAS) in 8 DRAM chips.

  • If we now read the 8 bytes of contiguous memory 0x0008-0x0015, it is the same. Since the memory segment starts at 8 (a multiple of 8), the coordinate address (RAS,CAS) of each memory address in this segment is also the same in DREAM chip, which we can also read in one go.

Note: The coordinate addresses in 0x0000-0x0007 memory segment (RAS,CAS) are not the same as those in 0x0008-0x0015 memory segment (RAS,CAS).

  • But if we read now0x0007 - 0x0014This 8 bytes of contiguous memory is a different story because of the starting address0x0007In DRAM chips (RAS,CAS) with the rear address0x0008 - 0x0014RAS,CAS are not the same, so the CPU can only first from0x0000 - 0x0007Read 8 bytes out and put in firstResult registerAnd move 7 bytes to the left (the purpose is to fetch only0x0007), and then the CPU in from0x0008 - 0x0015Read 8 bytes out into a temporary register and shift 1 byte right (to fetch)0x0008 - 0x0014) finally with the result registerOr operation. The resulting0x0007 - 0x0014Eight bytes on the address segment.

From the above analysis, when the CPU accesses memory-aligned addresses, such as 0x0000 and 0x0008, the starting addresses are both aligned to multiples of 8. The CPU can read it from a read transaction.

However, when the CPU accesses an unaligned address, such as 0x0007, the starting address is not aligned to a multiple of 8. The CPU needs two read transactions to read the data out.

Remember the question I asked at the beginning of the section?

“Why should the starting addresses of objects in the Java virtual machine heap be aligned to multiples of 8? Why not align to multiples of 4 or 16 or 32?” Now can you answer??

atomic

The CPU can operate atomically on an aligned Word size memory. Word size = 8 bytes in 64-bit processors.

Try to allocate in a cache line

When introducing false Sharding, we mentioned that the size of cache line in current mainstream processors is 64 bytes. The starting address of objects in the heap is aligned to a multiple of 8 through memory, so that objects can be allocated to a cache line as much as possible. An object whose memory start address is not aligned may be stored across the cache lines, causing the CPU to execute twice as slowly.

One of the most important reasons for field memory alignment in objects is that fields only appear in cached rows on the same CPU. If the fields are not aligned, then it is possible to have fields that span cached rows. That is, reading the field may require replacing two cache rows, and the storage of the field may pollute both cache rows at the same time. Both cases are detrimental to the efficiency of the program.

In addition, the three field alignment rules introduced in section 2. Field Rearrangement ensure that the memory occupied by the instance data area is as small as possible on the basis of field memory alignment.

7. Compress Pointers

Having covered memory alignment, let’s cover the oft-mentioned compression pointer. This can be enabled with the JVM parameter XX:+UseCompressedOops, which is enabled by default.

Before we get started in this section, let’s discuss why we use compressed Pointers?

Let’s say we are now preparing to switch from a 32-bit system to a 64-bit one. At first we might expect an immediate performance improvement, but that may not be the case.

The single biggest cause of performance degradation in the JVM is object references on 64-bit systems. As we mentioned earlier, object references and type Pointers on 64-bit systems take up 64 bits, or 8 bytes.

This results in object references taking up twice as much memory on a 64-bit system as on a 32-bit system, which indirectly leads to more memory consumption and more frequent GC occurrences on a 64-bit system. The more CPU time GC takes up, the less CPU time our application takes up.

Another is that as object references become larger, the CPU has fewer objects to cache, increasing access to memory. The above points lead to the degradation of system performance.

On the other hand, on a 64-bit system, the addressing space of memory is 2^48 = 256T. Do we really need that much addressing space in reality? It doesn’t seem necessary ~~

So we had a new idea: Should we switch back to a 32-bit system?

If we switch back to a 32-bit system, how do we solve for having more than 4 gigabytes of memory addressing space on a 32-bit system? Because the current MEMORY size of 4G is obviously not enough for current applications.

I think this is the same problem that the JVM developers had to deal with, and they did a great job of finding more than 4 gigabytes of memory addressing space for 32-bit object references on 64-bit systems using compressed Pointers.

7.1 How do compressed Pointers work?

Remember earlier in the section on aligned padding and memory alignment that the starting addresses of objects in the Java virtual machine heap must be aligned to multiples of 8?

Since the starting addresses of objects in the heap are all aligned to multiples of 8, the last three bits of the 32-bit binary of object references with the compression pointer turned on are always 0 (because they are always divisible by 8).

Since the JVM already knows that the last three digits of the memory address of these objects are always zeros, there is no need for these meaningless zeros to continue to be stored in the heap. Instead, we can use the three bits of 0 to store some meaningful information, which gives us three more bits of addressing space.

The JVM still stores the data as 32 bits, but the last three bits that were used to store zeros are now used to store meaningful address space information.

When addressing, the JVM moves the 32-bit object reference three bits to the left (the last three bits complement 0). As a result, with the compression pointer turned on, our 32-bit memory addressing space suddenly became 35 bits. Addressable memory becomes 2^32 * 2^3 = 32GB.

As a result, the JVM performs some extra bit operations but greatly increases the addressing space and reduces the memory footprint of object references by half, saving a lot of space. And these bits are very easy and lightweight operations for the CPU

Another important reason I discovered for memory alignment through the principle of compressed Pointers is that by memory alignment to multiples of 8, we can use compressed Pointers on 64-bit systems to increase the addressing space by 32 gigabytes through 32-bit object references.

Starting with Java7, the compressed pointer is enabled by default when the maximum heap size is less than 32gb. However, when the Maximum heap size is greater than 32GB, the compressed pointer is turned off.

So how do we expand the addressing space even further with the compression pointer on?

7.2 How can I Expand the Addressing Space

Mentioned previously we object in the Java virtual machine heap starting address all need to to multiples of 8, however this value we can by the JVM parameter – XX: ObjectAlignmentInBytes to change (the default value is 8). Of course, the value must be a power of 2, ranging from 8 to 256.

Because the object addresses are aligned to multiples of eight, the three extra bits allow us to store additional address information, increasing the addressable space from 4 gigabytes to 32 gigabytes.

Similarly, what if we set the value of ObjectAlignmentInBytes to 16?

Object addresses are aligned to multiples of 16, which gives us four more bits to store additional address information. The addressing space becomes 2^32 * 2^4 = 64G.

If the compression pointer is enabled on a 64-bit system, the addressing range can be calculated using the following formula: 4G * ObjectAlignmentInBytes = addressing range.

I don’t recommend doing this because increasing ObjectAlignmentInBytes increases the addressing range, but it may also increase byte padding between objects, making the compressed pointer less space-saving.

8. Memory layout of array objects

We’ve spent a lot of time talking about how Java ordinary objects are laid out in memory. In this last section we’ll look at how Java array objects are laid out in memory.

8.1 Memory layout of primitive arrays

The figure above shows the in-memory layout of the primitive array, which is represented in the JVM by the typeArrayOop structure, and the primitive array type meta-information by the TypeArrayKlass structure.

The memory layout of arrays is roughly the same as that of ordinary objects, except that there are four more bytes in the array type object header to represent the length of the array.

Pointer compression is turned on and pointer compression is turned off in the following example:

long[] longArrayLayout = new long[1];
Copy the code

Enable pointer compression -xx :+UseCompressedOops

We see that the red box is the extra 4 bytes in the array header that represent the length of the array.

Because the long array in our example has only one element, the size of the instance data area is only 8 bytes. If the long array in our example becomes two elements, the size of the instance data area becomes 16 bytes, and so on……………. .

Turn off pointer compression -xx: -usecompressedoops

When pointer compression is turned off, the MarkWord in the object header still occupies 8 bytes, but the type pointer changes from 4 bytes to 8 bytes. The array length property remains unchanged at 4 bytes.

Here we find alignment padding between the instance data area and the object header. Do you remember why?

The three field arrangement rules we introduced earlier in the field rearrangement section continue to apply here:

  • Rule 1: If a field takes X bytes, the field’s OFFSET needs to be aligned to NX.

  • Rule 2: In 64-bit JVMS with compression Pointers turned on, the OFFSET of the first field in a Java class needs to be aligned to 4N, and the OFFSET of the first field in a class with compression Pointers turned off needs to be aligned to 8N.

When pointer compression is turned off, rules 1 and 2 need to align it to a multiple of 8, so 4 bytes are padded between it and the object header for memory alignment purposes, with the starting address changed to 24.

8.2 Memory layout of reference type arrays

The figure above shows the in-memory layout of an array of reference types represented by an objArrayOop structure in the JVM, and the primitive array type meta-information represented by an ObjArrayKlass structure.

There is also a 4-byte section in the header of the object that refers to an array of types.

Pointer compression is turned on and pointer compression is turned off in the following example:

public class ReferenceArrayLayout {
    char a;
    int b;
    short c;
}

ReferenceArrayLayout[] referenceArrayLayout = new ReferenceArrayLayout[1];
Copy the code

Enable pointer compression -xx :+UseCompressedOops

The biggest difference between the memory layouts of reference array types and the memory layouts of the base array types is their instance data area. Since compressed Pointers are enabled, object references take up 4 bytes of memory, whereas the reference array in our example contains only one reference element, so there are only 4 bytes in the instance data area. Similarly, if the reference array in the example contains two reference elements, the instance data area becomes eight bytes, and so on…… .

Finally, since Java objects need to be memory-aligned to multiples of 8, four bytes are filled after the instance data area of the reference array.

Turn off pointer compression -xx: -usecompressedoops

When the compression pointer is turned off, the memory footprint for object references changes to 8 bytes, so the instance data area that references array types takes up 8 bytes.

According to field rearrangement rule 2, four bytes need to be filled between the header of the reference array type object and the instance data area for memory alignment purposes.


conclusion

In this paper, the author introduces in detail the memory layout of Java common objects and objects of array type, as well as the calculation method of the size of memory occupied by related objects.

And three important rules for field rearrangement in the instance data area of an object’s memory layout. The following false Sharding problem is caused by byte alignment padding, and the @contented annotation introduced in Java8 to solve false sharding and how to use it.

In order to explain the underlying principles of memory alignment, I also spent a lot of time explaining the physical structure of memory and the complete process of CPU reading and writing memory.

Finally, the working principle of compressed pointer is introduced by memory alignment. From this we know four reasons for memory alignment:

  • CPU access performance: When a CPU accesses a memory-aligned address, a single word size of data can be read from a Read transaction. Otherwise, two Read Transactions are required.

  • Atomicity: THE CPU can operate atomically on an aligned Word size memory.

  • Use THE CPU cache as much as possible: Memory alignment allows objects or fields to be allocated to as many cache rows as possible, avoiding cross-cache row storage and halving CPU performance.

  • Improving the memory addressing space of compressed Pointers: Memory alignment between objects allows us to increase the memory addressing space up to 32GB with 32-bit object references on 64-bit systems. This reduces the memory footprint of object references and increases the memory addressing space.

In this article we have also introduced several JVM parameters related to memory layout: -XX:+UseCompressedOops, -XX +CompactFields ,-XX:-RestrictContended ,-XX:ContendedPaddingWidth, – XX: ObjectAlignmentInBytes.

Finally, thank you all for being here. See you in the next article