Personal tech blog: www.zhenganwen.top

Is the variable visible

Are shared variables visible

We’ll start with code to point out the problems with the Java memory model: Start two threads t1 and t2 to access the sharedVariable sharedVariable. The t2 thread increments sharedVariable to MAX gradually and gives up CPU execution right for 500ms every time it increments. In the meantime, another thread, T1, is expected to detect changes to sharedVariable during polling at lines 7-12 and print them

private static int sharedVariable = 0;
private static final int MAX = 10;

public static void main(String[] args) {
    new Thread(() -> {
        int oldValue = sharedVariable;
        while (sharedVariable < MAX) {
            if(sharedVariable ! = oldValue) { System.out.println(Thread.currentThread().getName() +" watched the change : " + oldValue + "- >"+ sharedVariable); oldValue = sharedVariable; }}},"t1").start();

    new Thread(() -> {
        int oldValue = sharedVariable;
        while (sharedVariable < MAX) {
            System.out.println(Thread.currentThread().getName() + " do the change : " + sharedVariable + "- >" + (++oldValue));
            sharedVariable = oldValue;
            try {
                Thread.sleep(500);
            } catch(InterruptedException e) { e.printStackTrace(); }}},"t2").start();

}
Copy the code

However, the actual running results of the above program are as follows:

t2 do the change : 0->1
t1 watched the change : 0->1
t2 do the change : 1->2
t2 do the change : 2->3
t2 do the change : 3->4
t2 do the change : 4->5
t2 do the change : 5->6
t2 do the change : 6->7
t2 do the change : 7->8
t2 do the change : 8->9
t2 do the change : 9->10
Copy the code

Volatile ensures visibility

You can see that the T1 thread is almost unaware of every change that T2 makes to the sharedVariable. Why is this? You might be told to make sharedVariable volatile, but the result is as volatile as you would expect:

t2 do the change : 0->1
t1 watched the change : 0->1
t2 do the change : 1->2
t1 watched the change : 1->2
t2 do the change : 2->3
t1 watched the change : 2->3
t2 do the change : 3->4
t1 watched the change : 3->4
t2 do the change : 4->5
t1 watched the change : 4->5
t2 do the change : 5->6
t1 watched the change : 5->6
t2 do the change : 6->7
t1 watched the change : 6->7
t2 do the change : 7->8
t1 watched the change : 7->8
t2 do the change : 8->9
t1 watched the change : 8->9
t2 do the change : 9->10
Copy the code

This makes sense, as volatile ensures that shared variables are visible across threads.

Does synchronized guarantee visibility?

However, you might be told that you should just use synchronized + Wait /notify: put all operations on shared variables into synchronized code blocks, and use Wait /notify to coordinate changes and reads of shared variables

private static int sharedVariable = 0;
private static final int MAX = 10;
private static Object lock = new Object();
private static boolean changed = false;

public static void main(String[] args) {
    new Thread(() -> {
        synchronized (lock) {
            int oldValue = sharedVariable;
            while (sharedVariable < MAX) {
                while(! changed) {try {
                        lock.wait();
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                System.out.println(Thread.currentThread().getName() +
                                   " watched the change : " + oldValue + "- >" + sharedVariable);
                oldValue = sharedVariable;
                changed = false; lock.notifyAll(); }}},"t1").start();

    new Thread(() -> {
        synchronized (lock) {
            int oldValue = sharedVariable;
            while (sharedVariable < MAX) {
                while (changed) {
                    try {
                        lock.wait();
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                System.out.println(Thread.currentThread().getName() +
                                   " do the change : " + sharedVariable + "- >" + (++oldValue));
                sharedVariable = oldValue;
                changed = true; lock.notifyAll(); }}},"t2").start();

}
Copy the code

You can see that sharedVariable and changed appear to be visible between T1 and T2 even if they are not volatile:

t2 do the change : 0->1
t1 watched the change : 0->1
t2 do the change : 0->2
t1 watched the change : 0->2
t2 do the change : 0->3
t1 watched the change : 0->3
t2 do the change : 0->4
t1 watched the change : 0->4
t2 do the change : 0->5
t1 watched the change : 0->5
t2 do the change : 0->6
t1 watched the change : 0->6
t2 do the change : 0->7
t1 watched the change : 0->7
t2 do the change : 0->8
t1 watched the change : 0->8
t2 do the change : 0->9
t1 watched the change : 0->9
t2 do the change : 0->10
t1 watched the change : 0->10
Copy the code

Does CAS guarantee visibility?

Change the type of sharedVariable to AtomicInteger, and the T2 thread updates the variable using getAndSetCAS provided by AtomicInteger. You’ll see that this makes it visible.

private static AtomicInteger sharedVariable = new AtomicInteger(0);
private static final int MAX = 10;

public static void main(String[] args) {
    new Thread(() -> {
        int oldValue = sharedVariable.get();
        while (sharedVariable.get() < MAX) {
            if(sharedVariable.get() ! = oldValue) { System.out.println(Thread.currentThread().getName() +" watched the change : " + oldValue + "- >"+ sharedVariable); oldValue = sharedVariable.get(); }}},"t1").start();

    new Thread(() -> {
        int oldValue = sharedVariable.get();
        while (sharedVariable.get() < MAX) {
            System.out.println(Thread.currentThread().getName() + " do the change : " + sharedVariable + "- >" + (++oldValue));
            sharedVariable.set(oldValue);
            try {
                Thread.sleep(500);
            } catch(InterruptedException e) { e.printStackTrace(); }}},"t2").start();

}
Copy the code

Why can synchronized and CAS also be visible? This is because synchronized release-acquire and CAS modified-read have the same semantics as volatile write-reads. Since so, let’s go to the Java memory model, synchronized/volatile/into the underlying implementation of the CAS!!!!

CPU Cache

To understand the visibility of variables across threads, we first need to understand the CPU’s read and write model, which may be a bit boring, but it is very helpful for understanding concurrent programming!

Main memory RAM & Cache Cache

In the development of computer technology, the main memory access speed has been much slower than the CPU operation speed, which makes the CPU’s high-speed processing ability can not give full play to the whole computer system’s work efficiency has been affected, so the modern processor generally introduced the cache memory (referred to as cache).

Caches can match CPU access speed, but are expensive and therefore much smaller than main memory. According to the principle of program locality, when a CPU attempts to access a unit of main memory (one byte per unit of memory), there is a high probability that adjacent units will be used later. Thus, when a CPU accesses a main memory cell, the computer hardware automatically calls into the cache the contents of the set of cells (called memory blocks, which are usually contiguous 64 bytes) that include this cell. The main memory cell that the CPU is accessing is likely to be in the set of cells that have just been called into the cache. The CPU can then access the cache directly. If most of the CPU’s access to main memory can be replaced by the access cache during the entire process, the computer system’s processing speed can be significantly improved.

Terms related to Cache

The following terms may be a bit confusing at first, but take it easy.

Cache Line & Slot & Hot Data

As mentioned above, when a CPU requests access to a storage unit in main memory, it calls the group of units including that storage unit into the cache. This set of cells (commonly referred to as memory blocks) will be stored in the cache’s cache lines (also called slots). A cache divides its storage cells into equal parts, each of which is a cache line. Today, the typical CPU cache line is 64 bytes (that is, if the cache size is 512 bytes, there should be 8 rows).

In addition, the data cached by the cached row is called hot data.

Cache Hit

When the CPU accesses data (including read and write operations) from the data address stored in the register, it first searches the Cache and returns the data stored in the Cache. This is called Cache hit. According to the operation type, it can be classified into read Cache hit and write Cache hit.

Cache Miss & Hit Latency

As opposed to a cache hit, if it is not found, it is searched for in main memory across the System Bus. This is called a cache miss. If a Cache miss occurs, then operations that should have been directly accessing main memory will take time because of the Cache, which is called hit latency. Specifically, the hit delay is the time it takes to determine whether the target data is in the Cache.

Cache classification

If you open your task manager to check CPU performance, you may notice that the author’s cache has three regions: L1 (level 1 cache, 128KB), L2 (Level 2 cache, 512KB), and L3 (shared cache, 3.0MB) :

At first, only L1 Cache was implemented. Later, with the development of science and technology, on the one hand, the increase of main memory leads to more hot data to be cached, and the cost performance of simply increasing the capacity of L1 will be very low. On the other hand, the access speed of L1 and main memory is further increased, requiring a cache based on the access speed between the two. Based on the above two considerations, L2 is introduced, whose access speed is between L1 and main memory, and its access capacity is expanded on the basis of L1.

L1 and L2 are generally processor private, meaning that each CPU core has its own L1 and L2 and is not shared with other cores. At this time, L3 was added in order to have a cache area shared by all cores and to further improve the cache hit ratio in order to prevent both L1 and L2 from missing cache. It can be guessed that L3 has slower access speed than L1 and L2, but larger capacity.

Cache Line Conflict

To ensure a high CPU hit ratio, the contents of the Cache should be replaced according to a certain algorithm. A more common algorithm is the least-recently-used algorithm (LRU algorithm), which weeds out rows that have been accessed least recently in a period of time. Therefore, you need to set a counter for each row. The LRU algorithm zeroes out the hit counter for each row and increments the other counters by one. The row with the largest value in the row counter is eliminated when replacement is required. This is an efficient and scientific algorithm, whose counter zeroing process can eliminate some frequently called data that is no longer needed (corresponding to the data with the largest value) from the Cache, and improve the Cache utilization.

The Cache is extremely limited in size relative to main memory, so no matter how you implement the Cache storage mechanism (more on this later), if you don’t use the appropriate replacement algorithm, Therefore, with the use of Cache, it is inevitable that all the Cache lines in the Cache are occupied, so that new memory blocks need to be cached cannot be allocated. Or, depending on the storage mechanism of the Cache, the Cache Line allocated for the block is in use. Cache Line Conflict occurs when new memory blocks have no Cache Line Conflict.

CPU Cache architecture

At this point, we have roughly a CPU cache architecture:

As shown in the figure, when the CPU tries to access data from a storage unit address, it will search L1, L2, L3 and main memory successively from top to bottom. If it finds the data, it will directly return the data in the corresponding Cache and will not search down. If L1, L2 and L3 are all Cache miss, The CPU would then have to access main memory or data on the hard disk via the bus. And according to the clock cycle (cycle, the reciprocal of CPU frequency is a clock cycle) required by each hardware access operation shown in the following figure, it can be seen that, from top to bottom, the access cost is increasing, so Cache design should be as high as possible Cache hit ratio, otherwise it will not be worth the loss if the access to the memory at last.

In order to facilitate everyone’s understanding, the author picked a cool shell of a paragraph:

We know that computing data needs to be scheduled from disk to memory, then to L2 Cache, then to L1 Cache, and finally to the CPU register for calculation.

To the wife in the computer city to buy this book when the computer sales personnel asked these parameters, the wife does not understand, let me explain to her, after the explanation, the wife said, “the original computer internal so troublesome, no wonder the computer is always so slow, direct operation of memory is not fast.” I’m the khan.

I had to explain to her that this was for a faster process, but she didn’t understand, so I made the following analogy — it’s like breastfeeding a baby:

The CPU is like the milk already in the baby’s mouth. It can be swallowed directly. It takes a second

The L1 cache is like milk in a bottle that you just pick up and put into your mouth. It takes five seconds.

The L2 cache is like the milk powder at home, which needs to be made with hot water before the baby is picked up and fed into it. It takes two minutes.

RAM is like milk powder in supermarkets all over the city, some far away, some near, you have to address, then have to go to the store to get it. It takes 1-2 hours.

Hard disks are like warehouses, which can be in distant suburbs or even factory warehouses. You need big trucks on the highway to get to the city. It takes 2-10 days.

So, in this case —

We can’t keep milk powder at home. Imagine if the child is hungry, and then go to the supermarket to buy, isn’t it slower?

We can’t make all the milk powder and put it in bottles, because there are not enough bottles. It is also impossible to bring all the milk powder from the supermarket to my home, because house prices are too expensive to afford such a big house.

We can’t put all the stuff in the warehouse in the supermarket because it would be too expensive. And if the supermarket shelves are just sold out, it needs to be transferred from the warehouse or even the manufacturer’s factory, which is called page-changing in the calculation, quite slow.

Cache structure and Cache affinity

If you were to design such a Cache, what would you do?

If you are untrained like this author, you may find it a good choice to use a hash table. Each memory block corresponds to one record, using the hash value of the address of the memory block as the key, using the data stored in the memory block as the value. The time complexity of O(1) to complete the search is simple and efficient.

But if you hash the address every time you Cache a block of memory, it can take much longer than the dozens of clock cycles required to access the Cache, and this is not the memcache of our applications, where the Cache is actual hardware. Implementing a hashing of memory addresses at the hardware level is a bit of a duck’s back.

Take common X86 chips as an example. The structure of Cache is shown in the following figure: The entire Cache is divided into S groups. Each group consists of E rows of the smallest storage unit, the Cache Line. Each Cache Line contains B (B=64) bytes of data. Each Cache Line contains one valid bit and t tag bits. The valid bit indicates whether the Cache Line is valid. The tag bit is used to assist in addressing and uniquely identifies the block stored in the Cache Line. The 64 bytes in the Cache Line are actually copies of the data in the corresponding memory address. Based on the Cache structure, we can calculate that the size of each Cache level is B x E x S.

A key decision in cache design is to ensure that each main memory block can be stored in any cache slot, or just some of them (where a slot is a cache row).

There are three ways to map a cache slot to a main memory block:

Each memory block can only be mapped to one specific cache slot. A simple solution is to map the block index block_index to the corresponding slot (block_index % cache_slots). Two memory blocks mapped to the same slot cannot be swapped into the cache at the same time. (Note: Block_index can be calculated by physical address/cache line bytes)
N-way set Associative cache Each memory block can be mapped to any path in a specific cache slot of n-way. For example, in a 16-way cache, each memory block can be mapped to 16 different cache slots. In general, memory blocks with the same low bit address will share 16 cache slots. (The same low address indicates contiguous memory that is a certain size apart)
Fully associative cache Each memory block can be mapped to any cache slot. The operation is equivalent to a hash table.

Among them, n-way group association is improved according to the other two ways, which is the mainstream implementation scheme at present. Examples of these three approaches are given below.

Fully associative cache

Fully associative, as the name implies. This means that a block of memory to be cached can be cached in any Slot in the Cache. Take a 32-bit operating system (meaning memory is addressed by a 32-bit address), such as an 0101… 10, 000000-0101… Memory blocks need to be cached, so they will be randomly placed into an available Slot. Each Line contains a 64-byte Cache Line, one extra bit, and t unique IDS. In this example, t is 26. When the memory needs to access the data in the range of the address, the first check will be the Cache whether the high 26 bits (0101). Slot 10, if found, is located to a storage location on the Cache Line based on the lower 6 bits of the data address. The lower 6 bits are called word offset.

You might say, isn’t that a hash table? True, it is random in deciding which Slot to put the block in, but it does not hash the address and use the hash value as a tag bit, so it is fundamentally different from a hash table.

The reason why this method is not widely used is that the Slot into which the memory block will be placed is unknown. Therefore, the CPU needs to perform a linear search for the high end of the data address (in this case, the high 26 bits) and the tag bits of all slots in the Cache. In my L1 128KB example, There are 128 * 1024/64 = 2048 slots. Although parallel processing can be performed at the hardware level, the efficiency is not significant.

Direct Mapped Cache

In this way, memory blocks in main memory and slots in Cache are encoded to obtain block_index and slot_index respectively, and then block_index is modulo d of SLOt_index to determine which Slot a memory block should be placed in, as shown in the following figure:

The following will take my L1 Cache 128KB and memory 4GB as an example for analysis:

The addressing range for 4GB of memory is 000… 000 (32 zeros) to 111… 111 (32 ones), given a 32-bit data address, how to determine whether the DATA address is cached in L1 Cache?

First, the 32-bit address is divided into the following three parts:

For a given 32-bit data address, obtain the middle slot offset bit regardless of the lower 6 bits and determine which slot it is. Then compare whether the slot’s tag bit matches the remaining high bits of the data address. If a match is found, it indicates a Cache Hit. Finally, the storage unit is retrieved from the Cache Line of the Slot six bits lower than the specified one.

The flaw in Direct Mapped Cache is that blocks of memory with the same low value but different high value are Mapped to the same Slot (because SlotCount is modelled to the same result). If the CPU happens to request access to these blocks, Only one memory block can be cached in the corresponding Slot in the Cache, which is prone to Cache Line Conflict.

N-Way Set Associative Cache

N-way group affinity is a combination of Direct Mapped Cache and Full Associative Cache. The idea is not to put a Slot in a given data address.

As shown in the x86 Cache structure diagram above, the Cache is divided into S groups with E slots for each group. If my L1 Cache 128KB is divided into a group of 16 slots, the number of groups is 128 * 1024/64 (Slot number) / 16 = 128 groups (we call each group a Set, representing a Set of slots). In this case, a given data address can still be divided into the following three parts:

What was different from the Direct Mapped Cache was that the 11 middle bits were Mapped to which Slot instead of 7 middle bits Mapped to which Set. After determining the Set, The Slot into which the block is placed is random (it may be that the Slot is available at the time), and the remaining 19 bits are used as the final tag bit to store the block.

The advantage of this is that a given data address will only be mapped to a specific Set, which greatly reduces the probability of Cache Line Conflict. In addition, the CPU only needs to find a linear Slot in a specific Set, and the number of slots in the Set is small. There are fewer slots per group), so the time complexity of linear look-up is approximately O(1).

How to write Cache Hit-friendly programs

From the previous understanding of the CPU read and write model, we know that once the CPU wants to access data from memory, there will be a large delay, the performance of the program is significantly reduced, the so-called far water can not save the near fire. To do this, we have to improve the Cache hit ratio, which is to make full use of the locality principle.

Locality includes time locality and space locality.

Time locality: The same data can be used multiple times, and subsequent accesses can be hit multiple times from the Cache Line after the first load, thus increasing the read speed (instead of reading from the underlying Cache).
Space locality: A Cache Line has 64 bytes of space, so we can use 64 bytes of space to load all the data that will be accessed later in the program at once, thus improving the Cache Line hit ratio (rather than re-addressing the data).

Try to read adjacent data addresses when reading

Let’s first look at the different overhead associated with the two ways of traversing a two-dimensional array:

static int[][] arr = new int[10000] [10000];
public static void main(String[] args) {
    m1();		16 / / output
    m2();		// Output 1202 slightly different results for each test
}
public static void m1(a) {
    long begin = System.currentTimeMillis();
    int a;
    for (int i = 0; i < arr.length; i++) {
        for (int j = 0; j < arr[i].length; j++) { a = arr[i][j]; }}long end = System.currentTimeMillis();
    System.out.println(end - begin + "= = = = = = = = = = = = = = = =");
}
public static void m2(a) {
    long begin = System.currentTimeMillis();
    int a;
    for (int j = 0; j < arr[0].length; j++) {
        for (int i = 0; i < arr.length; i++) { a = arr[i][j]; }}long end = System.currentTimeMillis();
    System.out.println(end - begin + "= = = = = = = = = = = = = = = =");
}
Copy the code

After several tests, it is found that column by column traversal is significantly less efficient than Line by Line traversal. This is because the data addresses of row by Line traversal are adjacent, so it is possible to access 16 consecutive int variables (16×4=64 bytes) in the same Cache Line. After the first int variable is accessed and 64 bytes including it are added to the Cache Line, subsequent int variables are fetched directly from the Cache Line without additional operations. If the number of columns is more than 16, that means there are more than 16 int variables in a row, and the interval between the starting addresses of each row is more than 64 bytes, then the int variables in each row will not be in the same Cache Line, which will cause the Cache Miss to be reloaded into memory and read across the Cache Line each time. Will have one more Hit Latency than a line-by-line read.

I and J in the above example reflect time locality. I and J are frequently operated as loop counters and will be stored in registers. The CPU can access them in the fastest way each time, instead of from Cache, main memory, and other places.

Priority traversal of contiguous elements in a Line takes advantage of spatial locality. Loading contiguous 64 bytes of address into the Cache Line at one time facilitates fast access to subsequent contiguous elements.

Cache Consistency & Cache Lock & False Sharing

Is it better to operate on the same cache row than across it all the time? There is no one-size-fits-all mechanism, only one that works best for a particular scenario, and contiguous and compact memory allocation (the smallest unit of Cache is the Cache Line) also has its drawbacks.

This problem is caused by Cache consistency. Since each CPU core has its own Cache (usually L1 and L2) and most of the time accesses its own Cache, this can lead to different copies of data in caches and shared data in main memory. Sometimes we need to call the cpus to cooperate, and we have to use the shared data in main memory and keep the caches in sync with main memory.

This is where the cache consistency protocol comes into play: if you want to keep the cache rows in sync with main memory, you all have to modify the shared variables according to my rules

This is a caching subsystem that tracks the state of each cached row. The system uses a technique called “bus dynamic monitoring” or “bus sniffing” to monitor all transactions occurring on the system bus to detect when a read or write has occurred at an address in the cache.

When the cache subsystem detects a read on the system bus of a memory region loaded in the cache, it changes the state of the cache row to “shared.” If it detects a write to that address, it changes the status of the cached row to “invalid.”

The caching subsystem wants to know if the system contains a unique copy of the data in its cache when it is monitoring the system bus. If the data is updated by its own CPU, the cache subsystem changes the state of the cache row from “exclusive” to “Modified.” If the cache subsystem detects another processor reading the address, it blocks access, updates the data in system memory, and then allows access to the processing to continue. It also allows you to mark the state of the cached row as shared.

In short, each CPU uses bus sniffing to monitor other cpus. Once a CPU makes changes to a shared variable in its Cache (provided that the state of the shared variable in the Cache line is not invalid), Would lead to other CPU Cache the Shared variables will this variable in the Cache Line for invalid state, the next time the state of the CPU accesses invalid Cache Line will first require to modify a Shared variables for the CPU will modify from the Cache write back into main memory, and then their latest Shared variables from main memory to read their own Cache Line.

And cache coherence protocol by caching lock to ensure that the CPU to modify Shared variables in a cache line and inform the other CPU will correspond to the cache line set as invalid this operation atomicity, namely when a CPU change in yourself in the cache Shared variables will prohibit other also cache the CPU access to their Shared variables in the cache line corresponds to the cache, The CPU is notified to invalidate the corresponding cache row before the cache lock ends.

Prior to cache locking, synchronization between cpus was achieved through bus locking. That is, when a CPU writes back to main memory, it locks the bus to prevent other cpus from accessing main memory. However, this mechanism is expensive.

Although the Cache consistency protocol ensures the synchronization between Cache and main memory, it introduces a new problem: False Sharing.

As shown in the figure below, data X, Y, and Z are loaded into the same Cache Line, with thread A modifying X on Core1 and thread B modifying Y on Core2. If Core1 is the first CPU core to initiate the operation, the L1 Cache Line on Core1 changes from the S (shared) state to the M (modified, dirty) state, and then informs the other CPU cores, the legend is Core2. The Cache Line referencing the same address is invalid. When Core2 initiates a write operation, it first causes Core1 to write X back to main memory, and the Cache Line state changes from M to I (invalid). Core2 then reads the address contents from main memory again, and the Cache Line state changes from I to E (exclusive). Finally, it changes the Y operation. The Cache Line changes from E to M. Multiple threads operate on different data on the same Cache Line, competing with each other for the same Cache Line, causing the threads to contain each other (a behavior known as ping-pong effect) and become serial programs, reducing concurrency. In this case, data shared between multiple threads should be isolated so that they are not on the same Cache Line to improve the performance of multiple threads.

There are two solutions to Cache Line pseudo-sharing:

Cache Line Padding, which increases the address distance of two variables so that they are on two different Cache lines, so that operations on shared variables X and Y do not affect each other.
The thread does not operate directly on the global shared variable, but reads a copy of the global shared variable into its own local variable. Local variables are not visible from thread to thread, so the thread writes the result back to the global variable as it plays with it.

Cache Line Padding

Doug Lea, a well-known concurrency guru, has made JDK7’s LinkedTransferQueue more efficient by appending bytes:

public class LinkedTransferQueue<E>{
    private PaddedAtomicReference<QNode> head;
    private PaddedAtomicReference<QNode> tail;
    static final class PaddedAtomicReference<E> extends AtomicReference<T{
        // Append 15 * 4 = 60 bytes to the object
        Object p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, pa, pb, pc, pd, pe;
        PaddedAtomicReference(T r){
            super(r); }}}public class AtomicReference<V> implements Serializable{
    private volatile V value;
}
Copy the code

Can you make sense of line 6? It starts with the memory layout of objects, which is what non-array objects look like if you’ve read Inside the Java Virtual Machine (2nd Edition)

Object head

The object header is divided into three parts:
- Mark Word, which can be 32 or 64 bits depending on the JVM’s bit, holds the object’s Hashcode, generational age, lock flag bit, and so on. The data in this part can be reused to point to the ID of the thread preference or to the product Mark Word in the stack or to the heavyweight lock.
- Class Mete Data, a type pointer (also 32-bit or 64-bit) that points to the type information of the bytecode Class to which the object belongs stored in the method area after it is loaded into the JVM.
- Array Length, which you would have if you were an Array object.
The instance data

Runtime objects contain data that can change dynamically and is shared by threads. This part of the data consists of the following types:
- Byte, char, short, int, float, four bytes (note that this is the JVM data type, not the Java language level data type, the two are fundamentally different, due to the JVM’s limited number of instructions, so less than four of their own data will use the int family of instructions).
- Long,double, 8 bytes.
- Reference, 4 or 8 bytes depending on the implementation of the virtual machine, but 4 bytes in a 32-bit JVM for reference type variables.
Alignment filling

This part of the data has no substantial effect, only for the purpose of placeholder. For Hotspot JVM, its memory management is 8 bytes, and the object headers of non-array objects are exactly 8 bytes (32-bit JVMS) or 16 bytes (64-bit JVMS), so it is used for alignment padding when instance data is not a multiple of 8 bytes.

The PaddedAtomicReference object’s optical instance data section contains p0-PE 15 reference variables. The PaddedAtomicReference object’s optical instance data section contains p0-PE 15 reference variables. In addition, AtomicReference is a reference variable of 16 bytes, which means that the optical instance data portion is 64 bytes. Therefore, head and tail objects must not be loaded into the same cache line, so that the operations on the queue head and tail are not serialized due to cache locking. The ping-pong effect of mutual containment will not occur, which improves the concurrency performance of queues.

Three elements of concurrent programming

With the CPU Cache mentioned above, we can finally get into Java concurrency programming. If you really understand Cache, it is easy to understand the Java concurrency model.

The three elements of concurrent programming are atomicity, visibility and order.

visibility

Invisible problem is caused by the CPU Cache mechanism, when the CPU does not have direct access to main memory and most of the time in the Cache, because each thread may be conducted on different CPU core context switch, so you can understand for each thread has its own a “local memory,” of course the local memory is not real, It is an abstraction of the CPU Cache:

If thread-1 changes a copy of a shared variable in its local memory and does not refresh it to main memory and notify Thread-2 to read it again from main memory, Thread-2 will not see the changes made by Thread-1 and will continue to operate on the copy of the shared variable in its own memory. This is also known as the Java Memory Model (JMM).

So how do threads interact with main memory? The JMM defines the following eight operations to allow interaction between threads and main memory, and the JVM implementation must be atomic and non-divisible for all of the following operations (the load, Store, read, and write operations are permitted on some platforms for variables of type double and long)

Lock, a variable acting on main memory, identifies an object as a thread-exclusive state
Unlock, a variable that acts on main memory to release an object from its locked state
Read, reads variables from main memory
Load, which loads variables read into local memory
Use, which passes a variable in local memory to the execution engine whenever the JVM executes a bytecode instruction that requires the value of the variable to be read
Assign, which assigns a value received from the execution engine to a variable in local memory. This operation is performed whenever the JVM executes a bytecode instruction that needs to assign a value to the variable.
Store, the thread writes variables from local memory back to main memory
Write, main memory receives write back requests from threads to update variables in main memory

If you need to interact with main memory, you need to execute read and load instructions, or store and write instructions sequentially. Note that the order does not mean sequential. Read a -> read b -> load b -> Load

You can understand the results of the first example code at the beginning of this article, because the T2 thread sharedVariable = oldValue is executed in three steps: Assign -> store -> write: after modifying the copy of the shared variable in its own local memory, the T2 thread can insert the copy and read the shared variable before executing store and write to write the changes back to main memory. And even if T2 writes changes back to main memory, t1 will still be staring at variables in its local memory if it doesn’t have some mechanism to tell it to read from main memory again.

Why does volatile guarantee visibility of variables in threads? Because the JVM uses volatile to invoke cache consistency, if you look at the assembly code generated by JVM interpretation or JIT compilation for volatile programs, You’ll notice that the assembly instructions generated by writes to volatile fields (volatile shared variables) will have a lock prefix. The LOCK prefix indicates that the JVM sends a signal to the CPU that does two things:

Overwrites to this variable are flushed immediately to main memory (that is, tovolatileThe write of the field causesassgin -> store -> writeAtomic execution of)
The bus notifies other cpus that the shared variable has been updated, and the CPU that also caches the shared variable invalidates the Cache line in its Cache if it receives the notification. The next time the CPU reads the shared variable, it finds that the cache row has been set to an invalid state and returns to main memory to read the shared variable.

You can see that this is enabling the cache consistency protocol underneath. That is, once a volatile variable is volatile, each write to the volatile field causes the write to be flushed immediately to main memory and any subsequent reads to the volatile field are read back from main memory.

atomic

Atomicity means that one or more operations must be performed continuously and cannot be decomposed. As mentioned above, the JMM provides eight atomic operations, so let’s look at a few simple examples to see which operations are atomic at the code level.

For int variables a and b:

a = 1

This operation is atomic, and the bytecode instruction is putField, which is assigned
a = b

This operation is non-atomic, requiring getField to read variable B and putField to assign variable A
a++

In essence, a = a + 1. First getField reads variable A, then execute add to calculate the value of a + 1, and finally assign the calculated value to A through putField
Object obj = new Object()

AllocMemory is first used to allocate memory for the object, then

is called to initialize the object, and finally the memory address of the object is returned, which is more complicated and naturally not atomic.

order

As the CPU has multiple instruction execution units of different types, multiple instructions can be executed in one clock cycle. In order to maximize the parallelism of the program, the CPU distributes instructions of different types to each execution unit for simultaneous execution. The compiler may also reorder instructions during compilation.

Such as:

a = 1;
b = a;
flag = true;
Copy the code

Flag = true can be reordered before b = a or even a = 1, but the compiler will not reorder instructions that have dependencies, such as b = a before A = 1, and the compiler will also prevent CPU reordering by inserting instruction barriers.

For two instructions that have dependencies, the compiler can ensure the order in which they are executed. However, for instructions that do not have dependencies, the compiler can only ensure that the instruction written in the first place takes place in the last place. For example, a = 1 takes place in the first place when flag = true, but a = 1 is executed before flag = true. The instruction written in the first place only means that a = 1 is visible to flag = true.

happens-before

In Java, there are some innate antecedent rules that allow us to determine the order of two programs (that is, whether there is a relationship between one and the other that precedes the other) to determine whether synchronization is necessary.

Program order rule: In a single-threaded environment, programs are written in the order in which they are written, with happens-before written later.
volatileVariable rule: For onevolatileThe domain is written happens-before subsequent to the samevolatileThe domain of reading.
Monitor rule: One thread releases its own lock object happens-before other threads (including the thread that just released the lock) then lock that object.
Thread start rule: call to a threadstartThe happens-before method executes the threadrunmethods
Thread termination rules:t1Thread callst2.joinDetected,t2Thread execution terminates happens-beforet1The thread fromjoinMethod returns
Thread interrupt rule: Call to a threadinterruptThe happens-before method is used to interrupt the thread response
Object finalization rule: Creation of an objectnewHappens-before the objectfinalizeMethod called
4. If A happens-before B and B happens-before C, there is A happens-before C

Using these rules, we resolve the confusion raised at the beginning of this article as to why synchronized lock releases, CAS updates, and volatile writes have the same semantics (i.e., all make rewriting of shared variables immediately visible to all threads).

Lock release has volatile domain write semantics

new Thread(() -> {
    synchronized (lock) {
        int oldValue = sharedVariable;
        while (sharedVariable < MAX) {
            while(! changed) {try {
                    lock.wait();
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
            System.out.println(Thread.currentThread().getName() +
                               " watched the change : " + oldValue + "- >" + sharedVariable);
            oldValue = sharedVariable;
            changed = false; lock.notifyAll(); }}},"t1").start();

new Thread(() -> {
    synchronized (lock) {
        int oldValue = sharedVariable;
        while (sharedVariable < MAX) {
            while (changed) {
                try {
                    lock.wait();
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
            System.out.println(Thread.currentThread().getName() +
                               " do the change : " + sharedVariable + "- >" + (++oldValue));
            sharedVariable = oldValue;
            changed = true; lock.notifyAll(); }}},"t2").start();
Copy the code

fort2A single thread uses the program order rule, no34Row pairs share variablessharedVariableWrite a happens-before first38Row exits critical section to release lock.
fort1,t2Concurrent running of, no38linet2The release of the lock happens-before no2linet1Lock acquisition.
Again according to the order of procedure rule, no2Line lock gets happens-before first13Row pairs share variablessharedVariableIn the reading.
According to the above 1, 2, 3 and transitivity, can obtain no34Row pairs share variablessharedVariableWrite a happens-before first13Row pairs share variablessharedVariableIn the reading.

Summary: Common field write-reads have the same semantics as volatile field write-reads by locking shared variables before and after write-reads.

Atomic CAS updates have volatile domain write semantics

As mentioned earlier, for reads and assignments of primitive or reference types, the JMM requires JVM implementations to ensure atomicity. So we don’t have to worry about atomicity for these kinds of operations, but how can we guarantee atomicity for complex operations?

A typical example is if we start ten threads and perform 10,000 I ++ operations on the shared variable I, will the result reach 100,000 as expected?

private static volatile int i = 0;

public static void main(String[] args) throws InterruptedException {
    ArrayList<Thread> threads = new ArrayList<>();
    Stream.of("t0"."t2"."t3"."t4"."t5"."t6"."t7"."t8"."t9" ).forEach(
        threadName -> {
            Thread t = new Thread(() -> {
                for (int j = 0; j < 100; j++) { i++; } }, threadName); threads.add(t); t.start(); });for (Thread thread : threads) {
        thread.join();
    }
    System.out.println(i);	
}
Copy the code

The author tested several times and failed to meet expectations.

You might say make I volatile, really? You might as well have a try.

Even volatile doesn’t work if you analyze it rationally. Because volatile only ensures visibility of variable I, it does not guarantee atomicity for complex operations on it. I ++ is a complex operation that can be broken down into three steps: read I, evaluate I +1, and assign the result to I.

To do so, i++ must happen this time before i++ next time, and since the program does not meet this condition, we can manually add some code to make the program meet this condition. For example, putting i++ in the critical section takes advantage of the monitor rule. Let’s verify this:

private static int i = 0;
private static Object lock = new Object();
public static void main(String[] args) throws InterruptedException {
    ArrayList<Thread> threads = new ArrayList<>();
    Stream.of("t0"."t1"."t2"."t3"."t4"."t5"."t6"."t7"."t8"."t9" ).forEach(
        threadName -> {
            Thread t = new Thread(() -> {
                for (int j = 0; j < 10000; j++) {
                    synchronized(lock) { i++; } } }, threadName); threads.add(t); t.start(); });for (Thread thread : threads) {
        thread.join();
    }
    System.out.println(i);	/ / 10000
}
Copy the code

Running results prove that our logic is correct, this is the benefit of theoretical support, so that we have a method to find! Concurrency is not metaphysical, and we can easily write high and accurate code if we have enough theory to back it up. Correctness is the first element of concurrency! With that in mind, let’s talk about concurrency efficiency.

So let’s go back and see if we can improve the concurrency efficiency of this code. Synchronized causes only one out of ten threads to acquire a lock at any one time, and the other nine threads to block. Thread blocking – waking up causes a transition from user to kernel state (see my article on how Java threads are implemented), which is expensive, just to perform the following i++? This results in wasted CPU resources and an overall decrease in throughput.

To solve this problem, CAS was born.

CAS (Compare And Set) is an atomic complex operation that takes three parameters: data address, update value, And expected value. When a shared variable needs to be updated, CAS compares whether the data in the data address is the expected old value and updates it if so, otherwise the update failure does not affect the data at the data address.

CAS spin is also known as an optimistic lock. It always assumes that the concurrency is not that high, so even if I don’t update successfully this time, I can still try a few more times, which is not as expensive as the overhead of thread blocking. Therefore, the performance of synchronized is much higher when the actual concurrency is not high. But if the concurrency is really high, then the CPU overhead of multiple threads spinning CAS for a long time is not good either. Because concurrency is low 80% of the time, CAS is often used to replace synchronized for performance improvements.

The CAS spins in the Unsafe class are as follows:

public final int getAndSetInt(Object var1, long var2, int var4) {
    int var5;
    do {
        var5 = this.getIntVolatile(var1, var2);
    } while(!this.compareAndSwapInt(var1, var2, var5, var4));

    return var5;
}
Copy the code

CAS operations on x86 are implemented by CMPXCHG (Compare Exchange) (different instruction sets vary). The CAS interface is not exposed in Java, and CAS is defined in the Unsafe class (called only by the Java core library) in the form of compareAndSetXxx. We can call by reflection, but the JDK provides the AtomicXxx family of atomic manipulation classes for most of our needs.

So let’s look at the difference between starting 10 threads and executing 1000 000 i++ with CAS and synchronized:

CAS is around 200:

private static AtomicInteger i = new AtomicInteger(0);
public static void main(String[] args) throws InterruptedException {
    ArrayList<Thread> threads = new ArrayList<>();
    long begin = System.currentTimeMillis();
    Stream.of("t0"."t1"."t2"."t3"."t4"."t5"."t6"."t7"."t8"."t9" ).forEach(
        threadName -> {
            Thread t = new Thread(() -> {
                for (int j = 0; j < 10000; j++) { i.getAndIncrement(); } }, threadName); threads.add(t); t.start(); });for (Thread thread : threads) {
        thread.join();
    }
    long end = System.currentTimeMillis();
    System.out.println(end - begin);	/ / between 70-90
}
Copy the code

Synchronized is around 480:

private static int i = 0;
private static Object lock = new Object();
public static void main(String[] args) throws InterruptedException {
    ArrayList<Thread> threads = new ArrayList<>();
    long begin = System.currentTimeMillis();
    Stream.of("t0"."t1"."t2"."t3"."t4"."t5"."t6"."t7"."t8"."t9" ).forEach(
        threadName -> {
            Thread t = new Thread(() -> {
                for (int j = 0; j < 1000000; j++) {
                    synchronized(lock) { i++; } } }, threadName); threads.add(t); t.start(); });for (Thread thread : threads) {
        thread.join();
    }
    long end = System.currentTimeMillis();
    System.out.println(end - begin);
}
Copy the code

But the question remains why CAS updates to atomic classes have volatile write semantics. CAS alone can only ensure that use -> Assgin is atomic.

AtomicInteger = AtomicInteger; AtomicInteger = AtomicInteger;

public class AtomicInteger extends Number implements java.io.Serializable {
    private volatile int value;
    public final int getAndSet(int newValue) {
        return unsafe.getAndSetInt(this, valueOffset, newValue); }}Copy the code

You’ll see that the atomic class encapsulates a volatile field, and you’ll see. Volatile fields for CAS updates. We know that volatile field updates cause two things to happen:

Flush overwrites to main memory immediately
Notify other cpus to invalidate the cache row

Volatile disallows reordering

Another aspect of the semantics of volatile is to prohibit instruction reordering. In other words, the assembly instruction lock produced by volatile has an instruction barrier that prevents instructions preceding that barrier from being reordered behind it. This is best done using the concurrency optimization example of the singleton pattern.

Lazy loading mode

The class constructor < Clinit > is explicitly declared static variable initialization using the initialization phase of the class loading process, which should be initialized immediately when a class is actively referenced. (See In-depth Understanding of The Java Virtual Machine (Second Edition) for details about active and passive references, class constructors, and class loading.)

public class SingletonObject1 {

    private static final SingletonObject1 instance = new SingletonObject1();

    public static SingletonObject1 getInstance(a) {
        return instance;
    }

    private SingletonObject1(a) {}}Copy the code

What is an active reference to a class:

new,getStatic,putStatic,invokeStaticThe four bytecode instructions refer to the class, and the corresponding language level is to create instances of the class, read static fields of the class, modify static fields of the class, and call static methods of the class

throughjava.lang.reflectWhen a method of a package makes a reflection call to the class

When initializing a class, initialize its parent class first if its parent class is not initialized

When the JVM starts, the class in which the main function resides is first initialized

What is a passive reference to a class:

Using a subclass to access a parent static variable, the subclass is not initialized immediately

Classes referenced by array definitions are not initialized immediately

Access constants of a class that will not be initialized immediately (because the constant has already been copied to the current class constant pool after constant propagation optimization at compile time)

Hungry Man Mode 1

Create instances only when needed (to avoid pre-loading large memory objects that are not used for the time being) :

public class SingletonObject2 {

    private static SingletonObject2 instance = null;

    public static SingletonObject2 getInstance(a) {
        if (SingletonObject2.instance == null) {
            SingletonObject2.instance = new SingletonObject2();
        }
        return SingletonObject2.instance;
    }

    private SingletonObject2(a) {}}Copy the code

Hungry Man Mode 2

In the example above, the hungermode is fine for a single thread, but if you call getInstance concurrently, it might be possible for t1 to execute on line 6 and t2 to execute on line 6 before it can create an object. This would cause multiple threads to execute on line 7. Causes SingletonObject2 to be instantiated multiple times, so we serialize lines 6-7 with synchronized:

public class SingletonObject3 {
    private static SingletonObject3 instance = null;

    public static SingletonObject3 getInstance(a) {
        synchronized (SingletonObject3.class) {
            if (SingletonObject3.instance == null) {
                SingletonObject3.instance = newSingletonObject3(); }}return SingletonObject3.instance;
    }

    private SingletonObject3(a) {}}Copy the code

DoubleCheckedLocking

We already know that synchronized is a heavyweight lock. If a singleton is instantiated, the lock needs to be acquired each time the instance is acquired. In the long run, the cost will be expensive.

public class SingletonObject4 {

    private static SingletonObject4 instance = null;

    public static SingletonObject4 getInstance(a) {
        if (SingletonObject4.instance == null) {
            synchronized (SingletonObject4.class){
                if (SingletonObject4.instance == null) {
                    SingletonObject4.instance = newSingletonObject4(); }}}return SingletonObject4.instance;
    }

    private SingletonObject4(a) {}}Copy the code

DCL2

Is this really OK? It is true that only one thread can enter line 9 at a time to create an Object, but remember that new Object() can be decomposed! The corresponding pseudo-instruction is as follows:

allocMemory 	// Allocate memory for objects
<init>		    // Execute the object constructor
return reference // Returns the heap address of the object
Copy the code

And the above three steps are non-dependent, which means they can be reordered as follows:

allocMemory 	// Allocate memory for objects
return reference // Returns the heap address of the object
<init>		    // Execute the object constructor
Copy the code

At line 2, the T1 thread decides that the instance reference address is not null and uses the instance before the object is constructed. This means that if the object may contain a reference variable that is null and is not properly initialized, a null pointer exception will be thrown if the T1 thread happens to access that variable

So we use volatile to disallow

to reorder after assigning instance:

public class SingletonObject5 {
    
    private volatile static SingletonObject5 instance = null;

    public static SingletonObject5 getInstance(a) {
        if (SingletonObject5.instance == null) {
            synchronized (SingletonObject5.class) {
                if (SingletonObject5.instance == null) {
                    SingletonObject5.instance = newSingletonObject5(); }}}return SingletonObject5.instance;
    }

    private SingletonObject5(a) {}}Copy the code

InstanceHolder

We can also take advantage of the fact that the class is initialized only once by defining the singleton in the inner class to write a more elegant way:

public class SingletonObject6 {
    
    private static class InstanceHolder{
        public static SingletonObject6 instance = new SingletonObject6();    
    }

    public static SingletonObject6 getInstance(a) {
        return InstanceHolder.instance;
    }

    private SingletonObject6(a) {}}Copy the code

The constructor of an enumerated instance is called only once

This is required by the JVM specification and must be guaranteed by JVM implementations.

public class SingletonObject7 {
    
    private static enum Singleton{
        INSTANCE;

        SingletonObject7 instance;
        private Singleton(a) {
            instance = newSingletonObject7(); }}public static SingletonObject7 getInstance(a) {
        return Singleton.INSTANCE.instance;
    }

    private SingletonObject7(a) {}}Copy the code

(Full text)

Refer to the link

A thorough article on pseudo-sharing, cache line padding, and CPU caching
What you should know about CPU caches
7 examples of popular science CPU Cache
False sharing

Cache consistency protocol:

MSI
MESI

Starting from the CPU Cache completely understand volatile/synchronized/cas mechanism