Analysis of the principle of volatile keyword

The Volatile keyword

We are all familiar with the volatile keyword, its visibility, and its prohibition of reordering, but how does it do it, and how does the JVM guarantee these features?
Why do DCL singletons need volatile?

Let’s cut through the clouds a little bit and see how it works

Bytecode layer

The best way to understand the volatile keyword is to look at bytecode and disassembly code,

HSDIS(Hotspot Disassembler) and JITWatch-JIT compile log analysis: An introduction to the tool

public class VolatileDemo { private static volatile int i = 0; public static void n(){ i++; } public static void main(String[] args) {public static void main(String[] args) {public static void main(String[] args) { j < 1_000_000; j++) { n(); m(); // Field I :I 3: iconst_1 4: iadd 5: putStatic #2 // Field I :I 8: returnCopy the code

The same bytecode as above, how does the JVM know about the valotile keyword: by constant pool #2 flags
- Use jclassLib to obtain field flag 0x004A [private:0x0002 static:0x0008 Volatile :0x0040]

private static volatile int i; Descriptor: I // fields are defined in Const flags: ACC_PRIVATE, ACC_STATIC, ACC_VOLATILECopy the code

To see how Hotspot executes bytecode:BytecodeInterpreter: no compilation optimizations are used, which are purely interpreted at run time
- Volatile is just a keyword at the Java layer, and is really implemented by individual VMS: see the JVM vm specification documentation for details
```
• A write to a volatile field (§8.3.1.4) happens-before every subsequent read of that field.
Copy the code
```

BytecodeInterpreter Bytecode analysis

Function call logic analysis:

BytecodeInterpreter calls case(_putstatic) to parse putStatic bytecode

CASE(_putfield): CASE(_putstatic): ...... If (cache->is_volatile()) {//1: I field with ACC_VOLATILE...... OrderAccess::storeload(); } //1 accessflags. HPP bool is_volatile() const{return (_flags & JVM_ACC_VOLATILE)! = 0; } // I marks Acc_volatile as trueCopy the code

Are you familiar with memory barriers familiar with StoreLoad?

OrderAccess is the parent class, which varies from system to system implementation class

Class OrderAccess: AllStatic {public: // Memory barrier related method static void loadload(); static void storestore(); static void loadstore(); static void storeload(); static void acquire(); static void release(); static void fence();Copy the code

See the implementation of each method in the implementation class of OrderAccess_Linux_x86.inline-hpp

Only storeLoad calls the fence() method; the other three methods do not

inline void OrderAccess::loadload() { acquire(); } inline void OrderAccess::storestore() { release(); } inline void OrderAccess::loadstore() { acquire(); } inline void OrderAccess::storeload() { fence(); } Inline void OrderAccess::fence() {if (OS ::is_MP()) {// Return (_processor_count! = 1) #ifdef AMD64 __asm__ volatile ("lock; Addl $0,0(%% RSP)" : : : "cc", "memory"); #else __asm__ volatile ("lock; Addl $0,0(%%esp)" : : : "cc", "memory"); #endif } }Copy the code

Lock addl $0,0(%% RSP) => lock addl 0: a null statement

As you can see from the above analysis, the volatile keyword uses the Lock instruction in the assembly code to keep it visible at the Java layer, disallow reordering, and other features
Flowchart of the above call

Volatile features

The volatile keyword has the following properties: visibility, no reordering, partial atomicity.
Both are visible through the LOCK instruction and are equivalent to inserting a memory barrier to prevent reordering

The memory barrier

There are two main strategies for CPU write:Memory barrier Store Buffer, Invalid Queue
1. Write back: When the CPU writes data to the memory, it first stores real data into the Store Buffer and flusher the data to the memory at an appropriate point in time
2. Write Through: When the CPU writes data to the memory, it writes data to the Store Buffer and memory simultaneously.
Most cpus use a write back policy: the CPU asynchronously writes to memory and the latency is acceptable and extremely short. Only in a few special cases, such as in a multithreaded environment, where memory visibility is strictly required, do CPU writes appear to be synchronized to the outside world, but this can be achieved with the help of CPU memory barriers (LOCK instructions)
The compiler and CPU can optimize performance by reordering instructions with the same output, inserting a memory barrier that tells the CPU and compiler that this command must be executed before this command must be executed
Another function of the memory barrier is to force a cache update on a different CPU, which means if you write to a volatile field
- Once you’re done writing, any thread that accesses the field will get the latest value;
- Before you write, everything that happened before is guaranteed to have happened, and any updated data values are visible because the memory barrier flusher all previous writes to the cache.

The Lock instruction

The Lock instruction: All X86 cpus have the ability to Lock a specific memory address. Once this memory address is locked, it prevents other system buses from reading or modifying the memory address.
When the Lock prefix is used, it causes the CPU to declare a Lock# signal, which ensures exclusive use of the memory address on multiprocessor systems or in environments where multiple threads compete. When the command completes, the lock action disappears.
When modifying, other cpus need to know that this segment of memory has been modified, so cache consistency is required

Cache consistency principle

The consistency protocol is used to ensure the consistency of shared data between multiple CPU caches
Cache line: The smallest unit of data exchanged between Cache and memory. The value is 32 or 64 bytes depending on the operating system

[image upload failed…(image-42d569-1614774355189)]

The Cache line state

The status of the cache line can be modified, exclusive, shared, or invalid.

state	describe
M(modify)	The contents of the cache row have been modified, and the cache row is cached only in the CPU, and is inconsistent with the main memory data
E(exclusive)	Data exists only in the current CPU but not in other cpus. The data on the current CPU is consistent with that on the main memory
S(shared)	The current CPU has data in common with other cpus and is consistent with the data in main memory
I(invalid)	Data in the current CPU is invalid. Data should be fetched from main memory

The Cache Line data in the M and E states is unique. The difference is that the data in the M state is inconsistent with the data in the memory. The data in the E state is consistent with the memory

State transition

Each CPU not only knows its own status, but also hears the read and write operations of other caches by sniffing. The state of each Cache line migrates among the four states according to the read and write operations of the local and other cores.
There are four types of read and write states: localRead, localWrite, remoteRead, and remoteWrite: There are a total of 16 state transitions between CPU states and read/write listener states: Simple books do not support HTML tables, so screenshots are used instead

MESI state transition is implemented mainly through CPU sniffing protocol.

CPU sniffing protocol

All memory transfers occur on a shared bus that is visible to all processors: the cache itself is independent, but memory is a shared resource, and all memory access is mediated (only one CPU cache can read or write to memory in the same instruction cycle).
The CPU cache does not just interact with the bus when it does a memory transfer, but is constantly sniffing the data exchange that occurs on the bus to keep track of what other caches are doing. So when a cache reads or writes to memory on behalf of its own processor, other processors are notified, and they keep their caches in sync.

Lock instruction

Lock the bus, other CPU read and write requests to the memory will be blocked until the lock is released, but the actual later processors use lock cache instead of lock bus, because the cost of lock bus is relatively large, other CPUS can not access the memory during the lock bus
Write operations after lock write back the modified data while invalidating other CPU-related cache rows, reloading the latest data from main memory
Memory barriers that prevent reordering of instructions on both sides of the barriers: Use of volatile causes in DSLS

Question to consider

Since cpus have the MESI protocol to ensure cache consistency, why do they need volatile to ensure visibility (memory barriers)? Or is the cache consistency protocol triggered only when volatile variables are executed on multicore cpus?

In the case of multiple cores, all CPU operations involve checking cache consistency. However, this protocol is weak and does not guarantee that one thread can change a variable and other threads can see it immediately. This means that the current CPU can change the data and then do other things, even if other CPU states have been set to invalid. There is no time to flush the modified variable back to main memory, and if the variable is needed by another CPU, the old value will be read from main memory again. Volatile guarantees visibility by flushing back to main memory immediately. The modify and write operations must be an atomic operation.
Normally, operations do not validate cache consistency. The cache row is validated only if the variable is volatile.

application

Now that we know how volatile works, what does it do for our Java programming?

What problems can DCL cause? Why is the volatile keyword avoided?

2. How to solve performance problems caused by fake sharing? The concurrentHashMap of Java and QueueDrainSubscriber * of Rxjava are pseudo-shares where separate fields of different threads operate on the same cache row

How to avoid false sharing: 1. Manually complete the cache row size. 2: Use @sun.misc.Contended annotations

The first is used in Rxjava2

// QueueDrainSubscriber: Public Abstract Class QueueDrainSubscriber<T, U, V> extends QueueDrainSubscriberPad4 class QueueDrainSubscriberPad0 { volatile long p1, p2, p3, p4, p5, p6, p7; volatile long p8, p9, p10, p11, p12, p13, p14, p15; } /** The WIP counter. */ class QueueDrainSubscriberWip extends QueueDrainSubscriberPad0 { final AtomicInteger wip = new  AtomicInteger(); } /** Pads away the wip from the other fields. */ class QueueDrainSubscriberPad2 extends QueueDrainSubscriberWip { volatile long p1a, p2a, p3a, p4a, p5a, p6a, p7a; volatile long p8a, p9a, p10a, p11a, p12a, p13a, p14a, p15a; } /** Contains the requested field. */ class QueueDrainSubscriberPad3 extends QueueDrainSubscriberPad2 { final AtomicLong requested = new AtomicLong(); } /** Pads away the requested from the other fields. */ class QueueDrainSubscriberPad4 extends QueueDrainSubscriberPad3 { volatile long q1, q2, q3, q4, q5, q6, q7; volatile long q8, q9, q10, q11, q12, q13, q14, q15; }Copy the code

The second annotation method uses size() in ConcurrentHashMap
```
//size() => sumCount @sun.misc.Contended static final class CounterCell { volatile long value; CounterCell(long x) { value = x; } } final long sumCount() { CounterCell[] as = counterCells; CounterCell a; long sum = baseCount; // Use this field if (as! = null) { for (int i = 0; i < as.length; ++i) { if ((a = as[i]) ! = null) sum += a.value; } } return sum; }Copy the code
```
- BaseCount is used when there is no contest, and CAS is added to CounterCell when there is a contest. The position of CAS in the array is determined according to the hash & array size (2^ n-1). The expansion principle is the same as that of HashMap
- Arrays are contiguous in memory, CounterCell only has one parameter of type long, and multithreading competition is fierce. Even if the array is not updated with the same Hash value with different subscripts, such as 0, and 1 will still lead to false sharing problems. Therefore, Contended was added to solve the problem, so that each value is placed in a separate cache line.
Note: not all of the scenes false sharing problem, need to be resolved because of CPU cache is limited, filling will sacrifice part of the cache, so the Android @ JDK. Internal. Vm. The annotation. Contended: Android – removed

reference

Chow Chi-ming – Understanding the Java Virtual Machine
Java disassembler tool used
JSR133 Chinese version