Please note the original source, thank you!
preface
Volatile is a popular keyword in Java, which is often mentioned in interviews and discussed in various technical communication groups. However, it seems that the discussion cannot produce a perfect result. With all kinds of doubts, we are ready to comb through it again from the perspective of JVM, C++ and assembly.
The two main features of volatile: prohibition of reordering and memory visibility. For those who are not familiar with these concepts, see the Java volatile keyword
I know the concepts, but I’m still confused. How are they actually implemented?
This article will cover some of the assembly aspects of the content, if read a few times, should be able to understand.
reorder
To understand reordering, take a look at a simple piece of code
public class VolatileTest {
int a = 0;
int b = 0;
public void set() {
a = 1;
b = 1;
}
public void loop() {
while (b == 0) continue;
if (a == 1) {
System.out.println("i'm here");
} else {
System.out.println("what's wrong"); }}}Copy the code
The VolatileTest class has two methods, set() and loop(). If thread B executes loop and thread A executes set, what is the result?
The answer is no, because it involves compiler reordering and CPU instructions reordering.
Compiler reorder
The compiler can reorder bytecode instructions to improve the speed of the program without changing the single-thread semantics, so the assignment order of a and B in the code may change to set B first, then SET A after compilation.
Because for thread A, it doesn’t matter which one it sets first.
CPU instruction reorder
What about CPU instruction reorder? Before we dive in, let’s take a look at the CPU cache architecture of x86.
1. Various registers are used to store local variables and function parameters. It takes 1cycle and less than 1ns to access each time; L1 Cache, level 1 Cache, local core Cache, which is divided into 32K data Cache L1d and 32K instruction Cache L1i. Access to L1 requires 3cycles, which takes about 1ns. 3. L2 Cache, level 2 Cache, the Cache of local core, is designed as a buffer between L1 Cache and shared L3 Cache, with a size of 256K. Accessing L2 requires 12cycles and takes about 3ns. 4. L3 Cache, level 3 Cache, which is shared by all core in the same slot, is divided into multiple 2M segments. Accessing L3 requires 38cycles and takes about 12ns.
Of course, there is usually known as DRAM, access to memory generally takes 65ns, so a CPU access to memory and cache is relatively slow.
For cpus in different slots, data in L1 and L2 are not shared. MESI is used to ensure Cache consistency, but at a cost.
In the MESI protocol, each Cache line has four states:
1. The line M(Modified) is valid, but has been Modified. The data is inconsistent with the data in the memory
2. The E(Exclusive) line is valid and consistent with memory data. Data exists only in the local Cache
3. The S(Shared) line is valid and consistent with the data in the memory. The data is distributed in many caches
4, this line of data is Invalid
The Cache controller of each Core not only knows its own read/write operations, but also listens to the read/write operations of other caches. If there are four cores: 1, Core1 loads variable X from memory, and the value is 10, then the state of the Cache line of variable X in Core1 is E. Core2 also loads variable X from memory. Core1 and Core2 cache variable X’s cache line state changes to S. Core3 also loads variable X from memory, and sets X to 20. Then Core3 cache line X changes to M, and other Core cache lines change to I.
For example, Intel’s core I7 processor uses the MESIF protocol, which evolved from MESI. F(Forward) evolved from Share. A cache line in the F state can pass data directly to other cores. I don’t have to worry about it here.
After a long time of optimization, LoadBuffer and StoreBuffer were added between registers and L1 cache to reduce the blocking time. LoadBuffer and StoreBuffer, Memoryordering Buffers (MOB), Load Buffers 64 length, store Buffers 36 length, and the CPU does not need to wait for data transmission between Buffer and L1.
1. When the CPU loads read data, it puts the read request into the LoadBuffer so that it does not need to wait for other CPUS to respond. 2. When the CPU performs store data writing, the data is written to StoreBuffer, and the StoreBuffer data is flushed to main memory at an appropriate time point.
Because of StoreBuffer, the real data is not immediately displayed in memory when the CPU writes data, so it is not visible to other cpus. Similarly, requests in LoadBuffer do not get the latest data from other CPU Settings;
Since storeBuffers and loadBuffers are executed asynchronously, from the outside there is no rigid order between write and read, or read and write.
How is memory visibility implemented
From the above analysis, it can be seen that the memory between different cpus is not visible because of the asynchronous execution of load and store data. So how can the CPU get the latest data when load?
Setting volatile variables
Write simple Java code that declares a volatile variable and assigns a value
public class VolatileTest { static volatile int i; public static void main(String[] args){ i = 10; }}Copy the code
Javap-verbose VolatileTest Javap-volatiletest javap-verbose VolatileTest javap-volatiletest javap-volatiletest javap-volatiletest javap-volatiletest javap-volatiletest javap-volatiletest javap-volatiletest javap-volatiletest javap-volatiletest javap-volatiletest javap-volatiletest
It is disappointing to find a bytecode directive (monitorenter, Monitorexit) compiled with the keyword synchronize. Putstatic compiled with volatile is no different. The only difference is that the flags variable I decorates with the ACC_VOLATILE flag.
However, I thought it would be a good place to start by doing a global search for ACC_VOLATILE and, when in doubt, looking for where the keyword was being used. Sure enough, I found similar names in accessFlags.hpp.
Is_volatile () determines whether a variable is volatile, searches globally for where “is_volatile” is used, and finds the interpreter implementation of the putStatic bytecodeInterpreter instruction in the bytecodeInterpreter. Inside is the is_volatile() method.
Of course, in normal execution, this logic is not used, but the machine code instructions corresponding to bytecode are directly executed. This code can be used in debugging, but the final logic is the same.
The cache variable is an instance of the Java code variable I in the constant pool cache. Because I is volatile, cache->is_volatile() is true. The assignment to I is performed by release_int_field_put.
Look again at the release_int_field_put method
OrderAccess:: Release_store does magic to allow other threads to read the latest value of variable I.
Strangely, in the implementation of OrderAccess:: Release_store, the first parameter is forcibly volatile, which is clearly the C/C ++ keyword.
The c/ C++ keyword volatile, used to describe variables and commonly used for language-level memory barriers, is described as follows in “The C++ Programming Language” :
A volatile specifier is a hint to a compiler that an object may change its value in ways not specified by the language so that aggressive optimizations must be avoided.
Volatile is a type modifier. The variable declared by volatile indicates that it is subject to change at any time. Each time it is used, it must be read from the memory address of variable I
#include <iostream>
int foo = 10;
int a = 1;
int main(int argc, const char * argv[]) {
// insert code here...
a = 2;
a = foo + 10;
int b = a + 20;
return b;
}
Copy the code
The variable I in the code is actually invalid. G++ -s -o2 main. CPP results in the compiled assembly code as follows:
It can be seen that in the generated assembly code, some of the operations responsible for invalidation of variable A are indeed optimized away. If variable A is declared with volatile, it can be seen as volatile
#include <iostream>
int foo = 10;
volatile int a = 1;
int main(int argc, const char * argv[]) {
// insert code here...
a = 2;
a = foo + 10;
int b = a + 20;
return b;
}
Copy the code
Generate assembly code again as follows:
Compared with the first time, there are the following differences:
1. Assignment of 2 to a is also retained, albeit as an invalid action, so that volatile prevents instruction optimization and acts as a compiler barrier.
Compiler barriers can avoid out-of-order memory access problems associated with compiler optimizations, or you can manually insert compiler barriers into your code, such as the following code, which has the same effect as volatile
#include <iostream>
int foo = 10;
int a = 1;
int main(int argc, const char * argv[]) {
// insert code here...
a = 2;
__asm__ volatile ("" : : : "memory"); // Compiler barrier a = foo + 10; __asm__ volatile ("" : : : "memory");
int b = a + 20;
return b;
}
Copy the code
When compiled, this is similar to the above
Where _a(%rip) is the address of variable A, movL $2, _a(%rip) is the address of variable A, movL $2, _a(%rip) is the address of variable A, movL $2, _a(%rip) is the address of variable A
So, every assignment to variable A is written to memory; Each time a variable is read, it is reloaded from memory.
Feeling a bit off track, let’s go back to the JVM code.
After the assignment, OrderAccess:: storeLoad () is executed.
In fact, this is often talked about memory barrier, before only know read, but do not know how to achieve. From the analysis of CPU cache structure, we know that a load operation needs to enter LoadBuffer and then load in memory. A store operation needs to go into StoreBuffer and then write to the cache. Both operations are asynchronous and result in incorrect instruction reordering, so a series of memory barriers are defined in the JVM to specify the order in which instructions are executed.
Memory barriers defined in the JVM are as follows, implemented in JDK1.7
1, loadLoad barrier (load1, loadload, load2)
Both barriers are implemented through the acquire() method
Where __ASm__ represents the beginning of assembly code. Volatile, as previously analyzed, prevents the compiler from optimizing code. After compiling this instruction, I found that I did not understand…. The last “memory” is the function of compiler barriers.
Inserting this barrier into the LoadBuffer clears the loads before the barrier before performing the operations after the barrier, ensuring that the data for the load operation is ready before the next store instruction
3, storeStore barrier (store1, storestore, store2) with “release()” method:
After inserting the barrier into StoreBuffer, store operations before the barrier are cleared, and store operations after the barrier can be performed to ensure that data written to Store1 is visible to other cpus when store2 is executed.
4, StoreLoad barrier (store, storeload, load) This barrier is inserted after assigning volatile variables in Java, using the “fence()” method:
Are you excited to see this?
Check for multicore with OS ::is_MP(). If you have only one CPU, you don’t have these problems.
The StoreLoad barrier is completely implemented by the following instructions
__asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc"."memory");
Copy the code
So just to see what these instructions actually do, let’s write some c++ code and compile it, okay
#include <iostream>
int foo = 10;
int main(int argc, const char * argv[]) {
// insert code here...
volatile int a = foo + 10;
// __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc"."memory");
volatile int b = foo + 20;
return 0;
}
Copy the code
The compiler uses volatile so that variables a and B are not optimized by the compiler. The compiled assembly instruction is as follows:
As you can see from the compiled code, the second time you use the foo variable, instead of reloading from memory, you use the value of the register.
Add __asm__ volatile *** and recompile
There are two more instructions, lock and ADDL, than before. The lock directive does the following: Behind the execution of the lock instruction, will set the processor LOCK# signal (bus, the signal will lock to prevent other CPU access memory through the bus, until the end of the instruction execution, the execution of this instruction into atomic operation, read and write requests before cannot across the lock instruction in rearrangement, equivalent to a memory barrier.
One more: the second time you use the foo variable, reload it from memory and make sure you get the latest value of the foo variable. This is done with the following instruction
__asm__ volatile ( : : : "cc"."memory");
Copy the code
Also a compiler barrier, notifies the compiler to regenerate load instructions (which cannot be fetched from cache registers).
Reading volatile variables
Also in the bytecodeinterpreter.cpp file, find the interpreter implementation of the getStatic bytecode directive.
Obtain the variable value through obj->obj_field_acquire(field_offset)
The final implementation is OrderAccess::load_acquire
inline jint OrderAccess::load_acquire(volatile jint* p) { return *p; }
Copy the code
The underlying implementation of volatile is based on C++, because volatile has a compiler barrier that always keeps the latest values in memory.