I met a volatile
The Java language specification version 3 defines volatile as follows: The Java programming language allows threads to access a shared variable, and to ensure that the shared variable can be updated accurately and consistently, threads should ensure that the variable is acquired separately through an exclusive lock. This concept may sound abstract, but let’s start with an example:
package com.zwx.concurrent;
public class VolatileDemo {
public static boolean finishFlag = false;
public static void main(String[] args) throws InterruptedException {
new Thread(()->{
int i = 0;
while (!finishFlag){
i++;
}
},"t1").start();
Thread.sleep(1000);// Make sure that T1 enters the while loop before the main thread changes finishFlag
finishFlag = true; }}Copy the code
The while loop in t1 cannot be stopped because we have changed finishFlag on the main thread, and this value is not visible to T1 thread if we volatile finishFlag:
public static volatile boolean finishFlag = false;
Copy the code
If you run it again, you’ll see that the while loop stops pretty quickly. From this example we can see that volatile solves the problem of variable visibility between threads. Visibility means that when one thread modifies a shared variable, another thread can read the changed value.
How does volatile guarantee visibility
Using HSDIS, we can see that the assembly instruction printed with volatile has the following line:Lock is a kind of control instruction, in the multi-processor environment, lock assembly instruction can be based on the mechanism of bus lock or cache lock to achieve an effect of visibility.
The nature of visibility
Hardware level
Thread is the smallest unit of the CPU scheduling, design of thread end is still to make better use of the purpose of computer processing efficiency, but the vast majority of computing tasks can’t just rely on processor can complete “calculation”, the processor also need interaction with memory, such as read operation data, storage, operation results, it is hard to eliminate the I/O operations. Since the computing speed gap between the computer’s storage device and the processor is very large, modern computer systems will add a buffer between the memory and the processor, which is as close to the processing speed as possible. Copy the data needed for the operation to the cache so that the operation can proceed quickly, and then synchronize the operation from the cache to memory when it is finished. Looking at the configuration of our personal computer, it can be seen that the CPU has L1,L2 and L3 caches. The rough structure is shown in the figure below:As can be seen from the above figure, L1 and L2 caches are unique to each CPU. With the existence of caches, each CPU first caches the data needed for calculation in the CPU cache. When the CPU performs calculation, it directly reads the data from the cache and writes the data to the cache after calculation. After the entire operation is complete, the data in the cache is synchronized to main memory. Since there are multiple cpus, each thread may run on a different CPU and each thread has its own cache. The same piece of data may be cached in multiple cpus. If different threads running on different cpus see different cache values for the same piece of memory, there will be cache inconsistency. How to solve the cache consistency problem? The CPU layer provides two solutions:The bus lockandCache lock
The bus lock
When one of the cpus wants to operate on the shared memory, a LOCK# signal is issued on the bus. This signal prevents other processors from accessing the shared memory through the bus. A bus lock locks up communication between the CPU and memory, making it impossible for other processors to manipulate data at other memory locations during the lock. However, the cost of this approach is obviously too large, so how to optimize? The optimization was to reduce the granularity of locks, so the CPU introduced cache locks.
Cache lock
The core mechanism of cache locking is based on the cache consistency protocol (CACHE consistency protocol), which is implemented by the IA-32 and Intel 64 processors using MESI (note that the cache consistency protocol is not only implemented through MESI, Different processors implement different cache consistency protocols.)
MESI (Cache Consistency Protocol)
MESI is a common cache consistency protocol. MESI represents four states of cached rows, which are: M(Modify) indicates that the shared data is cached only in the current CPU cache and is in the modified state, that is, the cached data is inconsistent with the data in the main memory. E(Exclusive) indicates that the shared data is cached only in the current CPU cache. 3. S(Shared) indicates that data may be cached by multiple cpus, and the data in each cache is the same as that in the main memory. 4. It also snoop on reads and writes from other cpus. The MESI protocol complies with the following rules in terms of CPU read and write: CPU read request: the cache in the M, E, and S states can be read. The CPU in the I state can only read data from the main memory. CPU write request: The cache in the M and E states can be written. For writes in S state, you need to invalidate the cache rows in other cpus.
CPU workflow
With bus locking and cache locking, CPU operations on memory can be abstracted into the following structure. To achieve cache consistency:
Problems with the MESI protocol
The MESI protocol achieves cache consistency, but there are some problems: the state of the individual CPU cache rows is mediated by message passing. If CPU0 wants to write to a variable shared in the cache, it first needs to send an invalid message to the other CPUS that have cached the data. And wait for their confirmation receipt. CPU0 will be blocked during this time. To avoid the waste of resources caused by blocking. Store bufferes has been introduced in CPU:As shown in the figure above, CPU0 simply writes the shared data directly to the Store Bufferes, sends an invalidate message, and continues processing other instructions (asynchronously). The data from Store Bufferes is then stored in the cache row, and finally synchronized from the cache row to main memory. However, this optimization creates visibility problems, which can also be attributed to out-of-order CPU execution or instruction reordering (instruction reordering occurs not only at the CPU level, but also at the compiler level). Let’s take a look at the problem caused by instruction reordering with a simple example.
package com.zwx.concurrent;
public class ReSortDemo {
int value;
boolean isFinish;
void cpu0(a){
value = 10;//S->I state, writes value to Store bufferes, notifies other CPUS that the current value cache is invalid
isFinish=true;/ / E
}
void cpu1(a){
if (isFinish){//true
System.out.println(value == 10);// May be false}}}Copy the code
If the value of isFinish is true, the value of isFinsh should be equal to 10. However, if the value of isFinish is true, the current CPU0 will continue to execute isFinish=true. And value is not equal to 10. If you think about it, it’s hard to know the dependencies at the hardware level, so there’s no way to automate them. The CPU provides a Memory Barrier. Intel calls it Memory Fence, which allows the software level to decide where to insert a Memory Fence to prevent instruction reordering.
CPU level memory barrier
CPU Memory barriers fall into the following categories: Store Memory barriers: This tells the processor to synchronize all data stored in store bufferes prior to the write barrier to main memory. In short, this makes the results of instructions prior to the write barrier visible to read or write after the write barrier. Load Memory Barrier: All read operations performed by the processor behind the read Barrier. In conjunction with the write barrier, memory updates before the write barrier are visible to read operations after the read barrier. Full Memory Barrier: Ensure that read and write operations before the Barrier are committed to Memory. These concepts may sound a little vague, but let’s rewrite the above example to illustrate:
package com.zwx.concurrent;
public class ReSortDemo {
int value;
boolean isFinish;
void cpu0(a){
value = 10;//S->I state, writes value to Store bufferes, notifies other CPUS that the current value cache is invalid
storeMemoryBarrier();// Insert a write barrier so that value=10 is forced to write to main memory
isFinish=true;/ / E
}
void cpu1(a){
if (isFinish){//true
loadMemoryBarrier();// Insert a read barrier to force CPU1 to fetch the latest data from main memory
System.out.println(value == 10);//true}}void storeMemoryBarrier(a){/ / write barriers
}
void loadMemoryBarrier(a){/ / read barrier}}Copy the code
With the above memory barrier, we can prevent instruction reordering and get the desired result. In general, memory barriers can be used to ensure that shared data is visible in parallel execution by preventing out-of-order CPU access to memory, but how can this barrier be added? Going back to the beginning of our code for the volatile keyword, which generates an assembly instruction for lock, this acts as a memory barrier. Let’s move on to the analysis of the principle of volatile
The JVM level
At the JVM level, visibility issues are addressed by defining an abstract memory model (JMM) to regulate and control reordering.
JMM(Java Memory Model)
JMM stands for Java Memory Model. What is JMM? From the previous analysis, the root cause of visibility problems is caching and instruction reordering. The JMM actually provides a reasonable way to disable caching and reordering. So the core value of the JMM is addressing visibility and order. JMM belongs to the language level of abstract memory model, can be as simple as the hardware model of abstract, it defines the Shared memory multi-threaded programs to read and write operation behavior, by these rules to regulate the memory read and write operations to ensure the correctness of the instruction, it solves the multistage cache, the processor optimization, CPU instructions to reorder memory access problem, This ensures visibility in concurrent scenarios. Note that the JMM does not restrict the execution engine from using the processor’s registers or cache to speed up instruction execution, nor does it restrict the compiler from reordering instructions, which means that the JMM also has cache consistency issues and instruction reordering issues. It’s just that the JMM abstracts the underlying issues down to the JVM level and addresses concurrency issues based on cpu-level memory barrier instructions and limited compiler reordering.
JMM Abstract model structure
JMM abstract model is divided into main memory and working memory. Main memory is shared by all threads, typically instance objects, static fields, array objects, and other variables stored in heap memory. The working memory is exclusive to each thread, and all operations on variables must be carried out in the working memory. Variables in the main memory cannot be read or written directly. Values of shared variables between threads are transferred based on the main memory, which can be abstracted as follows:
How does the JMM solve the visibility problem
From the JMM’s abstract model structure, if thread A and thread B want to communicate, they must go through the following two steps. 1) Thread A flusher the updated shared variable from local memory A to main memory. 2) Thread B goes to main memory to read the shared variable that thread A has updated before. The following is a schematic illustration of these two steps:In combination with the figure above, let’s assume that at the beginning, the x values in each of these three memories are 0. When thread A executes, it temporarily stores the updated x value (suppose 1) in its local memory, A. When thread A and thread B need to communicate, thread A will first refresh the modified X value in its local memory to main memory, and then the x value in main memory becomes 1. Thread B then goes to main memory to read thread A’s updated x value, and thread B’s local memory x value also changes to 1. Taken as A whole, these two steps are essentially thread A sending messages to thread B, and this communication must go through main memory.The JMM provides Java programmers with memory visibility assurance by controlling the interaction between main memory and local memory for each thread.
The compiler’s instructions are reordered
Based on the above analysis from the hardware level and JVM level, we know that the compiler and processor often reorder instructions to improve performance when executing programs. There are three types of reordering: 1)Compiler optimized reordering. The compiler can rearrange the execution order of statements without changing the semantics of a single-threaded program. 2)Instruction – level parallel reordering. Modern processors use instruction-level Parallelism (ILP) to overlap multiple instructions. If there is no data dependency, the processor can change the execution order of the machine instructions corresponding to the statement. 3)Memory system reordering. Because the processor uses caching and read/write buffers, this makes the load and store operations appear to be out of order. The sequence of instructions from the Java source code to the actual execution goes through the following three reorders, as shown below:2 and 3 of these are handler reorders (which have been analyzed at the hardware level). All of these reorders can cause visibility problems (the compiler and processor adhere to data dependencies when reordering; the compiler and processor do not change the order in which two operations with data dependencies are executed; the compiler doesHappens-before rule and as-if-serial semantics. For compilers, the JMM’s compiler reordering rules disallow certain types of compiler reordering (not all compiler reordering is prohibited). For processor reordering, the JMM’s processor reordering rules require the Java compiler to insert Memory Barriers (Intel calls them Memory fences) of a specific type to prevent reordering of a particular type of processor when generating the sequence of instructions.The JMM is a language-level memory model that ensures consistent memory visibility for programmers across compilers and processor platforms by disallowing certain types of compiler reordering and processor reordering. Because of this property of volatile,Therefore, the singleton pattern can use the volatile keyword to solve the problem of the double-checked lock (DCL) writing
Memory barrier at JMM level
There are four types of memory barriers in the JMM:StoreLoad Barriers is an “all-in-one” barrier that has the effect of the other three Barriers. Most modern processors support this barrier (other types of barriers are not necessarily supported by all processors). Performing this barrier can be expensive because the current processor usually flusher all the data in the write Buffer to memory (Buffer Fully Flush).
Happens-before rules
Happens-before means that the result of the previous operation is visible to subsequent operations. It is a way of expressing the visibility of memory between multiple threads. So we can assume that in the JMM, if the result of one operation needs to be visible to another, there must be a happens-before relationship between the two operations. The two operations can be on the same thread or on different threads.
conclusion
There are three main characteristics in concurrent programming: atomicity, visibility, and order. Volatile prevents instruction reordering through the memory barrier, following the following three rules:
- When the second operation is a volatile write, there is no reordering, regardless of the first operation. This rule ensures that operations before volatile writes are not reordered by the compiler after volatile writes.
- When the first operation is a volatile read, no matter what the second operation is, it cannot be reordered. This rule ensures that operations after volatile reads are not reordered by the compiler to those before volatile reads.
- When the first operation is volatile write and the second is volatile read, reorder cannot be performed.
To implement the memory semantics of volatile, when the bytecode is generated, the compiler inserts a memory barrier into the instruction sequence to prevent a particular type of handler from reordering. It is almost impossible for the compiler to find an optimal arrangement that minimizes the total number of insertion barriers. To this end, the JMM takes a conservative approach. Here is the JMM memory barrier insertion strategy based on the conservative policy:
- Insert a StoreStore barrier before each volatile write.
- Insert a StoreLoad barrier after each volatile write.
- Insert a LoadLoad barrier after each volatile read.
- Insert a LoadStore barrier after each volatile read.
Finally, a special mention of atomicity. The Java language specification encourages, but does not require, atomicity for JVM writes to 64-bit long and double variables. When a JVM runs on such a processor, it is possible to split a 64-bit long/double write into two 32-bit writes, which may be assigned to different bus transactions, and the writes to the 64-bit variable will not be atomic. The semantics of locking determine the atomicity of critical section code execution. But because a read of a volatile always sees the last write to that volatile, even 64-bit longs and doubles are read/written atomically as long as they are volatile. But multiple volatile operations, or compound operations like i++, are not atomic as a whole. For compound operations such as i++, if atomicity is to be guaranteed, the synchronized keyword or other locks are required to handle.
Note: In the old memory model prior to JSR-133, a read/write operation of a 64-bit long/double variable could be split into two 32-bit read/write operations. From the JSR-133 memory model (that is, starting with JDK5), only one 64-bit long/double write is allowed to be split into two 32-bit writes, and any read must be atomic in JSR-133 (that is, any read must be performed in a single read transaction).