Many people are irritated by the Volatile keyword, a humble Java keyword that confuses too many heroes. At first look at it, feel very simple, look at it carefully, only to find disturb. This article takes you from deep to shallow analysis of this keyword.

One, soul torture

Were you asked the following questions that year, somewhere you wanted to fight?

  • volatileWhat is it related tosynchronizedThe difference between? Where have you seen it used?
  • Which scenarios are suitable for useVolatile?
  • volatileWhat problem is it to solve? What are its advantages and disadvantages?
  • Talk aboutvolatileUnderlying principles? Say againMESI?
  • Do you knowhappen-beforeThe principle?

COMBO, you can’t handle it. Who am I? Why am I here? It’s getting late. Maybe I should get some rest.

Ii. Case analysis

Let’s walk through this keyword with an example.

/** * thread safely */ //public static volatile boolean run = true; /** * thread not safely */ public static boolean run = true; public static void main(String[] args) throws InterruptedException { new Thread(()->{ int i=0; while (run){ i++; Thread.yield(); System.out.println(i+" loading..." ); } }).start(); System.out.println("Main Thread ready to sleep..." ); TimeUnit.SECONDS.sleep(10); System.out.println("Main Thread finish sleep..." ); run= false; }Copy the code

The above example demonstrates the difference between using the Volatile keyword and not using it.

Public static Boolean run = true;Copy the code

Using thread-unsafe notation, it turns out that there is a fairly small probability that the child thread will stay in the RUNNABLE state and stop forever. Notice that the probability is very small, I run it tens of thousands of times before I get one child thread that doesn’t stop. But this one could lead to a serious production accident.

Public static volatile Boolean run = true;Copy the code

The code above is thread-safe and fundamentally solves the problem of endless loops. So why?

For a quick explanation, the core problem is that even though the main thread changes the value, the child thread does not perceive that the main thread changes the run variable to false. Why, I will analyze it in detail later.

Third, what is volatile

The volatile keyword is used when multiple threads modify the value of the same variable. It allows threads to safely access and manipulate the shared variable. This means that multiple threads can use a method and instance at the same time without any problems. This keyword can modify both Java base types and reference types.

While the above description is all about the phenomenon, the underlying principle is that, based on the JMM(Java Memory Model), the volatile keyword is used to mark Java variables as “stored in main Memory.” More precisely, each read of a volatile variable is read from the computer’s main memory, not the CPU cache, and each write to a volatile variable is written to main memory, not just the CPU cache.

As we all know, when multiple threads want to access a shared variable at the same time, there are three aspects to consider, including atomicity, visibility, and orderliness. Atomicity means that there should be no thread interference when another thread performs some operation on the shared data. Visibility means that the effect of a thread’s behavior on the shared data should be felt by other threads; Sequentiality means that instructions should be executed in the same order as expressed in the source code.

Analyze volatile, which has:

  • Visibility. A read to a volatile variable always sees the last write to that volatile variable (by any thread).
  • Atomicity, atomicity to read/write to any single volatile variable. But fori++This compound operation is not atomic.
  • The order,volatileKeyword disables instruction reordering, sovolatileOrder can be guaranteed to a certain extent.

Analyze visibility from the hardware level

At its core, volatile is about visibility. What is the nature of this visibility?

As we know, CPU, memory, I/O devices and so on are the core components of a computer. Hardware manufacturers will improve the overall performance of the computer by continuously improving the performance of these three components. There is a big difference in processing speed between the three, CPU> memory >I/O devices. According to the barrel principle, the most important factor affecting the overall performance is the shortest plate. In order to balance the performance of the three and maximize the utilization of computing resources, the previous article “Concurrent Programming processes, Threads, and Coroutines” mentioned that a group of smart people have developed operating systems, processes, and threads to maximize the utilization of CPU resources through time slice switching.

However, this is not enough, is it possible to execute CPU instructions, access resources directly stored on the CPU, rather than access memory, of course, there is CPU cache, after the cache, in order to make more reasonable use of the cache, compiler instructions have been optimized. These optimizations have created all sorts of problems, although they have literally drained CPU resources. Industrious people began to solve such problems again, bringing the idea of concurrent programming into an era of explosion.

4.1 CPU Cache

CPU cache is a temporary memory located between CPU and memory. Its function is mainly to solve the contradiction between CPU computing speed and memory reading speed. To put it more bluntly, what we use the computer ultimately is the computing power of the computer, but to complete the computation is very complicated, not only the CPU is involved in the line, the CPU needs to interact with the memory to read and write operations. Therefore, in order to eliminate as much time as possible for the CPU to execute this IO, people have developed a cache to solve this problem.

As shown in figure,

  • L1 CacheThis refers to level 1 caching, which is basically caching instructions and data,CoreExclusive. Because of its complex structure and high cost, it has the smallest capacity in the cache, and the common capacity range is32KB~512KB.
  • L2 CacheLevel 2 cache, direct impactCPUThe performance,CoreExclusive. The universal capacity range is128KB~24MB.
  • L3 CacheLevel 3 caching, which further reduces memory latency and improves performance in scenarios with more data,CoreTo be Shared. The universal capacity range is2MB~32MB.

In general, the closer you are to the CPU core’s cache, the smaller the capacity and the lower the simultaneous access latency (the faster the speed).

The storage interaction of cache solves the contradiction of processor and memory speed difference, but it introduces a new problem, that is cache consistency.

4.2 Cache Consistency

The process of an operation is to move the data used in the operation to the cache at the beginning, and then synchronize the result to memory at the end of the operation. This is the biggest problem in multi-core CPU running environment, the main memory of the Shared data may be more nuclear cache at the same time, to read and write operations at the same time, to write data back to the first time is in the cache, and then further synchronization to main memory, but the other at the core of the cache may not perceive the change, it can produce cache inconsistency problem.

To solve this problem and ensure cache consistency, there are two ways to solve this problem.

  • The bus lock
  • Cache lock

A bus lock, in simple terms, locally locks the communication between the CPU and memory at the bus level. During the lock, other threads cannot access the locked memory address. This approach costs a lot, affects a wide range of lock control granularity is coarse, is not suitable for solving the problem of inconsistent cache.

So in order to optimize the bus lock, they put forward the cache lock. The core mechanism of the cache lock is the cache consistency protocol, which reduces the granularity of the lock and ensures that the contents of the cache on each CPU core are consistent for the same shared data in main memory.

4.3 CPU Cache and Cache Line

The above mentioned cache lock uses the cache consistency protocol to ensure cache consistency. There are many such protocols, the most common being MESI. Since the MESI protocol is implemented in caches in four states per Cache line, there is one piece of hardware that needs to be mentioned in advance: the difference between a CPU Cache and a Cahce line.

If the L1 Cache size is 512KB, there are eight (512/64=8) Cache lines. If the L1 Cache size is 512KB, there are eight (512/64=8) Cache lines. Refer to the diagram below for details. So MESI’s granularity is the Cache Line, so the best effect is that when the main memory and Cache data need to be synchronized, the impact is only on the Cache Line, not the entire CPU Cache.

4.4 Cache Consistency ProtocolMESI

The MESI protocol has four states: Modified, Exclusive, Shared, and Invalid. The four states are represented by two bits.

  • ModifiedsaidCache LineValid, data exists only in the currentCache, and the data has been modified, and is inconsistent with the main memory.
  • ExclusivesaidCache LineValid, data exists only in the currentCacheIn, data is consistent with main memory.
  • SharedsaidCache LineValid. Data doesn’t just exist in the currentCacheIn, by manyCacheShare, eachCacheConsistent with the main memory data.
  • InvalidIndicates that the current cache row is invalid.

In the MESI protocol, each Cache controller not only knows its own read and write operations, but also listens for the read and write operations of other caches. This is called sniffing technology, and the controller performs different listening tasks for the current state of the Cache Line.

4.4.1 Status change process

Let’s use a simple example to analyze how tasks are monitored. Suppose we now have a dual-core CPU, and the main memory contains an I variable of 1. Now the CPU needs to do some operations and read I into the cache.

Step 1: In the figure, CPU1 reads data from main memory into the cache. The variable I in the current cache is 1, and the state of the cache line E is exclusive. It is always listening for other caches to load the variable from main memory.

Step 2: In the figure, CPU2 also attempts to read variable I from main memory and load it into the cache. CPU1 listens for this event and changes its state to S. CPU2 also reads the data and changes its state to S. Both CPU Cache lines are listening for events that require the Cache to invalidate itself with I, or for other requests that the Cache should have its own variable.

Step 3: After CPU1 completes the calculation, the Cache manager changes the state of the Cache Line to M and sends an event notification to other cpus. CPU2 receives the event notification and changes the state of the Cache Line to I invalid. CPU1 listens for other caches to read main memory events. CPU2’s cache row is invalid because it is invalid in state.

Step 4: On the diagram, CPU2 uses variable I because the cache line that stored I is invalidated and the main memory is actively synchronized. CPU1 receives a request from another CPU to read main memory, so it synchronizes the changed variable to main memory before reading it. After the synchronization, the variable I =2 on main memory, and the CACHE manager sets the state of the cache row to E. Then follow step 4 to change the final state of the Cache lines of both cpus to S.

4.4.2 Principle of State change

In general, the MESI protocol follows the following principles for CPU reads and writes:

  • CPURead request: The current state of the cache row isM E SAll states can be read, inIState,CPUData can only be read from main memory.
  • CPUWrite request: The current state of the cache row isM EStates can be written directly, inIIn state, the cache line is invalid and cannot be read. In aSState that can be written only if other cache rows are set to invalid.
4.4.3 MESIProblems brought about

Although the four states and sniffing techniques of the MESI protocol achieve cache consistency, they also bring some problems.

We talked about above, if the CPU to write to Cache the results of calculation after Line, need to send a notice to the other store the same failure data of the CPU, and must wait until their status changes after the completion of the write operation is performed, throughout the period, the CPU in the synchronous blocking of waiting, very affect the performance of the CPU.

In order to solve the problem of blocking waiting, the Store Buffer was introduced into the CPU. Through this Buffer, when the CPU wants to modify the value in the cache, it only needs to write the data into this Buffer, and then it can execute other instructions. The buffer is then stored in the Cache Line and then synchronized to main memory if necessary when another CPU changes the state of the specified Cache Line to invalid I.

This scheme is asynchronous and solves the problem of CPU synchronous wait blocking. But it also introduces new problems.

  • Since it is an asynchronous operation, when exactly will the others be receivedCPUNotification of status changes is not explicit, so causesStore BufferWhen is the data written toCache LineIt’s also uncertain.
  • When nothing else is receivedCPUBefore state changeCPUIt’s possible to read data, first fromStore BufferMedium read, if not, read againCache LineIf not, read main memory again.

The new problem, with a huge impact, is instruction reordering.

Let’s use an example to see what the problem is.

int value =1; bool finish = false; void runOnCPU1(){ value = 2; finish = true; } void runOnCPU2(){ if(finish){ assert value == 2; }}Copy the code

Let’s assume that the #runOnCPU1 and #runOnCPU2 methods are running on two separate cpus. It is easy to assume that there will be no assertion execution. When it does, here’s one possible scenario.

Two key variables are cached on the CPU1 cache line with the following status:

value finish
CacheLinestate S E

When executing the #runOnCPU1 method, CPU1 writes value=2 to the Store Buffer, executes finish=true, and notifies other cpus that Store the same variable to set the state of the cache to I invalid, and asynchronously waits for the result to be returned.

Because the current Cache Line where the finish variable is stored is E exclusive, finish=true can be written to the Cache Line immediately without notifying other cpus. Finish =true; the Cache Line status of Finish is set to S, and the Cache Line status is set to S. And finish=true for main memory. CPU2 continues to assert Value == 2; This instruction will first fetch the value of value from main memory. Since CPU1’s modified value is stored in the Store Buffer, CPU2 will fetch the value of 1.

In other words, what we can see is that in the #runOnCPU1 method, finish is assigned earlier than value, which is different from what we expected. This is the visibility problem caused by instruction reordering.

This visibility problem can be solved using the JMM memory barrier, and it is precisely this that volatile is the killer for ensuring visibility in multithreaded environments.

4.5 Memory Barriers

4.5.1 Analyzing memory barriers using the Hsdis tool

Hsdis is a disassembly library for analyzing code generated by the JIT compiler at Java runtime. It is a DLL (dynamic link library) file that is used in the ${JAVA_HOME}/jre/bin/server directory. We used this tool to compare the difference between an assembly instruction that prefixes a variable with a volatile keyword and one that does not.

You start by configuring the Hsdis environment in the Java installation directory, which essentially puts the two files in the specified directory.

After configuring the Hsdis environment, add the following parameters to the VM Options where you run the program.

-server -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:CompileCommand=compileonly,*VolatileExample.*
Copy the code
  • -XX:+UnlockDiagnosticVMOptionsParameter indicates that diagnostic mode is enabled for printingPrintAssemblyKind of information.
  • -XX:+PrintAssemblyThe parameter stands for printing just-in-time compiled binary information
  • XX:CompileCommand=compileonly,*VolatileExample.*This indicates that only information that satisfies the *VolatileExample.* regular expression is displayed.

Then we run the VolatileExample example in Chapter 2.

The volatile run assembly instruction was found to be

0x0000000003565073: lock add dword ptr [rsp],0h ; *putstatic runCopy the code

The unvolatile run assembly instruction is

0x00000000030e5ceb: push 0ffffffffc4834801h ; *putstatic runCopy the code

You can clearly see the difference between volatile variables and a lock instruction. Lock is an assembly control instruction that implements a memory barrier.

4.5.2 Detail the memory barrier

In the Case of the Java language, memory barriers are not directly exposed for use by the JVM, but are inserted by the JVM into the underlying runtime instructions based on code semantics. For example, if volatile is used to modify the variable, the assembly instruction will add the LOCK instruction.

The memory barrier instructions are different for different hardware exposed to the outside world. The familiar X86 computer device, for example, provides Ifence(read barrier), Sfence(write barrier), and Mfence (full fence) instructions to enforce the memory barrier.

  • A Load Memory Barrier is all read operations that are performed behind a read Barrier. This instruction is done in conjunction with the write barrier so that the memory updates before the write barrier and the read operations after the read barrier are visible. In essence, all the invalidation instructions already in the invalidation queue are applied before the barrier is read.
  • A Store Memory Barrier means that all writes that precede a write Barrier must be created byStore BufferSynchronously flush to main memory. The effect is that all read and write operations after the write barrier see the memory update before the write barrier.
  • A Full Memory Barrier is a Barrier that allows all read and write operations before it to be visible to all read and write operations after it.

So for the example in Section 4.4.3, let’s make a simple change and add a memory barrier for visibility.

int value =1; bool finish = false; void runOnCPU1(){ value = 2; storeMemoryBarrier(); // Write barriers to force variables to be flushed to main memory finish = true; } void runOnCPU2(){ if(finish){ loadMemoryBarrier(); Assert == 2; assert == 2; }}Copy the code

In summary, the Java compiler solves the instruction reordering problem by inserting memory barriers before and after volatile reads and writes. In particular, for the Java language, the real hero of instruction reordering is the JMM(Java Memory Model). As a bridge to the underlying implementation of complex hardware, the JMM addresses visibility and ordering issues by providing reasonable methods to disable caching and reordering, which are then translated into specific CPU instructions at compile time, including synchronized, volatile, and final. This chapter will not be in-depth, and detailed analysis will be carried out later.

4.6 happens-before

The happens-before principle is a core concept in the JMM that should have been addressed in the JMM chapter. But since we’re talking about visibility and sequentiality, let’s just briefly mention it.

The happens-before principle states that the previous action must be visible to the subsequent action. That is to say, following this rule in a multithreaded environment can satisfy the visibility of multiple operations on shared variables.

4.6.1 happens-before && as-if-serial

The happens-before relationship is essentially the same thing as the as-if-serial semantics.

  • happens-beforeandas-if-serialEnsure that multi-threaded and single-threaded execution results are not changed, respectively.
  • as-if-serialSemantics give the user the appearance that a single thread executes in sequence;happens-beforeSemantics bring the appearance that exactly synchronous multithreaded programs are pressedhappens-beforeIn the order specified.

The purpose of these two rules is to improve the parallelism of program execution as much as possible without changing the result of program execution.

4.6.2 happeds-beforeThe rules

** JSR-133:Java Memory Model and Thread Specification ** defines the following happens-before rule.

Procedure order rule: For every action in a thread, happens-before any subsequent action in that thread.

In short, you can tweak the order of execution as much as you want to provide parallelism, but the result is always the same. Sort of a general guideline.

Monitor lock rule: a lock is unlocked, happens-before a lock is subsequently locked.

Monitor lock rule

static int y = 1; static void testMonitorRule(){ synchronized(Object.class){ if(y==1){ y=2; }}}Copy the code

Thread 1 acquires lock (y==2); thread 2 acquires lock (y==2); thread 2 acquires lock (y==2);

Volatile variable rule: Writes to a volatile field, happens-before any subsequent reads to that volatile field.

Volatile variable rule

static int i = 1; static volatile boolean flag = false; static void testVolatileRuleReader(){ i=2; //1 flag=true; //2 } static void testVolatileRuleWriter(){ if(flag){//3 assert i == 2; / / 4}}Copy the code

So step 3 is reading, and you can see step 2 is writing.

Transitivity: If A happens-before B, and B happens-before C, then A happens-before C.

Transitivity rules, as in the volatile rule example above, must happen before Step 4.

Start () rule: if thread A performs an operation threadb.start () (starts ThreadB), then thread A’s threadb.start () operation happens before any operation in ThreadB.

Start () the rules

    static void testStartRule(){
        AtomicInteger x = new AtomicInteger(1);
        final Thread t = new Thread(() -> {
            assert x.get() == 2;
        });
        x.set(2);
        t.start();
    }
Copy the code

All operations on shared variables before the main thread calls the child thread t.start() are visible to the child thread.

Join () rule: if thread A performs the operation threadb.join () and returns successfully, any operation in ThreadB happens-before thread A returns successfully from threadb.join ().

The join () the rules

static void testJoinRule() throws InterruptedException {
        AtomicInteger x = new AtomicInteger(1);
        final Thread t = new Thread(() -> {
            try {
                TimeUnit.SECONDS.sleep(10);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            x.set(2);
        });
        t.start();
        t.join();
        assert  x.get() == 2;
    }
Copy the code

In the main thread, the child thread t.join() is called, and all changes made by the child thread to shared variables are visible to the main thread.

5 concludes

5.1 volatileUsage scenarios

Volatile in concurrent Java in many, such as Atomic value in the package, and the state variables are AbstractQueuedLongSynchronizer volatile modifications to ensure visibility of memory.

In the process of development, there are also some applications, such as.

Use one thread to stop another thread by modifying flag bits

public static volatile   boolean run = true;

    public static void main(String[] args) throws InterruptedException {
        final Thread t = new Thread(() -> {
            int i = 0;
            while (!run) {
                i++;
                Thread.yield();
            }
        });
        t.start();
        TimeUnit.SECONDS.sleep(10);
        run= true;
    }
Copy the code

The double-checked lock implements the singleton pattern

public class DoubleCheckingLock {
    private static volatile Instance instance;
    public static Instance getInstance(){
        if(instance == null){
            synchronized (DoubleCheckingLock.class){
                if(instance == null){
                    instance = new Instance();
                }
            }
        }
        return instance;
    };
    static class  Instance { }
}
Copy the code

For details on how to do this, check out the excellent book the Art of Concurrent Programming in Java.

5.2 ending

This article starts from the example, leads to the multithreaded environment operating shared variables will have visibility, order and other problems. Combined with the hardware environment, analysis of the effective use of computer resources, improve the concurrency, CPU introduced cache problems. In order to solve the consistency problem caused by caching, we analyzed a wave of MESI protocols, and finally took a look at the core principle of volatile memory barriers.

The questions mentioned at the beginning of the article are already answerable after reading this article. The problems of synchronized are analyzed in the chapter of synchronized.


Brother boy, don’t panic to go! Leave a thumbs-up and comment on the discussion. Welcome to the column face interview don’t panic | Java concurrent programming, a raise don’t have to worry about the interview. Also welcome to pay attention to me, must be a longer better man.