Trace the history of Volatile

The ConcurrentHashMap attribute is volatile. Many of the ConcurrentHashMap attributes are volatile. So next I will take you to thoroughly analyze the role of this keyword, then please follow my train of thought, the old driver started! 🚗

What is volatile?

Volatile ensures visibility of shared variables in multiprocessor environments, so what exactly is visibility?

You ever think about such a problem, in a single-threaded environment and if the first to write a value to a variable, and then in the absence of written interference read this variable, then read the value of the variable value should be written before, it was always supposed to be a very normal thing, but in a multithreaded environment, read and write now different threads, It is possible that a reader thread cannot read the latest value written by another thread in time, which is called visibility

To achieve memory visibility across threads, some mechanism must be used, and volatile is one such mechanism

How does volatile guarantee visibility?

If we add volatile to the variable, we will later see an additional lock instruction during multithreaded execution. Lock is a kind of control instruction. In multi-threaded environment, lock instruction can be based on the mechanism of bus lock or cache lock to achieve the effect of visibility (if you want to query the operation process of this instruction in detail, you can leave a message in the comment section, I will fill in the operation process).

Now that we’re talking about assembly instructions, let’s take a look at the nature of visibility from the hardware level

Understand the nature of visibility at the hardware level

First, the core of a computer is the CPU, memory and I/O devices. So in the development process of the computer, the processing capacity of the computer will be constantly upgraded and improved, but a function is not only dependent on CPU or which one of them, but the process of cooperation, but there is a very contradictory point, that is, the speed of processing tasks of the three are different. CPU processing speed is very fast, memory, and at last the I/O devices, but in order to improve computing performance, from mononuclear upgrade to multicore CPU even with the help of hyper-threading to maximize utilization of the CPU, but if the other part of the processing performance of do not keep up with, that means that the overall efficiency of the also won’t have a lot of ascension, Therefore, in order to balance the speed difference among the three and maximize the use of CPU to improve performance, many optimization points have been made from hardware, operating system and other aspects

  1. The CPU has increased the cache
  2. The operating system added processes, threads, and CPU time slice switching to maximize CPU utilization
  3. Compiler instruction optimization, etc
CPU cache

Thread is the smallest unit of the CPU scheduling, thread is designed to fully use of CPU processing tasks, but most of the task is not only calculation can be completed by the processor, also need to interact with memory or an I/O device, due to the CPU and I/O devices, or memory computing speed difference is very big, so will add a layer of caching to improve the processing speed

Although the addition of cache solves the direct speed problem of CPU and memory, it also increases the complexity of the whole system and introduces a new problem, cache consistency

What is cache consistency

With the existence of the cache, each CPU now reads data from main memory into the CPU cache, then performs computation operations in the CPU, and then writes back to main memory

However, because there are multiple CPU cores, the same data may be cached in multiple cpus, so there may be different cache values of the same memory at a certain point in time, and then the problem of cache inconsistency will occur

In order to solve the problem of cache inconsistencies, I thought about many ways at the CPU level, and found that two main solutions were provided

  1. The bus lock

    Bus lock, in fact, when one thread operation on a shared variable, add a lock signal, other threads can not operate on the shared variable, but this locking cost is high, this mechanism is not suitable

  2. Cache lock

    Cache lock, we only need to ensure that the same shared variable is consistent in different cpus, so the introduction of cache lock, it is based on the cache consistency protocol implementation

Cache consistency protocol MESI

Next, I will give you a brief introduction. If you have any questions, you can leave a message in the comment section, and then see if you need to introduce the cache consistency protocol in detail according to the message

MESI represents the four states of the cached row, respectively

  1. M (Modify) Indicates that the shared data exists only in the current CPU and is in the modified state. The data in the cache is inconsistent with that in the main memory
  2. E (Exclusive) indicates the Exclusive state of the cache. Data is cached only in the current CPU and is not modified
  3. S (Shared) indicates that data may be cached by multiple cpus and the data in the cache is the same as that in the main memory
  4. I (Invalid) indicates that the cache is Invalid

In the MESI protocol, the cache controller for each cache not only knows about its own read and write operations, but also listens for reads and writes to other cached rows

Summarize the nature of visibility

Visibility problems arise when multiple CPUS cache the same data at the same time. If CPU0 modifies its local cache, the data is not visible to CPU1. When CPU1 writes to a shared variable, it will be using dirty data, resulting in an uncertain result

The cache consistency protocol resolves cache inconsistencies, so why volatile? Why is there a cache consistency problem?

Visibility issues with MESI optimization

The MESI protocol can achieve cache consistency, but it does have some problems

Example: If CPU0 writes to a shared variable, it first needs to send an invalid message to the other cached rows and wait for their acknowledgement. During this waiting process, CPU0 is blocked. In order to avoid resource waste caused by blocking, Store Bufferes are introduced into CPU

When CPU0 needs to write to a shared variable, it simply writes the data to store Bufferes, sends the invalidate message, and continues processing the other instructions, thus preventing CPU0 from blocking

Finally, after receiving the acknowledgement message from other cpus, the data is stored in the cache line and then synchronized to main memory

But this optimization poses two problems

  1. It is uncertain when the data will be committed because it will wait for the other CPU to send a receipt message
  2. When storeBufferes are introduced, the processor first tries to read values from StoreBuffer, directly from storeBuffer if there is data in storeBuffer, and then from cache lines otherwise

But this also leads to another problem, instruction reordering

Because when the data is cached in Store Bufferes and the CPU executes other instructions, instruction reordering will occur, and this instruction reordering will also cause visibility problems

In fact, it is very difficult to know the dependency of software level from the hardware level, so there is no way to completely solve the problem in a certain way

Hardware engineer say: hum ~ since how they optimize all not line, so you write yourself!

So at the CPU level there are memory-barrier instructions that force the refresh of store bufferes. The software layer can decide where to insert memory barriers

CPU level memory barrier

Memory barriers simply write instructions from Store Bufferes to memory to make it visible to other threads accessing the same shared memory.

Memory barriers can also be read barriers, write barriers, and full barriers, but they are not expanded here, which is beside the point

In general, memory barriers prevent out-of-order CPU access to memory to ensure that shared variables are visible during parallel execution

But how to add this barrier?

Going back to the beginning of our code for the volatile keyword, which generates a Lock assembly instruction that acts as a memory barrier

Again, memory barriers and reordering seem to depend on platform and hardware architecture. As a feature of the Java language, features that compile in one place and run in many places should not be a programmer’s concern. So the Java language does that for us

JMM

What is the JMM

The Java Memory Model is the full name of the Java Memory Model. According to the previous analysis, the root cause of visibility problems is cache and resort, and the JMM provides reasonable methods to disable cache and resort to achieve visibility.

JMM is a language-level abstract memory model, which can be simply understood as an abstraction of the hardware model. It defines the behavior of multithreaded program read and write operations in shared memory: the low-level implementation details of storing shared variables into and out of memory in the virtual machine

These rules regulate the read and write operations of memory to ensure the correctness of instructions. It solves the memory access problems caused by multi-level CPU cache, processor optimization and instruction reordering, and ensures the visibility in concurrent scenarios.

The underlying implementation of the Java memory model can be briefly described as follows:

Reordering is prohibited through memory barriers, which the compiler replaces with specific CPU instructions, depending on the underlying architecture. For the compiler, the memory barrier limits the amount of reordering optimization it can do. In the case of the processor, the memory barrier will cause the cache to flush.

conclusion

In the case of volatile, the compiler inserts barriers before and after volatile fields, so that shared variables are written directly to main memory and data is read directly from main memory in multithreaded accesses, eliminating visibility and reordering problems

Finally I want to ask you carsick 🚗?

Today, I give you a deep analysis of the underlying implementation of volatile from the hardware level to the software level. I also hope that you can learn from my article that we still have knowledge points missing. In the future, I will continue to give you some source code analysis, Java keyword underlying analysis, common middleware principle analysis of big factory articles. The next article should be about the basic explanation of synchronized. If you particularly want the author to explain some knowledge points in advance, you can also leave a comment to me in the comments section.

Finally, I give you a word, work hard, come on, after many years, you will thank you for having worked so hard!

I am he who loves writing code. See you next time!