Writing in the front

Following the summary of the JMM model and volatile, this review of THE CPU’s MESI, or cache consistency protocol, will give you a better understanding of volatile. Before this, we have understood the internal structure of the CPU and set up L1, L2, L3 level cache. So in the case of multi-core CPU, there are multiple level caches, how to ensure the consistency of internal cache data, not to let the system data chaos. This brings us to the consistency protocol MESI. Before getting to MESI, let’s take a look at the flow of Java code from the JVM to the CPU.

Jvm-cpu low-level execution processes

As shown in the figure above, I won’t go into detail about how Java classes are loaded into the JVM until later. The JVM translates the class into Java bytecode instructions, interprets the executor into assembly instructions, and the CPU cannot execute the instructions directly. The assembly instructions need to be converted into binary and the CPU has extra threads to execute the binary instructions. The CPU is executing at this time. Not a single line of code has to go through this process for the CPU to execute.

We looked at volatile in the previous article. If you add volatile to a field, the JVM will add the ACC_VOLATILE modifier to the bytecode. What happens when it is converted to assembly instructions? Take a look at the following program.

// -server -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:CompileCommand=compileonly,*CodeVisibility.refresh public class CodeVisibility { private volatile static boolean initFlag = false; private static int counter = 0; private static Integer counter2 = 0; public static void refresh(){ log.warn("refresh data......." ); initFlag = true; log.warn("refresh data success......." ); } public static void main(String[] args){ Thread threadA = new Thread(()->{ while (! InitFlag){} log.warn(" Thread: "+ thread.currentthread ().getname () +" change in initFlag state detected by currentThread "); },"threadA"); threadA.start(); try { Thread.sleep(500); } catch (InterruptedException e) { e.printStackTrace(); } Thread threadB = new Thread(()->{ refresh(); },"threadB"); threadB.start(); }}Copy the code

Hsdis-amd64. DLL and HSDIS-amd64. Lib can be downloaded by baidu. Put these two files in the JDK jre/bin, add parameters at runtime, can output assembly instructions.

The generated assembly instructions can be found with one more in front of initFlag that is volatilelockThe instructions. By looking it uplockThe instruction will triggerThe bus lockThat is, an application can use the lock instruction code. The main function of lock is to use the lock prefix to call the lock read-modify-write operation (atomic) when modifying memory operations. This mechanism is mostly used for reliable communication between processors in multiprocessor systems.As shown in the figure above, in order for the CPU to retrieve variable data from main memory, theThe busIn this case, core0 Thread0 needs to fetch variable data x=1 from main memoryThe bus lockTo lock theBus busOther cores such as Core1 cannot be accessed by using the busGet main memoryThe data, that is, were used early onLock the prefixWill,Lock the busMulti-core cpus can only be available simultaneouslyOne core CPUIn order to obtainThe right to use the busThat’s what the lock prefix does, but it’s inefficient.

As you can see, in the early days, bus locks were used to ensure cache consistency.

Cache consistency protocol MESI

MESI is the first letter of the four states. Each cache line has four states, which can be represented by two bits:

state describe Listening task
M change (Modified) This Cache line is valid. The data is modified and inconsistent with the data in memory. The data only exists in this Cache. The cached row must always listen for all attempts to read the cached row relative to main memory, which must be deferred until the cache writes the cached row back to main memory and changes the state to S (shared).
E Exclusive The Cache line is valid and the data is consistent with the data in memory. The data exists only in the local Cache. The cached row must also listen for other caches reading the cached row from main memory. Once this happens, the cached row needs to become an S (shared) state.
S sharing (Shared) This Cache line is valid, and the data is consistent with the data in memory, which exists in many caches. The cache line must also listen for requests from other caches to invalidate or monopolize the cache line and make it Invalid.
I Invalid (Invalid) The Cache line is invalid. There is no

Here’s an example:

As you can see from the earlier bus locks above, in order to access the data in main memory, you have to go through the bus bus. Therefore, as technology has evolved, hardware has also been improved so that when the CPU reads the data in main memory, as long as the data is readLock the prefixModifier, the data will also be listened to by other cpus.

As shown in the figure, there is x=1 data in the main memory, and core0 reads the x=1 variable into L3 cache, while no other CPU reads it. At this time, the X state isE (exclusive)If there is another CPU, core1, to read the x=1 variable in the L3cache (note: each CPU reads data from the L3cache first, regardless of whether other cpus have read data), then core0 will read the x=1 variable state from the L3cacheE(exclusive) becomes S(shared state), the x=1 variable state read by CORE1 isS(Shared state). Reading the data doesn’t seem like a problem. But what if both cpus want to change the x variable at the same time? So let’s move on.



As shown in the figure, both cpus now read data into their respective L1 caches, which, as discussed in the previous article, are present in each cache64byte cache line, which is used to store variable data. If both cpus need to modify x variable at the same time, they will change the value of the x variableThe cache line where the x variable data residesTo competeThe same lockTo be locked by the competing CPU,Locking successI can do xModify the, otherwise cannot be modified. For example, the CPU core1 grabs the lock and pairs itThe cache row containing the x variable was locked successfully.A signal is also sent to the bus, if you haveX variable CPUWill receive this signal, will change the state of the local x variable fromChange S(shared) to I(invalid), core1 can change the x variable from x=1 to x=3, and the state is changed fromS(shared) to M(modified). But it’s not over yet. The CPU is running very fast. What if both cpus successfully lock their local cache rows? Who do we listen to at this point? And then we look down.



If both cpus successfully lock their own local cache lines, both sides will send a local write cache signal to the bus, at this time, toBus to decide. Determine which CPU successfully locked.

Finally: In the above example, the x variable can fit in a single cache line, but if it is 128 bytes, it takes multiple cache lines to fit the variable. At this point, there is no way to use MESI to ensure cache consistency, and the bus lock is upgraded.

summary

Volatile field variables from Java code are translated into assembly instructions by the interpretor. Volatile variables are prefixed with LOCK. When the CPU reads data from main memory, it passes through the bus, which listens for data with lock prefix. Ensure visibility of variables per CPU.

Transitions between states are shown. The triggering events of each state are as follows:

Triggering event describe
Local read Local cache Reads data from the local cache
Local Write Local cache Writes data to the local cache
Remote Read Other caches read data from the local cache
Remote Write Other caches write data to the local cache

Cache row pseudo-sharing

The CPU cache system is stored in units of cache lines. The current CPU Cache Line size is 64Bytes. In multi-threaded situations, if you need to modify “variables that share the same cache row,” you can inadvertently affect each other’s performance, which is called False Sharing. For example, there are two long variables A and B. If t1 accesses A and T2 accesses B, and a and B are in the same cache line, T1 will change a first, causing B to flush. A new annotation @sun.misc.contended has been added to java8 to address this issue. Classes that add the annotation will automatically complete cached lines. Note that the annotation is invalid by default, and you need to set -xx: -restrictcontended at JVM startup for it to work.

MESI optimization and the problems it brings

Consistent message delivery of the cache is time consuming, which causes latency when it switches. When one cache is switched, the other caches receive a message, complete their switch, and send a response message. The CPU waits for a long time for all cache responses to complete. Any possible blocking can cause all sorts of performance and stability issues. For example, if you need to modify a piece of information in the local cache, you must notify the I (invalid) status to the other CPU caches that have the cached data and wait for confirmation. Waiting for confirmation blocks the processor, which can degrade processor performance. Because the wait is much longer than the execution time of an instruction. To avoid this waste of CPU power, Store Bufferes were introduced. The processor writes the value it wants to write to main memory to the cache, and then goes on to do something else. Data will not be submitted until all Invalidate acknowledgements have been received. But there are two risks.

  • First, the processor will try to read the value from the Store buffer, but it hasn’t committed it yet. The solution to this is called Store Forwarding, which causes loading to return if the storage cache is present.
  • Second, there is no guarantee of when the save will be completed.

Here’s an example:

Value = 3; void exeToCPUA(){ value = 10; isFinsh = true; } void exeToCPUB(){if(isFinsh){//value must equal 10? ! assert value == 10; }}Copy the code

Imagine starting execution with CPU A saving the FINISHED state in E, but not the value in its cache. (for example, Invalid). In this case, value jettisons the storage cache later than FINISHED. It is possible that CPU B reads the FINISHED value as true and value is not equal to 10. That is, the isFinsh assignment precedes the value assignment. This change in recognizable behavior is called reordings. Note that this does not mean that the location of your command has been changed maliciously (or for good reason). It simply means that other cpus will read the results in a different order than they were written to in the program. NIO’s design is very similar to Store Bufferes’ design.

Hardware memory model

Performing invalidation is also not a simple operation and requires the processor to handle it. In addition, Store Buffers are not infinite, so processors sometimes need to wait for the return of an invalidation. Both of these operations result in significant performance degradation. To deal with this situation, failure queues were introduced. Their conventions are as follows:

  • Invalidate Acknowlege messages must be sent immediately for all received Invalidate requests
  • Invalidate is not actually executed, but is placed in a special queue and executed at a convenient time.
  • The handler does not send any messages to the cache entry being processed until it processes the Invalidate.

Even so, the processor doesn’t know when optimization is allowed and when it isn’t. The processor simply leaves the task to the person writing the code. These are Memory Barriers. The Store Memory Barrier(A.K.A. ST, SMB, smP_WMB) is a Barrier that tells the processor to apply all instructions already stored in the Store buffer before executing subsequent instructions. A Load Memory Barrier (A.K.A. LD, RMB, SMP_RMB) is an instruction that tells the processor to apply all invalidation operations already in the invalidation queue before performing any Load.

void executedOnCpu0() { value = 10; // All instructions stored in the store buffer must be executed before updating data. storeMemoryBarrier(); finished = true; } void executedOnCpu1() { while(! finished); // Execute all instructions on this data in the invalidation queue before reading. loadMemoryBarrier(); }Copy the code