Some reflections on the MESI conformance protocol

One afternoon I came across the MESI cache consistency protocol, which led me to a number of related doubts, I write this article

What you can learn from this article

What is the MESI protocol, and what problems does it solve
What is the reordering and what problem does it solve
What is the memory barrier and what does it solve
MESI vs. concurrent synchronization

MESI Cache consistency protocol

The premise for MESI

The emergence of multi-level caching
The emergence of multi-core cpus

In short, for the performance of computers, modern computers are equipped with the above two design, the figure below is the IO speed comparison of different storage

MESI solves the problem

Since computers are now multi-core cpus, and each CPU has its own independent cache (the architecture shown in the following figure), it is possible for multiple cpus to manipulate the same data, resulting in inconsistent values of the same data in the caches of each CPU.

MESI concrete design

MESI: consists of four state initials of a cache line (the basic unit of operations in the cache, similar to a page on a disk) :

Modified: This row of data has been Modified to be inconsistent with the data in memory. This means that if the cache in another CPU has this row of data, its status is changed to invalid
Exclusive: Only one cache reads the row and does not modify it
Shared: CaHe with multiple cpus read this row without modifying it
Invalid: This row of cache data is Invalid

State of the circulation

The state flow of the cache line is shown in the figure below. By controlling the state bits of each cache line, the consistency of data in the multi-core CPU cache is achieved

A case in point

Now we have two cpus, assuming that their caches are both empty, with one data x = 0 in memory
Cpu1 reads X from memory via bus with the cache line state set to E
Cpu2 reads X from memory by bus. CPU 1 detects an address conflict and sets the status bit of the cache line where X resides on both cpus to S
Cpu1 needs to change the value of x in its cache, set the status of row X to M, tell CPU2 to change the status bit of row X to I, and then change the value of x in its cache

Simulation website: * * * * www.scss.tcd.ie/Jeremy.Jone… (Slightly different)

Problems brought about

It takes time to swap the cache state of multiple cpus. When a CPU switches the cache state, the CPU needs to wait for other cpus to receive the message to complete the state switch of corresponding data in their caches and send a response message. Any possible blocking can cause all sorts of performance and stability issues.

In fact, the CPU could have used the time waiting for the cache line state switch to execute the next instruction, which leads to the instruction rearrangement below

Instruction rearrangement and memory barriers

As mentioned above, the CPU can execute the next instruction without waiting for the result of the current instruction. This is actually called instruction reordering.

Of course, instruction rearrangement is divided into two periods: compile time and run time. The purpose of instruction reordering is to pursue performance, to execute out of order without changing the original semantic results. The instruction reordering here is clearly run-time, and the instruction reordering that occurs in the compiler is not discussed here.

Implementation of instruction rearrangement

Store buffer:
- You need to wait synchronously for messages from other cpus to confirm and then modify the values in the cache
- The result of the modification of the current instruction is now directly placed in the storage cache, and the next instruction is directly executed. After receiving the confirmation message from other cpus asynchronously, the value of the modification in the storage cache is synchronized to the cache
Invalidate queue:
- When the CPU detects the invalidation notification sent by another CPU, the CPU stops working, switches the status of the corresponding cache line, and replies with an acknowledgement message
- Now put the failure notification directly on the failure queue, return the confirmation message immediately, and process the message in the failure queue slowly later

Problems with reordering orders

Visibility issues in parallel environments

This rearrangement of instructions creates visibility problems in parallel because changes made by the processor are not immediately visible to other kernels (the Store Buffer and the Invalidate queue are both handled asynchronously), which can lead to data inconsistencies in concurrent programs.

The memory barrier

The CPU does not know what instructions can and cannot be reordered, but the CPU leaves the task to the software (the programmer), which is the memory barrier.

There are four types of memory Barriers: LoadLoad Barriers, StoreStore Barriers, LoadStore Barriers, and StoreLoad Barriers

Different processors implement memory barriers differently, so let’s take a look at the x86 architecture

Read barrier

Effect: All memory updates that occurred before the read barrier are visible to loads that occurred after the read barrier

CPU operation: Execute all valid instructions (I) in the invalidate queue

Write barriers

What it does: All memory updates (M) that occurred before the write barrier are visible to subsequent commands

CPU operation: Wait until the store buffer is empty (all updates have been flushed) before the CPU can execute the write barrier instruction

Full screen

Function: Sum of the above two

CPU operation: after the above two

conclusion

Multi-core multistage cache computer: Improves the performance of a single-core computer without cache, but has cache consistency problems

MESI: Solve the problem of multi-core computer cache consistency, causing slow performance

Instruction rearrangement: Alleviates MESI performance issues and introduces visibility issues.

Memory barrier: The problem of concurrent visibility is left to the software (programmer), which tells the CPU which instructions cannot be reordered

All in all, this series is designed to improve computer performance and ensure the consistency of cached data

thinking

What is the difference between MESI cache consistency and concurrent synchronization?

One for the CPU (cache) and one for the thread. A CPU can have multiple threads. In fact, concurrent synchronization considers thread interaction with memory, and does not see the cache in between.

Let’s start with a single-core machine

Single-core machines do not have cache consistency problems, but they do have the problem of multi-threaded concurrent synchronization

Let’s say it’s on a multicore machine

Let’s say I have two threads A and B running on CPU1 and CPU2, and I have a variable x in memory, and they’re operating on x++.
Concurrency control requires the x++ operation to be locked, meaning that two threads cannot operate on X at the same time. Both threads must run x++ serially, even if they are on two cpus
What we see in concurrency control is: thread A writes x=1 to memory, and thread B reads x in memory, x++ writes x=2 to memory
From the CPU’s point of view, however, it deals directly with the cache. Without the MESI cache consistency protocol, it sees that it is possible that thread A reads x=0 from cpu1’s cache, x++ writes x=1 to the cache, and then thread B reads x=0 from CPU2’s cache, writes x=1 to the cache.
In practice, however, concurrency control relies on the MESI cache consistency protocol. Thread A reads x=0, x++ from CPU1’s cache, writes x=1 to the cache, invalidates cpu2’s x=0 cache and writes it back to memory. Since the cache of x on thread 2 is invalidated, it can only read from memory x=1, x ++, write x=2 to memory and invalidate the cache of x=1 on thread 1.

To sum up, concurrency control ensures serial execution of multiple threads on critical sections, while in multi-core computers, it relies on cache consistency to ensure the correctness of results.

The resources

zhuanlan.zhihu.com/p/370057417
www.cnblogs.com/hello-shf/p…
Stackoverflow.com/questions/2…
www.zhihu.com/question/27…