1 CPU cache
The CPU often needs to process the same data and instructions repeatedly. By storing this part of data and instructions in the CPU cache, the CPU does not need to read data and instructions from memory, thus reducing the response time. Here, too, is the principle of locality.
A CPU Cache is made up of a set of fixed-size data blocks called Cache lines. A Cache Line is the smallest unit of storage that can be allocated in the Cache, usually 64 bytes.
With the introduction of caching, performance is improved, but there are consistency issues to consider.
2 Consistency Problem
For a single-core CPU, the processing is relatively simple, usually in the following two ways:
-
Write Through: Every time the CPU modifies the cache, it immediately updates the cache to the memory. This means that every time the CPU writes to the shared data, it causes a bus transaction.
-
Write BACK: Each time the CPU modifies the cache data, it does not immediately update the data to memory. Instead, it waits until an appropriate time to update the data to memory.
Multi-core CPUS have multiple level 1 caches. First, in addition to writing and writing, two operations are introduced:
-
Write failure: When a CPU modifies data, other cpus are notified that the data is invalid if they have it.
-
Write update: When one CPU modifies data, other cpus are notified to update the data if they have the data.
In addition, at the CPU level, two solutions are provided:
-
Bus LOCK: In the case of multiple cpus, when a CPU operates on a shared variable, a #LOCK signal is issued on the bus. The bus locks the communication between the CPU and memory. Other cpus cannot handle the data at this memory address.
-
Cache lock: Reduces the granularity of the lock and is implemented based on the cache consistency protocol.
The cache consistency protocol must meet the following two features:
-
Write Propagation: Write operations performed by one processor on a memory location that are visible to other processors
-
Write Serialization: All Write operations on the same memory unit can be serialized. That is, all processors see the writes in the same order
For write serialization: only one CPU write event can occur on the bus at any time, and multi-core concurrent write events will be converted into serialized write event sequence through the bus arbitration mechanism.
For write propagation: it can be roughly divided into the following two ways:
- Snooping: A broadcast mechanism that listens for all activity on the bus.
- Directory-based: Point-to-point, bus events are sent only to interested cpus (with the help of Directory).
The cache consistency protocol we often refer to is the MESI protocol
- Back to writing
- Write the failure
- Cache lock
- Write propagation + write serialization
- Sniffer mechanism
3 the msci agreement
3.1 Four states
- M: Modified
The current CPU cache has the latest data, other cpus have invalid data, the current CPU data is inconsistent with the memory data, but the current CPU data prevails.
- E: Exclusive
Only the current CPU has data, but other cpus do not. The data on the current CPU is consistent with that on the memory.
- S: Shared
The current CPU has the same data as other cpus and is consistent with the data in the memory.
- I: Invalid
Data on the current CPU is invalid. Data on other cpus may not be available. Data must be read from the memory, and data on the current CPU is inconsistent with that in the memory.
3.2 Four Operations
3.3.1 Modified
LR: indicates the current CPU read operation. The latest data in the cache is read directly from the cache and the status remains unchanged.
LW: indicates the current CPU write operation, which directly modifies the cache data of the current CPU. After the modification, the latest data is still stored and the status remains unchanged.
RR: indicates that the current CPU writes data back to the memory to ensure consistency. RR makes other cpus have the same data as the current CPU, and the status changes to S.
RW: Other CPU write operations. The current CPU writes data back to the memory, and then the RW modifies the memory data. The current CPU cache status changes to I.
3.3.2 rainfall distribution on 10-12 Exclusive
LR: indicates the current CPU read operation and the status remains unchanged.
LW: indicates the current CPU write operation. The current CPU cache value is modified and the status is changed to M.
RR: indicates the read operation performed by other cpus. The data on the two cpus is consistent with the memory and the status changes to S.
RW: Indicates that the data on other cpus is the latest. The current CPU data is invalid and the status changes to I.
3.3.3 Shared
LR: indicates that the CPU is reading data and the status remains unchanged
LW: The current CPU write operation does not write data back to the memory immediately. To ensure consistency, change the status to M.
RR: indicates a read operation performed by other cpus because the data on multiple cpus is consistent with the memory and the status remains unchanged.
RW: Indicates that the data on other cpus is the latest. The current CPU data is invalid and the status changes to I.
3.3.4 Invalid
LR: This operation is performed by the current CPU. The current CPU cache is unavailable and memory needs to be read.
Other cpus have no data. The current CPU only has data. The status is changed to E.
Other cpus have data and the status is S and E. The current CPU is consistent with the data of other cpus and memory, and the status is changed to S.
Other cpus have data and the status is M. Other cpus write data back to the memory first, and the current CPU reads data. The data is consistent with other cpus and memory, and the status changes to S.
LW: Indicates a write operation performed by the current CPU. The current CPU cache is unavailable and memory needs to be written.
Other cpus have no data. Only the current CPU has data in the cache, and the status is changed to M because the data is inconsistent with the memory.
Other cpus have S and E data. The current CPU cache is the latest and has been modified, and its status is changed to M.
Other cpus have data and the status is M. Other cpus write data back to the memory first, and then the current CPU writes data and the status changes to M.
RR: indicates other CPU read operations, which are irrelevant to the current CPU cache and remain in the same state.
RW: Indicates other CPU write operations, which are irrelevant to the current CPU cache and remain unchanged.
4. MESI protocol issues and optimization
In MESI, depending on the bus sniffing mechanism, the whole process is serial and can be blocked.
If an LW occurs on CPU0, it must first send an Invalidate message to the other CPU1 that has cached the data. And wait for CPU1’s confirmation receipt. CPU0 will be blocked during this time.
For AN RW to occur in CPU1, invalidation caching is required. When the cache pressure is very high, it is difficult to process the failure event in real time, and there will be a certain delay.
4.1 MESI protocol optimization
If the MESI protocol is strictly followed, there are serious performance issues. In order to solve the above two problems, Load buffers and Invalid queues are introduced.
4.1.1 Writing The Load Buffer
The write buffer belongs to each CPU. When the write buffer is used, the current CPU no longer blocks waiting for the acknowledgement from other CPUS whenever an LW occurs. Instead, the current CPU writes the updated value directly to the write buffer and continues to execute subsequent instructions.
On LR, the CPU checks the write buffer first to see if the record exists, and if it does, fetchs it directly from the write buffer. This mechanism is Store Fowarding.
4.1.2 Invalid Queue
< font style = “color: RGB (51, 51, 51); font-size: 16px! Important;”
Later, the CPU will process the messages in the invalidation queue when idle, invalidating the corresponding CPU cache.
4.1.3 Problems after optimization
With the introduction of the Load Buffer, even though the read and write instructions themselves are executed sequentially, they can still end up out of order.
For example, if write instructions A and B are executed in sequence, the cache line where the write instruction A resides is in the STATE S, and the cache line where the write instruction B resides is in the state E, the write operation B completes before A. Or C and D are executed in sequence. The cache line of C is in the state I, and the cache line of D is in the state S. In this case, D completes the read operation before C.
After an Invalid Queue is introduced, stale data may be read.
For example, if CPU0 executes a write instruction, it issues an invalidation instruction to CPU1, which then immediately returns an invalidation instruction without actually invalidation. At this point, CPU0 updates the cache rows, causing direct inconsistencies between processors.
x = 2;
b = x + 1B ==3
Copy the code
The correct way to perform it
Abnormal — mononuclear
StoreBuffer was too late to confirm that it had already executed to the next step
Anomaly – multinucleate
A has just written x=2 to StoreBuffer, but B has already started operating on x
4.2 CPU Memory barrier
MESI was originally very consistent, but after performance optimization, it weakened to final consistency. In some intermediate states, data is inconsistent across cpus. It’s also possible to have out-of-order execution, which is reordering.
Generally speaking, there are three kinds of reordering:
Compiler optimized reordering: The compiler can rearrange the execution order of statements without changing the semantics of a single thread
Instruction-level Parallelism reordering: Modern processors use instruction-level Parallelism (ILP) to execute multiple instructions on top of each other. If there is no data dependency, the processor can change the execution order of the machine instructions corresponding to the statement.
Memory system reordering: Loading and stored procedures appear to be executed out of order because of the processor’s use of caching and read/write buffers.
The first is compiler reordering and the last two are processor reordering
Memory Barriers can be used to solve this problem.
-
Store Memory barriers: Forcing the contents of the Store Buffer to be written to the cache or writing subsequent writes to the Store Buffer until the previous contents are flushed into the cache, also known as smp_wMB ======> It is the Store Buffer that goes wrong
-
Load Memory Barrier: Tells the CPU to process all messages in the Invalid queue before performing any Load. ====> The read problem is the Invalid queue
But the CPU cannot decide for itself where and when to add memory barriers, leaving the decision up to the software application layer. We usually use the assembly instruction LOCK to do the corresponding processing.
The LOCK instruction prefix does two things:
-
Enable a bus LOCK or cache LOCK by sending a #LOCK signal on the bus. As mentioned above, the cache lock is generally implemented by the cache consistency protocol.
-
Provides the effect of a memory barrier with decorated assembly instructions
This ensures consistency of multi-core CPU caches at the CPU level. So what about the upper software system? Let’s take the JVM as an example.
5 JMM
At the JVM level, an abstract memory model (JMM) is defined to address visibility and ordering issues.
5.1 Abstraction of the Java memory model
The JMM is a language-level abstract memory model that defines the operation specification for multithreaded reads and writes in shared memory. The JMM abstracts the underlying issues down to the JVM level, addressing concurrency issues caused by multiple levels of CPU caching, processor optimization, and instruction reordering based on the memory barrier instructions provided at the CPU level and by limiting compiler reordering.
The JMM determines when a write to a shared variable by one thread is visible to another thread. The diagram below:
Shared variables between threads are stored in main memory, and each thread has its own local memory, which holds a copy of the shared variables. It is important to note that local memory is an abstraction and does not really exist. It corresponds to the CPU cache, write buffer, register, etc.
If thread A and thread B need to communicate:
-
First thread A flusher the updated shared variable from local memory A to main memory.
-
Thread B then goes to main memory to read the shared variable that thread A updated earlier.
The JMM ensures visibility by controlling the interaction between main memory and local memory between each thread.
5.2 JMM Layer Memory Barrier
As mentioned above, CPU level memory barrier instructions can be used to limit instruction rearrangement. In fact, the JSR specification also defines the JMM level memory barrier. There are four types:
LoadLoad barrier: operation sequence Load1,LoadLoad,Load2. Ensure that the data to be read by Load1 is completed before the data to be read by Load2 and subsequent read operations can be accessed.
LoadStore barrier: Operation sequence Load1,LoadStore,Store2. Before Store2 and its subsequent write operations are flushed out, ensure that the data to be read by Load1 is finished.
StoreStore barrier: operation sequence Store1,StoreStore,Store2. Before Store2 and subsequent write operations are executed, ensure that the write operations of Store1 are visible to other processors.
StoreLoad barrier: operation sequence Store1,StoreLoad,Load2. Ensure that writes to Store1 are visible to other processors before Load2 and subsequent reads are executed.
StoreLoad barriers are the most expensive and work as all three of the above barriers. These four barriers are specifications that Java designed to cross platform, and may be optimized to remove some barriers depending on the CPU. X86, for example, only has StoreLoad.
5.3 Association with Volatile
The most commonly used memory barrier in Java is the volatile keyword, so how does it work:
A StoreStore barrier is inserted before each volatile write to ensure that the Store Buffer queue prior to volatile writes has been flushed to the cache and to prevent instruction rearrangements between volatile writes and previous writes.
Insert a StoreLoad barrier at the end of each volatile write to ensure that all operations in the Store Buffer queue preceding subsequent writes/reads are flushed to the cache (i.e., volatile writes). Prevents subsequent write/read reorders from volatile writes.
Insert a LoadLoad barrier at the end of each volatile read to ensure that the invalidation queue for subsequent reads has flushed volatile invalidations to the cache, preventing subsequent reads from rearranging instructions with volatile reads.
Insert a LoadStore barrier at the end of each volatile read to ensure that the invalidation queue for subsequent writes has flushed volatile invalidations to the cache, preventing subsequent writes from rearranging instructions with volatile reads.
reference
The first paper
The second article