Memory barrier instructions [1] are supported in common CPU architectures, such as mfence/sfence/lfence instruction on x86, sync instruction on MIPS, etc. In this article, I want to talk about the implementation of memory barriers [2] and the memory consistency model [3].

In the last article [4], I wrote about the implementation of atomic operations. In this article, I briefly introduced MESI, the cache consistency protocol. MESI ensures that CPU1 can read changes that have been written to cache but not pided into memory on CPU2. It may seem like the consistency guarantee of the memory model (more on consistency models later) is strict enough at this point, but it is not. Let’s take a look at the implementation of spinlock [5] given on wiki:

; Intel syntax

locked:                      ; The lock variable. 1 = locked, 0 = unlocked.
     dd      0

spin_lock:
     mov     eax, 1          ; Set the EAX register to 1.

     xchg    eax, [locked]   ; Atomically swap the EAX register with
                             ;  the lock variable.
                             ; This will always store 1 to the lock, leaving
                             ;  the previous value in the EAX register.

     test    eax, eax        ; Test EAX with itself. Among other things, this will
                             ;  set the processor's Zero Flag if EAX is 0.
                             ; If EAX is 0, then the lock was unlocked and
                             ;  we just locked it.
                             ; Otherwise, EAX is 1 and we didn't acquire the lock.

     jnz     spin_lock       ; Jump back to the MOV instruction if the Zero Flag is
                             ;  not set; the lock was previously locked, and so
                             ; we need to spin until it becomes unlocked.

     ret                     ; The lock has been acquired, return to the calling
                             ;  function.

spin_unlock:
     mov     eax, 0          ; Set the EAX register to 0.

     xchg    eax, [locked]   ; Atomically swap the EAX register with
                             ;  the lock variable.

     ret                     ; The lock has been released.
Copy the code

However, this code will not work on some cpus, depending on whether the XCHG eAX, [locked] instructions implement memory barriers to ensure that memory reads and writes are in order.

So what exactly does the memory barrier do? Take a look at the following image:


The CPU accesses registers, store buffers, and load buffers much faster than cache and memory. And in modern CPU design, the main way to improve performance is to improve parallelism. The two main ways to improve instruction-level parallelism include increasing pipeline depth and multi-firing pipeline (each pipeline executes multiple instructions). Since the execution time of each instruction is determined by the most time-consuming stage on the pipeline, the execution time of each instruction can be shortened by increasing the depth of the pipeline. However, there is an upper limit to the depth of the pipeline that can be added because the time it takes to transfer data between each stage of the pipeline is not negligible. So, another approach is to have each pipeline execute multiple instructions. The realization of multi-launch pipeline includes static multi-launch and dynamic multi-launch. Static multiple emission can be understood as an instruction containing multiple operations (VLIW [6]). Dynamic multiple emission is also called superscalar technology [7]. A pipeline has multiple execution units, which can be used to perform floating point calculation, integer calculation, store/load, etc. Multiple instructions can be executed in parallel according to their dependencies. In addition, due to the existence of store/ Load Buffer, writing data does not need to be flushed into memory immediately, while reading data can be read in advance, thus further enhancing parallelism. Meanwhile, for branching instructions, there’s speculation execution, and the latest CPU flaw was linked to this technique. Therefore, as you can see from the above, our program instructions are actually executed out of order. However, to ensure correctness, the results of the execution are submitted sequentially, preventing branch prediction failures that result in execution of instructions that should not have been executed. During out-of-order execution, the results are only temporarily stored in physical registers and buffers, and only when committed are changes written to ISA registers and cache. Therefore, when a prediction failure occurs, the result of execution is discarded, and uncommitted instructions are not exposed. By the way, the registers we use when we write assembly are actually logical registers, or ISA registers, and there ISA mapping between them and real physical registers.

First of all, the data written by the CPU is not and does not need to be written directly to the cache. Instead, the data is cached in registers and store Buffer, and the data read is also cached in the Load Buffer. Then, the read and write instructions will be out of order due to the parallel optimization of instructions.

At this point, it should be clear that the external representation of a memory barrier is to prevent reordering of read and write actions. For example, mfence prevents reads and writes to memory after the mfence directive from being reordered to before mfence, and prevents the mfence directive from being reordered to before earlier reads and writes to memory. Sfence is similar to mfence, but only prevents reordering of write memory actions. Lfence is similar to mfence, but only prevents reordering of read memory actions.

The implementation of memory barrier, for example, x86 architecture: Sfence /mfence flushes changes cached in store Buffer into L1 cache, making them visible to other CPU cores, and writes after sfence/mfence are not scheduled before sfence/mfence.

According to official Intel documentation [8], early Intel processors, such as Intel 486 and Pentium, followed sequential consistency [9], including ensuring that the order of writes observed on multiple cores was consistent, and that the order of reads and writes in programs would not be disrupted. There is no need for a memory barrier to protect the memory order. At this time, reading, reading, writing, reading, writing are not out of order.

However, in the new version of Intel CPU implementation, in order to improve performance, the above constraints are broken. First, the prefetch operation is allowed to pre-store data to the Load buffer. The store buffer is then allowed to cache, thus causing the previous write -> read operation to become read -> write operation. Finally, there can be reordering between the string manipulation instructions and the cache bypass write instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD). Therefore, a memory barrier is required to protect the memory order. By the way, those interested in the x86-TSO memory consistency model used for x86 architecture can see it here [10].

We mentioned sequential consistency above, but as the CPU optimizes, consistency is reduced to improve performance over time. In this case, to ensure that the program is correct, developers need to use tools such as memory barriers to ensure consistency. Therefore, how the CPU chooses to implement a consistent memory model is essentially a tradeoff between improving concurrency by reducing instruction sorting constraints and improving programming complexity.

The article was originally posted here:

Why do we need a memory barrier? · Issue #4 · Luohaha.com