series

  1. IOS assembler introductory tutorial (I) ARM64 assembler basics
  2. Embedding assembly code in Xcode projects
  3. IOS assembler introduction (3) Assembler Section and data access
  4. IOS assembly tutorial (four) based on LLDB dynamic debugging quickly analyze the implementation of system functions
  5. IOS assembly tutorial (5) Objc Block memory layout and assembly representation
  6. IOS assembly tutorial (6) CPU instruction rearrangement and memory barrier

Introduction to the

In multithreaded programming, mutex is often used to keep global variables thread-safe, such as pthread_mutex in pthread or Semaphore in Mach. They maintain resource state by lock & unlock or up & Down, ensuring that only a certain number of threads can acquire a certain number of resources.

So can mutex really be implemented from the software level alone? The answer is no, because the mutex state of a program needs to be stored in memory anyway, and there is no way to ensure that mutex state is thread-safe when multithreading the mutex state. That is, hardware support, along with primitives at the processor instruction set level, is required to achieve true Exclusive, also known as Exclusive.

This article introduces the ARM Exclusive instruction set. By learning these instructions, you will not only understand the nature of locking and how to implement it, but also understand how to ensure read and write consistency at the assembly level.

The basic concept

Acquire and Release

In the previous article, we introduced memory barrier instructions for limiting out-of-order CPU execution. Similarly, Acquire and Release are memory barriers to prevent out-of-order execution from causing logical errors.

Read-Acquire

Acquire is used to modify memory read instructions. A read-acquire read instruction will prevent subsequent memory operation instructions from being executed ahead of time, that is, the subsequent memory operation instructions cannot cross the barrier when they are rearranged. This function is visually described in the following figure [1] :

Write-Release

Release is used to modify the memory write instruction. A write-release write instruction will prohibit the memory operation instruction above it from being delayed until the completion of the write instruction, that is, when the memory operation instruction before the write instruction is rearranged, it will safely cross the barrier downward. This function is visually described in the following figure [2] :

Exclusive Monitor

The working process of the

In order to support read/write mutual exclusion at the hardware level, it is necessary to determine whether an address has been modified by other processors or cores. The ARM processor contains a state machine called Exclusive Monitor to maintain the mutually Exclusive state of memory, thus ensuring read/write consistency [2].

The state machine starts in the Open state. A load-EXCLUSIVE read operation on an ADDRESS marks the read address as the Exclusive state. A store-EXCLUSIVE write to the same address checks whether the Monitor is in the Exclusive state. If the write is in the Exclusive state, the data is written to the address and the state is set to Open. Data has been written to another processor or core, and the data fails to be written this time.

In short, the load-exclusive read directive marks the read address as Exclusive, and the store-EXCLUSIVE execution writes only to the address in the Exclusive state and, if successful, marks the address as Open. It ensures the efficiency of reading and prevents the consistency problem caused by excessive writing.

Hardware implementation

For multi-core architectures, ARM divides Exclusive Monitor into Local and Global.

Local Monitor

If a memory address is marked as Nonshareable, its visibility is limited to the processor, and the mutually exclusive state of such memory needs to be maintained only in the Local Monitor that handles its internals.

The Local Monitor only maintains state within the processor. Since it does not involve state sharing among multiple processors, it does not need to mark real memory. Therefore, its hardware can be realized by marking memory addresses or tracing instructions. This also requires that the code without memory sharing cannot be programmed with the premise that the Local Monitor will check the address [2].

Global Monitor

For multi-processor concurrent programming, it can be realized by defining a MUtex semaphore in the memory unit marked as Shareable. In order to ensure mutex multi-read and single-write, Global Monitor, a hardware structure shared by all processors, is needed. It records the Exclusive state of a specific processor to the shared memory to ensure that multiple processors can read and write simultaneously.

Compare and Swap

Compare and Swap, CAS for short, is the most common method in lockless programming. When changing a shared value a, CAS reads the value of A and copies it to pre_A and new_A respectively. The value of new_A is changed. Check whether a in the memory is equal to pre_A. If a is equal to pre_A, the value of A is not changed. In this case, you can write new_A to the memory.

The key step in CAS must be atomic. Otherwise, it is possible to find that the value of Compare & Swap has not changed, but someone else has changed the value between Compare and Swap. This leads to overwriting.

Exclusive command

We introduced three basic concepts, Acquire and Release, Exclusive Monitor, and Compare and Swap, all of which have specific instruction support at the assembly level.

LDXR & STXR

LDXR is the Exclusive version of LDR. It is used exactly the same as LDR, except that it contains load-EXCLUSIVE semantics, which means that the read memory cell state is Exclusive.

STXR is the Exclusive version of STR. Since STR needs to Store successfully, it takes a 32-bit register argument to receive execution results.

STXR  Ws, Xt, [Xn|SP{,#0}]
Copy the code

That try to write Xt into [Xn | SP {# 0}], if write successful writing 0 Ws, otherwise will not 0, it often and CBZ instruction is tie-in, if written to failure is to jump back to LDXR, execute the LDXR again & STXR operation, until success.

The following example shows the process of adding one atoms using LDXR & STXR:

; extern int atom_add(int *val);
_atom_add:
mov x9, x0 ; Backup x0 for recovery on failure
ldxr w0, [x9] ; Read an int from the memory in which val is stored and mark Exclusive
add w0, w0, # 1 
stxr w8, w0, [x9] ; Try writing back to val and save the result in W8
cbz w8, atom_add_done ; If w8 is 0, the program is successful
mov x0, x9 ; Restore the backup x0 and re-execute atom_add
b _atom_add
atom_add_done:
ret
Copy the code

A similar example exists in the OSAtomicAdd32 function provided by libkern:

; int32_t OSAtomicAdd32(int32_t __theAmount, volatile int32_t *__theValue);
ldxr    w8, [x1]
add     w8, w8, w0
stxr    w9, w8, [x1]
cbnz    w9, _OSAtomicAdd32
mov     x0, x8
ret     lr
Copy the code

In addition to the Exclusive semantics, LDXR & STXR also has a version of its acquire-release semantics, LDAXR & STLXR, to ensure execution order. For pure Atomic Add operations, the former is sufficient; If read/write wait operations like the one mentioned in the previous article are involved, the latter is strongly guaranteed against out-of-order execution.

CAS

ARM provides multiple instructions to directly complete Compare and Swap operations, among which CAS is the most basic version and its usage method is as follows [4] :

CAS Xs, Xt, [Xn|SP{,#0}] ; 64-bit, no memory ordering
Copy the code

Try to Xt and memory of the exchange of the values in the first comparison of Xs is equal to the memory of [Xn | SP {# 0}], if equal to write Xt into memory, at the same time to write the values in the memory back to the Xs and therefore as long as in the CAS after judge whether Xs is equal to the Xt can know whether to write successful, If the writing fails, the value of Xs should be the original value, that is, Xs ≠ Xt. If the writing succeeds, the value in memory has been updated, that is, Xs = Xt.

The following example also implements the atom plus one operation using CAS:

; extern int cas_add(int *val);
_cas_add:
mov x9, x0
ldr w10, [x9]
mov w11, w10 ; w11 is used to check cas status
add w10, w10, # 1
cas w11, w10, [x9]
cmp w10, w11 ; if cas succeed, w11 = <new value in memory> = w10
b.ne _cas_add
mov w0, w10
ret
Copy the code

Note: To Compile content containing CAS directives on iOS, add a Compile Flag to the.s file: -march=armv8.1-a[5].

Similarly, CAS also has its version containing Acquire semantics, namely CASA containing Acquire semantics, CASL containing Release semantics, and CASAL containing both Acquire-Release semantics.

The experimental code

If you want to practice and verify these instructions for yourself, you can reuse the experimental code shown below, from which most of the code in this article comes.

M to create a new iOS Project and add this code to main.m:

// main.m
#include <pthread.h>

#define N 100

extern int atom_add(int *val);
extern int cas_add(int *val);
int as[10000] = {0};
int flags[10000] = {0};
int counter = 0;

void* pthread_add(void *arg1) {
    int idx = *(int *)arg1;
    // in this way will break the assert
// as[idx] += 1;
    cas_add(as + idx);
    __asm__ __volatile__("dmb sy");
    atom_add(flags + idx);
    return NULL;
}

void* pthread_end(void *arg1) {
    int idx = *(int *)arg1;
    while(flags[idx] ! = N); assert(as[idx] == N);printf("a = %d\n", as[idx]);
    return NULL;
}

void test(int idx) {
    printf("begin test %d\n", idx);
    int n = N;
    pthread_t threads[n + 1];
    for (NSInteger i = 0; i < n; i++) {
        int *copyIdx = calloc(1.4);
        *copyIdx = idx;
        pthread_create(threads + i, NULL, &pthread_add, (void *)copyIdx);
    }
    for (NSInteger i = 0; i < n; i++) {
        pthread_detach(threads[i]);
    }
    pthread_create(threads + n, NULL, (void *)pthread_end, (void *)(&idx));
    pthread_detach(threads[n]);
}

int main(int argc, char * argv[]) {
    printf("atom_add at %p\n", atom_add);
    int round = 0;
    while (true) {
        test(round++);
    }
    
    // omit codes...
}
Copy the code

Two ways to implement atomic plus one assembly code, need to add Compile Flag: -march= armv8.1a.

; exclusive.s
.section __TEXT,__text, regular, pure_instructions
.p2align 2
.global _atom_add, _cas_add

_atom_add:
mov x9, x0
ldxr w0, [x9]
add w0, w0, # 1
stxr w8, w0, [x9]
cbz w8, atom_add_done
mov x0, x9
b _atom_add
atom_add_done:
ret

_cas_add:
mov x9, x0
ldr w10, [x9]
mov w11, w10 ; w11 is used to check cas status
add w10, w10, # 1
cas w11, w10, [x9]
cmp w10, w11 ; if cas succeed, w11 = new value in memory = w10
b.ne _cas_add
mov w0, w10
ret
Copy the code

The resources

  1. Preshing on Programming. Acquire and Release Semantics
  2. ARM Info Center. Exclusive monitors
  3. Stack Overflow. ARM64: LDXR/STXR vs LDAXR/STLXR
  4. ARM Info Center. CASA, CASAL, CAS, CASL, CASAL, CAS, CASL
  5. GCC, the GNU Compiler Collection .AArch64 Options