Original address: research.swtch.com/plmm

Original author: link.juejin.cn/?target=htt…

Release date: July 6, 2021

(Memory model, Part 2)

The memory model of a programming language answers the question of what behavior parallel programs can rely on to share memory between their threads. For example, consider this program written in a c-like language, where x and done both start with zero.

// Thread 1           // Thread 2
x = 1;                while(done == 0) { /* loop */ }
done = 1;             print(x);
Copy the code

The program tries to send a message in X from thread 1 to thread 2, using the DOED as a signal that the message is ready to be received. If thread 1 and thread 2, each running on its own dedicated processor, both run to completion, is the program guaranteed to complete and print 1, as expected? The memory model of programming languages answers this question and others like it.

Although the details differ for each programming language, there are a few general answers that apply essentially to all modern multithreaded languages, including C, C++, Go, Java, JavaScript, Rust, and Swift.

First, if x and done were ordinary variables, the loop on thread 2 might never stop. A common compiler optimization is to load a variable into a register the first time it is used, and then reuse the register as much as possible for future access to the variable. If thread 2 copies the DOED into a register before thread 1 executes, it may use the register throughout the loop without noticing that thread 1 later modifies the DOED.
Second, even if thread 2’s loop stops and doED == 1 is observed, it may still print x is 0. Compilers often rearrange program reads and writes based on optimization heuristics or even just traversing hash tables or other intermediate data structures while generating code. The compiled code for thread 1 might end up writing X after completion rather than before, or the compiled code for thread 2 might end up reading X before the loop.

Given how broken the program is, the obvious question is how to fix it.

Modern languages provide special features, in the form of atomic variables or atomic operations, that allow programs to synchronize their threads. If we turn doED into an atomic variable (or manipulate it with atomic operations, in languages that use this approach), then our program is guaranteed to complete and print 1. Making DOED an atomic variable has many implications.

The compiled code for thread 1 must ensure that the write to X is complete and visible to other threads before the write to doED becomes visible.
The compiled code for thread 2 must (re) read done at each iteration of the loop.
The compiled code for thread 2 must be read from X after being read from DOED.
The compiled code must do whatever is necessary to disable hardware optimizations that might reintroduce any of these problems.

The end result of making done atomic is that the program behaves as we want it to, successfully passing the value in X from thread 1 to thread 2.

In the original program, after code rearrangement by the compiler, thread 1 might write X at the same time thread 2 read x. It’s a data race. In the modified program, the atomic variable done is used to access X synchronously: it is now impossible for thread 1 to write x while thread 2 is reading x. The program is ethnically insensitive. In general, modern languages guarantee that programs with no data chains will always execute in a sequential manner, as operations from different threads are arbitrarily interlaced on one processor, but not reordered. This is the DRF-SC property of the hardware memory model, which is adopted in programming languages.

Incidentally, these atomic variables or atomic operations are more appropriately called “synchronous atoms”. Admittedly, these operations are atomic in the database sense, allowing simultaneous reads and writes, and behave as if they were run in some order: a race on ordinary variables is not a race when atomic operations are used. But more importantly, atomics can synchronize the rest of the program, providing a way to eliminate the race for nonatomic data. The standard term is plain “atom”, so this article uses that term. Remember to understand “atom” as “synchronous atom” unless otherwise stated.

The memory model of a programming language specifies the exact details required by the programmer and compiler as a contract between them. The general characteristics outlined above apply essentially to all modern languages, but have only recently converged: at the beginning of the 21st century, there has been significantly more variation. Even today, different languages differ greatly on second-order problems, including.

What is the guaranteed ordering of the atomic variable itself?
Can a variable be accessed by atomic and non-atomic operations?
In addition to atomic operations, is there any synchronization mechanism?
Are there out of sync atomic operations?
Are there any warranties for conflicting procedures?

With some preparation, the rest of this article will examine how different languages answer these and related questions, and the paths they take to get there. The article also highlights the many false starts along the way to highlight that we’re still learning what works and what doesn’t.

Hardware, litmus test, what happened before and DRF-SC

Before we get into the details of any particular language, let’s briefly summarize the lessons we need to keep in mind about the hardware memory model.

Different architectures allow different numbers of instructions to be rearranged, so code running in parallel on multiple processors will produce different allowable results depending on the architecture. The gold standard is sequential consistency, where any execution must behave as if programs running on different processors were simply interleaving one processor in some order. This pattern is much easier for developers to reason about, but no significant architecture today provides this pattern because weaker guarantees lead to performance gains.

Comparing different memory models makes it difficult to make completely generic declarations. Instead, it helps to focus on specific test cases, known as litmus tests. If two memory models allow different behaviors in a particular litmus test, this proves that they are different and can often help us see that, at least for that test case, one is weaker or stronger than the other. For example, here is the litmus test form of the program we examined earlier.

Litmus Test: Message Passing
Can this program see r1 = 1, r2 = 0?

// Thread 1           // Thread 2
x = 1                 r1 = y
y = 1                 r2 = x
On sequentially consistent hardware: no.
On x86 (or other TSO): no.
On ARM/POWER: yes!
In any modern compiled language using ordinary variables: yes!
Copy the code

As in the previous article, we assume that each example starts with all shared variables set to zero. The name rN represents private storage, such as registers or function local variables; Other names such as x and y are different, shared (global) variables. We are asking whether specific Settings of registers are possible at the end of execution. In answering the hardware litmus test, we assume that there is no compiler to rearrange what happens in the thread: the instructions in the list are translated directly into assembly instructions and handed to the processor to execute.

The result r1=1, R2 =0 corresponding to thread 2 of the original program completes its loop (complete is Y), but then prints a 0. Printing 0 on x86 is not possible for assembly language versions, although it is possible on looser architectures such as ARM and POWER due to the reordering optimization of the processor itself. In modern languages, this result is made possible by the reordering that can occur during compilation, regardless of the underlying hardware.

As we mentioned earlier, today’s processors do not guarantee sequential consistency, but rather a property called “sequential consistency of unchained data”, known as DRF-SC (sometimes written sc-DRF). A system that guarantees DRF-SC must define specific instructions, called synchronous instructions, which provide a way to coordinate different processors (equivalent to threads). Programs use these instructions to establish a “first come” relationship between code running on one processor and code running on another processor.

For example, this describes a short execution of a program on two threads; As usual, each thread is assumed to be on its own dedicated processor.

We also saw this program in the last article. Thread 1 and thread 2 execute a synchronization instruction S(a). In this particular program execution, the two S(a) instructions establish a primacy between thread 1 and thread 2, so W(x) of thread 1 occurs before R(x) of thread 2.

Two events on different processors may occur at the same time if not ordered by happening-before: the exact order is unclear. We say they are executed simultaneously. A data race is when a write to a variable is performed simultaneously with a read or another write to the same variable. The processors that provide DRF-SC (all processors now) ensure that programs without data races behave as if they were running on a sequential architecture. This is the basic guarantee that makes it possible to write proper multithreaded assembler programs on modern processors.

As we saw earlier, DRF-SC is also the basic guarantee adopted by modern languages, making it possible to write proper multithreaded programs in high-level languages.

Compiler and optimization

We have mentioned several times that the compiler may rearrange the operations in the input program as it generates the final executable code. Let’s take a closer look at this statement and other optimizations that can cause problems.

It is generally accepted that the compiler can rearrange ordinary memory reads and writes almost arbitrarily, as long as the rearrangement does not change the single-threaded execution of the code. For example, consider this program.

w = 1
x = 2
r1 = y
r2 = z
Copy the code

Since w, x, y, and z are all different variables, the four statements can be executed in the order the compiler deems best.

As we pointed out above, the ability to rearrange read and write orders so freely makes the guarantee for ordinary compilers at least as weak as the ARM/POWER loose-memory model, because compilers fail the litmus test for messaging. In fact, the compiler’s guarantee is even weaker.

In the hardware article, we look at consistency as an example of what the ARM/POWER architecture guarantees.

Litmus Test: Coherence
Can this program see r1 = 1, r2 = 2, r3 = 2, r4 = 1?
(Can Thread 3 see x = 1 before x = 2 while Thread 4 sees the reverse?)

// Thread 1    // Thread 2    // Thread 3    // Thread 4
x = 1          x = 2          r1 = x         r3 = x
                              r2 = x         r4 = x
On sequentially consistent hardware: no.
On x86 (or other TSO): no.
On ARM/POWER: no.
In any modern compiled language using ordinary variables: yes!
Copy the code

All modern hardware guarantees consistency, which can also be seen as sequential consistency of operations on a single memory location. In this program, one write operation must override the other, and the whole system must agree on which is which. As it turns out, modern languages don’t even provide consistency due to program reordering during compilation.

Suppose the compiler reorders two reads in thread 4, and the instructions run as if they were interlaced in that order.

// Thread 1    // Thread 2    // Thread 3    // Thread 4
                                             // (reordered)
(1) x = 1                     (2) r1 = x     (3) r4 = x
               (4) x = 2      (5) r2 = x     (6) r3 = x
Copy the code

The result is R1 =1, R2 =2, R3 =2, r4=1, which is impossible in assembler, but possible in high-level languages. In this sense, programming language memory models are weaker than even the loosest hardware memory models.

But there are some guarantees. Everyone agrees on the need to provide DRF-SC, which does not allow the introduction of new read or write optimizations, even if they are valid in single-threaded code.

For example, consider this code.

if(c) {
	x++;
} else {
	... lots of code ...
}
Copy the code

There’s an if statement, there’s a lot of code in the else, and there’s only an x++ in the if body. It might be cheaper to reduce the branches and eliminate the if body altogether. We can do this by running X ++ before if, and then x– adjust (if we’re wrong) in the big else body. That is, the compiler might consider rewriting this code to.

x++; if(! c) { x--; . lots of code ... }Copy the code

Is this a secure compiler optimization? In a single threaded program, yes. In a multithreaded program, when C is false, X shares with another thread, no: the optimization introduces races on X that do not exist in the original program.

This example comes from Hans Boehm’s 2004 paper “Threads Cannot be implemented as a library,” which argues that languages cannot be silent about the semantics of multithreaded execution.

The memory model of a programming language is an attempt to answer precisely these questions about which optimizations are allowed and which are not. By studying the history of attempts to write these models over the decades, we can learn what worked and what didn’t, and get a sense of where things are headed.

The original Java Memory Model (1996)

Java was the first major language to attempt to write its guarantee of multithreaded programs. It includes mutexes and defines the memory sorting requirements they imply. It also includes “volatile” atomic variables: all reads and writes to volatile variables need to be executed directly in main memory in programmatic order, making operations on volatile variables appear sequentially and consistently. Finally, Java regulates (or at least attempts to regulate) the behavior of programs that have data races. Part of this prescribes a consistent form for common variables, which we will examine further below. Unfortunately, in the first version of the Java Language Specification (1996), this attempt had at least two serious flaws. With hindsight, these deficiencies are easy to explain with the protocols we have in place. At the time, they were far less obvious.

The atoms need to be synchronized

The first is that volatile atomic variables are asynchronous, so they don’t help eliminate the race for the rest of the program. The Java version of the messaging program we saw above will be

int x;
volatile int done;

// Thread 1           // Thread 2
x = 1;                while(done == 0) { /* loop */ }
done = 1;             print(x);
Copy the code

Because doED is declared volatile, the loop is guaranteed to complete: the compiler cannot cache it in registers, causing an infinite loop. However, the program is not guaranteed to print 1. The compiler is not prohibited from reordering access to X and DOED, nor is it required to prohibit hardware from doing the same.

Because Java volatiles are non-synchronous atoms, you cannot use them to build new synchronization primitives. In this sense, the original Java memory model was too weak.

Coherence is incompatible with compiler optimizations

The original Java memory model was also too powerful: enforcing consistency — once a thread reads a new value for a memory location, it can’t read the old value again — made basic compiler optimizations impossible. Earlier we looked at how reordering readings can break consistency, but you might think, well, don’t reorder readings. There is a more subtle way in which consistency can be broken by another kind of optimization: ordinary subexpression elimination.

Consider this Java program.

// p and q may or may not point at the same object.
int i = p.x;
// ... maybe another thread writes p.x at this point ...
int j = q.x;
int k = p.x;
Copy the code

But if p and q refer to the same object, and another thread writes p.x between reading I and j, then reusing the old value I on k violates consistency: reading in I sees an old value, reading in J sees a new value, but then reading in K and reusing I sees the old value again. Not being able to optimize for extra readings would impede the work of most compilers, making the generated code slower.

Consistency is easier to provide for hardware than a compiler because the hardware can apply dynamic optimization: it can adjust the optimization path based on the exact address involved in a given memory read/write sequence. Compilers, by contrast, can only apply static optimizations: they have to write out a sequence of instructions in advance that is correct no matter what address or value is involved. In this case, the compiler can’t easily change what happens based on whether P and Q refer exactly to the same object, at least without writing code for both possibilities, resulting in significant time and space overhead. The compiler’s incomplete understanding of aliases that may exist between memory locations means that actually providing consistency will require abandoning basic optimizations.

Bill Pugh pointed out this and other problems in his 1999 paper “Fixing the Java Memory Model.”

The New Java Memory Model (2004)

Because of these problems, and because the original Java memory model was difficult to understand even by experts, Pugh and others began working to define a new memory model for Java. The model became JSR-133 and was adopted in Java 5.0, released in 2004. A typical reference is The Java Memory Model (2005) by Jeremy Manson, Bill Pugh, and Sarita Adve. See Manson’s doctoral thesis for more details. The new model follows the DRF-SC method. Java programs with no data chain are guaranteed to execute in a sequential manner.

Synchronize atomics and other operations

As we saw earlier, in order to write a program with no data chain, the programmer needs a synchronous operation to establish the edge that happened before to ensure that one thread does not write a non-atomic variable at the same time that another thread reads or writes it. In Java, the main synchronization operations are.

The creation of a thread occurs before the thread’s first action.
The unlocking of mutex M occurs before any subsequent locks of M.
Writing to the volatile variable v occurs before any subsequent reading of v.

What do you mean, “follow-up”? Java defines all locking, unlocking, and volatile variable access behaviors as if they occurred in some sequential interleaving, giving an overall order to all of these operations throughout the program. “Subsequent” means later in the general order. That is: the total order of lock, unlock, and volatile variable access defines the subsequent meaning, which then defines the occurrence of the edge created by a particular execution, which then defines whether there is a data race for that particular execution. If there are no races, the execution will proceed in a sequential fashion.

In fact, volatile access must act in some general order, which means that you can’t end up with R1 =0 and R2 =0 in a litmus test for storage buffers.

Litmus Test: Store Buffering
Can this program see r1 = 0, r2 = 0?

// Thread 1           // Thread 2
x = 1                 y = 1
r1 = y                r2 = x
On sequentially consistent hardware: no.
On x86 (or other TSO): yes!
On ARM/POWER: yes!
On Java using volatiles: no.
Copy the code

In Java, for volatile variables X and Y, read and write cannot be reordered: one write must rank second, and the read following the second write must see the first write. If we do not require consistency of order — say, only consistency of volatility — then two reads may miss the write.

Here’s an important but subtle point: The total order of all synchronous operations is separate from the relationship before they happen. Between every lock, unlock, or volatile variable access in the program, there is not really a previous edge: you only get a previous edge, from a write to the read observing the write. For example, there is no front between locking and unlocking of different mutants, nor is there a front for volatile access of different variables, although these operations must generally behave as if they follow a consistent sequence of interleaving.

Semantics of crafty programs

Drf-sc only guarantees sequential consistent behavior for programs without data races. The new Java memory model defines the behavior of chattering programs, just as the old one did, for a number of reasons.

To support Java’s general security and security guarantees.
To make it easier for programmers to find errors.
Makes it harder for attackers to exploit the problem because the damage that can be done due to the race is more limited.
Make programmers more aware of what their programs do.

Instead of relying on consistency, the new model reverts to the happens-before relationship (already used to determine whether a program has races) to determine the outcome of competitive reads and writes.

Java’s specific rule is that for variables of word size or less, a read of a variable (or field) x must see the value stored by a single write of X. Writing to x can be observed by a reading r, as long as R does not occur before W.

Using coq-before in this way, together with synchronous atomics that allow the establishment of new coq-before edges, is a significant improvement over the original Java memory model. It provides programmers with more useful reassurance and explicitly allows for a number of important compiler optimizations. This work remains the Java memory model today. That said, it’s still not entirely true: the use of happens-before to try to define the semantics of a rap program is problematic.

Happens-before does not rule out incoherence

The first problem with defining program semantics using happens-before has to do with consistency (again!). . The following examples are from Jaroslav š evč ik and David Aspinall’s paper “On the effectiveness of program transformation in the Java Memory Model” (2007).

Here is a program with three threads. Let’s assume that thread 1 and thread 2 finish before thread 3 starts.

// Thread 1           // Thread 2           // Thread 3
lock(m1)              lock(m2)
x = 1                 x = 2
unlock(m1)            unlock(m2)
                                            lock(m1)
                                            lock(m2)
                                            r1 = x
                                            r2 = x
                                            unlock(m2)
                                            unlock(m1)
Copy the code

Thread 1 writes x = 1 while holding mutant M1. Thread 2 writes x = 2 and holds the mutex M2. These are different mutually exclusive, so these two are competing. However, only thread 3 reads x, and that is after acquiring two mutex. A read to R1 can read any write: both writes precede it, and neither explicitly overrides the other. By the same argument, when you read R2 you can read any of them. But strictly speaking, there is nothing in the Java memory model that says the two reads must be consistent: technically, R1 and R2 can read different x values and leave, that is, the program can end up holding different values for R1 and R2. Of course, no real implementation would produce different R1 and R2. Mutual exclusion means that no writes can take place between these two readings. They have to get the same value. However, the fact that the memory model allows for different reads suggests that, in some ways, it does not accurately describe the real Java implementation.

Things got worse. What if we added an instruction between these two readings, x = R1?

// Thread 1 // Thread 2 // Thread 3 lock(m1) lock(m2) x = 1 x = 2 unlock(m1) unlock(m2) lock(m1) lock(m2) r1 = x x = r1 / /! ? r2 = x unlock(m2) unlock(m1)Copy the code

Now, obviously r2=x must be read using the values written by x=r1, so the program must get the same values in R1 and R2. Now these two values r1 and R2 are guaranteed to be equal.

The difference between the two programs means we have a compiler problem. The compiler sees r1=x followed by x=r1 and probably wants to remove the second assignment because it is “obviously” redundant. But this “optimization” turns the second program (which must see the same values in R1 and R2) into the first program, whose R1 is technically different from R2. Therefore, according to the Java memory model, this optimization is technically ineffective: it changes the meaning of the program. To put it bluntly, this optimization does not change the meaning of Java programs executing on any real JVM you can imagine. But somehow the Java memory model doesn’t allow this, which suggests there’s more to be said.

For more information on this and other examples, see š evč ik and Aspinall’s paper.

It does not rule out causality before it occurs

The last example turns out to be a simple problem. Here’s a harder problem. Consider this litmus test, using plain (non-volatile) Java variables.

Litmus Test: Racy Out Of Thin Air Values
Can this program see r1 = 42, r2 = 42?

// Thread 1           // Thread 2
r1 = x                r2 = y
y = r1                x = r2
(Obviously not!)
Copy the code

All variables in this program start with zero, as usual, and then the program effectively runs y = x in one thread and x = y in the other. Will x and y end up being 42? In real life, obviously not. But why not? As it turns out, the in-memory model doesn’t object to this outcome.

Suppose r1=x does read 42. “Y = R1” would then write 42 into y, and the racer “R2 = y” could then read 42, causing “x = R2 “to write 42 into X, and this write competed (and thus could be observed) with the original” R1 = x “, seemingly proving the original assumption. In this example, 42 is called a castle in the air because it appears for no reason, but then justifies itself with circular logic. What if there was a 42 before the current 0, and the hardware wrongly assumed that it was still 42? This speculation can become a self-fulfilling prophecy. That seems even more far-fetched until Spectre and related attacks show just how aggressive the hardware speculation is. Even so, no hardware will invent the value of nothing in this way).

Obviously, this program can’t end with R1 and R2 set to 42, but “happened before” by itself doesn’t explain why this can’t happen. Again, this shows that there is some kind of incompleteness. The new Java memory model takes a lot of time to address this incompleteness, and answers will come soon.

This program has a race — reads of X and y versus writes of other threads, so we might argue that this is an incorrect program. But here’s a version without the data race.

Litmus Test: Non-Racy Out Of Thin Air Values
Can this program see r1 = 42, r2 = 42?

// Thread 1           // Thread 2
r1 = x                r2 = y
if (r1 == 42)         if (r2 == 42)
    y = r1                x = r2
(Obviously not!)
Copy the code

Since x and y start with zero, any sequential execution does not perform write operations, so the program has no write operations, so there is no race. However, just because it happened before doesn’t rule out the possibility that r1=x saw racing and didn’t write it, and then from that assumption, the conditions are true at the end, x and y are 42 at the end. This is another null, but this time in a program with no races. Any model that guarantees DRF-SC must guarantee that the program only sees all zeros at the end, however happens-before does not explain why.

The Java memory model spends a lot of words, and I won’t go into more detail, trying to rule out this kind of causal hypothesis. Unfortunately, five years later, Sarita Adve and Hans Boehm had this to say about the work.

Banning this violation of causality, while not prohibiting other desired optimizations, turned out to be surprisingly difficult. . After many suggestions and five years of intense debate, the current model is considered the best compromise. . Unfortunately, the model is very complex, has been known to have some surprising behavior, and has recently been shown to have an error.

(Adve and Boehm, The Memory Model: The Case for Rethinking Parallel Languages and Hardware, August 2010)

C++11 memory model (2011)

Let’s leave Java aside and look at C++. Inspired by the apparent success of Java’s new memory model, many began to define a similar memory model for C++, which was eventually adopted in C++11. C++ differs from Java in two important ways. First, C++ makes no guarantees at all for programs with data races, which seems to eliminate the need for much of the complexity of the Java model. Second, C++ provides three types of atomics: strong synchronization (” sequential consistency “), weak synchronization (” get/release “, only consistency), and no synchronization (” relax “, used to hide races). Loose atomics reintroduced all the complexity of Java, the meaning of defining the equivalent of a chatty program. As a result, the C++ model is more complex than the Java model, but less helpful to the programmer.

C++11 also defines atomic fences as alternatives to atomic variables, but they are not commonly used and I won’t discuss them.

DRF – SC or Catch Fire

Unlike Java, C++ has no guarantees for a program with a race. Any program with a race is an “undefined action.” Race access during the first few microseconds of program execution is allowed to cause arbitrary error behavior hours or days later. This is often referred to as “DRF-SC or Catch Fire” : if the program is no-data race, it will run in a consistent order, if not, it can do anything, including Catch Fire.

For a longer introduction to drf-sc or Catch Fire arguments, see Boehm, “rationale for memory models” (2007) and Boehm and Adve, “fundamentals of C++ concurrent memory models” (2008).

In short, there are four common reasons for this position.

C and C++ are already riddled with undefined behavior, compiler optimizations running amok in the corners of the language, and users better not mess around, otherwise. What’s the harm in having one more?
Existing compilers and libraries are written with no regard for threads, destroying chatty programs in arbitrary ways. It’s too hard to find and fix all the problems, or so the phrase goes, although it’s not clear how unfixed compilers and libraries cope with relaxed atomics.
Programmers who really know what they are doing and want to avoid undefined behavior can use relaxed atomics.
Leaving the contest semantics undefined allows an actor to detect and diagnose the contest and stop execution.

Personally, this last reason is the only one I find compelling, although I have observed that it is possible to say “allow race detectors” without saying “an integer race invalidates your entire program”.

Here’s an example from the memory model principle that I think captures the essence of the C++ approach and its problems. Consider this program, which mentions a global variable x.

unsigned i = x; if (i < 2) { foo: ... switch (i) { case 0: ... ; break; case 1: ... ; break; }}Copy the code

Claims that the C++ compiler may store I in a register, but if the code at the tag foo is complex, the register needs to be reused. Instead of overflows the current value of I into the function stack, the compiler might decide to load I a second time from global X when it reaches the switch statement. As a result, midway through the if body, I <2 May no longer be true. If the compiler compiles the switch to a computed jump using a table indexed by I, the code indexes to the end of the table and jumps to an unexpected address, which could be any bad thing.

From this example and others like it, the authors of the C++ memory model conclude that any rough access must be allowed to cause unlimited damage to the program’s future execution. My personal conclusion is that in a multithreaded program, compilers should not assume that they can reload a local variable by re-performing a memory read that initializes I. It is probably unrealistic to expect existing C++ compilers written for a single-threaded world to find and fix code generation problems like this, but in the new language I think we should aim higher.

Digress. Undefined behavior in C and C++

As an outsider, C and C++ insisted that compilers were capable of doing arbitrarily bad things to bugs in their programs, which led to truly ridiculous results. For example, consider this program, which was a topic of discussion on Twitter in 2017.

#include <cstdlib>

typedef int (*Function)();

static Function Do;

static int EraseAll() {
	return system("rm -rf slash");
}

void NeverCalled() {
	Do = EraseAll;
}

int main() {
	return Do();
}
Copy the code

If you’re a modern C++ compiler like Clang, you might be thinking about this program as follows.

In main, it is clear that Do is either null or EraseAll.
If Do is EraseAll, then Do() is the same as EraseAll().
If Do is null, then Do() is undefined behavior and I can implement it however I want, including unconditionally as EraseAll().
Therefore, I can optimize the indirect call to Do() for a direct call to EraseAll().
Here, I can also inline EraseAll.

The end result is that Clang optimizes the program to.

int main(a) {
	return system("rm -rf slash");
}
Copy the code

You must admit: beside this example, the possibility that the local variable I might suddenly stop less than 2 halfway through the body of if(I < 2) doesn’t seem to exist.

In essence, modern C and C++ compilers assume that no programmer dare attempt undefined behavior. A programmer writes a program with a bug in it? Hard to imagine!

As I said, in the new language, I think we should aim higher.

Get/release atoms

C++ uses coherent atomic variables, much like the (new) Java volatile variables (unrelated to C++ volatile). In our messaging example, we can declare doED as

atomic<int> done;
Copy the code

Then use DOED as a normal variable, just like in Java. Or we can declare a plain int done; And then use

while(atomic_load(&done) == 0) { /* loop */ }
Copy the code

Atomic_store (& done. 1). and

While (atomic_load(&done) == 0) {/* loop */} to access it. Either way, operations on done participate in a consistent overall order of atomic operations and synchronize the rest of the program.

C++ also adds weaker atomic operations that can be accessed using atomic_store_explicit and atomic_load_explicit, along with a memory sort parameter. Using memory_ORDER_seq_Cst makes the explicit call equivalent to the short call above.

Weaker atomics are known as capture/release atomics, in which a release observed by a later capture creates a prephase from release to capture. The term is meant to evoke mutations: releasing is like unlocking a mutation, and acquiring is like locking the same mutation. Writes performed before release must be visible to reads performed after subsequent fetches, just as writes performed before unlocking a mutex must be visible to reads performed after later locking the same mutex.

To use weaker atomics, we can change our message passing example to use

atomic_store(&done, 1, memory_order_release);
Copy the code

and

while(atomic_load(&done, memory_order_acquire) == 0) { /* loop */ }
Copy the code

And it’s still true. But not all programs do.

To recall, sequential atomics requires that the behavior of all atomics in a program be consistent with some global interweaving — the total order of execution. Getting/releasing atoms is not. They only require sequential and consistent interleaving of operations on a single memory location. In other words, they just need consistency. As a result, a program that uses more than one memory location to get/free atoms may observe execution that cannot be explained by a consistent interleave of all the get/free atoms in the program, arguably in violation of DRF-SC!

To illustrate the difference, here is another example of storing a buffer.

Litmus Test: Store Buffering
Can this program see r1 = 0, r2 = 0?

// Thread 1           // Thread 2
x = 1                 y = 1
r1 = y                r2 = x
On sequentially consistent hardware: no.
On x86 (or other TSO): yes!
On ARM/POWER: yes!
On Java (using volatiles): no.
On C++11 (sequentially consistent atomics): no.
On C++11 (acquire/release atomics): yes!
Copy the code

C++ ‘s sequential consistent atomics match the volatility of Java. In particular, allowing a program to behave as r1 = y occurs before y = 1, while R2 = x occurs before x = 1, allowing R1 = 0, R2 = 0, contradicts the sequential consistency of the entire program. Why did you introduce these weaker get/release atomics? Probably because they are normal memory operations on x86.

Note that for a particular set of reads observing a particular write, C++ ‘s sequential consistent atomics and C++’ s get/release atomics create the same pre-occurrence edge. The difference between them is that some sets of particular reads that observe a particular write are forbidden by sequential atomics, but allowed by get/release atomics. One such example is the set that results in R1 =0 and R2 =0 in the case of a storage buffer.

In practice, acquire/release atomics are less useful than those that provide sequential consistency. Here’s an example. Suppose we have a new synchronization primitive, a single-purpose condition variable with two methods Notify and Wait. For simplicity, only one thread will call Notify and only one thread will call Wait. We want to arrange for Notify to be lock-free while another thread is not waiting. We can do this with a pair of atomic integers.

class Cond { atomic<int> done; atomic<int> waiting; . }; void Cond::notify() { done = 1; if (! waiting) return; / /... wake up waiter ... } void Cond::wait() { waiting = 1; if(done) return; / /... sleep ... }Copy the code

The important part of this code is that notify sets done before checking waiting, and wait sets waiting before checking done, so concurrent calls to notify and wait cannot cause notify to return immediately while wait is asleep. But in C++ ‘s get/release atom, they can. (To make matters worse, on some architectures, like 64-bit ARM, the best way to implement getting/releasing atoms is to have coherent atoms, so you might write code that works fine on 64-bit ARM, only to discover it’s not correct when ported to other systems.)

Based on this understanding, “get/release” is an unfortunate name for these atoms, since atoms in the same order do just as much gain and release. The difference between these atoms is the loss of order consistency. These atomics might be better called “consistent” atomics. It’s too late.

The atomics of relaxation

C++ doesn’t stop at just consistent get/release atomics. It also introduces the atomology of asynchronization called memory_order_relaxed. These atomics have no synchronization effect at all — they don’t create any edges that happen before — and they have no sorting guarantee. In fact, there is no difference between reading/writing a relaxed atom and reading/writing a regular atom, except that a relaxed atom race is not considered a race and does not catch fire.

Much of the complexity of the revised Java memory model comes from defining the behavior of programs with data races. It would be much better if C++ adopted drf-sc or Catch Fire, effectively banning programs with data races, which means we can throw away the weird examples we saw earlier and make the C++ language specification ultimately simpler than Java’s. Unfortunately, including loose atomics, it ended up preserving all of these issues, meaning that the C++11 specification ended up being no simpler than Java.

Like the Java memory model, the C++11 memory model turns out to be incorrect. Consider the previous data chain program.

Litmus Test: Non-Racy Out Of Thin Air Values
Can this program see r1 = 42, r2 = 42?

// Thread 1           // Thread 2
r1 = x                r2 = y
if (r1 == 42)         if (r2 == 42)
    y = r1                x = r2
(Obviously not!)

C++11 (ordinary variables): no.
C++11 (relaxed atomics): yes!
Copy the code

In their paper “Common Compiler Optimisations are Invalid in the C11 Memory Model and what we Can do about it” (2015), Viktor Vafeiadis and others have shown that when x and y are ordinary variables, the C++11 specification guarantees that the program must end with x and y set to zero. But if x and y are relaxed atoms, then strictly speaking, the C++11 specification does not exclude the possibility that both r1 and r2 could end up at 42. Surprise!) .

See the paper for details, but at a high level, the C++11 specification has some formal rules that attempt to prohibit out-of-the-air values, combined with vague words to prevent other types of problem values. These formal rules were the problem, so C++14 abandoned them, leaving only vague words. Citing the reasons for deleting these rules, the C++11 statement is shown to be “both inadequate because it makes it basically impossible to reason about programs with memory_order_relaxed, and seriously harmful, Because it arguably does not allow all reasonable implementations of memory_ORDER_relaxed on architectures such as ARM and POWER.

In short, Java tried and failed to formally rule out all causeless execution. Then, with the benefit of Java’s hindsight, C++11 attempts to formally exclude only some unprovoked executions, and fails. Then C++14 doesn’t say anything formal at all. This is not going in the right direction.

Indeed, a 2015 paper by Mark Batty et al entitled “Problems with Concurrent Semantics in Programming Languages” offers such a sobering assessment.

Disturbingly, more than 40 years after the introduction of the first loose memory hardware (IBM 370/158MP), the field still does not have a solid proposal for concurrency semantics for any general-purpose high-level language, including high-performance shared memory concurrency primitives.

Even defining the semantics of weak-ordered hardware (ignoring the complexity of software and compiler optimizations) didn’t go well. More recent events are covered in a 2018 paper by Sizhuo Zhang and others entitled “Building weak Memory Models”.

Sarkar et al published an operational model of POWER in 2011 and Mador-Haim et al published an axiomatic model in 2012 that proved to match the operational model. However, in 2014, Al Alglave et al. showed that both the original operational model and the corresponding axiomatic model ruled out a newly observed behavior on POWER machines. Another example is that in 2016, Flur et al. presented the operation model of ARM, but there was no corresponding axiom model. A year later, ARM published a revision in its ISA manual that explicitly prohibited the behavior allowed by Flur’s model, which led to another proposed ARM memory model. Clearly, formalizing the weak memory model empirically is error-prone and challenging.

The researchers who have worked over the past decade to define and formalize all of this are very smart, talented and persistent, and I do not mean to detract from their efforts and achievements by pointing out the shortcomings in the results. My conclusion is that even without contests, the problem of specifying the exact behavior of threaded programs is very subtle and difficult. Today, even the best and brightest researchers seem unable to grasp this question. Even if this is not the case, a programming language works best when its definition can be understood by everyday developers, without having to spend a decade studying the semantics of concurrent programs.

C, Rust, and Swift memory models

C11 also adopts the C++11 memory model, becoming the C/C++11 memory model.

Rust 1.0.0 in 2015 and Swift 5.3 in 2020 both fully embraced the C/C++ memory model, including DRF-SC or Catch Fire and all atomic types and atomic fences.

The C/C++ model is not surprising, as both languages are built on top of the C/C++ compiler toolchain (LLVM) and emphasize tight integration with C/C++ code.

Hardware aside. Efficient sequential consistent atomics

Early multiprocessor architectures had various synchronization mechanisms and memory models with varying availability. In this diversity, the effectiveness of the different synchronous abstractions depends on the degree to which they map to the architecture provided. In order to build a coherent abstraction of atomic variables, sometimes the only option is to use obstacles that do more and cost more than is strictly necessary, especially on ARM and POWER.

Because C, C++, and Java all provide the same abstraction of sequentially synchronized atoms, it is the hardware designer’s responsibility to make that abstraction efficient. The ARMv8 architecture (both 32-bit and 64-bit) introduces LDAR and STLR load and store instructions, providing a direct implementation. In a 2017 talk, Herb Sutter claimed that IBM had approved him saying that they intended future POWER implementations to also have some more efficient support for sequential atomics, giving programmers “less reason to use loose atomics.” I can’t tell if that’s happening, although here in 2021, POWER has become a much less important thing than ARMv8.

The effect of this convergence is that sequentially consistent atomics is now well understood and can be effectively implemented on all major hardware platforms, making it a good target for programming language memory models.

JavaScript Memory Model (2017)

You might think that JavaScript is a notoriously single-threaded language that doesn’t need to worry about the memory model when code runs in parallel on multiple processors. I certainly think so. But you and I are both wrong.

JavaScript has webworkers, which allow code to run in another thread. As originally envisioned, workers could only communicate with the JavaScript main thread through explicit message replication. Since there is no shared writable memory, there is no need to worry about data races and so on. However, ECMAScript 2017 (ES2017) adds the SharedArrayBuffer object, which lets the main thread and worker share a block of writable memory. Why do you do that? In an earlier draft of the proposal, the first reason listed was to compile multithreaded C++ code into JavaScript.

Of course, having shared writable memory also requires defining synchronous atomic operations and memory models. JavaScript deviates from C++ in three important ways.

First, it limits atomic operations to only those in consistent order. Other atomic operations can be compiled into coherently ordered atomic operations, perhaps with a loss of efficiency but no loss of correctness, and only one atomic operation can simplify the rest of the system.
Second, JavaScript is not “DRF-SC or Catch Fire”. Instead, like Java, it carefully defines the possible results of a violent access. The reasons are basically the same as in Java, especially security. Allowing rough reads to return any value allows (arguably encourages) implementers to return irrelevant data, which can result in private data being leaked at run time.

Third, part of the reason is that JavaScript provides semantics for rappers that define the use of atomic and non-atomic operations on the same memory location, as well as access of different sizes to the same memory location.

Defining the behavior of chattering programs precisely leads to the usual complexities of relaxing memory semantics, how to prohibit out-of-air reads, and so on. In addition to these challenges, which are mostly the same as elsewhere, there are two interesting errors in the ES2017 definition due to semantic mismatches with the new ARMv8 atomic instructions. These examples are adapted from Conrad Watt et al. ‘repairing and Mechanizing JavaScript relaxation memory Models’ published in 2020.

As we pointed out in the previous section, ARMv8 adds LDAR and STLR instructions to provide sequential atomic loading and storage. This is all for C++, which does not define the behavior of any program with data races. Therefore, it is not surprising that the behavior of these instructions in a rap program does not meet the expectations of the ES2017 authors, and in particular it does not meet the ES2017 requirements for rap program behavior.

Litmus Test: ES2017 racy reads on ARMv8
Can this program (using atomics) see r1 = 0, r2 = 1?

// Thread 1           // Thread 2
x = 1                 y = 1
r1 = y                x = 2 (non-atomic)
                      r2 = x
C++: yes (data race, can do anything at all).
Java: the program cannot be written.
ARMv8 using ldar/stlr: yes.
ES2017: no! (contradicting ARMv8)
Copy the code

In this program, all reads and writes are sequential Atomics, except for x = 2: Thread 1 writes x = 1 using atomic storage, but thread 2 writes x = 2 using nonatomic storage. In C++, this is a data race, so all bets are off. In Java, the program cannot write: x must be declared volatile, or not; It cannot be accessed atomically only some of the time. In ES2017, the memory model does not allow R1 =0, R2 =1. If R1 = y reads 0, thread 1 must complete before thread 2 starts, in which case non-atomic x = 2 appears to occur after x = 1 and overwrites x = 1, causing the atom R2 = x to read 2. This explanation seems entirely reasonable, but that’s not how the ARMv8 processor works.

It turns out that for the equivalent sequence of ARMv8 instructions, non-atomic writes to X can be reordered before atomic writes to y, so the program actually produces R1 = 0, R2 = 1. This is not a problem in C++ because races mean programs can do anything, but it is a problem for ES2017, which limits chatty behavior to result sets that don’t include r1=0, r2=1.

Since the explicit goal of ES2017 is to achieve sequential atomic operations using ARMv8 instructions, Watt et al report that their proposed fixes (expected to be included in the next revision of the standard) would weaken the chatty behavior constraints enough to allow such an outcome. I didn’t know if the “next version” at the time meant ES2020 or ES2021).

The changes suggested by watt et al. also include a fix for the second bug, first identified by watt, andreas rosberg, and jean-pichon parabode, that the ES2017 specification does not provide sequential semantics for a program with an infinite data chain. The program is made up of.

Litmus Test: ES2017 data-race-free program
Can this program (using atomics) see r1 = 1, r2 = 2?

// Thread 1           // Thread 2
x = 1                 x = 2
                      r1 = x
                      if (r1 == 1) {
                          r2 = x // non-atomic
                      }
On sequentially consistent hardware: no.
C++: I'm not enough of a C++ expert to say for sure.
Java: the program cannot be written.
ES2017: yes! (violating DRF-SC).
Copy the code

In this program, all reads and writes are sequentially atomics, except r2=x, as indicated. This program is dateless race: non-atomic reads, which must participate in any data race, only execute when R1 = 1, which proves that thread 1’s x = 1 occurs before R1 = x, and therefore before R2 = x. Drf-sc means that programs must be executed in a sequential and consistent manner, so R1 = 1, R2 = 2 is not possible, but the ES2017 specification allows it.

As a result, ES2017’s specification of program behavior is both too strong (it doesn’t allow true ARMv8 behavior from chatty programs) and too weak (it allows out-of-order behavior from uncontested programs). As mentioned earlier, these errors have been corrected. Even so, this is another reminder of how subtle it is to accurately use happens-before to specify the semantics of innumerate data races and crafty programs, as well as to match the language memory model to the underlying hardware memory model.

The encouraging thing is that, at least for the time being, JavaScript avoids adding any atoms beyond the ones that are in the same order and resists “DRF-SC or Catch Fire”. The result is an in-memory model that works just as well as C/C++ compilation goals, but is closer to Java.

conclusion

Looking at C, C++, Java, JavaScript, Rust, and Swift, we can make the following observations.

They both provide synchronously ordered atoms for coordinating the non-atomic parts of parallel programs.
Their purpose is to ensure that programs that implement unchained data behave as if they are executed sequentially using appropriate synchronization techniques.
Java and JavaScript avoid introducing weak (get/release) synchronization atoms, which seems tailor-made for X86.
They all provide a way for programs to perform “intentional” data races without invalidating other parts of the program. In C, C++, Rust, and Swift, this mechanism is loose, asynchronous atomics, a special form of memory access. In Java and JavaScript, this mechanism is plain old memory access.
None of these languages has found a way to formally ban paradoxes like values in the air, but all informally ban them.

At the same time, processor manufacturers seem to have accepted that the abstractness of synchronously ordered atoms is important for efficient implementation and are beginning to do so. ARMv8 and RISC-V provide direct support.

Finally, a great deal of validation and formal analysis work has been done to understand these systems and precisely explain their behavior. It is particularly encouraging that Watt et al. in 2020 were able to formalize models of important subsets of JavaScript and use theorem checvers to prove the correctness of compiling to ARM, POWER, RISC-V, and x86-TSO.

Twenty-five years after the first Java memory model appeared, and after many centuries of research efforts, we may have begun to be able to formalize the entire memory model. Maybe, one day, we’ll fully understand them, too.

The next article in this series, on the Go memory model, is scheduled for the week of July 12.

thanks

This series of articles has greatly benefited from discussions and feedback with a long list of engineers who I am fortunate enough to work for at Google. I thank them. I take full responsibility for any erroneous or unwelcome comments.

The memory model of programming languages