The Thin-air Problem

Working note, version of 2014-02-17

Mark Batty (mbatty@cantab.net), Peter Sewell (Peter.Sewell@cl.cam.ac.uk)

Introduction

A longstanding issue in the design of memory models for C and C++ is that of "thin-air" reads. The question is what the semantics of accesses like C/C++ relaxed atomics should be: accesses for which races are permitted but where we are not prepared to pay the cost of any barriers or other h/w instructions beyond normal reads and writes. Related questions arise in the semantics of C as used in the Linux Kernel (for "ACCESS_ONCE" accesses), and in Java.

There is not a precise and satisfactory definition of what it means for a read to be "out of thin air" (if there were, the problem would be solved). Rather, there are some example executions for which there is a consensus that the language should forbid them, and that current hardware and compiler optimisations do not exhibit. We regard these as being of two kinds: thin-air values and self-satisfying conditionals. There has not yet been a good way to exclude them without also forbidding compiler optimisations which should be permitted. In the C/C++ context we assume that to a first approximation it would be impractical to prevent any current compiler optimisations, and so the problem is essentially one of defining an envelope that permits all those, and the current hardware behaviour, while excluding the example executions that it is agreed should be forbidden. To be clear, this is a high-level-language specification problem: there is no suggestion that thin-air executions occur in practice with current compilers and hardware.

This note describes the thin-air problem via a series of examples, shows that thin-air executions cannot be forbidden by any per-candidate-execution condition using the C/C++11 notion of candidate executions, and summarises some possible future directions.

For more context:

The Examples

In the examples below, we identify particular executions (by specifying the values read) and discuss whether those executions should be allowed by the semantics or not. By convention, all accesses to x and y are relaxed-atomic accesses (in C11 terms) or perhaps ACCESS_ONCE (in Linux C terms). Variables r1, r2, r3, r4, ra, and rb are thread-local. Initially all locations and registers hold 0. Threads are separated by horizontal lines.

1 Example LB (language must allow)

r1 = x; // reads 42
y = 42;
------------------
r2 = y; // reads 42
x = 42;

This is allowed by current hardware (presuming that it is compiled directly, without compiler optimisations): it is architecturally permitted for ARM and IBM POWER, and observable on current ARM multiprocessors. Hence, the language semantics must allow it.

2 Example LB+datas (language could and should forbid)

r1 = x; // reads 42
y = r1;
-------------------
r2 = y; // reads 42
x = r2;

(This is Example 1 of N2176.)

This is the paradigmatic "thin-air read value" example, in which the value 42 "appears out of thin air". It is forbidden on current hardware (architecturally forbidden by x86, ARM, and IBM POWER), presuming that it is compiled directly. We don't expect future hardware to adopt the load-value prediction that would be required to make it observable. Moreover, to the best of our knowledge it cannot be exhibited by any reasonable current compiler optimisation combined with current hardware. Hence, the language semantics could forbid it.

Then one can argue that, just to make the language semantics as intuitive as possible, it should forbid it.

Moreover, if this is at a type where the language has to preserve some implementation invariant (such as that all pointer values point to allocated memory), then the language has to forbid it.

This example suggests a possible solution, to forbid this example by forbidding candidate executions with cycles in (rf + dep), the union of the reads-from and dependency relations. We'll call this Straw-man Proposal 1, and the next two examples show that it is not feasible.

3 Example LB+ctrl+data+po (language must allow)

r1 = x;        // reads 42
if (r1 == 42)
  y = r1;
--------------------------
r2 = y;        // reads 42
x = 42;

This is allowed on hardware as-is (compiled directly, without optimisation): architecturally allowed on ARM and POWER, and observable on both. Hence, the language must allow it.

4 Example LB+ctrl+data+ctrl-double (language must allow)

r1 = x;        // reads 42   
if (r1 == 42)
  y = r1;
---------------------------
r2 = y;        // reads 42
if (r2 == 42)
  x = 42;
else
  x = 42;

This is forbidden on hardware as-is, but reasonable thread-local optimisation (that compilers actually do - see Example 6 below) will collapse the conditional on the second thread, making it identical to the previous example, which is allowed on hardware. Hence the language must allow this execution.

However, this execution does have an (rf+dep) cycle, ruling out the Straw-man Proposal 1 above.

(Aside: we could change this example without affecting the above reasoning by removing the data dependency on the first thread, replacing y=r1 by y=42. We don't do that because we want to relate it to the following example.)

5 Example LB+ctrl+data+ctrl-single (language could and should forbid)

r1 = x;        // reads 42
if (r1 == 42)
  y = r1;
--------------------------
r2 = y;        // reads 42
if (r2 == 42)
  x = 42;

This is the paradigmatic "self-satisfying conditional" example. It is forbidden on h/w as-is (both ARM and POWER architectures prevent speculative writes becoming visible to other threads), and applying reasonable thread-local compiler optimisation doesn't change that. Hence, the language could forbid it.

Moreover, it is problematic for compositional reasoning (as observed by Batty, Dodds, and Gotsman in POPL 2013), so arguably the language should forbid it.

But, the candidate execution that we want to forbid here is identical to the execution of the previous example that we have to allow. Hence, we cannot do both simultanously with any adaptation of the C/C++11 per-candidate-execution definition that uses the same notion of candidate execution.

6 Dependencies via other compilation units

Returning to Example 4, it is tempting to think that the compiler should respect all control dependencies between relaxed atomic operations, especially those from loads to stores, which would rule that out. One might hope that, even if this required preventing some current compiler optimisations, the impact would be small.

But dependencies might go via functions in other compilation units that only involve non-atomic accesses, e.g. the accesses to ra and rb in f() below, and when compiling f() the compiler doesn't know whether it might be used in a dependency chain between atomics. So the compiler would have to preserve all such dependencies. The cost of that is unknown, and worth investigating experimentally, but we suspect it might be unacceptable.

(language must allow):

// in one compilation unit
void f(int ra, int*rb) {
  if (ra==42) 
    *rb=42;
  else    
    *rb=42;
} 

// in another compilation unit

r1 = x; // reads 42
f(r1,&r2);
y = r2;
-------------------
r3 = y; // reads 42
f(r3,&r4);
x = r4;

In practice, GCC does optimise away the control dependency in f. (Checked with GCC 4.6.3 on x86, at O1, O2, and O3. The ARM and POWER compilations should also be checked, but they seem unlikely to differ.)

This is allowed as-is by the combination of current hardware and current compiler optimisations, but it would be forbidden by a semantics that excludes (rf + dep) cycles.

Possible Solutions

Here we just summarise some possible future directions.

Undefined behaviour for unannotated (rf+dep) cycles

We (Batty and Sewell) have earlier observed that, for C-like languages, one could conceivably make it be largely the programmer's responsibility to avoid thin-air cycles. This would adapt proposal 2 above by requiring the user to annotate (probably at a function granularity) where the compiler is required to preserve dependencies, make the language semantics respect cycles in (rf + dep) that go through annotated code, and make it be a programmer error (leading to undefined behaviour) for them to write any program for which any consistent candidate execution contains a cycle in (rf + dep) that goes outside code that has been so annotated.

It's unclear whether this could be acceptable. It would clearly be unacceptable if it required annotation of much code outside of a concurrency library. Example 6 above makes it clear that annotation of code beyond the boundaries of a concurrency library could well be required, but only cycles that include relaxed atomics need to be annotated, and one might think it reasonable to impose the extra burden of dependency annotation on programmers who are capable of reasoning about relaxed atomics. It might be that concurrency libraries rarely ever pass values to other code without synchronising, in which case annotating dependencies in the client would be unnecessary. This pragmatic question about the corpus of code that will use relaxed atomics has not been investigated.

Even if not too much annotation is required, the reasoning required to show that a system is sufficiently annotated may be unpalatable.

Forbidding LB for relaxed atomics outright

In [N3710: Specifying the absence of "out of thin air" results (LWG2265)] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html Hans Boehm proposes to forbid cycles in the union of reads-from and happens-before (including dependencies), and to make this sound with respect to the first example above (LB) by strengthening the ARM implementation of a relaxed loads with a conditional branch (or equivalent new instruction) following the load. Something similar would be required w.r.t. the POWER specification, though not w.r.t. current POWER implementations, and perhaps also for NVIDIA GPUs. This would have some performance implications. It is unclear how serious they would be, but it is arguably unpalatable to accept them `simply' because we have not yet found a good language specification.

A conservative envelope around current optimisations

The above shows that a per-candidate execution semantics (with the existing C/C++11 notion of candidate execution) cannot be used to solve the problem. We are exploring semantics in several alternative styles at present.