Erratum for POPL 2012 and PLDI 2012 papers

In two papers (POPL12 and PLDI12), we described a purported proof that the C/C++11 concurrency model could be soundly implemented on Power, and by analogy on ARMv7, by two mapping schemes: one with a leading sync/dmb for SC atomics (which is the current suggested mapping to Power), and one with a trailing sync/dmb (which is the current suggested mapping to ARMv7) (mappings).

Unfortunately this turns out not to be the case: two groups have indenpently found counterexamples which demonstrate that the current mapping scheme of current C11 is not sound on the Power and ARMv7 models as we know them:

As far as we can see so far, the conclusion is that C11 should be weakened; there does not seem to be a strong reason to change either the hardware architectures(/models) or the mappings. The paper by Lahav et al. has a proposed weakening of the C11 model that could be soundly implemented via the existing mappings on both ARMv7 and Power.

The soundness of the ARMv8 mappings is unaffected by this, and the issues do not affect multi-copy-atomic ARMv7 implementations (all those that we have tested experimentally are multi-copy-atomic). They presumably would affect code compiled for ARMv7 running on future ARMv8.0 non-multi-copy-atomic implementations - but later revisions of the ARMv8 architecture also exclude non-multi-copy-atomic behaviour.

Examples

On to the examples and caveats. As usual, all locations are initialised to 0, and threads are written one per line, going left to right.

The examples all involve mixing SC atomic and weaker (release/acquire or relaxed) atomics on the same location. That is arguably a strange thing to do - it might be that most "real code" does not, though we have no rigorous way to assess that, and Hans Boehm mentioned examples to the contrary.

Trailing barrier

For the trailing-barrier case, consider the C11 example (due to Manerkar et al.):

Wsc x 1
Racq x 1; Rsc y 0
Racq y 1; Rsc x 0
Wsc y 1
(a variant of IRIW with SC atomics everywhere except the acquire reads on Threads 1 and 2).

That is forbidden in the current C11 model. The ARMv7 version (eliding irrelevant barriers or dependencies) is:


W x 1
R x 1; ctrlisb; R y 0
R y 1; ctrlisb; R x 0
W y 1
which is ARMv7 allowed in the model. However, to the best of our knowledge, this is not and should not be observable on any current ARMv7 hardware, as all such are multi-copy atomic.

For comparison, the Power trailing-barrier version (not the suggested compilation scheme there, which is leading-barrier) is:


W x 1
R x 1; ctrlisync; R y 0
R y 1; ctrlisync; R x 0
W y 1
That is Power-allowed, and observable on current (Power7) hardware.

Leading barrier

For the leading-barrier case, consider the example (due to Ori et al):

Wrlx x 2; Fsc; Rrlx y 0
Wsc y 2; Wrel x 1; Rrlx x 2
Rsc x 1
This is also C11-forbidden. There is no ARMv7 problem here: the leading-dmb version is

W x 2; dmb; R y 0
W y 2; dmb; W x 1; R x 2
R x 1
which is forbidden in the ARMv7 model (and also not observable of course on hardware). In fact, we can say something more general, since ARMv7 does not have a lwsync analogue and release writes are mapped to a "full dmb" followed by a write, the leading-dmb mapping would be sound on ARMv7.

But the leading-sync Power version is


W x 2; sync; R y 0
W y 2; lwsync; W x 1; R x 2
R x 1
which is Power-architecture allowed. This is a version of another test we know well, R+lwsync+sync (called R01 in our PLDI'11 paper)

W x 2; sync; R y 0
W y 2; lwsync; W x 1

(finally x = 2 to settle coherence)
As discussed in our PLDI'11 paper, this (and all related examples, including in particular those we know of that break the mapping) is not observable by us on any current Power hardware. However, IBM have been very clear that they do want this behaviour to be allowed in the architecture, so this is also a reason to weaken C11. For ARMv8, none of these problems arise: for ARMv8.0, as they do not use any barriers in their mappings, using instead the (ARMv8 version of) release-acquire accesses for both SC and release-acquire atomics The ARMv8.0 draft soundness proof by Shaked Flur is unaffected. ARMv8.2 is multi-copy-atomic.

We also believe the problem will never show up if one has only C11 reads and writes, not the fences or read-modify-writes. And we believe the problem will never show up if programmers do not write using a (C11) relaxed or acquire write to a location read from by a C11 SC read.

The purported proof

What about our purported "proof" from POPL12 and PLDI12? Manerkar et al. also identified the flaw: the proof has to construct a C11 execution from a machine execution, and the construction of the SC order is not consistent with hb, which is required by the current C11 model. That may have been a simple mistake on our part, but we may have been working from an earlier version of the standard that had weaker requirements. It is wrong w.r.t. the ratified standard, in any case.

That was a pen-and-paper proof, and an obvious high-level lesson from all this is that we should work towards machine-checked proofs of these results, though doing that for the full models we have now is a very substantial task.