Unfortunately this turns out not to be the case: two groups have indenpently found counterexamples which demonstrate that the current mapping scheme of current C11 is not sound on the Power and ARMv7 models as we know them:
The soundness of the ARMv8 mappings is unaffected by this, and the issues do not affect multi-copy-atomic ARMv7 implementations (all those that we have tested experimentally are multi-copy-atomic). They presumably would affect code compiled for ARMv7 running on future ARMv8.0 non-multi-copy-atomic implementations - but later revisions of the ARMv8 architecture also exclude non-multi-copy-atomic behaviour.
The examples all involve mixing SC atomic and weaker (release/acquire or relaxed) atomics on the same location. That is arguably a strange thing to do - it might be that most "real code" does not, though we have no rigorous way to assess that, and Hans Boehm mentioned examples to the contrary.
(a variant of IRIW with SC atomics everywhere except the acquire reads on Threads 1 and 2).
Wsc x 1 Racq x 1; Rsc y 0 Racq y 1; Rsc x 0 Wsc y 1
That is forbidden in the current C11 model. The ARMv7 version (eliding irrelevant barriers or dependencies) is:
which is ARMv7 allowed in the model. However, to the best of our knowledge, this is not and should not be observable on any current ARMv7 hardware, as all such are multi-copy atomic.
W x 1 R x 1; ctrlisb; R y 0 R y 1; ctrlisb; R x 0 W y 1
For comparison, the Power trailing-barrier version (not the suggested compilation scheme there, which is leading-barrier) is:
That is Power-allowed, and observable on current (Power7) hardware.
W x 1 R x 1; ctrlisync; R y 0 R y 1; ctrlisync; R x 0 W y 1
This is also C11-forbidden. There is no ARMv7 problem here: the leading-dmb version is
Wrlx x 2; Fsc; Rrlx y 0 Wsc y 2; Wrel x 1; Rrlx x 2 Rsc x 1
which is forbidden in the ARMv7 model (and also not observable of course on hardware). In fact, we can say something more general, since ARMv7 does not have a lwsync analogue and release writes are mapped to a "full dmb" followed by a write, the leading-dmb mapping would be sound on ARMv7.
W x 2; dmb; R y 0 W y 2; dmb; W x 1; R x 2 R x 1
But the leading-sync Power version is
which is Power-architecture allowed. This is a version of another test we know well, R+lwsync+sync (called R01 in our PLDI'11 paper)
W x 2; sync; R y 0 W y 2; lwsync; W x 1; R x 2 R x 1
As discussed in our PLDI'11 paper, this (and all related examples, including in particular those we know of that break the mapping) is not observable by us on any current Power hardware. However, IBM have been very clear that they do want this behaviour to be allowed in the architecture, so this is also a reason to weaken C11. For ARMv8, none of these problems arise: for ARMv8.0, as they do not use any barriers in their mappings, using instead the (ARMv8 version of) release-acquire accesses for both SC and release-acquire atomics The ARMv8.0 draft soundness proof by Shaked Flur is unaffected. ARMv8.2 is multi-copy-atomic.
W x 2; sync; R y 0 W y 2; lwsync; W x 1 (finally x = 2 to settle coherence)
We also believe the problem will never show up if one has only C11 reads and writes, not the fences or read-modify-writes. And we believe the problem will never show up if programmers do not write using a (C11) relaxed or acquire write to a location read from by a C11 SC read.
That was a pen-and-paper proof, and an obvious high-level lesson from all this is that we should work towards machine-checked proofs of these results, though doing that for the full models we have now is a very substantial task.