This document summarises some known mappings of C/C++11 atomic operations to x86, PowerPC, ARM and Itanium instruction sequences. These are collected for discussion, not as a definitive source. At the moment, we do not include mappings for all atomic operations - for example, atomic increment is missing. We would be grateful for any suggestions.
This document does not cover any optimisations of either the mappings or C/C++11 programs in general.
For each C/C++11 synchronisation operation and architecture, the document aims to provide an instruction sequence that implements the operation on given architecture. This is not the only approach — one could provide a mapping that shows the necessary barriers (or other synchronisation mechanism) between two program-order adjacent memory operations (either atomic or non-atomic). A good example of this approach is Doug Lea's cookbook for JVM compiler writers. While that approach can result in higher-performance mappings, we do not use it here because the resulting tables would be large and we have not investigated correct mappings for all the combinations. The per-operation approach that we take here would benefit from an optimisation pass that removes redundant synchronisation between adjacent operations.
| C/C++11 Operation | x86 implementation |
|---|---|
| Load Relaxed: | MOV (from memory) |
| Load Consume: | MOV (from memory) |
| Load Acquire: | MOV (from memory) |
| Load Seq_Cst: | MOV (from memory) |
| Store Relaxed: | MOV (into memory) |
| Store Release: | MOV (into memory) |
| Store Seq Cst: | (LOCK) XCHG // alternative: MOV (into memory),MFENCE |
| Consume Fence: | <ignore> |
| Acquire Fence: | <ignore> |
| Release Fence: | <ignore> |
| Acq_Rel Fence: | <ignore> |
| Seq_Cst Fence: | MFENCE |
The parenthesised (LOCK) reflects the fact that the XCHG instruction on x86 has an implicit LOCK prefix. If a compiler emits code using non-temporal stores, it must also emit sufficient fencing to make the usage of non-temporal stores unobservable to callers and callees.
Sources: Alexander Terekhov's cpp-threads mailing list post, and the Batty et al., C/C++11 POPL 2011 paper.
Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:
| C/C++11 Operation | x86 implementation |
|---|---|
| Load Seq_Cst: | LOCK XADD(0) // alternative: MFENCE,MOV (from memory) |
| Store Seq Cst: | MOV (into memory) |
As there are typically more loads than stores in a program, this mapping is likely to be less efficient. We should also note that mixing the read-fencing and write-fencing mappings in one program (e.g., by linking object files from two different compilers) can result in an incorrectly compiled program. As a result, we strongly recommend to use the write-fencing mapping only.
| C/C++11 Operation | PowerPC implementation |
|---|---|
| Load Relaxed: | ld |
| Load Consume: | ld + preserve dependencies until
next kill_dependency OR ld; cmp; bc; isync |
| Load Acquire: | ld; cmp; bc; isync |
| Load Seq Cst: | hwsync; ld; cmp; bc; isync |
| Store Relaxed: | st |
| Store Release: | lwsync; st |
| Store Seq Cst: | hwsync; st |
| Cmpxchg Relaxed (32 bit): | _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit: |
| Cmpxchg Acquire (32 bit): | _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit: |
| Cmpxchg Release (32 bit): | lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit: |
| Cmpxchg AcqRel (32 bit): | lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit |
| Cmpxchg SeqCst (32 bit): | hwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit |
| Acquire Fence: | lwsync |
| Release Fence: | lwsync |
| AcqRel Fence: | lwsync |
| SeqCst Fence: | hwsync |
loop: lwarx r6,0,r3,1 #load lock and reserve cmpw r4,r6 #skip ahead if bne- wait # lock not free stwcx. r5,0,r3 #try to set lock bne- loop #loop if lost reservation isync #import barrier . . wait... #wait for lock to freeLock release (r3 contains the address of the lock structure, r4 the value of free lock):
sync #export barrier
stw r4,lock(r3)#release lock
Sources: Paul McKenney's C++ paper N2745, Power ISA 2.06, and Clarifying and Compiling C/C++ Concurrency: from C++11 to POWER by Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell, in POPL 2012.
As far as the memory model is concerned, the ARM processor is broadly similar to PowerPC, differing mainly in having a DMB barrier (analogous to the PowerPC hwsync in its programmer-observable behaviour for normal memory) and no analogue of the PowerPC lwsync. For the non-SC and non-cmpxchg operations, the translation to ARM is very similar to the translation to PowerPC, replacing both lwsync and hwsync by the dmb instruction. The cmpxchg operations are from some Linux kernel sources, and are not direct translations of the PowerPC mappings.
For SC atomics, two mappings have been discussed. The mapping immediately below puts a DMB after an SC load and both before and after an SC store, whereas the `alternative' mapping shown in the second table below follows the PowerPC mapping. It is believed that the former should typically give better performance on ARM processors, but experimental investigation of this would be welcomed.
In any case, it is important that all compilers agree on the choice of mapping, as otherwise SC atomics will not work correctly in code constructed by linking together the results of separate compilation by multiple compilers.
| C/C++11 Operation | ARM implementation |
|---|---|
| Load Relaxed: | ldr |
| Load Consume: | ldr + preserve dependencies until
next kill_dependency
OR ldr; teq; beq; isb OR ldr; dmb |
| Load Acquire: | ldr; teq; beq; isb
OR ldr; dmb |
| Load Seq Cst: | ldr; dmb |
| Store Relaxed: | str |
| Store Release: | dmb; str |
| Store Seq Cst: | dmb; str; dmb |
| Cmpxchg Relaxed (32 bit): | _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop |
| Cmpxchg Acquire (32 bit): | _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; isb |
| Cmpxchg Release (32 bit): | dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; |
| Cmpxchg AcqRel (32 bit): | dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; isb |
| Cmpxchg SeqCst (32 bit): | dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; isb |
| Acquire Fence: | dmb |
| Release Fence: | dmb |
| AcqRel Fence: | dmb |
| SeqCst Fence: | dmb |
| C/C++11 Operation | ARM implementation |
|---|---|
| Load Seq Cst: | dmb; ldr; teq; beq; isb
OR dmb; ldr; dmb |
| Store Seq Cst: | dmb; str |
Note: the only way to get atomic 64-bit memory accesses on ARM is to use ldrex/strex with a loop (ldrd and strd instructions are not guaranteed to appear atomic).
Loop:
LDREX R5, [R1] ; read lock
CMP R5, #0 ; check if 0
WFENE ; sleep if the lock is held
STREXEQ R5, R0, [R1] ; attempt to store new value
CMPEQ R5, #0 ; test if store suceeded
BNE Loop ; retry if not
DMB ; ensures that all subsequent accesses are observed after the
; gaining of the lock is observed
; loads and stores in the critical region can now be performed
Lock release (r1 contains the address of the lock
structure):
MOV R0, #0
DMB ; ensure all previous accesses are observed before the lock is
; cleared
STR R0, [R1] ; clear the lock.
Source: straightforward translation of PowerPC mappings, the cmpxchg was adapted from ARM linux kernel sources, the spinlock implementation is from ARM's Barrier Litmus Tests and Cookbook.
| C/C++11 Operation | IA64 implementation |
|---|---|
| Load Relaxed: | ld.acq |
| Load Consume: | ld.acq |
| Load Acquire: | ld.acq |
| Load Seq_Cst: | ld.acq |
| Store Relaxed: | st.rel |
| Store Release: | st.rel |
| Store Seq Cst: | st.rel; mf |
| Cmpxchg Acquire: | cmpxchg.acq |
| Cmpxchg Release: | cmpxchg.rel |
| Cmpxchg AcqRel: | cmpxchg.rel; mf |
| Cmpxchg SeqCst: | cmpxchg.rel; mf |
| Consume Fence: | <ignore> |
| Acquire Fence: | <ignore> |
| Release Fence: | <ignore> |
| Acq_Rel Fence: | <ignore> |
| Seq_Cst Fence: | mf |
Source: Alexander Terekhov's post to the cpp-threads mailing list
Last updated: 22-12-2011 by Peter Sewell. Change log.