C/C++11 mappings to processors

This document summarises some known mappings of C/C++11 atomic operations to x86, PowerPC, ARM and Itanium instruction sequences. These are collected for discussion, not as a definitive source. At the moment, we do not include mappings for all atomic operations - for example, atomic increment is missing. We would be grateful for any suggestions.

This document does not cover any optimisations of either the mappings or C/C++11 programs in general.

Approach

For each C/C++11 synchronisation operation and architecture, the document aims to provide an instruction sequence that implements the operation on given architecture. This is not the only approach — one could provide a mapping that shows the necessary barriers (or other synchronisation mechanism) between two program-order adjacent memory operations (either atomic or non-atomic). A good example of this approach is Doug Lea's cookbook for JVM compiler writers. While that approach can result in higher-performance mappings, we do not use it here because the resulting tables would be large and we have not investigated correct mappings for all the combinations. The per-operation approach that we take here would benefit from an optimisation pass that removes redundant synchronisation between adjacent operations.

Architectures

x86 (including x86-64)

C/C++11 Operationx86 implementation
Load Relaxed:MOV (from memory)
Load Consume: MOV (from memory)
Load Acquire: MOV (from memory)
Load Seq_Cst: MOV (from memory)
Store Relaxed: MOV (into memory)
Store Release: MOV (into memory)
Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE
Consume Fence: <ignore>
Acquire Fence: <ignore>
Release Fence: <ignore>
Acq_Rel Fence: <ignore>
Seq_Cst Fence: MFENCE

The parenthesised (LOCK) reflects the fact that the XCHG instruction on x86 has an implicit LOCK prefix. If a compiler emits code using non-temporal stores, it must also emit sufficient fencing to make the usage of non-temporal stores unobservable to callers and callees.

Sources: Alexander Terekhov's cpp-threads mailing list post, and the Batty et al., C/C++11 POPL 2011 paper.

Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:

C/C++11 Operationx86 implementation
Load Seq_Cst: LOCK XADD(0) // alternative: MFENCE,MOV (from memory)
Store Seq Cst: MOV (into memory)

As there are typically more loads than stores in a program, this mapping is likely to be less efficient. We should also note that mixing the read-fencing and write-fencing mappings in one program (e.g., by linking object files from two different compilers) can result in an incorrectly compiled program. As a result, we strongly recommend to use the write-fencing mapping only.

PowerPC

C/C++11 OperationPowerPC implementation
Load Relaxed: ld
Load Consume: ld + preserve dependencies until next kill_dependency
OR
ld; cmp; bc; isync
Load Acquire: ld; cmp; bc; isync
Load Seq Cst: hwsync; ld; cmp; bc; isync
Store Relaxed: st
Store Release: lwsync; st
Store Seq Cst: hwsync; st
Cmpxchg Relaxed (32 bit): _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:
Cmpxchg Acquire (32 bit): _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit:
Cmpxchg Release (32 bit): lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:
Cmpxchg AcqRel (32 bit): lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit
Cmpxchg SeqCst (32 bit): hwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit
Acquire Fence: lwsync
Release Fence: lwsync
AcqRel Fence: lwsync
SeqCst Fence: hwsync

Spinlock implementation

Lock acquire (address of the lock is in r3; register r4 contains the value indicating free lock, r5 the value of taken lock):
                                           
loop:
   lwarx  r6,0,r3,1 #load lock and reserve                                             
   cmpw   r4,r6     #skip ahead if      
   bne-   wait      # lock not free              
   stwcx. r5,0,r3   #try to set lock
   bne-   loop      #loop if lost reservation           
   isync            #import barrier 
   .
   .
wait...           #wait for lock to free
Lock release (r3 contains the address of the lock structure, r4 the value of free lock):
    sync              #export barrier
    stw     r4,lock(r3)#release lock

Sources: Paul McKenney's C++ paper N2745, Power ISA 2.06, and Clarifying and Compiling C/C++ Concurrency: from C++11 to POWER by Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell, in POPL 2012.

ARM

As far as the memory model is concerned, the ARM processor is broadly similar to PowerPC, differing mainly in having a DMB barrier (analogous to the PowerPC hwsync in its programmer-observable behaviour for normal memory) and no analogue of the PowerPC lwsync. For the non-SC and non-cmpxchg operations, the translation to ARM is very similar to the translation to PowerPC, replacing both lwsync and hwsync by the dmb instruction. The cmpxchg operations are from some Linux kernel sources, and are not direct translations of the PowerPC mappings.

For SC atomics, two mappings have been discussed. The mapping immediately below puts a DMB after an SC load and both before and after an SC store, whereas the `alternative' mapping shown in the second table below follows the PowerPC mapping. It is believed that the former should typically give better performance on ARM processors, but experimental investigation of this would be welcomed.

In any case, it is important that all compilers agree on the choice of mapping, as otherwise SC atomics will not work correctly in code constructed by linking together the results of separate compilation by multiple compilers.

C/C++11 OperationARM implementation
Load Relaxed: ldr
Load Consume: ldr + preserve dependencies until next kill_dependency
OR
ldr; teq; beq; isb
OR
ldr; dmb
Load Acquire: ldr; teq; beq; isb
OR
ldr; dmb
Load Seq Cst: ldr; dmb
Store Relaxed: str
Store Release: dmb; str
Store Seq Cst: dmb; str; dmb
Cmpxchg Relaxed (32 bit): _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop
Cmpxchg Acquire (32 bit): _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; isb
Cmpxchg Release (32 bit): dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop;
Cmpxchg AcqRel (32 bit): dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; isb
Cmpxchg SeqCst (32 bit): dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; isb
Acquire Fence: dmb
Release Fence: dmb
AcqRel Fence: dmb
SeqCst Fence: dmb

Alternative SC atomic mapping

C/C++11 OperationARM implementation
Load Seq Cst: dmb; ldr; teq; beq; isb
OR
dmb; ldr; dmb
Store Seq Cst: dmb; str

Note: the only way to get atomic 64-bit memory accesses on ARM is to use ldrex/strex with a loop (ldrd and strd instructions are not guaranteed to appear atomic).

Spinlock implementation

Lock acquire (address of the lock is in r1, the value of taken lock is in r0):
                                           
Loop:
    LDREX R5, [R1]              ; read lock
    CMP R5, #0                  ; check if 0
    WFENE                       ; sleep if the lock is held
    STREXEQ R5, R0, [R1]        ; attempt to store new value
    CMPEQ R5, #0                ; test if store suceeded
    BNE Loop                    ; retry if not
    DMB                         ; ensures that all subsequent accesses are observed after the
                                ; gaining of the lock is observed
    ; loads and stores in the critical region can now be performed
Lock release (r1 contains the address of the lock structure):
MOV R0, #0
DMB          ; ensure all previous accesses are observed before the lock is
             ; cleared
STR R0, [R1] ; clear the lock.

Source: straightforward translation of PowerPC mappings, the cmpxchg was adapted from ARM linux kernel sources, the spinlock implementation is from ARM's Barrier Litmus Tests and Cookbook.

Itanium

C/C++11 OperationIA64 implementation
Load Relaxed: ld.acq
Load Consume: ld.acq
Load Acquire: ld.acq
Load Seq_Cst: ld.acq
Store Relaxed: st.rel
Store Release: st.rel
Store Seq Cst: st.rel; mf
Cmpxchg Acquire:cmpxchg.acq
Cmpxchg Release:cmpxchg.rel
Cmpxchg AcqRel:cmpxchg.rel; mf
Cmpxchg SeqCst:cmpxchg.rel; mf
Consume Fence: <ignore>
Acquire Fence: <ignore>
Release Fence: <ignore>
Acq_Rel Fence: <ignore>
Seq_Cst Fence: mf

Source: Alexander Terekhov's post to the cpp-threads mailing list

Last updated: 22-12-2011 by Peter Sewell. Change log.