C/C++11 mappings to processors

This document summarises some known mappings of C/C++11 atomic operations to x86, PowerPC, ARMv7, ARMv8, and Itanium instruction sequences. These are collected for discussion, not as a definitive source. At the moment, we do not include mappings for all atomic operations - for example, atomic increment is missing. We would be grateful for any suggestions. ARMv8 has additional instructions: stores and loads named store-release and load-acquire, which appear to be strong enough to implement the C/C++11 SC atomic operations (not just the C/C++11 release and acquire operations), and additional memory barriers.

This document does not cover any optimisations of either the mappings or C/C++11 programs in general.

Change log

Approach

For each C/C++11 synchronisation operation and architecture, the document aims to provide an instruction sequence that implements the operation on given architecture. This is not the only approach — one could provide a mapping that shows the necessary barriers (or other synchronisation mechanism) between two program-order adjacent memory operations (either atomic or non-atomic). A good example of this approach is Doug Lea's cookbook for JVM compiler writers. While that approach can result in higher-performance mappings, we do not use it here because the resulting tables would be large and we have not investigated correct mappings for all the combinations. The per-operation approach that we take here would benefit from an optimisation pass that removes redundant synchronisation between adjacent operations.

Architectures

x86 (including x86-64)

C/C++11 Operation	x86 implementation
Load Relaxed:	MOV (from memory)
Load Consume:	MOV (from memory)
Load Acquire:	MOV (from memory)
Load Seq_Cst:	MOV (from memory)
Store Relaxed:	MOV (into memory)
Store Release:	MOV (into memory)
Store Seq Cst:	(LOCK) XCHG // alternative: MOV (into memory),MFENCE
Consume Fence:	<ignore>
Acquire Fence:	<ignore>
Release Fence:	<ignore>
Acq_Rel Fence:	<ignore>
Seq_Cst Fence:	MFENCE

The parenthesised (LOCK) reflects the fact that the XCHG instruction on x86 has an implicit LOCK prefix. If a compiler emits code using non-temporal stores, it must also emit sufficient fencing to make the usage of non-temporal stores unobservable to callers and callees.

JF Bastian suggests a locked identity operation to the top of stack as an alternative SC fence implementation

Sources: Alexander Terekhov's cpp-threads mailing list post, and the Batty et al., C/C++11 POPL 2011 paper.

Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:

C/C++11 Operation	x86 implementation
Load Seq_Cst:	LOCK XADD(0) // alternative: MFENCE,MOV (from memory)
Store Seq Cst:	MOV (into memory)

As there are typically more loads than stores in a program, this mapping is likely to be less efficient. We should also note that mixing the read-fencing and write-fencing mappings in one program (e.g., by linking object files from two different compilers) can result in an incorrectly compiled program. As a result, we strongly recommend to use the write-fencing mapping only.

IBM Power / PowerPC

C/C++11 Operation	Power implementation
Load Relaxed:	ld
Load Consume:	ld + preserve dependencies until next `kill_dependency` OR ld; cmp; bc; isync
Load Acquire:	ld; cmp; bc; isync
Load Seq Cst:	hwsync; ld; cmp; bc; isync
Store Relaxed:	st
Store Release:	lwsync; st
Store Seq Cst:	hwsync; st
Cmpxchg Relaxed (32 bit):	_loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:
Cmpxchg Acquire (32 bit):	_loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit:
Cmpxchg Release (32 bit):	lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:
Cmpxchg AcqRel (32 bit):	lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit
Cmpxchg SeqCst (32 bit):	hwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit
Acquire Fence:	lwsync
Release Fence:	lwsync
AcqRel Fence:	lwsync
SeqCst Fence:	hwsync

Spinlock implementation

Lock acquire (address of the lock is in r3; register r4 contains the value indicating free lock, r5 the value of taken lock):

                                           
loop:
   lwarx  r6,0,r3,1 #load lock and reserve                                             
   cmpw   r4,r6     #skip ahead if      
   bne-   wait      # lock not free              
   stwcx. r5,0,r3   #try to set lock
   bne-   loop      #loop if lost reservation           
   isync            #import barrier 
   .
   .
wait...           #wait for lock to free

Lock release (r3 contains the address of the lock structure, r4 the value of free lock):

    sync              #export barrier
    stw     r4,lock(r3)#release lock

Sources: Paul McKenney's C++ paper N2745, Power ISA 2.06, and Clarifying and Compiling C/C++ Concurrency: from C++11 to POWER by Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell, in POPL 2012.

ARMv7

As far as the memory model is concerned, the ARMv7 architecture is broadly similar to Power, differing mainly in having a DMB barrier (analogous to the Power hwsync in its programmer-observable behaviour for normal memory) and no analogue of the Power lwsync. For the non-SC and non-cmpxchg operations, the translation to ARM is very similar to the translation to Power, replacing both lwsync and hwsync by the dmb instruction. The cmpxchg operations are from some Linux kernel sources, and are not direct translations of the Power mappings.

For SC atomics, two mappings have been discussed. The mapping immediately below puts a DMB after an SC load and both before and after an SC store, whereas the `alternative' mapping shown in the second table below follows the PowerPC mapping. It is believed that the former should typically give better performance on ARM processors, but experimental investigation of this would be welcomed.

In any case, it is important that all compilers agree on the choice of mapping, as otherwise SC atomics will not work correctly in code constructed by linking together the results of separate compilation by multiple compilers.

C/C++11 Operation	ARM implementation
Load Relaxed:	ldr
Load Consume:	ldr + preserve dependencies until next `kill_dependency` OR ldr; teq; beq; isb OR ldr; dmb ish
Load Acquire:	ldr; dmb ish OR ldr; teq; beq; isb
Load Seq Cst:	ldr; dmb ish
Store Relaxed:	str
Store Release:	dmb ish; str
Store Seq Cst:	dmb ish; str; dmb ish
Cmpxchg Relaxed (32 bit):	_loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop
Cmpxchg Acquire (32 bit):	_loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; isb
Cmpxchg Release (32 bit):	dmb ish; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop;
Cmpxchg AcqRel (32 bit):	dmb ish; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; isb
Cmpxchg SeqCst (32 bit):	dmb ish; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop; dmb ish
Acquire Fence:	dmb ish
Release Fence:	dmb ish
AcqRel Fence:	dmb ish
SeqCst Fence:	dmb ish

Note added 2016/10/10: for the acquire operations (Load Acquire, Cmpxchg Acquire, Cmpxchg AcqRel), Will Deacon at ARM remarks that GCC and LLVM use a load;dmb sequence rather than the control-isb dependency. Both alternatives should be sound, but he would prefer the dmb-based mappings.

Note added 2015/04/11: the Cmpxchg SeqCst clause above used to have an isb in place of the dmb. As observed by Robin Morisset, and confirmed by experimental testing and discussions with ARM, a dmb is required there.

Alternative SC atomic mapping

C/C++11 Operation	ARM implementation
Load Seq Cst:	dmb ish; ldr; teq; beq; isb OR dmb ish; ldr; dmb ish
Store Seq Cst:	dmb ish; str

Note: the only way to get atomic 64-bit memory accesses on ARM without LPAE (the Large Physical Address Extensions) is to use ldrex/strex with a loop (ldrd and strd instructions are not guaranteed to appear atomic). Processors with LPAE have single-copy-atomic LDRD and STRD instructions to normal memory.

Spinlock implementation

Lock acquire (address of the lock is in r1, the value of taken lock is in r0):

                                           
Loop:
    LDREX R5, [R1]              ; read lock
    CMP R5, #0                  ; check if 0
    WFENE                       ; sleep if the lock is held
    STREXEQ R5, R0, [R1]        ; attempt to store new value
    CMPEQ R5, #0                ; test if store suceeded
    BNE Loop                    ; retry if not
    DMB ISH                         ; ensures that all subsequent accesses are observed after the
                                ; gaining of the lock is observed
    ; loads and stores in the critical region can now be performed

Lock release (r1 contains the address of the lock structure):

MOV R0, #0
DMB ISH          ; ensure all previous accesses are observed before the lock is
             ; cleared
STR R0, [R1] ; clear the lock.

Source: straightforward translation of PowerPC mappings, the cmpxchg was adapted from ARM linux kernel sources, the spinlock implementation is from ARM's Barrier Litmus Tests and Cookbook.

ARMv8

For ARMv8, ARM suggest the following. All approaches that work for ARMv7 can also be used, but for ARMv8, the following approaches are recommended.

AArch32

C/C++11 Operation	AArch32 implementation
Load Relaxed:	LDR
Load Consume:	LDR + preserve dependencies until next kill_dependency OR LDA
Load Acquire:	LDA
Load Seq Cst:	LDA
Store Relaxed:	STR
Store Release:	STL
Store Seq Cst:	STL
Cmpxchng Relaxed:	_loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop
Cmpxchng Acquire:	_loop: ldaex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop
Cmpxchng Release:	_loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; stlexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop
Cmpxchng AcqRel:	_loop: ldaex roldval, [rptr]; mov rres, 0; teq roldval, rold; stlexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop
Cmpxchng SeqCst:	_loop: ldaex roldval, [rptr]; mov rres, 0; teq roldval, rold; stlexeq rres, rnewval, [rptr]; teq rres, 0; bne _loop
Acquire Fence:	DMB ISH LD
Release Fence:	DMB ISH
AcqRel Fence:	DMB ISH
SeqCst Fence:	DMB ISH

Note that the ARMv8 AArch32 SC mapping doesn't interwork with the ARMv7 mapping.

AArch64

C/C++11 Operation	AArch64 implementation
Load Relaxed:	LDR
Load Consume:	LDR + preserve dependencies until next kill_dependency OR LDAR
Load Acquire:	LDAR
Load Seq Cst:	LDAR
Store Relaxed:	STR
Store Release:	STLR
Store Seq Cst:	STLR
Cmpxchng Relaxed:	_loop: ldxr roldval, [rptr]; cmp roldval, rold; b.ne _exit; stxr rres, rnewval, [rptr]; cbnz rres, _loop; _exit
Cmpxchng Acquire:	_loop: ldaxr roldval, [rptr]; cmp roldval, rold; beq _exit; stxr rres, rnewval, [rptr]; cbnz rres, _loop; _exit
Cmpxchng Release:	_loop: ldxr roldval, [rptr]; cmp roldval, rold; b.ne _exit; stlxr rres, rnewval, [rptr]; cbnz rres, _loop; _exit
Cmpxchng AcqRel:	_loop: ldaxr roldval, [rptr]; cmp roldval, rold; b.ne _exit; stlxr rres, rnewval, [rptr]; cbnz rres, _loop; _exit
Cmpxchng SeqCst :	_loop: ldaxr roldval, [rptr]; cmp roldval, rold; b.ne _exit; stlxr rres, rnewval, [rptr]; cbnz rres, _loop; _exit
Acquire Fence:	DMB ISH LD
Release Fence:	DMB ISH
AcqRel Fence:	DMB ISH
SeqCst Fence:	DMB ISH

Itanium

C/C++11 Operation	IA64 implementation
Load Relaxed:	ld.acq
Load Consume:	ld.acq
Load Acquire:	ld.acq
Load Seq_Cst:	ld.acq
Store Relaxed:	st.rel
Store Release:	st.rel
Store Seq Cst:	st.rel; mf
Cmpxchg Acquire:	cmpxchg.acq
Cmpxchg Release:	cmpxchg.rel
Cmpxchg AcqRel:	cmpxchg.rel; mf
Cmpxchg SeqCst:	cmpxchg.rel; mf
Consume Fence:	<ignore>
Acquire Fence:	<ignore>
Release Fence:	<ignore>
Acq_Rel Fence:	<ignore>
Seq_Cst Fence:	mf

Source: Alexander Terekhov's post to the cpp-threads mailing list