The Semantics of x86 Multiprocessor Programs

Multiprocessors are now dominant, but real multiprocessors do not provide the sequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have subtle relaxed (or weak) memory models, usually described only in ambiguous (and sometimes flawed) prose, leading to widespread confusion.

We are working to develop mathematically rigorous and usable semantics for multiprocessor programs, focussing on three architectures: x86 (Intel 64/IA-32 and AMD64), Power, and ARM. Such models should provide a solid intuition for low-level programming, and a sound foundation for future work on verification, static analysis, and compilation of low-level concurrent code, and for the design of high-level concurrent programming languages.

This work is associated with a broader project on Relaxed-Memory Concurrency.

x86-TSO

For x86, we have developed the x86-TSO model. To the best of our knowledge, x86-TSO is sound with respect to actual processor behaviour, matches the current vendor intentions, and is a good model to program above. We give two equivalent definitions of x86-TSO: an intuitive operational model based on local write buffers, and an axiomatic total store ordering model, similar to that of the SPARCv8. Both are adapted to handle x86-specific features. We implemented the axiomatic model in our memevents tool, which calculates the set of all valid executions of test programs and compared the results with experimental data from our litmus tool. The model is described in the CACM paper below, written to be accessible for a broad audience.

The x86-TSO model was introduced and is formally defined in the TPHOLs paper below, together with its accompanying techreport and HOL4 definitions.

The paper below introduces a reasoning principle for programs running above the x86-TSO model: a generalisation, triangular race freedom of a data race freedom property. Of particular interest are the low-level implementations of the abstractions that support language-level concurrency - especially because they invariably contain data races. Triangular race freedom is used to analyze five concurrency abstraction implementations: two spinlocks (from Linux); a non-blocking write protocol; the double-checked locking idiom; and java.util.concurrent's Parker.

The paper below develops a program logic for x86-TSO assembly programs, extending rely-guarantee reasoning. It is formalised in HOL4 and illustrated with Simpson's 4-slot algorithm.

  • A Rely-Guarantee proof system for x86-TSO assembly code programs. Tom Ridge. In VSTTE 2010.

x86-CC

Our earlier x86-CC model, presented at POPL 2009, was based on the then-current Intel and AMD documentation: the Intel 64 and IA-32 Architectures Software Developer's Manual [up to revision 28, Sept. 2008] and AMD64 Architecture Programmer's Manual [vol 2, rev. 3.14, Sept. 2007], and also with the earlier Intel 64 Architecture Memory Ordering White Paper [Aug. 2007]. We tested the semantics against actual processors and the vendor litmus-test examples, and gave an equivalent abstract-machine characterisation of our axiomatic memory model. We proved that the memory model is equivalent to a simpler (`nice execution') definition. For programs that are, in some precise sense, data-race free, we proved in HOL that their behaviour is sequentially consistent. However, that vendor documentation, and hence x86-CC, turned out to be misleading or unsound with respect to the behaviour of actual x86 processors: it forbids behaviour that actual processors exhibit, as detailed in the Addendum to the x86-CC paper below. More recent vendor documentation (revisions 29ff of the Intel SDM) fixes that unsoundness but gives surprisingly weak guarantees to programmers. Accordingly, while the paper may still be of some interest (and the instruction semantics and testing material is still relevant), for most purposes one should focus on the x86-TSO model above.

Experimental data:

Instruction Semantics

To reason about multiprocessor machine-code or assembly code programs, the memory model must be integrated with a semantic definition of the behaviour of machine instructions. The HOL developments above for x86 (x86-CC) and for the Power and ARM (DAMP'09) include such definitions for modest fragments of the instruction sets. In the absence of fully verified processor implementations, we establish confidence in these definitions by empirical testing, comparing the behaviours permitted by the HOL definition against the behaviours observed in practice running randomly generated instructions on actual processors. Our x86sem tool and results for doing this on x86 are below.

See also the work by Fox and Myreen in the same vein on testing Fox and Gordon's ARM semantics.

Tools

Experimental data:

  • From the x86sem sequential testing: results

People

Computer Laboratory, University of Cambridge:

University of Leicester:

INRIA Paris-Rocquencourt:

Funding

We acknowledge funding from the EPSRC (grants EP/H005633/1 and EP/F036345), the INRIA equipes associees MM, and ANR grant ANR-06-SETI-010-02.

an x86-CC example execution

[validate]