Task 2: Architectural Multiprocessor Semantics

2 Task 2: Architectural Multiprocessor Semantics

The first key interface we address is the architectural interface between hardware and software, specifying the multiprocessor behaviours that processor implementations guarantee and that systems code relies on. Normal industrial practice is to specify this interface in informal-prose architecture reference manuals, sometimes with pseudocode descriptions of the sequential instruction semantics, but these descriptions cannot be used directly to test either the hardware below the interface or software above it, or to build verification tools, let alone as a basis for proof. Compounding the problem, architectures must be nondeterministic loose specifications, to accommodate microarchitectural variation, so one cannot simply characterise them with reference implementations.

There are two linked challenges: handling the subtlety of the concurrency architectures, and handling the scale of the instruction-set descriptions. We are addressing these for multiple architectures: the ARM A-class production architecture, which is the basis for the processors used in the majority of mobile phones and many other devices; the IBM POWER production architecture, used for high-performance server machines, among other areas; the open RISC-V architecture; the x86 architecture of Intel and AMD processors, used in the majority of laptop, desktop, and server machines; and the research CHERI architecture, of which more below.

2.1 Multiprocessor Concurrency: Rigorous Concurrency Architecture

The concurrency architecture of a family of multiprocessors defines the range of allowed behaviours of concurrent programs. It is the interface between multiprocessor hardware vendors, who endeavour to ensure that all their hardware implementations have behaviour within that range, and concurrent programmers (of high-performance concurrent algorithms, operating system kernels, etc.), who can thereby assume that range as their programmer’s model. Ten years ago, in 2008, most architectures, including ARM, x86, and IBM POWER, did not have well-defined concurrency architectures; they had only non-rigorous and highly confusing prose descriptions of the allowed behaviour, sometimes augmented by a few examples. These did not provide a usable basis for either hardware testing or concurrent programming.

Our work before REMS, in collaboration with other academic colleagues and with senior IBM staff, established rigorous concurrency models for modest fragments of the x86 and IBM POWER architectures. These covered the fundamental “user-mode” non-privileged concurrency instructions (loads, stores, dependencies, memory barriers, read-modify-write and load-linked/store-conditional instructions), with ad hoc descriptions of the intra-instruction behaviour and assuming no mixed-size accesses.

Within REMS, we have successfully moved to a world in which these aspects of production architectures are mathematically defined, based on rigorous engineering processes including formal modelling, proof about those models, making the models executable as test oracles, and testing against them.

ARMv8-A

We worked closely with ARM staff and our INRIA partners to clarify the concurrency model of their new ARMv8-A architecture, with a combination of rigorous formal modelling, experimental testing, test generation, and mathematical proof. This contributed to ARM’s 2017 switch to a “multi-copy-atomic” model – a substantial simplification. Together we have developed rigorous axiomatic and operational models, principally by ARM and by our academic group respectively, but deeply informed by each other and proved equivalent for the intersection of their scope. The ARM architecture definition now contains the former, replacing its previous unusable prose [7, 8, 10].

Along the way, our experimental testing discovered a number of hardware errata, in core designs from four major vendors, and our test suite is being used within ARM.

RISC-V

Most major architectures (x86, ARM, MIPS, IBM POWER, Itanium) are proprietary. In contrast, the RISC-V architecture is an open-source architecture, gathering increasing academic and industry interest. Its concurrency architecture is broadly similar to the simplified ARMv8 model, but with a number of subtle differences. We have contributed substantially to the RISC-V Memory Model Task Group and the resulting specification [14], which again has (experimentally) equivalent axiomatic and operational models.

We have also provided a permissive open-source library of approximately 7 000 concurrency litmus tests for RISC-V, including the reference results from the two models.

RISC-V Litmus Tests github repo: https://github.com/litmus-tests/litmus-tests-riscv

Mixed-size and instruction-semantics integration

Scaling up from the modest fragments of the instruction set of our pre-REMS work, we have extended all of these models – and also our operational IBM POWER (working closely with senior IBM staff) and x86 models – to cover mixed-size accesses (all the subtleties of overlapping accesses to parts of memory and registers), and integrated them with principled descriptions of the intra-instruction semantics. This finally suffices to define the behaviour of concurrent algorithm code, not just the litmus tests used in concurrency model development.

Our RMEM exploration tool

Combining all the above with extensive search and user-interface engineering, we provide a web-interface tool, RMEM, that allows one to explore, either interactive, randomly, or exhaustively, the behaviour of concurrent code for ARMv8, RISC-V, IBM POWER, and x86. This supports both litmus tests and small ELF binaries (e.g. compiled by gcc or clang), using our semantics for ELF linking and DWARF debug information from Task 3. The model-allowed behaviour results can then be compared with experimental data, obtained using the litmus tool of the diy tool suite (http://diy.inria.fr/doc/), from production silicon or simulation. The sources for our operational models and RMEM tool are available under a permissive open-source licence.

RMEM tool web interface: http://www.cl.cam.ac.uk/users/pes20/rmem.
RMEM github repo: https://github.com/rems-project/rmem.

GPU Concurrency

We also studied the concurrency semantics of GPUs, with colleagues elsewhere – conducting a large empirical study of their behaviour and proposing a model of Nvidia GPU hardware [11].

2.2 Instruction-Set Architecture (ISA)

The second important aspect of architecture specification is the sequential intra-instruction behaviour. At least for the non-privileged part of an architecture, and neglecting floating point, this is usually less semantically involved than the concurrency behaviour, but the scale of modern ISAs makes it a non-trivial problem, particularly when one wants definitions that are both readable and usable for mathematical proof. We have developed two domain-specific languages for describing ISA behaviour: first L3 and then Sail, as introduced in Task 1.

ARMv8-A

We are working closely with ARM to translate their internal (“ASL”) description of their ARMv8-A architecture into Sail [1]. The resulting Sail model is complete enough to boot Linux, and to pass most of the relevant ARM-internal Architecture Validation Suite tests (in the C emulator generated from Sail). We have approval from ARM to release this under a permissive open-source licence. This will be the first modern production architecture to have a public formal definition complete enough for an OS boot; it should provide a foundation for much future work on system software verification. As a first step towards this, we have an Isabelle machine proof of a characterisation of ARMv8-A address translation.

The first version of this was for a fragment based on the ARM public XML release for ARMv8.3-A, excluding system registers. A more complete version, updated to ARMv8.5-A, is now available under a permissive open-source licence (BSD 3 Clause Clear).

ARMv8.5-A ASL-derived Sail ISA model, in a github repo: https://github.com/rems-project/sail-arm.
ARMv8.3-A ASL-derived Sail ISA model, in the Sail github repo: https://github.com/rems-project/sail/tree/sail2/aarch64.

We also have an earlier hand-written Sail ISA model for a modest user-mode ARM fragment; this model is integrated with our RMEM tool to compute the set of all allowed behaviour for ARMv8 concurrency litmus tests or small ELF files.

ARMv8 hand-written Sail ISA model, in the Sail github repo: https://github.com/rems-project/sail/tree/sail2/arm.

ARMv6, ARMv7, and ARM Cortex-M0

We also developed sequential L3 models for the ARMv6-A and ARMv7-A architectures and the ARM Cortex-M0 implementations, the former including floating-point and some system-level instructions, and the latter equipped with accurate cycle counts for work on resource-aware compilation.

RISC-V

We are working closely with our SRI International partners to hand-write a formal RISC-V ISA specification. This is also complete enough to boot an OS, here Linux, FreeBSD, and seL4, and the model is integrated with our RMEM tool to compute the set of all allowed behaviour for the RISC-V concurrency litmus tests mentioned above.

RISC-V Sail ISA model github repo: https://github.com/rems-project/sail-riscv

IBM POWER

We worked closely with IBM to automatically translate their ISA pseudocode, for a modest user-mode non-vector non-floating-point fragment, from an XML version of their ISA documentation (not previously parsed or type-checked) into a Sail formal model (albeit currently only for an older version of Sail) [12]. This too is integrated with the RMEM tool. We tested the ISA model by generating single-instruction test cases and comparing their behaviour in the model and on POWER 7 hardware; for the 154 user-mode branch and fixed-point instructions, we generated 6984 tests.

IBM POWER Sail ISA model, in the Sail github repo: https://github.com/rems-project/sail/tree/sail2/power.

x86

We have hand-written a Sail model for a modest user-mode fragment of x86, again integrated with RMEM to support concurrency exploration.

x86 Sail ISA model, in the Sail github repo: https://github.com/rems-project/sail/tree/sail2/x86.

2.3 CHERI Architecture Formal Engineering

Turning to large-scale systems research, the CHERI project, led by Moore and Watson at Cambridge and Neumann at SRI International, largely DARPA-funded from 2010 to date, is developing new architectural mechanisms for security protection, revisiting the hardware-software security interface for general-purpose CPUs to fundamentally improve security. CHERI includes an architecture design, processor implementations, and a software stack adapting Clang/LLVM and FreeBSD.

In contrast to ARM and IBM POWER, this is a research architecture, not a mainstream production architecture, but it is being developed at scale, e.g. with a software stack including a variant of a production OS.

CHERI was originally developed using traditional systems and computer architecture methods, with prose and pseudocode descriptions of the architecture, hand-written test suites, and hardware and software prototypes – though with an unusual hardware/software co-design process. REMS created the opportunity for new collaborations between the original CHERI team, spanning systems, security, and computer architecture, and REMS researchers in semantics, verification, and programming languages. This gave us an ideal context to develop and try new rigorous engineering methods in practice:

Since 2014, CHERI architecture research has used formal models of the architecture as central design tools – first in our L3 language, and now in our Sail language. The Sail version is used as the principal definition in the current working version of the CHERI ISA specification (to be released as version 7 of the CHERI architecture).
An SML emulator (running at several hundred KIPS) automatically generated from the L3 formal model has been invaluable, e.g. to bring up a bootable CHERI version of FreeBSD above the model, and as an oracle for testing the CHERI hardware implementation.
We have augmented the hand-written CHERI test suite with an automatic test generator: it creates randomized instruction sequences paired with precise descriptions of their required behaviour, machine-checked against our model by HOL4/SMT provers [26, 25, 24]. We have also produced a generic synthesisable test bench, BlueCheck [19]. This has been used to check various microarchitectural properties of CHERI, finding some bugs. It has moreover been used for teaching in MIT (in the 6.175 Computer Architecture course).
As a first exercise in developing mathematically rigorous proof of specific security properties about an architecture description, we have generated theorem-prover definitions (in the Isabelle and HOL4 provers), discussed and clarified some of the fundamental security properties the CHERI designers intended the architecture to guarantee, and produced machine proofs (in Isabelle) of some of these.
We have proved (mechanised proof, in HOL4) correctness properties of the “CHERI-128” compressed capability scheme.
We have further contributed to the CHERI hardware implementation by building an efficient checker for its memory consistency model, our Axe tool [9], which has also been used for RISC-V.

The theory/practice interactions with the CHERI team have also been fruitful at higher levels in the stack, for the C language, linking, and OS interfaces, as we return to below. This has only been possible with intertwined groups that cover both systems and semantics at multiple layers of the stack, from hardware to systems software.

2.4 CakeML ISA models

We have used L3 to define the user-mode instruction semantics for each of the target architectures of the CakeML verified compiler, and to generate proof infrastructure for them [64, 65, 66, 67]. CakeML uses the above ARMv6-A and ARMv8-A L3 models, together with MIPS and x86-64 models developed here and a RISC-V model developed by our partners at SRI International.