# Operational Semantics of Concurrent Assembly

# Shaked Flur

Collaboration with: Peter Sewell
Christopher Pulte
Luc Maranget and others

# What Does this Program Do?

```
MOV X2,#1
STR X2,[X0]
STR X2,[X1]
```

# What Does this Program Do?

| Thread 0 |         | Thread 1     |
|----------|---------|--------------|
| MOV      | X2,#1   | LDR X3,[X1]  |
| STR      | X2,[X0] | LDR X4, [X0] |
| STR      | X2,[X1] |              |

#### Outline of the Talk

What's wrong with the reference manual?

Why operational semantics?

Why concurrency is hard?

An operational model for ARMv8

**Further abstractions** 

- TPHOLs 2009 A better x86 memory model: x86-TSO. Scott Owens, Susmit Sarkar, and Peter Sewell.
  - PLDI 2011 Understanding POWER Multiprocessors. Susmit Sarkar, Peter Sewell,

    Jade Alglave, Luc Maranget, and Derek Williams.
  - POPL 2016 Modelling the ARMv8 architecture, operationally: concurrency and ISA. Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell.
  - POPL 2017 Mixed-size Concurrency: ARM, POWER, C/C++11, and SC.

    Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc Maranget, Kathryn
    E. Gray, Ali Sezgin, Mark Batty, and Peter Sewell.
  - POPL 2018 Simplifying ARM Concurrency: Multicopy-atomic Axiomatic and Operational Models for ARMv8. Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell.
  - POPL 2019 ISA Semantics for ARMv8-A, RISC-V, and CHERI-MIPS.

    Alasdair Armstrong, Thomas Bauereiss, Brian Campbell, Alastair Reid, Kathryn E. Gray,

    Robert M. Norton, Prashanth Mundkur, Mark Wassell, Jon French, Christopher Pulte,

    Shaked Flur, Ian Stark, Neel Krishnaswami, and Peter Sewell.
  - POPL 2020 ARMv8-A system semantics: instruction fetch in relaxed architectures. Ben Simner, Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon-Pharabod, Luc Maranget, and Peter Sewell.

# ARM Architecture Reference Manual is 7900 pages!!!

Informal, ambiguous, imprecise, incomplete...



# **Formal Semantics**

#### The architecture can be decomposed to two main parts:

- ► Instruction semantics (sequential execution)
- Memory model (multi-threading)

#### The architecture can be decomposed to two main parts:

- Instruction semantics (sequential execution)
- Memory model (multi-threading)

#### The ARM manual:

```
bits(datasize) result;
result = Zeros();
result < pos + 15: pos > = imm16;
X[d] = result;
```

#### Sail:

```
(bit['R]) result := 0;
result := Zeros();
result[(pos+15)..pos] := imm16;
wX(d) := result;
```

# MOV X2,#1

#### The architecture can be decomposed to two main parts:

- Instruction semantics (sequential execution)
- ► Memory model (multi-threading)

#### The ARM manual:

```
bits(datasize) result;
result = Zeros();
result < pos + 15: pos > = imm16;
X[d] = result;
```

#### Sail:

```
(bit['R]) result := 0;
result := Zeros();
result[(pos+15)..pos] := imm16;
wX(d) := result;
```

# MOV X2,#1

- ► There is no prior reliable model
- We model the architect's intent
- Our model provides formal foundation for other models

- ► There is no prior reliable model
- We model the architect's intent
- Our model provides formal foundation for other models

#### How do we make our model reliable?

- There is no prior reliable model
- We model the architect's intent
- Our model provides formal foundation for other models

#### How do we make our model reliable?

- Litmus tests with known results
- Soundness with respect to existing hardware

- There is no prior reliable model
- We model the architect's intent
- Our model provides formal foundation for other models

#### How do we make our model reliable?

- Litmus tests with known results
- Soundness with respect to existing hardware
- ► Expressed in the same terms architects use

**Axiomatic semantics** 

**Denotational semantics** 

#### **Axiomatic semantics**

- Candidate execution over memory events
- Apply a predicate to the execution
- ► The main construct here is relations

#### **Denotational semantics**

#### **Axiomatic semantics**

- Candidate execution over memory events
- Apply a predicate to the execution
- ► The main construct here is relations

#### **Denotational semantics**

Construct executions by applying rules to partially ordered multi-sets

#### **Axiomatic semantics**

- Candidate execution over memory events
- Apply a predicate to the execution
- ► The main construct here is relations

#### **Denotational semantics**

Construct executions by applying rules to partially ordered multi-sets

- State machine
- Construct execution incrementally
- Relations emerge from the structure (not explicit)
- ► This is the language architects use

#### Axiomatic and Denotational Semantics

- ► No microarchitectural intuition
- Not a reliable foundation
- How do we add system-level features?
- ► Allow more behaviour than the architecture
- For well-behaved code
- Combinatorial explosion
- Better suited for higher-level programming languages

- ► Familiar to the architects (microarchitectural intuition)
- Captures the architects intent
- Tightly integrated with instruction semantics
- Relatively easy to extend with system-level features
- Construct valid executions incrementally

# The ARM Operational Memory Model

- Written in Lem
- Core functionality: ~7.5k LoC
- ▶ Utilities and other models: ~10k LoC

```
let reorder_pred params state feic =
  match feic huffer below with
               -> Nothing
   e :: below ->
      guard ((e. feic.event) NIN state.reordered) >>
      guard (reorder_pred' params feic.event e) >>
      return (e, below)
  end
let reorder_next _params state feic = fun e below ->
  let buffer =
    feic.buffer_above ++ [e: feic.event] ++ below
  in
  < state with
    huffers =
     Map, insert feic, segment buffer state, buffers:
    reordered =
      {(feic.event. e)} union state.reordered:
  1>
```









#### https://www.cl.cam.ac.uk/~sf502/regressions/rmem/



# Why Concurrency is Hard?

- Computers create an illusion that single threaded code is executed sequentially
- ▶ It is not practical to do the same for multi-threaded programs
- Multi-threaded programs can observe optimisations

# Taste of Multi-threaded Behaviour

| Thread 0          |                             | Thread 1                     |
|-------------------|-----------------------------|------------------------------|
| MOV<br>STR<br>STR | X2,#1<br>X2,[X0]<br>X2,[X1] | LDR X3, [X1]<br>LDR X4, [X0] |







R data

W flag=1













#### Out of Order Execution





#### Out of Order Execution



R data=0

W flag=1

## Forcing In Order Execution

| Initial state: X0=&data, X1=&flag                          |                                                        |  |
|------------------------------------------------------------|--------------------------------------------------------|--|
| Thread 0                                                   | Thread 1                                               |  |
| MOV X2,#1<br>STR X2,[X0]<br>DMB SY//barrier<br>STR X2,[X1] | LDR X3,[X1]<br>AND X4,X3,#0//address<br>LDR X5,[X0,X4] |  |
| Final state: 1:X3=1 ∧ 1:X5=0                               |                                                        |  |

## Forcing In Order Execution

| Initial state: X0=&data, X1=&flag                          |                                                  |  |
|------------------------------------------------------------|--------------------------------------------------|--|
| Thread 0                                                   | Thread 1                                         |  |
| MOV X2,#1<br>STR X2,[X0]<br>DMB SY//barrier<br>STR X2,[X1] | LDR X3,[X1] AND X4,X3,#0//address LDR X5,[X0,X4] |  |
| Final state: 1: X3=1 ∧ 1: X5=0                             |                                                  |  |



## Forcing In Order Execution

| Initial state: X0=&data, X1=&flag                          |                                                  |  |
|------------------------------------------------------------|--------------------------------------------------|--|
| Thread 0                                                   | Thread 1                                         |  |
| MOV X2,#1<br>STR X2,[X0]<br>DMB SY//barrier<br>STR X2,[X1] | LDR X3,[X1] AND X4,X3,#0//address LDR X5,[X0,X4] |  |
| Final state: 1:X3=1 \ 1:X5=0                               |                                                  |  |



# Speculative Execution (Load)

| Initial state: XO=&data, X1=&flag |                                   |                                                          |  |
|-----------------------------------|-----------------------------------|----------------------------------------------------------|--|
| Thread 0 Thread 1                 |                                   |                                                          |  |
| STR<br>DMB                        | X2,#1<br>X2,[X0]<br>SY<br>X2,[X1] | LDR X3,[X1]<br>CBZ X3,end//branch<br>LDR X4,[X0]<br>end: |  |
| Final state: 1:X3=1 \( 1:X4=0 \)  |                                   |                                                          |  |

## Speculative Execution (Load)

| Initial state: XO=&data, X1=&flag |         |                                                          |
|-----------------------------------|---------|----------------------------------------------------------|
| TI                                | hread 0 | Thread 1                                                 |
| STR<br>DMB                        |         | LDR X3,[X1]<br>CBZ X3,end//branch<br>LDR X4,[X0]<br>end: |
| Final state: 1:X3=1 ∧ 1:X4=0      |         |                                                          |



## Speculative Execution (Load)

| Initial state: XO=&data, X1=&flag                 |                                                          |  |  |
|---------------------------------------------------|----------------------------------------------------------|--|--|
| Thread 0                                          | Thread 1                                                 |  |  |
| MOV X2,#1<br>STR X2,[X0]<br>DMB SY<br>STR X2,[X1] | LDR X3,[X1]<br>CBZ X3,end//branch<br>LDR X4,[X0]<br>end: |  |  |
| Final state: 1                                    | X3=1 /\ 1:X4=0                                           |  |  |



# Speculative Execution (Store)

| Initial state: X0=&a, X1=&b           |                                   |                                                                   |  |
|---------------------------------------|-----------------------------------|-------------------------------------------------------------------|--|
| Thread 0 Thread 1                     |                                   |                                                                   |  |
| DMB<br>MOV                            | X2,[X0]<br>SY<br>X3,#1<br>X3,[X1] | LDR X2, [X1]<br>CBZ X2, end<br>MOV X3, #1<br>STR X3, [X0]<br>end: |  |
| Final state: 0:X2=1 \(\times 1:X2=1\) |                                   |                                                                   |  |

## Speculative Execution (Store)

| Initial state: XO=&a, X1=&b  |                                   |                                                                   |  |  |
|------------------------------|-----------------------------------|-------------------------------------------------------------------|--|--|
| Thread 0 Thread 1            |                                   |                                                                   |  |  |
| DMB<br>MOV                   | X2,[X0]<br>SY<br>X3,#1<br>X3,[X1] | LDR X2, [X1]<br>CBZ X2, end<br>MOV X3, #1<br>STR X3, [X0]<br>end: |  |  |
| Final state: 0:X2=1 ∧ 1:X2=1 |                                   |                                                                   |  |  |



# Speculative Execution (Store)

| Initial state: X0=&a, X1=&b |          |         |       |            |
|-----------------------------|----------|---------|-------|------------|
| Thread 0 Thread 1           |          |         |       |            |
| LDR                         | X2,[X0]  | LDR     | Х2,   | [X1]       |
| DMB                         | SY       | CBZ     | 12,6  | nd         |
| VOM                         | X3,#1    | MOV     | X3, ‡ | <b>‡</b> 1 |
| STR                         | X3,[X1]  | STR     | X3,   | [XO]       |
|                             |          | end:    |       |            |
| Fina                        | etato: A | V 2 - 1 | 1 . V | 2-1        |



## Write Forwarding

Initial state: X0=&data, X1=&flag,
X2=&temp

| Thread 0   |                                   | Thread 1  |  |
|------------|-----------------------------------|-----------|--|
| STR<br>DMB | X3,#1<br>X3,[X0]<br>SY<br>X3,[X1] | MOV X4,#1 |  |

Final state: 1:X3=1 \(\times 1:X5=1 \) \(\times 1:X7=0 \)

## Write Forwarding

# Initial state: X0=&data, X1=&flag, X2=&temp

| Thread 0   |                                   | Thread 1                        |                                   |
|------------|-----------------------------------|---------------------------------|-----------------------------------|
| STR<br>DMB | X3,#1<br>X3,[X0]<br>SY<br>X3,[X1] | CBZ<br>MOV<br>STR<br>LDR<br>AND | X5,[X2]<br>X6,X5,#0<br>X7,[X0,X6] |

Final state: 1:X3=1 \(\Lambda\) 1:X5=1 \(\Lambda\)



## Write Forwarding

| Initial state: | X0=&data, | X1=&flag, |
|----------------|-----------|-----------|
| X2=&temp       |           |           |

| Thread 0 |         | Thread 1         |
|----------|---------|------------------|
| MOV      | X3,#1   | LDR X3, [X1]     |
| STR      | X3,[X0] | CBZ X3,end       |
| DMB      | SY      | MOV X4,#1        |
| STR      | X3,[X1] | STR X4, [X2]     |
|          |         | LDR X5, [X2]     |
|          |         | AND X6, X5, #0   |
|          |         | LDR X7, [X0, X6] |
|          |         | end:             |

Final state: 1:X3=1 \(\Lambda\) 1:X5=1 \(\Lambda\)























#### Thread transitions

#### Execute a store

Condition: a store can be executed if

- 1. all previous stores and loads to the same location have been executed;
- 2. all previous branches are determined; and
- 3. all previous barriers have been executed.

Action: add a write to the top of the buffer associated with the thread.

#### Execute a barrier

Condition: a barrier can be executed if

- all previous stores and loads to the same location have been executed; and
- 2. all previous branches are determined;

Action: add a barrier to the top of the buffer associated with the thread.

. . .

## Storage transitions

#### Flow event

The bottom most event of a non-root buffer can flow to the top of the root buffer, and the bottom most event of the root buffer can flow into memory.

#### Satisfy a read

A read that is adjacent to a write (read is higher) to the same location can be satisfied by the write.

#### Reorder events

Two adjacent memory accesses in a buffer can be reordered with each other, if they are to different memory locations.

. . .

Initial state: X0=&data, X1=&flag,
0:X2=1

| Thread 0 |         | Thread 1                                      |
|----------|---------|-----------------------------------------------|
| STR      | X2,[X0] | LDR X3,[X1]<br>AND X4,X3,#0<br>LDR X5,[X0,X4] |
| DMB      | SY      | AND X4, X3, #0                                |
| STR      | X2,[X1] | LDR X5, [X0, X4]                              |

Forbidden: 1:X3=1 \(\Lambda\) 1:X5=0







































































# Multi-copy Atomicity

Behaviour that arise from buffer topologies other than the flat one:



This turned out to be more complicated than ARM expected.

# Multi-copy Atomicity

Behaviour that arise from buffer topologies other than the flat one:



This turned out to be more complicated than ARM expected.

#### More Abstract Models

#### The Flat Model

- Similar thread sub-system
- ► Storage is a simple array (no buffers)

#### More Abstract Models

#### The Flat Model

- Similar thread sub-system
- Storage is a simple array (no buffers)

#### The Promising Model

- Instead of speculation, promise stores
- Instructions are executed in one step
- Instructions are executed in-order (almost)
- More suitable as basis for program logic

#### More Abstract Models

#### The Flat Model

- Similar thread sub-system
- Storage is a simple array (no buffers)

#### The Promising Model

- Instead of speculation, promise stores
- Instructions are executed in one step
- Instructions are executed in-order (almost)
- ► More suitable as basis for program logic

#### **Axiomatic Model**

- Very concise
- Appears in the ARM reference manual

# Research Methodology



# Research Methodology



#### Contribution

- As a result of my work ARM has changed the memory model
- ► A reliable model of the ARMv8 architecture
- ► Machine readable definitions
- A tool, derived from the definitions (RMEM)
- Bugs in commercial hardware
- Similar work for RISC-V

#### Contribution

- As a result of my work ARM has changed the memory model
- ► A reliable model of the ARMv8 architecture
- Machine readable definitions
- A tool, derived from the definitions (RMEM)
- Bugs in commercial hardware
- Similar work for RISC-V

# Thank you