# The Flat Operational Model

- CHRISTOPHER PULTE, University of Cambridge
- SHAKED FLUR, University of Cambridge
- WILL DEACON, ARM Ltd.

- JON FRENCH, University of Cambridge
- SUSMIT SARKAR, University of St. Andrews
- PETER SEWELL, University of Cambridge

This document gives a prose description of the Flat operational model, as formally defined in its Lem definition. This is part of the supplementary material for "Simplifying ARM Concurrency: Multicopy-atomic Axiomatic and Operational Models for ARMv8".

### 1 AN OPERATIONAL MODEL FOR MCA ARMV8

To help reading this document we have colour-coded some text as follows:

- [release/] Release/Acquire instructions
- [exclusive] Exclusive instructions
- [ dmb ld/ ] dmb ld and dmb st instructions

The operational model is expressed as a state machine, with states that are an abstract representation of hardware machine states. We first introduce the model states and transitions informally.

Model states A model state consists just of a shared memory and a tuple of thread model states:



The shared memory state effectively just records the most recent write to each location. To handle load/store-exclusives, the memory is extended with a map (the exclusives map) from read requests to sets of write slices, that associates a read request of a load-exclusive with the write slices it read from (excluding writes that have been forwarded to the read and have not reached memory yet).

Each thread model state consists principally of a list or tree of instruction instances, some of which have been finished, and some of which have not. For example, below we show a thread model state with instruction instances  $i_1, \ldots, i_{13}$ , and the program-order-successor relation between them. Three of those ( $i_1$ ,  $i_3$ , and  $i_4$ , boxed) have been finished; the remainder are non-finished.



Authors' addresses: Christopher Pulte, University of Cambridge, first.last@cl.cam.ac.uk; Shaked Flur, University of Cambridge, first.last@cl.cam.ac.uk; Will Deacon, ARM Ltd. first.last@arm.com; Jon French, University of Cambridge, first.last@cl.cam.ac.uk; Susmit Sarkar, University of St. Andrews, ss265@st-andrews.ac.uk; Peter Sewell, University of Cambridge, first.last@cl.cam.ac.uk.

2017. XXXX-XXXX/2017/10-ART \$15.00

48 https://doi.org/10.1145/nnnnnnnnn

#### :2 Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell

Non-finished instruction instances can be subject to restart, e.g. if they depend on an out-of-order or 50 speculative read that turns out to be unsound. The finished instances are not necessarily contiguous: 51 in the example,  $i_3$  and  $i_4$  are finished even though  $i_2$  is not, which can only happen if they are 52 sufficiently independent. Instruction instances  $i_5$  and  $i_9$  are conditional branches for which the 53 thread has fetched multiple possible successors. When a conditional branch is finished, any un-54 55 taken alternative paths are discarded, and instruction instances that follow (in program order) a non-finished conditional branch cannot be finished until that conditional branch is. One can choose 56 57 whether or not to allow simultaneous exploration of multiple successors of a conditional branch (as shown above); this does not affect the set of allowed outcomes. 58

The intra-instruction behaviour of a single instruction can largely be treated as sequential (but not atomic) execution of its ASL/Sail pseudocode. Each instruction instance state includes a pseudocode execution state, which one can think of as a representation of the pseudocode control state, pseudocode call stack, and local variable values. An instruction instance state also includes information, detailed below, about the instruction instance's memory and register footprints, its register and memory reads and writes, whether it is finished, etc.

**Model transitions** For any state, the model defines the set of allowed transitions, each of which is a single atomic step to a new abstract machine state. Each transition arises from the next step of a single instruction instance; it will change the state of that instance, and it may depend on or change the rest of its thread state and/or the shared memory state. Instructions cannot be treated as atomic units: complete execution of a single instruction instance may involve many transitions, which can be interleaved with those of other instances in the same or other threads, and some of this is programmer-visible. The transitions are introduced below and defined in §1.4, with a precondition and a construction of the post-transition model state for each. The transitions labelled  $\circ$  can always be taken eagerly, as soon as they are enabled, without excluding other behaviour; the  $\bullet$  cannot.

Transitions for all instructions:

- Fetch instruction: This transition represents a fetch and decode of a new instruction instance, as a program-order successor of a previously fetched instruction instance, or at the initial fetch address for a thread.
- Register read: This is a read of a register value from the most recent program-order predecessor instruction instance that writes to that register.
- Register write
- Pseudocode internal step: this covers ASL/Sail internal computation, function calls, etc.
- Finish instruction: At this point the instruction pseudocode is done, the instruction cannot be restarted or discarded, and all memory effects have taken place. For a conditional branch, any non-taken po-successor branches are discarded.

Load instructions:

- Initiate memory reads of load instruction: At this point the memory footprint of the load is provisionally known and its individual reads can start being satisfied.
- Satisfy memory read by forwarding from writes: This partially or entirely satisfies a single read by forwarding from its po-previous writes.
- Satisfy memory read from memory: This entirely satisfies the outstanding slices of a single read, from memory.
- Complete load instruction (when all its reads are entirely satisfied): At this point all the reads of the load have been entirely satisfied and the instruction pseudocode can continue execution. A load instruction can be subject to being restarted until the Finish instruction transition. In some cases it is possible to tell that a load instruction will not be restarted or

65

66

67

68

69

70

71

72

73

74 75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

discarded before that, e.g. when all the instructions po-before the load instruction are finished. The Restart condition over-approximates the set of instructions that might be restarted.

# <sup>101</sup> Store instructions:

99

100

103

104

105

106

113

114

115

116 117

118

- Initiate memory writes of store instruction, with their footprints: At this point the memory footprint of the store is provisionally known.
- Instantiate memory write values of store instruction: At this point the writes have their values and program-order-subsequent reads can be satisfied by forwarding from them.
- Commit store instruction: At this point the store is guaranteed to happen (it cannot be restarted or discarded), and the writes can start being propagated to memory.
- Propagate memory write: This propagates a single write to memory.
- Complete store instruction (when its writes are all propagated): At this point all writes have
   been propagated to memory, and the instruction pseudocode can continue execution.
- <sup>112</sup> Store-exclusive instructions:
  - Guarantee the success of store-exclusive: This guarantees the success of the store-exclusive.
  - Make a store-exclusive fail: This makes the store-exclusive fail.

#### Barrier instructions:

• Commit barrier

## 119 1.1 Intra-instruction Pseudocode Execution

To link the model transitions introduced above to the execution of the instructions an interface is needed between Sail and the rest of the concurrency model. For each instruction instance this intra-instruction semantics is expressed as a state machine, essentially running the instruction pseudocode, where each pseudocode execution state is a request of one of the following forms:

| 124 |                                                       |                          |
|-----|-------------------------------------------------------|--------------------------|
| 125 | READ_MEM(read_kind, address, size, read_continuation) | Read request             |
| 126 | Excl_res(res_continuation)                            | Store-exclusive result   |
| 127 | WRITE_EA(write_kind, address, size, next_state)       | Write effective address  |
| 127 | WRITE_MEMV(memory_value, write_continuation)          | Write value              |
| 128 | BARRIER( <i>barrier_kind</i> , <i>next_state</i> )    | Barrier                  |
|     | READ_REG(reg_name, read_continuation)                 | Register read request    |
| 130 | WRITE_REG(reg_name, register_value, next_state)       | Write register           |
| 131 |                                                       | e                        |
| 132 | INTERNAL( <i>next_state</i> )                         | Pseudocode internal step |
| 133 | Done                                                  | End of pseudocode        |
|     |                                                       |                          |

Each of these states is a suspended computation with a request for an action or input from the concurrency model and, except in the case of DONE, a continuation for the remaining execution.

Here memory values are lists of bytes, addresses are 64-bit numbers, read and write kinds identify 136 whether they are regular, exclusive, and/or release/acquire operations, register names identify 137 a register and slice thereof (start and end bit indices), and the continuations describe how the 138 instruction instance will continue for any value that might be provided by the surrounding memory 139 model. This largely follows Gray et al. [2015, §2.2], except that memory writes are split into two 140 steps, WRITE\_EA and WRITE\_MEMV. We ensure these are paired in the pseudocode, but there may 141 be other steps between them: it is observable that the WRITE\_EA can occur before the value to 142 be written is determined, because the potential memory footprint of the instruction becomes 143 provisionally known then. 144

We ensure that each instruction has at most one memory read, memory write, or barrier step, by rewriting the pseudocode to coalesce multiple reads or writes, which are then split apart into

the architecturally atomic units by the thread semantics; this gives a single commit point for all
 memory writes of an instruction.

150 Each bit of a register read should be satisfied from a register write by the most recent (in program order) instruction instance that can write that bit, or from the thread's initial register state if there is 151 no such. That instance may not have executed its register write yet, in which case the register read 152 should block. The semantics therefore has to know the register write footprint of each instruction 153 instance, which it calculates when the instruction instance is created. We ensure in the pseudocode 154 that each instruction does exactly one register write to each bit of its register footprint, and also 155 that instructions do not do register reads from their own register writes. In some cases, but not in 156 the fragment of ARM that we cover at present, register write footprints need to be dynamically 157 recalculated, when the actual footprint only becomes known during pseudocode execution. 158

159 Data-flow dependencies in the model emerge from the fact that a register read has to wait for 160 the appropriate register write to be executed (as described above). This has to be carefully handled 161 in order not to create unintentional strength. First, for some instructions we need to ensure that the pseudocode is in the maximally liberal order, e.g. to allow early computed-address register 162 writebacks before the corresponding memory write. Leaving load-pair aside (which we do not 163 cover), and the treatment of the multiple reads or writes that can be associated with a single load 164 or store instruction (which we do), we have not so far needed other intra-instruction concurrency. 165 Second, the model has to be able to know when a register read value can no longer change (i.e. due 166 to instruction restart). We approximate that by recording, for each register write, the set of register 167 and memory reads the instruction instance has performed at the point of executing the write. 168 This information is then used as follows to determine whether a register read value is final: if the 169 instruction instance that performed the register write from which the register reads from is finished, 170 171 the value is final; otherwise check that the recorded reads for the register write do not include memory reads, and continue recursively with the recorded register reads. For the instructions we 172 cover this approximation is exact. 173

We express the pseudocode execution semantics in two ways: a definitional interpreter for Sail [Gray et al. 2015], with an exhaustive symbolic mode to (re)calculate an instruction's memory and register footprints, and as a shallow embedding, translating Sail into directly executable code, with separate hand-written definitions of the footprint functions. The two are essentially equivalent: the first lets one small-step through the pseudocode interactively, while the second is more efficient and should be more convenient for proof.

#### 1.2 Instruction Instance States

Each instruction instance *i* has a state comprising:

- program\_loc, the memory address from which the instruction was fetched;
- *instruction\_kind*, identifying whether this is a load, store, or barrier instruction, each with the associated kind; or a conditional branch; or a 'simple' instruction.
- *regs\_in*, the set of input *reg\_names*, as statically determined;
- *regs\_out*, the output *reg\_names*, as statically determined;
- *pseudocode\_state* (or sometimes just 'state' for short), one of
  - *PLAIN next\_state*, ready to make a pseudocode transition;
  - PENDING\_MEM\_READS read\_cont, performing the read(s) from memory of a load; or
  - *Pending\_mem\_writes write\_cont*, performing the write(s) to memory of a store;
- reg\_reads, the accumulated register reads, including their sources and values, of this instance's execution so far;

195 196

180 181

182

183 184

185

186

187

188

189

190

191

192

193

194

, Vol. 1, No. 1, Article . Publication date: October 2017.

#### Simplifying ARM Concurrency

197

198 199

200

201

202

203 204

205

206

207

221

222

223

224

225

226

227

228

229 230

231 232

233 234

235

236

237

238

239

240

241

242

243

- *reg\_writes*, the accumulated register writes, including dependency information to identify the register reads and memory reads (by this instruction) that might have affected each;
  - *mem\_reads*, a set of memory read requests. Each request includes a memory footprint (an address and size) and, if the request has already been satisfied, the set of write slices (each consisting of a write and a set of its byte indices) that satisfied it.
    - *mem\_writes*, a set of memory write requests. Each request includes a memory footprint and, when available, the memory value to be written. In addition, each write has a flag that indicates whether the write has been propagated (passed to the memory) or not.
      - [exclusive] *successful\_exclusive*, for store-exclusives, indicates whether it was previously guaranteed to succeed or made to fail.
      - information recording whether the instance is committed, finished, etc.

208 Read requests include their read kind and their memory footprint (their address and size), the 209 as-yet-unsatisfied slices (the byte indices that have not been satisfied), and, for the satisfied slices, 210 information about the write(s) that they were satisfied from. Write requests include their write 211 kind, their memory footprint, and their value. When we refer to a write or read request without 212 mentioning the kind of request we mean the request can be of any kind. A load instruction which 213 has initiated (so its read request list *mem\_reads* is not empty) and for which all its read requests 214 are satisfied (i.e. there are no unsatisfied slices) is said to be entirely satisfied. A load-exclusive is 215 called *successful* if the first po-following store-exclusive that has not been made to fail has been 216 guaranteed to succeed (as opposed to does not exist or has not been guaranteed to succeed or made 217 to fail). The successful load-exclusive and the successful store-exclusive are said to be *paired*. If a 218 successful load-exclusive has a read request that is mapped, in the exclusives map, to a write slice 219 ws, we say the load-exclusive has an outstanding lock on ws. 220

### 1.3 Thread States

The model state of a single hardware thread includes:

- *thread\_id*, a unique identifier of the thread;
- *register\_data*, the name, bit width, and start bit index for each register;
- *initial\_register\_state*, the initial register value for each register;
- *initial\_fetch\_address*, the initial fetch address for this thread;
- *instruction\_tree*, a tree or list of the instruction instances that have been fetched (and not discarded), in program order.

## 1.4 Model Transitions

**Fetch instruction** A possible program-order successor of instruction instance *i* can be fetched from address *loc* if:

- it has not already been fetched, i.e., none of the immediate successors of *i* in the thread's *instruction\_tree* are from *loc*;
  - (2) *loc* is a possible next fetch address for *i*:
    - (a) for a non-branch/jump instruction, the successor instruction address (*i.program\_loc+4*);
  - (b) for an instruction that has performed a write to the program counter register (\_PC), the value that was written;
- (c) for a conditional branch, either the successor address or the branch target address<sup>1</sup>; or
- (d) for a jump to an address which is not yet determined, any address (this is approximated in our tool implementation, necessarily); and

<sup>&</sup>lt;sup>244</sup> <sup>1</sup>In AArch64, all the conditional branch instructions have statically determined addresses.

#### :6 Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell

(3) there is a decodable instruction in program memory at *loc*.

Note that this allows speculation past conditional branches and calculated jumps.

Action: construct a freshly initialized instruction instance *i'* for the instruction in the program memory at *loc*, including the static information available from the ISA model such as its *instruction\_kind*, *regs\_in*, and *regs\_out*, and add *i'* to the thread's *instruction\_tree* as a successor of *i*.

This involves only the thread, not the storage subsystem, as we assume a fixed program rather than modelling fetches with memory reads; we do not model self-modifying code.

### **Initiate memory reads of load instruction** An instruction instance *i* with next state

*READ\_MEM(read\_kind, address, size, read\_cont)* can initiate the corresponding memory reads. Action:
 (1) Construct the appropriate read requests *rrs*:

- if *address* is aligned to *size* then *rrs* is a single read request of *size* bytes from *address*;
- otherwise, *rrs* is a set of *size* read requests, each of one byte, from the addresses *ad*-*dress*...*address+size-1*.
- (2) set *i.mem\_reads* to *rrs*; and
- (3) update the state of *i* to *PENDING\_MEM\_READS read\_cont*.

**Satisfy memory read by forwarding from writes** For a load instruction instance *i* in state *PENDING\_MEM\_READS read\_cont*, and a read request, *r* in *i.mem\_reads* that has unsatisfied slices, the read request can be partially or entirely satisfied by forwarding from unpropagated writes by store instruction instances that are po-before *i*, if the *read-request-condition* predicate holds. This is if:

- (1) all po-previous dmb sy and isb instructions are finished;
- (2) [ dmb ld/ ] all po-previous dmb ld instructions are finished;
- (3)  $\begin{bmatrix} release/ \\ acquire \end{bmatrix}$  if *i* is a load-acquire, all po-previous store-releases are finished; and
- (4) [release/ acquire ] all non-finished po-previous load-acquire instructions are entirely satisfied.

Let wss be the maximal set of unpropagated write slices from store instruction instances po-before i (if i is a load-acquire, exclude store-exclusive writes), that overlap with the unsatisfied slices of r, and which are not superseded by intervening stores that are either propagated or read from by this thread. That last condition requires, for each write slice ws in wss from instruction i':

- that there is no store instruction po-between i and i' with a write overlapping ws, and
- that there is no load instruction po-between *i* and *i'* that was satisfied from an overlapping write slice from a different thread.

#### Action:

246 247

248

249

250

251

252

253

257

258

259

260

261 262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

285

286

287

288

- (1) update r to indicate that it was satisfied by wss; and
- (2) restart any speculative instructions which have violated coherence as a result of this, i.e., for
  every non-finished instruction *i'* that is a po-successor of *i*, and every read request *r'* of *i'* that
  was satisfied from *wss'*, if there exists a write slice *ws'* in *wss'*, and an overlapping write slice
  from a different write in *wss*, and *ws'* is not from an instruction that is a po-successor of *i*,
  restart *i'* and its data-flow dependents (including po-successors of load-acquire instructions).

Note that store-release writes cannot be forwarded to load-acquires: a load-acquire instruction cannot be satisfied before all po-previous store-release instructions are finished, and *wss* does not include writes from finished stores (as those must be propagated).

Satisfy memory read from memory For a load instruction instance *i* in state *PENDING\_MEM\_READS read\_cont*, and a read request *r* in *i.mem\_reads*, that has unsatisfied slices, the read request can be satisfied from memory if *i* is not a successful load-exclusive or no other successful load-exclusive from a different thread has an outstanding lock on the writes *r* is trying to read from.

293 294 If: the read-request-condition holds (see previous transition).

#### Simplifying ARM Concurrency

Action: let wss be the write slices from memory covering the unsatisfied slices of r, and apply 295 the action of Satisfy memory read by forwarding from writes. In addition, if i is a successful 296 297 load-exclusive, union wss with the set of write slices r is mapped to in the exclusives map.

Note that Satisfy memory read by forwarding from writes might leave some slices of the read request unsatisfied. Satisfy memory read from memory, on the other hand, will always satisfy all the unsatisfied slices of the read request.

302 Complete load instruction (when all its reads are entirely satisfied) A load instruction in-303 stance i in state PENDING\_MEM\_READS read\_cont can be completed (not to be confused with finished) 304 if all the read requests *i.mem\_reads* are entirely satisfied (i.e., there are no unsatisfied slices).

Action: update the state of i to PLAIN (read\_cont (memory\_value)), where memory\_value is assembled from all the write slices that satisfied *i.mem.reads*.

307 **Guarantee the success of store-exclusive** A store-exclusive instruction instance *i* with next 308 state *Excl\_res(res\_cont)* can be guaranteed to succeed if: 309

- (1) the store-exclusive has not been made to fail (as recorded in *i.successful\_exclusive*);
  - (2) assuming *i* is successful, it can be paired with a load-exclusive i' (see §1.2); and
- 311 (3) if i' has already been satisfied (not necessarily entirely), let wss be the set of propagated write 312 slices i' has read from, then, no slice in wss has been overwritten (in memory) by a write from 313 this thread, and no other successful load-exclusive from a different thread has an outstanding 314 lock on a write slice from wss. 315

#### Action:

298

299

300

301

305

306

310

316 317

320

321

325

326

- (1) record in *i.successful\_exclusive* that the store-exclusive will be successful;
- 318 (2) if i' has already been satisfied, union wss with the set of write slices the read request of i' is 319 mapped to in the exclusives map, where wss is as above; and
  - (3) update the state of *i* to *PLAIN* (*res\_cont* (*true*)).

Make a store-exclusive fail A store-exclusive instruction instance *i* with next state *Excl\_res(res\_con-*322 tinuation) can be made to fail if the store-exclusive has not been guaranteed to succeed (as recorded 323 in *i.successful\_exclusive*) Action: 324

- (1) record in *i.successful\_exclusive* that the store-exclusive was made to fail; and
- (2) update the state of *i* to *PLAIN* (*res\_cont* (*false*)).

327 Note the promise-success transition is enabled before the store-exclusive commits, and we do not 328 require it to have a fully-determined address or to be non-restartable. As a result, a store-exclusive 329 that has already promised its success might be restarted. Since other instructions may rely on 330 its promise, the restart will not affect the value of *i.successful\_exclusive*. Instead, when the store-331 exclusive is restarted it will take the same promise/failure transition as before its restart - based 332 on the value of *i.successful\_exclusive*. 333

Initiate memory writes of store instruction, with their footprints An instruction instance 334 i with next state WRITE\_EA(write\_kind, address, size, next\_state') can announce its pending write 335 footprint. Action: 336

- (1) construct the appropriate write requests:
  - if *address* is aligned to *size* then *ws* is a single write request of *size* bytes to *address*;
  - otherwise ws is a set of size write requests, each of one byte size, to the addresses address. . . address+size-1.
- (2) set *i.mem\_writes* to ws; and 341
- (3) update the state of *i* to *PLAIN next\_state'*. 342
- 343

337

338

339

:8 Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell

Note that at this point the write requests do not yet have their values. This state allows nonoverlapping po-following writes to propagate.

Instantiate memory write values of store instruction An instruction instance *i* with next state
 *WRITE\_MEMV(memory\_value, write\_cont)* can initiate the corresponding memory writes. Action:

- (1) split *memory\_value* between the write requests *i.mem\_writes*; and
- (2) update the state of *i* to *PENDING\_MEM\_WRITES write\_cont*.

Commit store instruction For an uncommitted store instruction *i* in state *PenDING\_MEM\_WRITES write\_cont*, *i* can commit if:

- (1) *i* has fully determined data (i.e., the register reads cannot change, see \$1.5);
- (2) all po-previous conditional branch instructions are finished;
- (3) all po-previous dmb sy and isb instructions are finished;
- (4) [ dmb ld/ dmb st ] all po-previous dmb ld instructions are finished;
  - (5) [<sup>release/</sup>] all po-previous load-acquire instructions are finished;
- (6) all po-previous store instructions, except for store-exclusives that failed, have initiated and so have non-empty *mem\_writes*;
  - (7)  $\begin{bmatrix} release/\\ acquire \end{bmatrix}$  if *i* is a store-release, all po-previous memory access instructions are finished;
  - (8) [dmb ld/dmb st ] all po-previous dmb st instructions are finished;
- (9) all po-previous memory access instructions have a fully determined memory footprint; and
- (10) all po-previous load instructions have initiated and so have non-empty mem\_reads.

Action: record i as committed.

**Propagate memory write** For an instruction *i* in state *Pending\_mem\_writes write\_cont*, and an unpropagated write, *w* in *i.mem\_writes*, the write can be propagated if:

- (1) all memory writes of po-previous store instructions that overlap *w* have already propagated
- (2) all read requests of po-previous load instructions that overlap with w have already been satisfied, and the load instruction is non-restartable (see §1.5);
- (3) all read requests satisfied by forwarding *w* are entirely satisfied; and
- (4) [exclusive] no successful load-exclusive from a different thread has an outstanding lock on a write slice that overlaps with *w*.

### Action:

349

350

354

355

356

357

358

359

360

361

362

363

364

365 366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

- (1) restart any speculative instructions which have violated coherence as a result of this, i.e., for every non-finished instruction *i*' po-after *i* and every read request *r*' of *i*' that was satisfied from *wss*', if there exists a write slice *ws*' in *wss*' that overlaps with *w* and is not from *w*, and *ws*' is not from a po-successor of *i*, restart *i*' and its data-flow dependents;
- (2) record *w* as propagated;
- (3) update the memory with *w*; and
- (4) [exclusive] for every successful load-exclusive that has read from w (by forwarding), add the slices of w this load-exclusive read from to the set of write slices the read request of the load-exclusive is mapped to in the exclusives map.

Complete store instruction (when its writes are all propagated) A store instruction *i* in state
 *PENDING\_MEM\_WRITES write\_cont*, for which all the memory writes in *i.mem\_writes* have been
 propagated, can be completed. Action: update the state of *i* to *PLAIN(write\_cont(true))*.

# **Commit barrier** A barrier instruction *i* in state *PLAIN next\_state* where *next\_state* is *BARRIER(barrier\_kind, next\_state')* can be committed if:

(1) all po-previous conditional branch instructions are finished;

, Vol. 1, No. 1, Article . Publication date: October 2017.

#### Simplifying ARM Concurrency

394

398

399

403

404

412

413 414

415

416

417

418

419

420

425

427

428

434

435

436

437

438

439

- (2)  $\begin{bmatrix} \frac{dmb \ ld}{dmb \ s} \end{bmatrix}$  if *i* is a dmb ld instruction, all po-previous load instructions are finished;
  - (3)  $\begin{bmatrix} dmb \, ld' \\ dmb \, st \end{bmatrix}$  if *i* is a dmb st instruction, all po-previous store instructions are finished;
- <sup>395</sup> (4) all po-previous dmb sy barriers are finished;
- (5) if *i* is an isb instruction, all po-previous memory access instructions have fully determined
   memory footprints; and
  - (6) if i is a dmb sy instruction, all po-previous memory access instructions and barriers are finished.
- Note that this differs from the previous Flowing and POP models: there, barriers committed in
   program-order and potentially re-ordered in the storage subsystem. Here the thread subsystem is
   weakened to subsume the re-ordering of Flowing's (and POP's) storage subsystem.
  - Action: update the state of *i* to *PLAIN next\_state'*.

**Register read** An instruction instance *i* with next state *ReAD\_REG(reg\_name, read\_cont)* can do a register read if every instruction instance that it needs to read from has already performed the expected register write.

Let *read\_sources* include, for each bit of *reg\_name*, the write to that bit by the most recent (in program order) instruction instance that can write to that bit, if any. If there is no such instruction, the source is the initial register value from *initial\_register\_state*. Let *register\_value* be the assembled value from *read\_sources*. Action:

- (1) add reg\_name to i.reg\_reads with read\_sources and register\_value; and
- (2) update the state of *i* to *PLAIN* (*read\_cont*(*register\_value*)).

**Register write** An instruction instance *i* with next state *WRITE\_REG(reg\_name, register\_value, next\_state')* can do the register write. Action:

- (1) add reg\_name to i.reg\_writes with write\_deps and register\_value; and
- (2) update the state of *i* to *PLAIN next\_state'*.

where *write\_deps* is the set of all *read\_sources* from *i.reg\_reads* and a flag that is set to true if *i* is a load instruction that has already been entirely satisfied.

421 Pseudocode internal step An instruction instance *i* with next state *INTERNAL(next\_state')* can do
 422 that pseudocode-internal step. Action: update the state of *i* to *PLAIN next\_state'*.
 423

Finish instruction A non-finished instruction *i* with next state DONE can be finished if:

- (1) if i is a load instruction:
- (a) all po-previous dmb sy and isb instructions are finished;
  - **(b)**  $\begin{bmatrix} dmb \ ld \\ dmb \ st \end{bmatrix}$  all po-previous dmb ld instructions are finished;
  - (c) [<sup>release/</sup>] all po-previous load-acquire instructions are finished;
- (d) it is guaranteed that the values read by the read requests of *i* will not cause coherence violations, i.e., for any po-previous instruction instance *i'*, let *cfp* be the combined footprint of propagated writes from store instructions po-between *i* and *i'* and fixed writes that were forwarded to *i* from store instructions po-between *i* and *i'* including *i'*, and let *cfp'* be the complement of *cfp* in the memory footprint of *i*. If *cfp'* is not empty:
  - (i) i' has a fully determined memory footprint;
  - (ii) i' has no unpropagated memory write that overlaps with cfp'; and
  - (iii) If i' is a load with a memory footprint that overlaps with cfp', then all the read requests of i' that overlap with cfp' are satisfied and i' can not be restarted (see §1.5).
  - Here a memory write is called fixed if it is the write of a non-exclusive-store instruction that has fully determined data.
- (e)  $\begin{bmatrix} release/\\ acquire \end{bmatrix}$  if *i* is a load-acquire, all po-previous store-release instructions are finished;
- 441

#### :10 Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell

- (2) *i* has fully determined data; and
  - (3) all po-previous conditional branches are finished.

### Action:

442

443 444

445

446

447

448 449

450 451

452

453

454

455

456 457

458

459

460

461

462

463 464

465

466

467

468

469

470

471 472

477

481

483

484

485

486

487

488

489 490

- (1) if i is a branch instruction, discard any untaken path of execution, i.e., remove any (nonfinished) instructions that are not reachable by the branch taken in *instruction\_tree*; and
- (2) record the instruction as finished, i.e., set finished to true.

#### **Auxiliary Definitions** 1.5

Fully determined An instruction is said to have fully determined footprint if the memory reads feeding into its footprint are finished: A register write w, of instruction i, with the associated write\_deps from *i.reg\_writes* is said to be *fully determined* if one of the following conditions hold:

- (1) *i* is finished; or
- (2) the load flag in *write\_deps* is *false* and every register write in *write\_deps* is fully determined.

An instruction *i* is said to have *fully determined data* if all the register writes of *read\_sources* in *i.reg\_reads* are fully determined. An instruction *i* is said to have a *fully determined memory footprint* if all the register writes of *read\_sources* in *i.reg\_reads* that are associated with registers that feed into *i*'s memory access footprint are fully determined.

**Restart condition** To determine if instruction *i* might be restarted we use the following recursive condition: *i* is a non-finished instruction and at least one of the following holds,

- (1) there exists an unpropagated write w such that applying the action of the Propagate memory write transition to *s* will result in the restart of *i*;
- (2) there exists a non-finished load instruction l such that applying the action of the Satisfy memory read from memory transition to l will result in the restart of i (even if l is already entirely satisfied); or
- (3) there exists a non-finished instruction i' that might be restarted and i is in its data-flow dependents (including po-successors of load-acquire instructions).

#### Remarks about load/store exclusive instructions 1.6

473 The MCA ARMv8 architecture intends that the success bit of store exclusives does not introduce 474 dependencies, to allow (e.g.) hardware optimisations that dynamically replace load/store exclusive 475 pairs by atomic read-modify-write operations that can execute in the memory subsystem and there-476 fore be guaranteed to succeed. The ARMv8-axiomatic definition assumes all address/data/control dependencies to be from reads, not writes. In the operational model, matching this weakness has 478 proved to be difficult: it means the operational model must be able to promise the success or failure 479 of a store-exclusive instruction even before any of its registers reads/writes have been done, so 480 before the store-exclusive's address and data are available. The early success promises are the source of deadlocks in the operational model. To illustrate this consider, for example, the following 482 litmus test and a state where both a and e are satisfied and finished, and where b and f are not propagated. Then d can promise its success, locking memory location x, and h can promise its success, locking location y. But now there is a deadlock:

- For d to propagate c has to be committed and hence b propagated. But b cannot propagate since y is locked.
- For h to propagate g has to be committed and hence f propagated. But f cannot propagate since x is locked.



Similar situations arise from cases where there are other barriers or release/acquire instructions in-between the load and the store exclusive, or if the store exclusive has additional dependencies that the load exclusive does not have. These are cases that are not really intended to be supported by the architecture.

The model can also currently deadlock if a load and a store-exclusive are paired successfully but later turn out to have different addresses: if the store-exclusive promises its success before its address is known it locks the matched load-exclusive's memory location; when they later turns out to be to a different addresses it never unlocks it. This issue can be fixed, but it is currently still being clarified what exactly the architecturally allowed behaviour should be.

#### REFERENCES

Kathryn E. Gray, Gabriel Kerneis, Dominic Mulligan, Christopher Pulte, Susmit Sarkar, and Peter Sewell. 2015. An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors. In Proc. MICRO-48, the 48th Annual IEEE/ACM International Symposium on Microarchitecture.