These Lectures

Part 1: Multicore Semantics: the concurrency of multiprocessors and programming languages

What concurrency behaviour can you rely on? How can we specify it precisely in semantic models? Linking to usage, microarchitecture, experiment, and semantics. x86, IBM POWER, ARM, Java, C/C++11

Part 2: Multicore Programming: Concurrent algorithms (Tim Harris, Oracle)

Concurrent programming: simple algorithms, correctness criteria, advanced synchronisation patterns, transactional memory.
Multicore Semantics

- Introduction
- Sequential Consistency
- x86 and the x86-TSO abstract machine
- x86 spinlock example
- Architectures
- Tests and Testing
- ...

## Implementing Simple Mutual Exclusion, Naively

Initial state: $x=0$ and $y=0$

<table>
<thead>
<tr>
<th>Thread 0</th>
<th>Thread 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>$x=1$</td>
<td>$y=1$</td>
</tr>
<tr>
<td>if ($y==0$) { ...critical section... }</td>
<td>if ($x==0$) { ...critical section... }</td>
</tr>
</tbody>
</table>
Implementing Simple Mutual Exclusion, Naively

<table>
<thead>
<tr>
<th>Initial state: x=0 and y=0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Thread 0</td>
</tr>
<tr>
<td>x=1</td>
</tr>
<tr>
<td>if (y==0) { ...critical section... }</td>
</tr>
</tbody>
</table>

repeated use?
thread symmetry (same code on each thread)?
performance?
fairness?
deadlock, global lock ordering, compositionality?
Let’s Try...

./runSB.sh
Fundamental Question

What is the *behaviour of memory*?

...at the *programmer abstraction*

...when *observed by concurrent code*
The abstraction of a *memory* goes back some time...
The calculating part of the engine may be divided into two portions

1st The Mill in which all operations are performed

2nd The Store in which all the numbers are originally placed and to which the numbers computed by the engine are returned.

[Dec 1837, On the Mathematical Powers of the Calculating Engine, Charles Babbage]
The Golden Age, (1837–) 1945–1962
1962: First(?) Multiprocessor

BURROUGHS D825, 1962

‘‘Outstanding features include truly modular hardware with parallel processing throughout’’

FUTURE PLANS
The complement of compiling languages is to be expanded.’’
... with Shared-Memory Concurrency

Thread\textsubscript{1} \quad \cdots \quad \text{Thread}\textsubscript{n}

\begin{align*}
\text{W} & \quad \text{R} \\
\cdots & \\
\text{W} & \quad \text{R}
\end{align*}

Shared Memory
Multiprocessors, 1962–now

Niche multiprocessors since 1962

IBM System 370/158MP in 1972

Mass-market since 2005 (Intel Core 2 Duo).
Multiprocessors, 2015

Intel Xeon E7-8895 v3
36 hardware threads

Commonly 4 or 8 hardware threads.

IBM Power 8 server
(up to 1536 hardware threads)

Oracle Sparc, Intel Itanium
Why now?

Exponential increases in transistor counts continuing — but not per-core performance

- energy efficiency (computation per Watt)
- limits of instruction-level parallelism

Concurrency finally mainstream — but how to understand, design, and program concurrent systems? Still very hard.
Concurrency everywhere

At many scales:
- intra-core
- multicore processors ← our focus
- ...and programming languages ← our focus
- GPU
- datacenter-scale
- internet-scale

explicit message-passing vs shared memory abstractions
Sequential Consistency
Our first model: Sequential Consistency

Multiple threads acting on a *sequentially consistent* (SC) shared memory:

*the result of any execution is the same as if the operations of all the processors were executed in some sequential order, respecting the order specified by the program*  

[Lamport, 1979]
Defining an SC Semantics: SC memory

Define the state of an SC memory $M$ to be a function from addresses $x$ to integers $n$, with $M_0$ mapping all to 0. Let $t$ range over thread ids.

Describe the interactions between memory and threads with labels:

\[
\begin{align*}
\text{label, } l & ::= & \text{label} \\
& | & t:W x=n & \text{write} \\
& | & t:R x=n & \text{read} \\
& | & t:\tau & \text{internal action (tau)}
\end{align*}
\]

Define the behaviour of memory as a labelled transition system (LTS): the least set of $(M, l, M')$ triples satisfying these rules.

\[
\begin{align*}
M \xrightarrow{l} M' & \quad \text{memory } M \text{ does } l \text{ to become } M'
\end{align*}
\]

\[
\begin{align*}
M(x) & = n \\
M & \xrightarrow{t:R x=n} M & \text{M_READ}
\end{align*}
\]

\[
\begin{align*}
M & \xrightarrow{t:W x=n} M \oplus (x \mapsto n) & \text{M_WRITE}
\end{align*}
\]
SC, said differently

In any trace \( \vec{l} \in \text{traces}(M_0) \) of \( M_0 \), i.e. any list of read and write events:

\[
l_1, l_2, \ldots l_k
\]

such that there are some \( M_1, \ldots, M_k \) with

\[
M_0 \xrightarrow{l_1} M_1 \xrightarrow{l_2} M_2 \ldots M_k,
\]

each read reads from the value of the most recent preceding write to the same address, or from the initial state if there is no such write.
SC, said differently

Making that precise, define an alternative SC memory state $L$ to be a list of labels, most recent at the head. Define $\text{lookup}$ by:

$$\text{lookup } x \, \text{nil} = 0$$

initial state value

$$\text{lookup } x \, (t:\text{W } x'=n) :: L = n$$

if $x = x'$

$$\text{lookup } x \, l :: L = \text{lookup } x \, L$$

otherwise

$$L \xrightarrow{l} L'$$

list memory $L$ does $l$ to become $L'$

$$\frac{\text{lookup } x \, L = n}{L \xrightarrow{t:\text{R } x=n} (t:\text{R } x=n) :: L}$$

$L\text{READ}$

$$\frac{L \xrightarrow{t:\text{W } x=n} (t:\text{W } x=n) :: L}{\text{LWRITE}}$$

$L\text{WRITE}$

$\text{Theorem 1 (?) } M_0 \text{ and nil have the same traces}$
Extensional behaviour vs intensional structure

Extensionally, these models have the same behaviour.

Intensionally, they have rather different structure – and neither is structured anything like a real hardware implementation.

In defining a model, we’re principally concerned with the extensional behaviour: we want to precisely describe the set of allowed behaviours, as clearly as possible. But (see later) sometimes the intensional structure matters too, and we may also care about computability, performance, provability,...
SC, glued onto a tiny PL semantics

In those memory models:

- the events within the trace of each thread were implicitly presumed to be ordered consistently with the *program order* (a control-flow unfolding) of that thread, and

- the values of writes were implicitly presumed to be consistent with the thread-local computation specified by the program.

To make these things precise, we can combine the memory model with a threadwise semantics for a tiny concurrent language....
A Tiny Language: Design Choices

- A concurrent imperative language.
- Distinguish syntactically between (thread-local) “registers” and “memory”.
- Include explicit parallel threads, with thread ids.
- Define an operational semantics that exposes the potential memory events (reads and writes of a value at a memory address) of a thread or process as labelled transitions; this lets us glue it on to an SC or TSO memory.
- Keep the register behaviour internal. Use an explicit register state rather than substitution to highlight the relationship to the memory semantics.
- Otherwise, as simple as possible: just enough computational power to write litmus tests (no loops, no thread creation,
A Tiny Language: Example

Thread 0: \( x = 1 \); \( r_0 = y \)
Thread 1: \( y = 1 \); \( r_1 = x \)

and, with the initial register state \( R_0 \) for each thread, and an initial SC memory state:

\[
\langle t_0 : \langle x = 1 ; r_0 = y, R_0 \rangle | t_1 : \langle y = 1 ; r_1 = x, R_0 \rangle, \{ x \mapsto 0, y \mapsto 0 \} \rangle
\]
Let $R$ range over register states, functions from register names $r$ to integers $n$. Write $R_0$ for the initial register state in which all registers hold 0.
A Tiny Language: Syntax

statement, \( s ::= \)

\begin{align*}
| & r = e & \text{compute register value} \\
| & r = x & \text{read from memory} \\
| & x = e & \text{write to memory} \\
| & \text{if } (e == n) s_1 \text{ else } s_2 & \text{conditional} \\
| & s_1; s_2 & \text{sequential composition} \\
| & \text{skip} & \text{empty statement} \\
\end{align*}

thread, \( T ::= \)

\begin{align*}
| & t : \langle s, R \rangle & \text{id, statement, reg state} \\
\end{align*}

process, \( P ::= \)

\begin{align*}
| & T & \text{thread} \\
| & P | P' & \text{parallel composition} \\
\end{align*}
That was just the syntax — now we’ll be precise about the permitted behaviours of programs
Defining the Semantics: expressions

\[ \langle e, R \rangle \rightarrow n \] in register state \( R \), \( e \) evaluates to \( n \)

\[ \frac{\langle n, R \rangle \rightarrow n}{E_{\text{INT}}} \]

\[ \frac{R(r) = n}{\langle r, R \rangle \rightarrow n} \] \( E_{\text{REG}} \)

\[ \frac{\langle e, R \rangle \rightarrow n}{\langle e' , R \rangle \rightarrow n'} \]
\[ n'' = n + n' \]

\[ \frac{\langle e + e', R \rangle \rightarrow n''}{E_{\text{PLUS}}} \]

These expressions read the register state, but do not mutate registers or memory (as you can see just from the form of the judgement).
Defining the Semantics: threads (1/2)

thread \( T \) does \( l \) to reach \( T' \)

---

T_READ

\[
\begin{align*}
t &: \langle r = x, R \rangle \\
\quad &\xrightarrow{t: R \ x = n} \\
\quad &t : \langle \text{skip}, R \oplus (r \mapsto n) \rangle
\end{align*}
\]

T_WRITE

\[
\begin{align*}
\langle e, R \rangle &\to n \\
\langle x = e, R \rangle &\xrightarrow{t: W \ x = n} \\
\quad &t : \langle \text{skip}, R \rangle
\end{align*}
\]

T_COMPUTE

\[
\begin{align*}
\langle e, R \rangle &\to n \\
\langle r = e, R \rangle &\xrightarrow{t: \tau} \\
\quad &t : \langle \text{skip}, R \oplus (r \mapsto n) \rangle
\end{align*}
\]

Register writes mutate the thread's register state \( R \); memory reads and writes are exposed as labelled transitions, with read values unconstrained.
Defining the Semantics: threads (2/2)

thread $T$ does $l$ to reach $T'$

\[ \langle e, R \rangle \rightarrow n' \]

\[ n = n' \]

$$t : \langle \text{if } (e == n) s_1 \text{ else } s_2, R \rangle \xrightarrow{t: \tau} t : \langle s_1, R \rangle$$

**T_COND1**

\[ \langle e, R \rangle \rightarrow n' \]

\[ n \neq n' \]

$$t : \langle \text{if } (e == n) s_1 \text{ else } s_2, R \rangle \xrightarrow{t: \tau} t : \langle s_2, R \rangle$$

**T_COND2**

$$t : \langle \text{skip; } s_2, R \rangle \xrightarrow{t: \tau} t : \langle s_2, R \rangle$$

**T_SEQ_SKIP**

$$t : \langle s_1, R \rangle \xrightarrow{l} t : \langle s'_1, R' \rangle$$

$$t : \langle s_1; s_2, R \rangle \xrightarrow{l} t : \langle s'_1; s_2, R' \rangle$$

**T_SEQ_CONTEXT**
Example thread transitions

\[ t_0 : \langle r_1 = r_0 + 1, R_0 \rangle \xrightarrow{t_0 : \tau} t_0 : \langle \text{skip}, R_0 \oplus (r_1 \mapsto 1) \rangle \]
Example thread transitions

\[ t_0 : \langle r_1 = r_0 + 1, \ R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, \ R_0 \oplus (r_1 \mapsto 1) \rangle \]

\[ t_0 : \langle r_0 = 3; \ r_1 = r_0, \ R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip; } r_1 = r_0, \ R_0 \oplus (r_0 \mapsto 3) \rangle \]

\[ t_0 : \langle \text{skip; } r_1 = r_0, \ R_0 \oplus (r_0 \mapsto 3) \rangle \]

\[ t_0 : \langle r_1 = r_0, \ R_0 \oplus (r_0 \mapsto 3) \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip, } R_0 \oplus (r_0 \mapsto 3, r_1 \mapsto 3) \rangle \]
Example thread transitions

\[ t_0 : \langle r_1 = r_0 + 1, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, R_0 \oplus (r_1 \mapsto 1) \rangle \]

\[ t_0 : \langle r_0 = 3; r_1 = r_0, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}; r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle \]
\[ t_0 : \langle r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 3, r_1 \mapsto 3) \rangle \]

The transitions are those derivable by trees of instantiations of the rules, e.g.

\[
\frac{\langle 3, R_0 \rangle \rightarrow 3}{\text{E_INT}} \quad \frac{t_0 : \langle r_0 = 3, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 3) \rangle}{\text{T_COMPUTE}} \quad \frac{t_0 : \langle r_0 = 3; r_1 = r_0, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}; r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle}{\text{T_SEQCONTEXT}}
\]

where that instance of \text{T_SEQCONTEXT} has instantiation:

\[
\begin{align*}
t & \mapsto t_0 & R & \mapsto R_0 & l & \mapsto t_0:\tau \\
s_1 & \mapsto r_0 = 3 & s'_1 & \mapsto \text{skip} \\
s_2 & \mapsto r_1 = r_0 & R' & \mapsto R_0 \oplus (r_0 \mapsto 3)
\end{align*}
\]
Example thread transitions

\[ t_0 : \langle r_1 = r_0 + 1, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, R_0 \oplus (r_1 \mapsto 1) \rangle \]

\[ t_0 : \langle r_0 = 3; r_1 = r_0, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}; r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle \]
\[ t_0 : \langle r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle \]
\[ t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 3, r_1 \mapsto 3) \rangle \]

\[ t_0 : \langle x = 3, R_0 \rangle \xrightarrow{t_0:W x=3} t_0 : \langle \text{skip}, R_0 \rangle \]
Example thread transitions

\[ t_0 : \langle r_1 = r_0 + 1, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, R_0 \oplus (r_1 \mapsto 1) \rangle \]

\[ t_0 : \langle r_0 = 3; r_1 = r_0, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}; r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle \]
\[ \xrightarrow{t_0:\tau} t_0 : \langle r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle \]
\[ \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 3, r_1 \mapsto 3) \rangle \]

\[ t_0 : \langle x = 3, R_0 \rangle \xrightarrow{t_0:W\,x=3} t_0 : \langle \text{skip}, R_0 \rangle \]

\[ t_0 : \langle r_0 = x, R_0 \rangle \xrightarrow{t_0:R\,x=7} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 7) \rangle \]
\[ \xrightarrow{t_0:R\,x=23} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 23) \rangle \]
Example thread transitions

\[ t_0 : \langle r_1 = r_0 + 1, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, R_0 \oplus (r_1 \mapsto 1) \rangle \]

\[ t_0 : \langle r_0 = 3; r_1 = r_0, R_0 \rangle \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}; r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle \]
\[ \xrightarrow{t_0:\tau} t_0 : \langle r_1 = r_0, R_0 \oplus (r_0 \mapsto 3) \rangle \]
\[ \xrightarrow{t_0:\tau} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 3, r_1 \mapsto 3) \rangle \]

\[ t_0 : \langle x = 3, R_0 \rangle \xrightarrow{t_0:W \ x = 3} t_0 : \langle \text{skip}, R_0 \rangle \]

\[ t_0 : \langle x = 3; r_0 = x, R_0 \rangle \]
\[ \xrightarrow{t_0:R \ x = 7} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 7) \rangle \]
\[ \xrightarrow{t_0:R \ x = 23} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 23) \rangle \]

\[ t_0 : \langle x = 3; r_0 = x, R_0 \rangle \]
\[ \xrightarrow{t_0:W \ x = 3} t_0 : \langle \text{skip}; r_0 = x, R_0 \rangle \]
\[ \xrightarrow{t_0:\tau} t_0 : \langle r_0 = x, R_0 \rangle \]
\[ \xrightarrow{t_0:R \ x = 7} t_0 : \langle \text{skip}, R_0 \oplus (r_0 \mapsto 7) \rangle \]
Defining the Semantics: lifting to processes

Remember the process syntax:

\[
\text{process, } \ P::= \quad \text{process} \\
\quad \mid \quad T \quad \text{thread} \\
\quad \mid \quad P | P' \quad \text{parallel composition}
\]

\[P \xrightarrow{l} P'\]

process \( P \) does \( l \) to become \( P' \)

\[
\begin{align*}
T & \xrightarrow{l} T' \\
\frac{T \xrightarrow{l} T'}{T \xrightarrow{l} T'} & \quad \text{P_THREAD}
\end{align*}
\]

\[
\begin{align*}
P_1 & \xrightarrow{l} P_1' \\
\frac{P_1 | P_2 \xrightarrow{l} P_1' | P_2}{P_1 | P_2 \xrightarrow{l} P_1' | P_2} & \quad \text{P_PAR_CONTEXT_LEFT}
\end{align*}
\]

\[
\begin{align*}
P_2 & \xrightarrow{l} P_2' \\
\frac{P_1 | P_2 \xrightarrow{l} P_1 | P_2'}{P_1 | P_2 \xrightarrow{l} P_1 | P_2'} & \quad \text{P_PAR_CONTEXT_RIGHT}
\end{align*}
\]

Free interleaving of the transitions of each thread.
Defining an SC Semantics: whole-system states

An SC system state $S = \langle P, M \rangle$ is a pair of a process and an SC memory.

$$S \xrightarrow{l} S' \quad \text{system } S \text{ does } l \text{ to become } S'$$

$$\begin{align*}
P \xrightarrow{l} P' \\
M \xrightarrow{l} M'
\end{align*}$$

$\langle P, M \rangle \xrightarrow{l} \langle P', M' \rangle$

**S\_ACCESS**

$$\begin{align*}
P \xrightarrow{t: \tau} P' \\
\langle P, M \rangle \xrightarrow{t: \tau} \langle P', M \rangle
\end{align*}$$

**S\_INTERNAL**

The rules force synchronisation between the process and the memory, constraining the values of the process’s read transitions to those the memory permits, and the memory’s write transitions to those the process does (threads can also freely do internal transitions).
Example system transitions: SC Interleaving

All threads can read and write the shared memory.

Threads execute asynchronously – the semantics allows any interleaving of the thread transitions. Here there are two:

\[
\langle t_1 : \langle x = 1, R_0 \rangle \mid t_2 : \langle x = 2, R_0 \rangle, \{ x \mapsto 0 \} \rangle
\]

\[
\langle t_1 : \langle \text{skip}, R_0 \rangle \mid t_2 : \langle x = 2, R_0 \rangle, \{ x \mapsto 1 \} \rangle
\]

\[
\langle t_1 : \langle \text{skip}, R_0 \rangle \mid t_2 : \langle \text{skip}, R_0 \rangle, \{ x \mapsto 2 \} \rangle
\]

But each interleaving has a linear order of reads and writes to the memory. C.f. Lamport’s

“the result of any execution is the same as if the operations of all the processors were executed in some sequential order, respecting the order specified by the program”
Back to the naive mutual exclusion example

<table>
<thead>
<tr>
<th>Initial state: $x=0$ and $y=0$</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Thread 0</strong></td>
</tr>
<tr>
<td>$x=1$</td>
</tr>
<tr>
<td>if ($y==0$) { ...critical section... }</td>
</tr>
</tbody>
</table>

- p. 35
Back to the naive mutual exclusion example

<table>
<thead>
<tr>
<th>Initial state: x=0 and y=0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Thread 0</td>
</tr>
<tr>
<td>( x = 1 );</td>
</tr>
<tr>
<td>( r_0 = y )</td>
</tr>
</tbody>
</table>

Allowed? Thread 0’s \( r_0 = 0 \) ∧ Thread 1’s \( r_1 = 0 \)
Back to the naive mutual exclusion example

<table>
<thead>
<tr>
<th>Initial state: x=0 and y=0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Thread 0</td>
</tr>
<tr>
<td>$x = 1$ ;</td>
</tr>
<tr>
<td>$r_0 = y$</td>
</tr>
<tr>
<td>Thread 1</td>
</tr>
<tr>
<td>$y = 1$ ;</td>
</tr>
<tr>
<td>$r_1 = x$</td>
</tr>
</tbody>
</table>

Allowed? Thread 0’s $r_0 = 0$ ∧ Thread 1’s $r_1 = 0$

In other words: is there a trace

$$\langle t_0 : \langle x = 1 ; r_0 = y, R_0 \rangle | t_1 : \langle y = 1 ; r_1 = x, R_0 \rangle, \{x \mapsto 0, y \mapsto 0\} \rangle$$

\[
\xrightarrow{l_1} \ldots \xrightarrow{l_n}
\]

$$\langle t_0 : \langle \text{skip}, R_0' \rangle | t_1 : \langle \text{skip}, R_1' \rangle, M' \rangle$$

such that $R'_0(r_0) = 0$ and $R'_1(r_1) = 0$ ?
Back to the naive mutual exclusion example

Initial state: \(x=0\) and \(y=0\)

<table>
<thead>
<tr>
<th>Thread 0</th>
<th>Thread 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>(x = 1)</td>
<td>(y = 1)</td>
</tr>
<tr>
<td>(r_0 = y)</td>
<td>(r_1 = x)</td>
</tr>
</tbody>
</table>

Allowed? Thread 0’s \(r_0 = 0\) ∧ Thread 1’s \(r_1 = 0\)

In other words: is there a trace

\[
\langle t_0 : \langle x = 1; r_0 = y, R_0 \rangle \mid t_1 : \langle y = 1; r_1 = x, R_0 \rangle, \{x \mapsto 0, y \mapsto 0\} \rangle
\]

\[
\xrightarrow{l_1} \ldots \xrightarrow{l_n}
\]

\[
\langle t_0 : \langle \text{skip}, R'_0 \rangle \mid t_1 : \langle \text{skip}, R'_1 \rangle, M' \rangle
\]

such that \(R'_0(r_0) = 0\) and \(R'_1(r_1) = 0\)?

In this semantics: no
Back to the naive mutual exclusion example

<table>
<thead>
<tr>
<th>Initial state: x=0 and y=0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Thread 0</td>
</tr>
<tr>
<td>( x = 1 );</td>
</tr>
<tr>
<td>( r_0 = y )</td>
</tr>
</tbody>
</table>

Allowed? Thread 0’s \( r_0 = 0 \) ∧ Thread 1’s \( r_1 = 0 \)

In other words: is there a trace

\[
\langle t_0 : \langle x = 1 ; r_0 = y, R_0 \rangle | t_1 : \langle y = 1 ; r_1 = x, R_0 \rangle, \{ x \mapsto 0, y \mapsto 0 \} \rangle
\]

\[
\overset{l_1}{\rightarrow} \ldots \overset{l_n}{\rightarrow}
\]

\[
\langle t_0 : \langle \text{skip, } R'_0 \rangle | t_1 : \langle \text{skip, } R'_1 \rangle, M' \rangle
\]

such that \( R'_0(r_0) = 0 \) and \( R'_1(r_1) = 0 \)?

In this semantics: no

But on x86 hardware, we saw it!
Options

1. the hardware is busted (either this instance or in general)
2. the program is bad
3. the model is wrong
Options

1. the hardware is busted (either this instance or in general)
2. the program is bad
3. the model is wrong

SC is not a good model of x86 (or of Power, ARM, Sparc, Itanium...).
Options

1. the hardware is busted (either this instance or in general)
2. the program is bad
3. the model is wrong

SC is not a good model of x86 (or of Power, ARM, Sparc, Itanium...)

Even though most work on verification, and many programmers, assume SC...
Message Passing Example

In SC, message passing should work as expected:

<table>
<thead>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>data = 1</td>
<td>if (ready == 1)</td>
</tr>
<tr>
<td>ready = 1</td>
<td>print data</td>
</tr>
</tbody>
</table>

In SC, the program should only print nothing or 1, and on bare-metal x86 it does (not ARM/Power). What about Java/C?
Message Passing Example

<table>
<thead>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>data = 1</td>
<td>int r1 = data</td>
</tr>
<tr>
<td>ready = 1</td>
<td>if (ready == 1)</td>
</tr>
<tr>
<td></td>
<td>print data</td>
</tr>
</tbody>
</table>

In SC, the program should only print nothing or 1, and on bare-metal x86 it does (not ARM/Power). What about Java/C?

It should be regardless of other reads.
Message Passing Example

<table>
<thead>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>data = 1</td>
<td>int r1 = data</td>
</tr>
<tr>
<td>ready = 1</td>
<td>if (ready == 1)</td>
</tr>
<tr>
<td></td>
<td>print data</td>
</tr>
</tbody>
</table>

In SC, the program should only print nothing or 1, and on bare-metal x86 it does (not ARM/Power). What about Java/C?

But common subexpression elimination (e.g. in HotSpot) can rewrite

\[ \text{print data} \quad \implies \quad \text{print r1} \]
Message Passing Example

<table>
<thead>
<tr>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>data = 1</td>
<td>int r1 = data</td>
</tr>
<tr>
<td>ready = 1</td>
<td>if (ready == 1)</td>
</tr>
<tr>
<td></td>
<td>print r1</td>
</tr>
</tbody>
</table>

In SC, the program should only print nothing or 1, and on bare-metal x86 it does (not ARM/Power). What about Java/C?

But common subexpression elimination (e.g. in HotSpot) can rewrite

\[
\text{print data} \quad \Rightarrow \quad \text{print r1}
\]

So the compiled program can print 0
Similar Options

1. the hardware is busted
2. the compiler is busted
3. the program is bad
4. the model is wrong
Similar Options

1. the hardware is busted
2. the compiler is busted
3. the program is bad
4. the model is wrong

SC is also not a good model of C, C++, Java,...
1. the hardware is busted
2. the compiler is busted
3. the program is bad
4. the model is wrong

**SC is also not a good model of C, C++, Java,...**

Even though most work on verification, and many programmers, assume SC...
What’s going on? Relaxed Memory

Multiprocessors and compilers incorporate many performance optimisations

(hierarchies of cache, load and store buffers, speculative execution, cache protocols, common subexpression elimination, etc., etc.)

These are:

- unobservable by single-threaded code
- sometimes observable by concurrent code

Upshot: they provide only various relaxed (or weakly consistent) memory models, not sequentially consistent memory.
New problem?

No: IBM System 370/158MP in 1972, already non-SC
But still a research question!

The mainstream architectures and languages are key interfaces

...but it’s been very unclear exactly how they behave.

More fundamentally: it’s been (and in significant ways still is) unclear how we can specify that precisely.

As soon as we can do that, we can build above it: explanation, testing, emulation, static/dynamic analysis, model-checking, proof-based verification,....