CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 2 TRANSCRIPT

(C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory.


L2.P3: THIS TIME

Good morning, everybody. Last time we learned that concurrency exists both in parallel programming and in distributed computation.
In contemporary computing, the principal difference between these two is whether there is cache-consistent shared memory available
to the participants.  But note we don't have shared memory either in the case of time sharing processes with separate address
spaces, even though that's on the same machine.

This part of the course is principally about parallel programming in a shared address space.

The first problem that comes up is the race condition where two threads are concurrently accessing the same shared data, but not
really aware of what each other are doing. And writing parallel programs is notoriously error-prone.
Concurrency bugs are very hard to anticipate.

This time, we'll take a quick look at parallel computer architecture and the relevant reference models.  We look at the primitives
provided in the instruction sets of older and contemporary machines for support of atomic operations.  We'll look at how concurrent
programs can be viewed from the perspective automata theory.  And see how that view can help us prove the correctness of programs,
either using pen and paper or automatic tools such as model checkers.

We'll look at one of the classic examples from concurrency theory, the Dining Philosophers.
And finally, we'll look at alternatives to providing atomic operations in the instruction set.
such as the Bakery Algorithm.

L2.P4 General Comments

Concurrency is essential in modern systems. Even from the earliest days of computing, the IO devices tended to be mismatched in
speed with the processor.

For instance, you may have heard of a SPOOL file.  Typically, a program generating data to be printed would not directly send it to
the printer. It would instead send it to a file called a spool file. The SPOOL file would then be sent to the printer asynchronously
at the rate that the printer operates while the processor is freed up to work on new work. And so the word spool is actually an
acronym. It's short for "simultaneous peripheral operation offline".
Even today, all I/O tends to run asynchronously using interrupts.

And given that it's now much more energy efficient to build lots of computers, rather than one faster one, parallel processing,
either tightly-coupled or in a distributed system, is the mainstream computing paradigm.

But this throws up lots of challenges.  We need our systems to be safe and reliable, which mainly means being race free.  If there's
a race, we get unreliable random behaviour.  We need mechanisms to enable one process to communicate and synchronise with another
one reliably. We need to ensure liveness and not suffer deadlock problems, of which more in another lecture.

Often it can be a good approach to write a single-threaded program that does what you want first.  Then only run it in a parallel or
concurrent form later on, once you're happy with the basic functionality.  But you should write the initial version of the program
with the awareness that you are going to later run it in parallel, because this can have major consequences for the design of your
data structures.

And as always, where an existing library solves a problem, then you should always use that rather than implementing your own unless you're implementing your own just to find out how to do it. Or if you think you could do it better.

L2P5. Computer Architecture

Let's just revise basic computer architecture: the architecture, designed by John von Neumann, which everybody has used ever since
more or less. There are four parts to a computer and we are only interested in two of them here.

My top picture here shows the main memory and one CPU. That's two parts of a computer. Traditionally, the other two parts are the
secondary storage, which is the disk drive(s) and the I/O devices.

We'll need network I/O devices for distributed computing based on networks.
But for the sake of this first half of the course, we're just interested in the processors and main memory.
The memory contains the operating system and the application program.

The CPU has two modes of operation. Normal user-mode and kernel-mode, with different access rights accorded to them.  In a
time-sharing system, there will also be a timer which generates an interrupt at a regular rate such as 10 Hertz or 100 Hertz, or
somewhere in between.

In a real system with I/O devices we get interrupts from those as well.

Both these forms of interrupt are very important in our analysis of concurrent programs and operating system design.  because a
thread is going to get preempted and something will be changed by the interrupt or another running thread before it's resumed.

So in a sense, memory locations magically change their values under our feet.

In a parallel processing environment, as we have in every laptop these days, we have multiple CPU cores as illustrated on the bottom
picture. We can also have more than one memory bank, but that is a hardware detail that tends to be hidden from us.

All of the memory is nominally in one flat uniform physical address space.  But each core will have a memory management unit which
does virtual memory translation, so each process running on a core will see a different image and a subset of that main memory.


L2P6 Operating System Behaviour

Let's just revise how the operating system schedules tasks. By task I mean thread.
The threads will belong to processes, with potentially multiple threads per process.

On a uni-processor, only one thread can be running at anyone time.
The context of all the other threads must be saved by the operating system in thread control blocks (aka process control blocks).

At a context swap, the operating system will save the processor state in the control block.  The scheduler will decide what is the
most worthy thread to run next.  Its context will be loaded into the physical registers and the operating system will do essentially
a 'return from interrupt': that thread will continue running without any knowledge that it had stopped.

Here we see a classic thread state diagram.  Although you might see other more complex diagrams, there are really only three states
of interest.  A new thread starts in the running mode and it continues until it either issues a system call that blocks it (that's a
voluntary yield in that the programmer made the system call and knew full well that the thread may well block at this point.

Or the thread may be preempted by an asynchronous interrupt from a device or the timer, in which case, in principle, it's still ready
to run.  So each of the suspended threads is either blocked or ready to run.

For a blocked thread, the operating system needs to be able to tell cheaply what it's blocked on.
We'll return to that in another lecture.

The operating system will consider which is the most worthy amongst the ready-to-run threads when it decides what to run next, which
is the SCHEDULING DECISION.

When interrupts from devices occur, they generally alter data structures and something that was blocked can often become ready-to-run.
So most operating systems are designed that once an interrupt has occurred, on return from interrupt, the operating system scheduler is invoked and a new determination of the most worthy work is made.

L2P7: Hardware foundations for atomicity 1

What hardware do we need for concurrency? Well, we've already mentioned the operating system timer for time sharing,
but we're also going to get interrupts from other sources.
Such interrupts can happen at anytime, and a thread can be preempted at anytime without warning.

This is pretty much an impossible situation.  It's pretty much impossible to write software that could be interrupted at any point
(see lock-free programming in L8 though).  Normally we are bound to have some critical sections.

At the hardware level, all processor have a means for disabling interrupts.  Generally you don't want another interrupt while you're
already processing one interrupt.  Although there can be multiple levels of interrupt in practise where a higher level interrupt can
interrupt a currently running lower level interrupt, but we'll put that aside and just imagine that there is normal and interrupt
mode.

The action of the hardware on getting that interrupt consists of several steps, including entering supervisor mode, changing the
program counter, possibly stashing some registers, but in particular it also sets the interrupt disable flag, which is also the
hardware reset state of that flag.

The flag will be found in a special processor control register alongside their NZC and V flags that you may have come across, and so
on. Anyway, there is a hardware flag that is manipulated by both software and hardware to turn interrupts on and off.

Also, there is a special instruction, a variation on the normal subroutine return instruction, which is the return from interrupt
instruction, and that will have the effect of restoring that flag to its previous value.

The details of all of that aren't important for this course. We just need to know that there is such a register that can turn
interrupts on or off and that it's manipulated in hardware, but also it can be manipulated in software.

The instructions for setting and clearing the flag are supervisor-mode (aka kernel-mode) instructions, but in this course we're not
going to be worried too much about the distinction between supervisor and kernel mode. Most of what we're discussing applies
equally.

So on the simple uni-processor system, we can turn interrupts off.  We are then guaranteed not to be preempted. Memory locations
will not change under our feet.

This gives us the perfect environment for implementing a critical section and at the end of that section we turn interrupts on
again.

While the interrupts are off, though, we cannot respond to external events, so real-time behaviour may well be influenced. We may
get a buffer overflow, for instance, if data arrives while we have interrupts off for an extended length of time.

That's a hardware buffer I'm thinking of there. Later on, we'll talk extensively about software FIFO buffers.

Moreover, this simple technique of turning interrupts on or off is not going to be suitable for use when there are other
CPUs that change the contents of memory or when DMA exists, copying data into memory from external devices.

So we're not going to use the crude approach of turning interrupts off method for managing critical sections.


L2.P8 Hardware foundations for atomicity 2


Last time we saw the problem with the lack of atomic operations.

If we want to check a flag and then set it if it wasn't set already on a simple instruction set, this requires a load and then a
store. These are separated in time and it's possible that another device also loads between that load and our store. That's the
classic race.

We need the load and the store to be in some sense atomic, with nothing happening between the load and the store.  So we need a new
ISA primitive for protection against parallel access to the memory from another CPU, or indeed the same CPU running an interrupt.

Older architectures provided instructions such as compare-and-swap, which we'll look at in the next slide.  More modern
architectures tend to use not an atomic instruction, but a pair of instructions that have an atomic effect somewhere else hidden in
the system, such as load-linked and store-conditional.

Modern architectures also have a variety of other atomic instructions, such as setting a bit in a word or incrementing a word, and
so on. But we don't need to concern ourselves with those at the moment.  They may come in handy for implementing lock-free data
structures later on.

L2P9: Atomic Compare and Swap (CAS)

As I said, older instruction sets tended to support an instruction such as atomic compare-and-swap, or some variant of it.
Compare-and-swap instruction takes a memory address and a value to compare against and a value to set that memory location to and it
returns a result as well.

The value to compare against and the value to set it to could be stored in registers or they could be hard coded as part of the
instruction. That doesn't make a lot of difference.  But certainly the memory address to operate on needs to come from a register
and the return value from the instruction needs to go into a register on the machine.

The semantics of compare and swap are straightforward. If the value in memory matches the compare value, then the new value is
stored.  If the old value doesn't match the compare value, then nothing happens.  And either way we get a result which tells us
whether the store was successful.

Here we see the assembly language implementation of this for the x86 architecture.  The instruction of interest is called compare
exchange. We can see it has a lock prefix in front of it.  It takes 2 arguments, one is the address of the word to be operated on,
which is called foo_lock, and the second is a register, the DX register.  The details of this are not important.

Compare-and-exchange is of course the same thing as compare-and-swap.

The compare-and-swap instruction here is inside a loop and this loop is going to repeat the operation until it's successful.
Later on, we'll call this a MUTEX ACQUIRE SPINLOCK.

The value to be stored is 1. It's loaded initially into the DX register.  The x86 exchange instruction puts its result in the AX
register, even though AX isn't mentioned in the operand list of the instruction. That's one of the sort of peculiarities of older
CISC instruction sets. We call that a non-orthogonal behaviour and it makes compiler writing a pain. But if we're writing this.  by
hand in assembler, then we know the semantics of the instruction and have to remember where the result went.
But the low-level details are certainly not important.

Another variant of this instruction is the atomic test-and-set, which is where the value to be tested against and the value to be stored are hardwired to unity.

The downside of the compare-and-swap instruction is nothing to do with this course. It's to do with hardware architecture.  But
essentially, the whole of the memory system needs to be locked between the load and the store phases of the atomic instruction, and
this isn't necessarily a good thing in a highly-parallel, multi-processor environment.


L2.P10: Load Linked-Store Conditional (LL/SC)

The more modern equivalent of compare-and-swap, which is found more commonly on today's RISC architectures, is called load-linked/
store-conditional. It's also known as load-exclusive/store-conditional.  Rather than having one instruction which makes 2 memory
operations, we have two instructions that make one memory operation each.

For those interested in computer architecture, which is hopefully all of you, the instruction pair can be implemented in a slightly
more cache friendly way, using some of the cache structures, in fact.  And also it doesn't require that the processor core is
microcoded.

Looking at the two halves in detail: the load-linked instruction is just like a normal load. It loads the value from memory into a
register. But the memory system remembers the processor ID that just loaded that location.
We can't see that that's happened from anything on the slide, but that's what's going on under the bonnet.

The value that we've loaded, we can operate on. We can check it as we would need to do in the compare phase of a compare-and-swap.
We can do arithmetic on it, or we can just produce a new constant which we're going to store back there using the store-conditional
instruction.

The store-conditional instruction is where the magic happens: store-conditional from its very name either stores or it doesn't.
If it stores, then we say it's successful. Otherwise we say it's failed.

And what is the condition for failure? It's that no other processor has issued a load-linked or a store-conditional on that memory region in the meantime.

There are some possible variations on the detailed semantics. For instance, any store at all to that location might potentially invalidate and cause the store conditional to fail.

What do I mean by memory region? Well, that will vary from one architecture to another and also might vary according to how the
cache structure is currently organised, which can be adjusted in control registers.

But basically, a memory region could be a single location, it could be 64 bytes, a cache line, something like that. Or it could
indeed be a whole memory bank. So that's why the slide here says neighbourhood. That will include the address of interest and
perhaps some surrounding addresses as well.

Again to implement MUTEX ACQUIRE SPINLOCK, we'll put it inside a loop and iterate it until we have success.  We can require two
loops in principle: one to get a successful LL/SC pair, and an outer one to make the spin, which is equivalent to the loop that was
present in the compare-and-swap example.

The example assembly code on the bottom of the slide shows what you would run on a RISC-V.
But note this just shows how to implement the test-and-set part and we'll need an external loop to make it into a spin.

And again, the fine details of that are not relevant for examination purposes on this course.

The takeaway from the last two slides is nothing more than processes do tend to have atomic instructions in their instruction set.
However, and interestingly, it's also possible to achieve atomic like behaviour without atomic instructions and at the end of this
lecture we'll have a look at one of the algorithms for that.


L2.P11 Finite State Machine Revision and Terminology

A digital computer is a finite state machine.Of course, there's so much memory that the number of states is totally mind boggling.

Finite state machine theory does apply to a complete computer and to a system of communicating CPU cores or programs.  But to model
concurrent programs as finite state machines is generally speaking impractical.  It certainly requires a very high level of
abstraction, so that just the important properties are represented in our model.

For instance, we might consider that the state of a program is just which basic block the program counter of each thread is in.  And
perhaps also add in the state of some important synchronisation primitives, such as the values of semaphores, which we'll talk about
another time.

Given such an abstraction, it is then possible to reason about programs and even automatically prove their correctness under various
assumptions.

As I hope you know already, the finite state machine can be formally defined as a tuple.
We have a set of states, a starting state, an input alphabet and a next-state function (aka the transition function).
Based on that, we can define some terminology to describe the general behaviour of finite state machines.

In my diagram, FSM-1, we have an initial state. We might have several initialisation states that lead from one to another, forming
the startup code for our program. But then we enter some sort of main eternal process loop.  This is the general structure of most
concurrent programs.  Its known as a LASSO shape.

We'll call these states the live state and from any live state we can get back to all the other live states at anytime in the future.
So that's Q1, Q2 and Q3 in this diagram.

The terminology I'm using here doesn't necessarily precisely align with what others might use, but it's going to be appropriate for this course.

Generally speaking, there could be one state transition diagram for each thread of our program.  But if several threads are
running the same code, their diagrams have the same shape.
We assume that the main, important, significant work of the thread is done in the live states.

When we have two threads running, of course, the state space of the computer as a whole is the product of the state spaces of the
individual threads. That would be the Cartesian product, of course. Not all of that may be REACHABLE as we'll discuss.


We define here a LIVE STATE to be one that can be returned to infinitely often in the future.  On the other hand, a dead or a
deadlocked state has no successors. The machine essentially is going to stop if we enter it, or at least this thread will. Other
threads may still be able to continue.

In digital hardware, we tend to refer to such a state as a dead state, whereas in concurrency theory we may say that the situation is deadlock, but we'll define that term much more thoroughly in another lecture.

Also on this diagram we see states Q5 and Q4.

These are outside of the live state set because once we've entered them, we can no longer get back to our live states.
We could call these bad states.


L2.P12 Finite State Machine: Fairness and Livelock Syphons

In the FSM-2 figure, we have two sets of states which are in a cycle.
If we ignore the letter F on state Q4 for the moment, which I'll come to shortly, then which is the live set of this diagram?

The live set is the set from which it's always possible to get to each other member of the live set in the future.
So if we're in Q5 and Q6, we might say they're live clearly.

What about Q2, Q3 and Q4? Well, they might be considered to be part of the start up trajectory for the live set made of Q5 and Q6.

So perhaps there's a bit of ambiguity with this diagram. But we can resolve this by marking some of the states with the letter F for
fair. This denotes what we call a FAIR STATE. We must always be able to get back to it whatever happens.
Live set must always include all the states marked with F.

Hence Q4 must be in our live set and hence our main live set is now going to be Q4, Q2 and Q3.

What does that mean for Q5 and Q6? Well, they're a bit like our deadlock state from the previous finite state machine. Once we've
entered them, we can no longer get back to our live set. But this thread hasn't stopped working completely.  We can say that it's
entered a live lock cycle (or syphon) where it's repeatedly doing something, but it's no longer able to get back to its main
operating set.

If our thread enters a state from where it's no longer to reach all of the fair states, then that state was not a live state.
If we also labelled say, Q5 as a fair state, then this would be an impossible and the labelling is invalid.

Because we can't have fairness for Q4 and Q5, if Q5 is entered, we can no longer get back to Q4, so they can't both be fair states.

Informally, we can defined livelocks as "we have not deadlocked, but we can no longer make useful progress".

There are perhaps some more rigorous definitions of these terms that you would find in a formal concurrency course.  But the
definitions that I've given here are going to be sufficient to give us the required insight into what's going on with concurrent
programming.


L2P13: Finite State Machine: FSM view of thread control flow.

Now we put some annotations on the arcs of our diagram and consider the input alphabet Sigma.

There are two annotations potentially for each arc. The first is the condition that must hold for the arc to be taken.  The second
is the effects to be generated when we follow the arc. The word 'effect' here is short for side effect.  The side effects mainly of
interest will be storing values in variables, but it also includes writing an expression to an output device.

The conditions for following an arc need to be boolean predicates. These will range over the values of shared variables and
variables that are local to a given thread, and possibly even the state of other threads in terms of which state they're in.
Together, all of these form our input alphabet.  In automata theory terms, the input alphabet is really vectors (or the
Cartesian product of) all of these possible observable variables.

In out diagrams, we separate the condition and the effect with the slash symbol. (In some future diagrams, the distinction between input and output is blurred because it is shared event or rendezvous.)

Let's have an instance of this diagram for each possible human being sharing a flat.  These human beings are asynchronously
composed. So when one of them makes an action, it's not correlated or clocked at the same time as another one taking an action.

This is very different from the composition of the flip-flops in synchronous digital logic, where you have a common clock signal and
each participant  makes a transition at the same time. We'll take a quick look at that in some follow-up slides.

In terms of shared state:

There is a global variable which is observable and mutable by all threads, which is how much beer there is in the fridge.

And another global variable is whether there is a note on the fridge or not.

Our initial state Q0 for beer drinking shows two possible successors.  But another possibility is that a human being who is
executing this life doesn't do anything in the current moment. Maybe another human being makes a move instead.

When we have more than one thread running different code, each thread will have its own instance of a control flow graph. It will
have its own state transition diagram.


L2P14 Finite State Machine: Product of Machines 1


Let's leave the beer aside and look at a couple of simpler threads which we're going to compose together.
Suppose the first thread has two states A and B, and the second has three states labelled zero, one and two.
Then the product space will have six states.

For two machines in product, we can plot the composition in two dimensions conveniently, and if we look at this two-dimensional plot
of the product, we can see that the arcs for the horizontal and vertical movement directly mirror what the individual machines did.

But this is for the asynchronous product where only one machine changes state at a time.
If we instead have a synchronous product, in general, we'll get diagonal arcs.

In future diagrams, we probably won't show most of the reflexive arcs. These are arcs where the source and destination are the same
state. That's certainly the case for the stuttering product, which we discuss on the next slide.
But sometimes a reflexive arc does have a side effect, in which case we will need to show it.

L2P15 Finite State Machine: Product of Machines 2

What's the difference between asynchronous and synchronous? It's pretty straightforward.  In the asynchronous product, exactly one
machine steps once.  That's what we get on a time sharing uni-processor where there is only one core, so at most one thread can
advance at any time.

The asynchronous composition doesn't mean strict alternation between threads. It's not thread A and then thread B and then thread a
again. There will be a lot of threads in general and each thread might make more than one step before another thread moves at all.
That's what we call a STUTTERING ASYNCHRONOUS composition.
Rather than a TURN-TAKING composition.

In the synchronous composition, however, all machines step at once. This will give us the diagonal arcs in the 2-D plot, and I don't
know what the word for it is in multi-dimensions (non-planar?).

The synchronous composition, of course, is highly relevant in digital hardware, where everything is clocked off the same clock
within a clock domain, but that's a different course. In this course, we'll be considering exclusively asynchronous, stuttering
composition. That's what you get from an operating system scheduler.

L2P16 Finite State Machine: Product of Machines 3

Now multiplcative composition of the machines is not all that happens, they also interact.  As we combine machines, we get a more
complex system in terms of the number of states: it grows quadratically.  But the possible behaviours of the individual machines
tends to reduce.  And ultimately, you'll see, this can lead to deadlock where there is no longer any behaviour possible in certain
corners of the composite state space.

Let's introduce a first coupling between our two machines, machine one and machine two.  We'll ultimately consider two couplings, one
in each direction.  We'll call the situation where one machine depends on the other, but not the other way round. A half-coupled
scenario and then we'll talk about the fully coupled scenario where they both depend on each other's behaviour.

The input to machine two is a single boolean called Y.  Let's make Y a function of the state of machine one.  Let's suppose Y is
simply equal to the condition that M1 is in state A.

This will immediately remove arcs from the product diagram.
Here I've plotted the synchronous and asynchronous half-coupled products.

I didn't previously draw out to the, uncoupled synchronous product diagram.  You can do that yourself if you want an exercise.

Another exercise is just to look at these two diagrams and to see whether we've introduced any dead states.  Is every state still
reachable from every other state? If so, they are all in our live set.  Yes, I've just had a look and I think they are all alive.

Now let's introduce the second coupling, a reverse coupling. So we're also going to determine X, which is the input to machine one
from some aspect of machine 2. We'll choose the X holds when M2 is in state 0.

Of course, in real world situations there will be additional state that's not drawn in the state diagrams of the machines. It could
be local variables to a thread, or it could be shared variables between threads.  Each state is an abstract projection of a lot of
states and whether a thread's PC is included may vary according to whatever is the most sensible projection.

So, local variables value could be expanded into the state space of the thread and PC values could be taken out.  Having one PC
value per basic block is often a reasonable projection.


L02.P17: Finite State Machine: Product of Machines 4

Now we'll look at our example called fully-coupled product machines.  Its not 'full' in any technical sense, but we now have
couplings in both directions and they both influence each other.

Again, we could have the synchronous and the asynchronous composition.

Note that even though there are no free variables or external inputs any longer in the system it's possible to get non deterministic
behaviour from the asynchronous product because there is arbitrary behaviour in the scheduler as to which machine is going to step
next. So we could see a state with more than one arc leading out of it.

Also, we now see the new concept of states becoming unreachable. There's no way to get to them from the start state.

In the fully-coupled asynchronous product at the top, we see that we can move from a zero along the top to A2 and deadlock there.
Or we can go up and down between A0 and B0.  If we were ever in B2, which we shouldn't normally get into, but under error
circumstances such as being hit by a high energy radioactive particle, or a heavy axe, we could possibly get erroneous behaviour,
and if we were in B2 we would exit and go to B0.

The synchronous product has its distinctive diagonal arcs, but isn't very much more interesting.


L2P18: Forward reference to semantics course notation.

The semantics course will denote stuttering parallel composition as shown, using a vertical bar.


L2P19: THE DINING PHILOSOPHERS

The dining philosophers problem is a classic exercise in concurrency theory.
It's due to Dykstra, along with many others like the Sleeping Barber problem and so on.
They're all amenable to this simple finite state machine analysis.

We have 5 philosophers sitting around the table. The life of a philosopher is very simple.  It is completely governed by this
eternal process loop: the philosopher alternately thinks and eats.  Each philosopher requires 2 forks to eat, but being a bit poor,
each philosopher can only afford one fork, so sharing of forks is needed.

The control flow graph for the philosophers given in this pseudo code here.  When the philosopher wants to eat, he picks up the fork
on their left, then picks up the fork on their right; the philosopher eats and then puts down the two forks.

In this pseudo code, we've used the subroutine names, wait and signal which will become apparent in the next lecture.

If we have 5 philosophers and each philosopher has about 5 states in its life, then we have five to the power of five states in our
composite diagram. That's a bit much to draw out on a sheet of paper. But if you draw out two or three, you might be able to find
the problem and illustrate it.

I'll give you a heads up straight away: the problem is that because each philosopher picks up their left fork first. There is a
deadlock state where each philosopher has picked up the fork on their left but can't pick up the fork on the right because of the
philosopher on their right, holding it in their left hand.

This problem has a circle in it. The philosophers are sitting around a circular table and circles and cycles are a primary cause of deadlock. In fact, they are needed for deadlock:  we'll return to that in a future lecture.
Also there's a textbook solution to solving this problem which can be applied in general and not just to philosophers.


L2P20 Reachable State Space Algorithm

Given that we have our concurrent system behaviour expressed as finite state machines, there are a variety of automatic and manual
tools that can be run to determine properties of the system.  One of these tools os the automatic model checker and I'll put a demo
of using the CBMC simple model checker on another link.


Model checkers basically check two things, either safety or liveness.  Safety checking is based on the reachable state space of the
system.

If we have multiple coupled finite state machines, we need to multiply them all together to get one finite state machine for input
to model checking, and that's what's denoted by the FSM in the top line of this slide.

The reachable state space is simply the states reachable from the initial state.  That can easily be found with the following
"iterate until closure".

We initialise the reachable state set with just the starting state.  And then for each state in the set, we add on all those that
can be reached in one hop from there.  And we iterate until nothing further changes.

This is called an iterate until closure or a FIXED POINT ITERATION.

When we're model checking for a safety property, then we have some predicate which must hold in all reachable states.  It doesn't
matter if it holds outside the reachable state space, because we're not supposed to be able to get outside the reachable state
space.

If we're controlling the air lock on a space station, our safety property might be that the inner door and the outer door must never
be simultaneously open. [But on a flying saucer that condition might be all right, provided that we've landed on a planet with
breathable atmosphere.]

The control systems for the air lock doors must not enter the unsafe states.
There must be no sequence of pressing the buttons that enables somebody to be jettisoned into space, as all the air rushes out.

There's two ways of checking the safety predicate. We can either check it for each incremental version of the reachable state space
as we develop it, in which case we could potentially report a violation earlier. Or we can check that predicate holds for all
reachable states, once we've computed the state space set.

For real world systems of any moderate complexity the reachable states space becomes infeasibly (exponentially) large, of course.
But automated tools have techniques for handling that, perhaps using binary decision diagrams and so on, but that's not for this
course.


L2P21: Live States Algorithm

The other thing that we can check with the model checker is liveness properties.
A liveness property is essentially that something can be achieved in the future.

We'll put it another way: that all possible paths from where we are will eventually lead to a state where a given property holds.

An example liveness exercise might be "it's always possible to unlock the doors on my car" and the classic deadlock situation,
certainly with older cars used to be that you would put the car keys in the boot and you would shut the boot and that would be
deadlock because we can't open the boot without the key and the key is in the boot.

Of course, if you have a fancy car like my wife has, then it's smart enough to spot that you put the keys in
the boot and the boot won't shut.

So the general principle of a liveness property is that whatever we do, we will eventually be able to make this liveness property
hold: at some point in the future, we will get there, one way or another.

There are more complicated, liveness properties as well, and nothing on this slide is examinable for this course anyway.

The liveness algorithm is pretty much the reverse of the safety algorithm. We're going to initialise our live set L to be the set of
all states rather than the starting state, which is what we did last time.  To understand this algorithm, it might rather be helpful
to think of that situation where we have a ring of people, each one seated, but they're seated on the knees of the person sitting
behind them.

The iteration is to preserve states in the live set where the successor is live.  And the initial state will always be considered to
be live.  If you work through the details, you'll see that not only is the main live set (as we defined it earlier, the mesh of
interconnected states reachable from each other) not only is that included, but also the trajectory from the start up state into that
mesh. That will be included too.

And if you want you can trim that off by considering a further step where you intersect with, not the successors, but the
predecessors instead. Alternatively you can initialise to the reachable state space. Anyway, all the details of those are beyond
this course, just for educational purposes. I'm mentioning that model checkers check liveness and safety.
If it's new to you right now, don't worry, it will be covered properly in other courses in the Tripos.


L2P22: Mutual Exclusion Without Atomic Actions

Earlier on, we spoke about the hardware primitives for atomic actions found in ancient and modern computers.
But it's informative to know that there are ways of achieving atomic behaviour without atomic instructions.
Lamport's Bakery algorithm is perhaps the most famous one, and we'll discuss that on the next slide.

Don't confuse this with the Bankers Algorithm, which we'll cover in another lecture.

As well as the Bakery Algorithm, there's Dekker's algorithm for two parties and so on. They all work essentially in the same way.
They will all use global shared variables, but the participants, the individual threads will know their thread identifier, their
TID.

Their TID is an integer and this will essentially be used as a subscript to a shared global array.  On an operating system, you can
invoke a system call to find out your current thread identifier. In Unix it's called gettid(), not surprisingly.

Since all the computers we're likely to use do have atomic instructions, you might think that getting away without them is not very
valuable. But actually there's something to learn here for distributed computing where we have distributed, read and write to a
database, perhaps from many different computers.

Again, we'll need to achieve atomicity one way or another, and perhaps some of these ideas could occasionally be useful in that sort
of scenario. Anyway, Doctor X will handle that sort of issue in the second-half of this course.

These algorithms also serve as lovely toy examples for model checking exercises for those learning to use a model checker.  However,
these algorithms will tend to rely on SEQUENTIAL CONSISTENCY.  Which is that a sequence of values stored by one writer are seen to
have been written in that order by observers running on other cores. On certain processor architectures, we may need to manually
execute fence instructions to preserve that sequential consistency.

But that's getting beyond what we need to talk about in this course.


L2P23: Lamport Bakery Algorithm

Here's the pseudo code for Lamport's Bakery algorithm.  The idea is motivated by a bakery where there is a queue of customers
waiting to be served, but it's a virtual queue.  On entry to the bakery, customer takes a number from the ticket issuing machine and
then waits for their number to be called.

The customer also needs to know their own personal identifier, which for threads will be their TID and we'll assume that this is in a range that can directly index a small array. Although in real world systems we probably need to compress it down somehow because the TID could be an arbitrary 32 bit integer.


We have two methods defined in the pseudo code, lock and unlock, and these will be used as mutex ACQUIRE and mutex RELEASE when we
enter and leave a critical section as discussed next time.

The LOCK operation is essentially going to perform an atomic test-and-set for us.

The algorithm operates with two arrays indexed by TID: a Boolean called enter, and a numeric array called number.

Let's put aside the enter array for the moment. 
So ignoring that, the first line of LOCK is that we're going to compute one plus the Max over J of the number array.
J will range over all entries in the array which is over all possible threads.

The array is initialised to 0, so this will be 0 in the very first instance and the very first user will be allocated
ticket number 1 + 0 which is 1.

A nice aspect of this coding compared with the real-life bakery is that the range of numbers tends to be fairly small.  That's
because when everybody's left the bakery, the number array will all be 0 again and the next person who comes in will be allocated
number 1 again. So the numeric space gets recycled automatically by this algorithm. Or this coding of the algorithm I should say.

There's another variation where we just allocate successively higher numbers to each new coming person, which is possibly more like the real life bakery where the number counter might only get reset when the shop closes.

Now, of course, as seasoned concurrency experts now that we are on Lecture 2, we can see that there's a race potentially straight
away. A number of threads might simultaneously be scanning the number array and they will come to the same conclusion about what their ticket number is. So it's entirely possible for two threads or more to be allocated the same number.

Now we're going to allocate service based on comparing the number fields with each other, and in fact we're going to allocate
service next to the person with the lowest number, which is what would happen in a real bakery: lowest number that has not yet left
the system.

We disambiguate the situation where two clients have the same number by forming a pair.  The pair consists of the ticket number
paried with the client's TID.  Since all the tips are unique then we can be sure that no two clients have the same pair.

Now we can compare pairs using lexicographical ordering, where we give more significance to ticket number and less significance to
TID and just use the TID as a tie break for two clients with the same ticket number.

The second-half of the lock code is pretty simple. Again, we'll ignore the enter array for now.
What we do is for all entries in the array, we see if there is one lower than the one that we hold,
provided it isn't 0 (because we use the value 0 as a special value that denotes that the slot is empty).

To reiterate, we just scan looking for a lower value lexicographically than our current one. If so, that customer is currently being served.

And we're spinning, waiting for that condition to no longer hold. The condition will no longer hold when the other customer exits
the system invoking the unlock primitive. As you can see, the unlock primitive simply zeroes the number entry for that person.

So remarkably, this more or less works if you manually step through it. If you put aside any possibility of races for the moment, you'll see that it works nicely.

The one case where there is a possible race is where two threads have both read from the number array and they haven't yet updated
their number field. We know that they'll both end up with the same value potentially, but we've already disambiguated that.  But if
one of the threads has made significant progress into the second-half of the lock code before another thread has stored its number,
this is where the race may be a problem.

And that's where this additional array, the Boolean array of enter flags, comes in useful.  We set our entry in the enter array
while we are scanning and updating our own number and we free it afterwards.  And very elegantly, if you look through it all very
carefully, you'll see that this is sufficient to resolve all possible race errors.
That's because we don't proceed to look at the number array while any of the enter array flags is set.

So we don't have time in the lecture to prove this is correct, but I'll leave that to your private study. Or you might even try and put it in a model checker and prove it, but probably that's a bit advanced for the average second year student right now.
But we do typicallly have some remarkable people on our course!


L2P24: Model checking demo.


L2P25: Summary + Next Time

In summary, we've looked at the underlying hardware primitives found on modern computer architectures, discussed memory consistency
a little bit.

We've looked at a finite state model of the control flow graph for a thread and we've spoken about how this could be analysed by our
model checker.

But do note that model checkers today cannot directly process the source code for a complete program.  That's not reasonable with
today's tools. Instead, they can analyse an abstraction of the program which may be sufficient to prove the point of interest.

Next time we'll look at access to these hardware primitives and how they can be integrated efficiently with the operating system scheduler. We'll introduce one of the oldest concurrency primitives, which is the semaphore.

And also look at how the same construct can be used to provide synchronisation between threads. By which we mean telling one thread that it's now time to look at the work produced by another thread.

END OF TRANSCRIPT