CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 1 TRANSCRIPT

(C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory.


Hello. Hello. My name is David Greaves. Welcome to this course on concurrent and distributed systems. This is what I look like. I'll
be responsible for the 1st 8 lectures on concurrent systems. And my colleague Dr X will present the second-half eight lectures on
distributed systems.

Welcome.


I plan to be in the Intel lab for the C&C++ practical classes, so do feel free to contact me then to discuss things about concurrency
and distributed systems.


Concurrent programming is generally the topic of one part of a parallel program reliably communicating with other parts of itself.
In the past, the term 'concurrent system' denoted a multithreaded or multiprocessing system running on a single processor with a
shared address space between all of those components. Although the virtual memory system may be providing isolation between
processes. In the last 15 years or so, a single processor has been replaced with multiple process cores that share the same memory space.

Distributed systems is the topic of multiple programs talking to each other over unreliable links.  The two topics share similar
interests and concerns, such as scalability with increasing parallelism or the increased number of participating distributed
computers.

And both are essentially aiming to mask the communications latency between the various parallel strands.

Human beings aren't very good at thinking in parallel, and by and large, if we're using an imperative programming language, we like
to think in sequence.  That's what imperative means: a sequence of commands one followed after the other.

This is a familiar programming paradigm. Functional programs on the other hand, tend to have more intrinsic parallelism. An
expression can be evaluated as soon as its arguments have been evaluated. Automatic parallelization of programs is a related
topic. But we will be concentrating mainly on imperative programming and manual parallelization, dividing the work into separate
threads of execution ourselves.

So while it's nice to be able to think in an imperative sequential way, if we execute in that way only, we won't get any parallel
speed up. There will be no actual concurrency if we force everything to operate serially.  But as we'll see, many possible
interleavings in parallel of fragments of imperative programs will produce incorrect results when executed.

So this subject concentrates on how to maximise parallelism while giving the programmer the freedom to express imperatively and
giving the system the freedom to run things in parallel, making sure that all of the correctness guarantees that we want are
respected.

Debugging parallel programs is also a very tricky problem, especially when the schedulers which choose what to run next make
decisions that are unrepeatable.


SHARED MEMORY and SHARED VARIABLES


Our concurrent system will typically be running on a shared memory multiprocessor with cache consistency, so that hardware
automatically communicates values written by one processor to another one.  In a baseline model, they see the same pattern of reads
and writes as well in terms of sequential ordering. We'll discuss relaxing that another time.

Distributed systems might expect to see communications failure and encounter that relatively often.  Another variant of that is that
the remote machine might be switched off or have crashed gone away for some indefinite length of time, in which case we will
potentially see unbounded delay in responding to a message.

On the other hand, we don't expect communications failures in a concurrent system within one computer. If we get a failure of the
internal hardware mechanisms that's normally regarded as a catastrophic failure, and we abend the whole program.

Distributed systems, however, are expected to recover gracefully.


And the final basic difference between concurrent and distributed systems concerns clocks.
The clock tells us the current time in unit of microseconds or similar. 

On a single computer, we can expect to reliably read a clock and have a high precision in the reading that we get.
It might be very accurate, synchronised to one of the main global atomic clocks on the planet.
We can rely on the fact that when two programs running on that computer compare their clock values that they read, if one sees a
higher value than another, then it read the clock later on.

On the other hand, in a distributed computing system, we can't rely on the clocks being precisely synchronised over all the different machines. We have to design protocols that are robust against minor variations in clock settings.


This slide describes the last 40 years of improvements in computing.  Moore's law told us that the transistor camp would continue to
grow exponentially, which it has because we see a straight line, and this is a logarithmic ordinate.  But we can also see another
important trend post 2010, which is that clock frequencies have flattened off and the number of cores in a computer has gone up. A
core is a processor. To exploit all of these cores, we have to program in a parallel way and that's where concurrency comes in.

The first design from Charles Babbage, the Difference Engine had arithmetic capabilities associated with every storage location.
His second design, the Analytical Engine, concentrated all the processing in what he called the mill, and he had as much memory
connected onto that as he could make.  [And he didn't end up making any of it. But that's another story.]

The motivation for his design change was the engineering cost of the mill. Today that is not really an issue. Silicon VLSI is amazing.
We can build massively parallel computers at very low cost.


OVERVIEW

Here's a quick overview of the first 8 lectures of this course.

Today we'll introduce the principles of critical section and mutual exclusion.

And next time we will look at some of the theoretical foundations in automata theory and how we can connect automata together.
This will give us the basis for theoretical analysis of some of the Work that we look at.

In Lecture 3, we'll look at the operating system and how it can help us schedule concurrent processes. We'll look at one of the oldest concurrency primitives, which is called the semaphore.

In Lecture 4, we'll expand the primitive set to look at a richer set of functionality and the sort of thing that may be provided in more modern operating systems.

In Lecture 5, we will look at the problem of deadlock where a process have fouled each other up.  And we'll look at how we can give
different threads of execution different priorities,  seeing problems that can arise as different priorities interact with each
other.

In Lecture 6, we'll look at alternative programming language constructs that avoid shared variables, which is one attractive way to go forward.
Then we'll move the level of abstraction upwards towards distributed computing by talking about composite operations, where a transaction performs a number of atomic operations, yet has atomicity itself as well.

Lecture 7 takes us further looking at how we can do this more efficiently. We will let systems run optimistically, hopefully
achieving a high degree of throughput, but occasionally needing to retry an operation.

And finally, in lecture 8, we will look at how database systems typically recover from component crashes.  Also a short digression
into how to programme without using low-level locks at all, but just using the consistency guarantees provided by the shared memory
system: that's called lock-free programming.


READING

Here's some of the recommended reading for this course.

WHAT IS CONCURRENCY


Well, what is concurrency? Computers appear to do many things at once. For instance, we might be running multiple programmes on our laptop.
The operating system is also doing concurrent things under the bonnet.

A good example is the asynchronous write back of data to secondary storage to hard disc or SSD.  The data transfer rate to secondary
storage is much slower than to main memory, and once a file has been generated there, it can't be written instantaneously to the
secondary storage. The operating system will typically give the illusion that it's written instantaneously, but it's actually going
on more slowly in the background.

You might have seen this especially with USB sticks. You copy a big file onto a USB stick and then you click on eject, but it won't
eject. Why is that? Well, that's because the operating system has given you the illusion that the file is copied, but really  it's
still doing it and it won't let you physically unmount the media until the operation is complete.

TIME SHARING

Before we had multicore processors, the concurrency was, generally speaking, an illusion.
The principal implementation technique for this is time sharing, but also asynchronous hardware interrupts from the device are another source of concurrency.

With time sharing, processes are given a quanta of execution (eg 10 milliseconds) and then a hardware timer generates an interrupt.

The operating system scheduler then works out whether something else that's ready-to-run, is now more worthy to run.

So both with time sharing and external interrupts from hardware devices, the event that causes a change in what's running is the interrupt.

But as I hope you know, from operating systems knowledge, the other major cause of the context swap is a voluntarily yield by a process when it performs a system call that can block, such as waiting for input.

In all those cases, the impression of concurrency is an illusion, but with modern hardware there is genuine concurrency as well.


One source of concurrent memory access is device DMA (DIRECT MEMORY ACCESS) where hardware is copying data to/from secondary storage or
the network, or the sound card. The other main forms are other processors or GPUs that are
share the main memory. In these cases we have genuine concurrency at the hardware level.

Whether pseudo or real, from the programming point of view, we have concurrency in all cases.

I'm not going to talk to slide 7. Let's move on to Slide 8.


RECALL: PROCESSES AND THREADS

Here, we'll just recap some basic terminology from the operating systems world.

A PROCESS is an instance of a running programq.  The operating system will load it, establish a memory space for it, run the
program, providing it with at least one thread. Most operating systems use virtual memory, so the address space that the
process has is by default disjoint and protected from all other address spaces.

THREADS are entities managed by the operating system scheduler. They represent an individual execution context. In essence, they are
a program counter (PC), but it's more than that. Each thread will have its own complete register file  and also its own stack.

A processor (aka processor core, aka core) can be running one thread at once.

Threads that are not running on any core at the current time must have their state saved in a thread control block (aka task control
block) managed by the operating system. There can be a lot of ancillary information in there as well, specifically concurrency
primitive interactions that we'll discuss in this course and scheduling information such as the threads priority that we'll also
discuss. Also access rights, environment variables ...

Threads run mainly in the address space of their process. But when they make a system call, they essentially run in the operating
system kernel, although it might in practise be a proxy thread that runs while user space thread is nominally executing inside the
kernel.

A CONTEXT SWITCH is where the operating system saves the state of one thread and restores the state of another.

If the switch is between threads in different processes, then the process state also needs to be switched.  This is a matter of
reprogramming the virtual memory management system of the computer.  In essence, this means changing the value of one of the control
registers in the processor to point to a new root translation table (segment table) for the new process.


CONCURRENCY WITH A SINGLE CPU

With a single CPU, we can see three forms of concurrency. But really, these are just different manifestations of the same basic principles.

ONE: In the simplest case, a process X is going to run until it blocks. It performs a system call that cannot be serviced straight
away. Then we say that's a BLOCKING SYSTEM CALL, or it might suffer a hardware interrupt and get preempted that way.

Either way, the operating system is now going to run for a while. It might run the device driver for an IO device, or it might be
servicing the system call from process X. It might, for instance be doing some TCP processing. TCP (transmission control protocol) is the
transport Protocol for the Internet. ... or any other activity that's implemented inside the kernel.

When that kernel activity has finished, there's two possibilities. Either process X is still the most worthy process to run, or
quite often the side effects of what's just happened have changed the worthiness, and something else is better deserving of running.

TWO: If it was just the time sharing interrupt from the timer, then process X has had its quantum and something else is now
going to be deserving of running, so we might switch to process Y and then on to process Z and so on.


THREE: we also have intra process concurrency. Suppose process X has multiple threads X1, X2, X3 and so on.

Then exactly the same sort of behaviour needs to happen in microcosm, potentially all in user space, or maybe with the kernel being involved. In this case
we make context swaps between threads without reconfiguring the memory map.


CONCURRENCY WITH A SINGLE CPU

This slide illustrates the time domain behaviour of a single CPU.

We see it's alternatively running different processes with the operating system doing a little bit of work in between at each context switch. Sometimes the work is just the context switch, other times there may need to be some kernel processing happening before we return to the next user process.

We see that a process stops running sometimes when it expects to. It might invoke a blocking system call, but at other times it can get preempted. Such as
when it's run out of time slice or an external event has happened, which means that some other task is more worthy to run.

This precise interleaving is almost never known in advance: the concurrency pattern will vary from one execution to another for a
given program or for any set of programs running together. We normally assume that this is completely non deterministic.

Alternatively, there is a discipline of deterministic scheduling which is helpful for avoiding concurrency bugs in safety
critical systems, but for general systems programming we'll assume it's non deterministic and we won't incur some of the minor
overheads that arise if we insist on determinancy in the behaviour.

So to make the system reliable, we need to assume worst case behaviour with very little assumption about when we're going to run
next, except perhaps in certain hard-real-time specialist areas such as engine management control or Saturn 5 rocket motor control
or whatever it is, in which case the operating system may be making us some reliable guarantees.

CONCURRENCY WITH MULTIPLE CPUS (AKA CORES)

Of course, most modern systems today have multiple CPUs. Hence, things really are going on in parallel, as shown by this diagram,
showing two CPUs in the time domain. Notice that the operating system can be running on one CPU or the other, or even both. That means
that we have genuine concurrency inside the operating system kernel.

Also, the different threads of a given process can be running on different CPUs. This means that the memory system has a bit of a
job keeping things consistent, but it means that we can exploit parallel speed up in a parallel program.


WHAT MIGHT THIS CODE DO?

This is written in the C language.

There is a global variable, a character string which is initialised to the constant "Thread".

There is a function that takes an integer and which invokes the sleep system call. It's given an integer random number in the range
zero or one, so two possible arguments there, and when it returns from the sleep system call it executes a print statement.

The main() code spawns NUMTHREADS different threads and that was set to four in the macro preprocessor defined at the top of the
page. So each of these four threads will be running in parallel. The  main() code is going to join up with them when they've exited.

Each thread is going to sleep by a random amount of time and then return.
So how many different print statement sequences might we see on the console output? Well, let's try it and see.

DEMO .... Check the course website for the output.

There are 4 factorial possible outputs, and even if we take them random number generator out, how many possible outputs are there? Let's see that.

And is the output from printf itself divisible?

Or does it manage to maintain no interleaving of characters from various lines? That again will depend a lot on the implementations of the library on the particular computer you're using.


MULTIPLE THREADS WITHIN A PROCESS

A standard process has four segments: the code and the static variables, the heap and the stack. This slide only shows three of
them, but think of the code also containing the static variables, some of which may be constants, and some of them might be
initialised to zero, and so on. So in actual fact there could be several segments within that area, but we won't worry about that
for this purpose.

What's important to note, of course, is that each thread has its own register file. The PC will be pointing to the execution point
of that thread, and each thread will also have its own stack and the stack pointers will be pointed to different segments (or
certainly regions) of memory space.

Each stack has a different start/base address, spaced out according to how much you think the stack might grow. Although when
using virtual memory, VM space is pretty much free, so we can afford to start the stacks along way off from each
other without significant cost.

On the other hand, the heap and the global variables and the code are shared amongst all the threads that participate in this
process.  It is possible for one thread to read and write the stack of another thread in most systems, although this is tending to
be blocked since it is one of the security backdoors that are sometimes exploited by viruses and other malware.


1:N USER-LEVEL THREADING

Let's now consider how the operating system interacts with per process threads. The first paradigm is called one-to-N user-level threading.

In this approach, the kernel only knows about and so only schedules processes. The process has one thread as far as the operating system sees it.

A user space library is then responsible for allocating that operating system thread to the various user threads and performing context
switches. Examples are the Java virtual machine, old versions of it, and threading libraries like Quick Threads that is ported over
many operating systems and widely used.

Also, we may have a system of co-routines (if you care to read up about co-routines) where all yielding is voluntary.
This will tend to give us the lightest possible threading overhead.

Hence, all creation, termination and context switching is all done in an application-specific way. It might be optimised for given
coding styles in a high level language where certain registers aren't even used. And it will be operating system independent.

But the disadvantages are that it's difficult to make system calls if one thread wants to make a blocking system call. Are all the other threads blocked in the meantime? That doesn't sound very helpful.

If one of the threads accesses fresh memory, there'll be a page fault, and this might be perfectly reasonable. The operating system is going to allocate some memory for that access, but that might take time and hence the real time performance of the other threads will be influenced while that page fault is being serviced.

Another problem is time slicing. We can't easily stop one thread in this thread group from hogging the CPU. We have no hardware mechanisms available in user space for preempting one thread when it's been executing for too long.

But perhaps the biggest problem is the lack of genuine parallelism. We won't be exploiting the multicores available on today's hardware platforms.
So, all-in-all, this is a pretty old fashioned approach.


1:1 - KERNEL-LEVEL THREADING

So by and large, a commonly preferred solution today is 1-to-1 kernel level threading where each user space thread is managed by the operating system kernel.

When the process starts, it has one thread, but that makes  system calls which ask the O/S to add further threads.

That sort of system call needs to be contrasted with something like the Unix fork system call that creates a whole new process. [In
fact, in modern operating systems, the underlying mechanism is generalised so that you can, with the parameters passed to a single system call,
sweep in the spectrum between just having a new thread and having a whole new process. But that's beyond this course. All that we
need to know today is that the new threads will be created and managed by the operating system.]

This scheme overcomes all of the disadvantages mentioned on the previous slide. There might be a slightly greater overhead in
creating a new thread, so you might programme to keep a thread sitting around idle.  A thread doesn't have much overhead while it's idle (sitting on the kernel's not-ready-to-run list), so this can be better than killing it and starting another one when needed.

It was once argued that kernel-level threading is not very portable over operating systems, but generally speaking it is available
today on all the major operating systems. So that's not an objection really.


M:N - HYBRID THREADING

It's also possible to have, sort of, the best of both worlds, supportng a large number, N, of user-space threads with a smaller
number, M, of kernel (aka native) threads.  This is good when the number of threads needed greatly exceeds the available number of cores.  The
dynamic mapping is managed by the user-space runtime system.

Using hybrid threading of this general nature is a common requirement, especially in simulations of multi-agent systems.  The agents
might be motor cars or Internet switches etc..  It is a good approach if we want to make the user-space threads exceptionally
lightweight just for running a very small fragment of code.

We still have the problem that when one of the user threads makes a blocking system call, that thread is no longer available for
providing parallelism in user space.  But if only a limited number of the user space threads are communicating outside the process,
this works ok.

By matching M to the number of CPU cores physically available, we do not overload the O/S scheduller with an excessive number of
threads and consequent high frequency of context swaps.  A simulation of N agents is efficiently portable over hardware with different numbers
of cores by varying N.


Certain operating systems implemented a fairly grand mechanism for generic M:N mapping, where the user-space threads might have been
referred to as ACTIVATIONS.  They could return a new thread to user space when one of the activations blocks and that new thread would serve as
a replacement activation.


O/S support for generic M:N hybrid threading of this nature has been removed from most mainstream operating systems.
The overhead of having a reasonable number of sleeping threads (M<1000) is not severe and so native O/S threads can be used in most cases.
But above that number the M:N mapping is typically instead managed by an application-specific run-time library.

Not examinable: Timer-based premption can be managed using one native thread spend most of its time sleeping on an O/S provided timer, but the details of
preempting user-space threads by the run-time system are tricky and it is easiest if these are coded cooperatively with regular calls to
some yield()-like method as per co-routines.


VIRTUALISATION NOT ON THS SLIDES

Another form of concurrency we tend to see today, especially in data centres, is complete machine virtualization, perhaps using the
Zen hypervisor which was designed here in this department.

These virtual machine systems are beyond the scope of this course, so this topic isn't examinable here. But in these virtual machine systems, the
amount of concurrency is limited out-of-band and involves switching complete racks of equipment on or off, and indeed moving
complete running operating systems from one machine to another, and so on.


SUMMARY: ADVANTAGES OF CONCURRENCY

But to get back down to Earth, let's just summarise the advantages of concurrency.

We are able to overlap computation and I/O on a single machine. This is typically done using interrupts.

It enables us to write simpler code when we need to handle genuinely concurrent situations, such as one thread handling the GUI
while another one handles input and the third one handles the main algorithm of something like a game.

If we're writing a web server, then it might be sensible to use one thread for each concurrent HTTP request that's being served.
Again, this is one situation where user space threading might actually still be highly relevant.  Or if we have something like a
Java virtual machine, then we would have one thread for each of the concurrent threads that the JVM is modelling for us and also
another one for the garbage collection so that that can operate largely asynchronously.

But of course, apart from code simplification, concurrency also enables us to exploit parallel speed up.

As we've seen, in general, we have some number of processes, each with some number of threads and that's running on some number of
computers and a distributed system, and each computer has some number of CPUs.

For this first half of the course, we're going to focus more or less entirely on just a single computer running a multi-threaded program in a shared address space.

And virtually everything that we look at will generalise to multiple processes with their different address spaces and to multiple
CPUs and to multiple machines. But perhaps with some ancillary complexities. For instance, in a distributed system, we have more
error recovery to consider in general.

For the next 3 lectures or so we're going to concentrate on shared-memory computing. It's arguable whether that's a good paradigm in
the long run, but that's certainly the mainstream paradigm at the moment, and we will look at how threads can cooperate as they
share these variables.


CONCRETE CONCURRENCY EXAMPLE: HOUSEMATES BUYING BEER


Let's introduce a first concrete example. This is some housemates buying beer.  Now these housemates keep their beer in the fridge,
which is a common practice in many parts of the world, or them. Maybe not here in England.

The house rates are all identical and we will model each housemate with an instance of a thread and all the threads run the same
code. The algorithm that they follow is 3 steps.

1 they look in the fridge.

2 If there's some beer, they take some out and drink it. But if there's no beer, they go and buy
some beer. They put the beer in the fridge.

3 Go back to step one.

In most cases, this is going to work fine, but now we're going to see our first example of a race condition.

Now somebody whose name is 1 looks in the fridge, sees there's no beer and goes out.  Meanwhile, somebody who's called 2
(imaginative parents in the naming scheme here) but the person called 2 is going to come along while person called one is out. 2
is also going to look in the fridge, see that there's no beer there. And what does person two do? They go out as well to buy beer.

Now the problem here is that 2 didn't know that 1 had already gone out to buy some beer.

1's going to come back and 2's going to come back. We don't know which order they'll return in, but they'll both put beer in the
fridge and the fridge might not be able to hold all of this beer, so we will have what we might later call, buffer overrun,
and the solution for that, generally speaking, to discard the excess material.

And here we could drink the extra beer, I suppose.

But if we weren't manipulating beer here, we were manipulating control rods in a nuclear reactor,
than the recovery procedure might be less pleasurable than drinking the extra beer.
Although I do appreciate that not everybody likes beer.

SOLUTION 1 - USE A NOTE

What's the obvious solution to the beer buying problem? It's to leave a note so when a housemate looks in the fridge and sees
there's no beer, then they're going to put a note on the fridge. And if there is no beer but there is a note, then they're going to
do nothing.

This will actually work in real life because of some interactions that will occur informally that aren't part of this algorithm that
we've documented.  The two human beings spot each other looking in the fridge and writing notes and so on. So let's analyse the
subtlety in Computer Science terms.

Here we've coded up the algorithm in pseudo code. We now have two variables. We have the beer in the fridge, which is some sort of
natural number, and we have a Boolean which is whether there's a note or not on the fridge. These variables are SHARED VARIABLES
looked at concurrently by multiple instances of the algorithm representing each of the different housemates.

The way this slide is coded, I think beer is not actually a shared variable. It's a local variable to each thread, but it's just
being updated from the global state about the level of beer in the fridge.
And that global state is shared. But that doesn't matter: we're going to concentrate more on the variable called node.

Instead of having multiple people, let's consider we're running this concurrent algorithm on a uniprocessor, which is going to be
stepping through one person's code and then doing context swaps at potentially arbitrary times and resuming that code where it was
preempted.

Remember, a thread will stop execution for two essential reasons. One is that it makes some sort of blocking system call, which
isn't relevant here, but also it can be stopped when it gets preempted by an interrupt either from the hardware timer or an I/O
device connected to our uniprocessor.  And that can context switches us as well.

Note: in the design of hardware and algorithms, you can't really rely on events being rare, because processors run at giga
operations per second, and something that happens once in a billion times will happen within a second.  Hence the situation
described on this slide, which is a context switch straight after thread one has checked for the absence of a note, is highly likely
to arise.

Those of us with any experience of this sort of thing will rapidly spot that this solution doesn't work when coded up this way.
Thread one check that there isn't a note. Then it's about to set the note using what we're going to call a test-and-set.
But between the the test and the set it experiences a context switch.

Thread two now runs and it sees that there isn't any beer in the fridge, and it sees that the note isn't set, so thread two also
goes out to buy beer.  There is a second context switch thread. Thread one resumes, sets its note and goes out to buy some more
beer. And we have the double-buying problem which the note was supposed to avoid.

What's the chances of this going wrong? Well, it won't go wrong every time, but we could do a simple uniform distribution analysis
of the percentage of time that a couple of threads might be synchronised this way and end up with some probability.  We'd have to
make quite a lot of assumptions, but we'll still get a finite probability, and in computing anything which happens more often than
once by universe lifetime is going to be a problem. That's a general rule.

And code that fails occasionally is much harder to debug than code that fails reliably under a given repeatable environment.

Even if we replicate the initial conditions, the non-deterministic behaviour of the operating system scheduler will mean that we get
different outcomes each time we run the program. Or it will certainly take a different amount of time before it fails.

And if we try to debug this using a tool like GDB or whatever is appropriate for the language we're using, it's possible the bug may
entirely disappear. Such a bug is known colloquially as an 'heisenbug'.


CRITICAL SECTIONS and MUTUAL EXCLUSION

The problem that we have here is that we have two threads trying to solve the same problem. They're both trying to execute  concurrently,
but ideally we just want one thread buying the beer.

The fragment of code where the problem really lies is between the test and the set of our note variable.  If we can be sure that we
don't get preempted between those two instructions, then the problem goes away.  We call that a CRITICAL SECTION: there should be at
most one thread at any one time that is executing that instruction path.  We respect that critical section by ensuring MUTUAL
EXCLUSION. Mutual exclusion means that at most one thread is in there.

One way to achieve mutual exclusion is to only have one thread. We could nominate one of the housemates as the beer buyer and not
let any others participate in beer buying. That would solve the problem. But it is not a general solution because it's going to
restrict concurrency.

Instead, in our first solution, we tried to provide mutual exclusion by leaving the note. The meaning of the note was "I'm in
the critical section" and removing the note was meaning "I've finished doing my critical work". As we saw it didn't work in this
situation. The reason it didn't work is because we could have a unanticipated context switch between reading the note and setting it.

Problems of this nature are called RACE CONDITIONS, a race where multiple threads compete with each other during conflicting access to shared resources.


ATOMICITY


The solution to this will be to make the test-and-set operations atomic. In other words, indivisible.

What we want is for the checking of the note and the setting of the note to happen without any other thread being intervening.
We might only set the note if it wasn't set already. It would be a conditional set.
But for a simple binary flag like this, it doesn't matter whether the setting part of it is conditional or not.

We don't care if another thread reads it after we're done or sets it before we start our check, but once we start our check, we need to continue without any interruption.

When a sequence of operations is made to happen as though they were just one operation, then we call that an ATOMIC OPERATION.
The atom is indivisible from the point of view of the rest of the programme.

An atomic read-and-set operation is going to be sufficient for us to implement a correct beer programme.

We'll also actually need something similar when we take beer out of the fridge, but that's not discussed on these slides.

Let's see our solution using a read-and-set operator, which is an atomic operation that takes a variable passed by reference using
the C syntax here, with the ampersand to denote pass by reference.

That means it will be able to check the value and if it's zero, set it to one and return a one if it were successfully changed from
zero to one. And return 0 otherwise.

[For those learning C+C++ at the moment, a side remark is that the pass-by-reference in C++ is denoted with the ampersand in the
signature or definition of the routine, whereas to get that effect in C you put a star there and you have to put the ampersand in
the call site.]

This read-and-set primitive is sufficient to implement a working program. Let's look at the edits that we have made here. The test of the note has been changed to a read-and-set call and that will also replace the setting of the note in the body of the basic block that's guarded by that call.

So does this work?

We will still experience context switches happening at arbitrary points, and now let's consider if we get a context switch between thread one, checking the bare variable and thread two checking the bare variable.

Well, I don't think this works. It's perfectly possible for thread 1 to observe that there isn't any beer in the fridge, and then thread two to make precisely the same observation.

So they're both going to embark on the control flow where they go out and buy beer. Oh dear.

What we need is the capability to turn a region of code into a critical section.


GENERAL MUTUAL EXCLUSION

Let's suppose we have a couple of calls which delimit the start and end of a critical section. We'll call them ENTER and LEAVE respectively.
And this should work. In fact, we've made the whole behaviour of each thread a critical section which looks a bit brutal.

But we'd have to consider that in reality there's going to be other behaviours by the housemates outside of what's described in these threads.
They potentially have a worthwhile life outside of just buying beer.

They'll also be drinking beer, and so now actually we have two possible operations on the fridge and we'll need synchronisation for each of those.
What sort of exclusion will they need? Well, we'll consider that later on when we look at multiple reader, multiple writer type locks.

Anyway, as I said before, this should work, but now we need to consider how to implement the enter and leave primitives.


IMPLEMENTING MUTUAL EXCLUSION

So how can we implement inter and leave primitives?

One option that works, at least for a uniprocessor, is to prevent any context switches while we're in the critical section.

If we're running in kernel mode, then we can disable all interrupts by setting a flag in the interrupt control register.
We reenable interrupts by restoring the value of the flag at the end of the critical section.

If we're in user space with lightweight threads, then we can set a similar memory variable that the scheduler running in user space looks at to decide whether context swaps are allowed.

Although this banning of contact swaps works, it's rather brutal. It will stop context swapping for any other threads, not just
those that want to enter or participate in this critical section.

There's also consequences for real time behaviour and the smoothness of the system. If we are in a critical section for any significant duration, the granularity of sharing is going to be increased. It is worse if we need to sleep in our critical section, which we'll look at when we look at condition variables for monitors in another lecture. This is a common enough need. If we need to sleep or wait during our critical section, we can't expect anything else to happen, because if we turned off interrupts, for instance, we won't get a timer interrupt.

And finally, of course, it doesn't work where we have multiple processors or other devices that are making access to the main memory.

See also slide seven of Lecture 2, where I make pretty much the same comments as I've just made.

Rather than having some global variable that turns off all aspects of concurrent system behaviour temporarily, it's much better to
have a local mechanism associated with each critical region.  We'll use a shared variable as our local mechanism, and we'll call it
a LOCK. We'll also call it a MUTEX or a MUTUAL EXCLUSION LOCK.

Our enter and leave primitives will take the lock as their argument. We must make sure we pass in the correct lock manually. We'll
talk about language-level support for making sure we have the correct lock in another lecture.

The locking primitive on the lock, which is also known as ACQUIRE is: while it doesn't hold repeatedly issue read and set.

The corresponding UNLOCK operation is a simple store of the value of 0 into the lock.
Unlock can also be known as RELEASE.

As a digression about coding style, rather than having a while statement with a semicolon as its body, it's actually much nicer to
put the word continue as the body of the while loop. It's harder to miss when you're quickly reading the code.

Here's a beer buying example using the lock primitive instead.

SOLUTION #3 MUTUAL EXCLUSION LOCKS

This is finally a correct program, but it's still not perfect. The lock might be held for quite a long time. For instance, somebody
else might be wanting to use the fridge to get the milk out.

The while loop is likely to waste a lot of time. It will be SPINNING as we say, waiting to acquire the lock,
when really the processor has got something better to do, for instance, running the thread that's going to release the lock.

When multiple threads want to acquire this lock, we have CONTENTION and in another lecture we'll look at ways of efficiently resolving that.

As I said, these locks are sometimes called mutexes. But we would probably prefer to reserve that name for a lock which is
integrated with the operating system for efficiency. Whereas what we have seen with this while loop is what we'll call a SPINLOCK.
In general, doing a little bit of spinning before giving up and sleeping is a very good design point, but we'll see that
another time.


SUMMARY

In summary, this time we've seen the definition of a concurrent system and looked at the sources of concurrency within a computer.

We've done some basic revision on processes and threads and seen the challenges when we have concurrent access to shared resources.

We introduced some important terms. The critical section, mutual exclusion, a race condition and the concept of atomicity.

We looked at how we can implement mutual exclusion using an exclusion lock or a mutex.

Next time we'll look at some of the underlying hardware mechanisms that the computer must provide to the operating system as the basis for concurrency primitives.

We look at a view of concurrency based on automata theory which will help us formalise some of the terms that we use in general
concurrency discussions.

END OF TRANSCRIPT