CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 3 TRANSCRIPT

(C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory.


L3 Mutual exclusion, semaphores, and producer-consumer relationships.


Reminder from last time concurrent systems can be very complicated, but ultimately they're finite
state machines.  Therefore, finite-state machine theory applies to them, and they can be analysed
as systems of communicating automata.  But the challenge is concurrent access to shared resources.

Two writes to the same variable may happen in any order, and reads may be interleaved as well. The
scheduler doesn't give us control over the relative ordering at a fine level, and hence we see
races and unpredictable results. So we fix that using a variety of mechanisms, some at the
hardware level, such as the atomic instructions of the computer, and then we build various software
structures on top of that.  These are, generally speaking, going to be tightly coupled with the
operating system for efficient implementation.

The main mechanism for avoiding races is going to be mutual exclusion, where we make sure that at
most one piece of code is handling a group of interdependent variables at one time.
And that, of course, is called a critical section.

The example from last time was the beer buying exercise. The problem was we couldn't take out a lock.

It takes some time for somebody to go and buy the beer and during that time there's no indication
that somebody else shouldn't start also going to buy the beer.  We're trying to provide a solution
to that using a note which was going to act as what we'll now call a mutual exclusion lock, but
these need to be implemented properly.

L3P03 THIS TIME

This time we'll talk about how to implement mutual exclusion.  We'll introduce one of the most old
fashioned forms of synchronisation, primitive, which is called the semaphore.

In general, as well as wanting to provide mutual exclusion, we'll also want to be able to provide
other forms of synchronisation between threads, such as a thread starting to run again when a
particular condition holds.

There are various types of classic example based around resource allocation and certain standard
paradigms for that which we consider: the producer consumer paradigm is where one agent produces
items, puts them in a queue and another agent takes them out of the queue.

Adding and subtracting to the queue needs generally to be protected with seller? falls? or other synchronisation primitives. There
may be just one producer and multiple consumers, or vice versa, or multiple producers and multiple consumers. Each of these is a
common design pattern, which we'll look at.

To implement mutual exclusion, we shall use locks. A lock for providing mutual exclusion is called a MUTEX.  Only one client aiming
to enter a critical section is able to hold the mutex and the others will have to wait while the mutex is busy before they can
enter.

Normally we try to make the critical sections as short as possible so that the time holding a mutex is minimised. This tends to help
the system run more smoothly.

L3P05 Implementing mutual exclusion

The simplest way to implement mutual exclusion is to directly use the hardware primitives provided by the system, such as a
read-and-set instruction. That instruction can be run in user space. It doesn't require any protection, although obviously the
memory that it accesses must also be in user space.

We'll delimit our critical section with a couple of primitives, one called ENTER the critical section and one called LEAVE the
critical section.

The programmer must call ENTER_CS at the start of the critical code and make sure they call the
LEAVE_CS as they leave the critical section of code under all control flows. Modern high level languages provide
various means to avoid the programming error of forgetting to invoke the leave operation. But we won't discuss many of these
right here.

The lock primitive can be implemented directly by using the reading set instruction inside a while loop. A thread wanting to gain
access to the critical section will call LOCK and it will 'spin' as we call it, repeatedly executing the read and set instruction
until that returns false.  When it returns false, it means no other thread was in the critical section and hence it can gain access
to the lock.

Exiting the critical section is very easy. We just use a straightforward store instruction to set
the mutex back to 0.

This is really simple, assuming all clients do have access to the mutex variable through the memory
protection system. It is fine inside the kernel of an operating system, and it is fine in user
space where all the threads are in the same process. But I guess you can see what the disadvantage
is?

The while loop is very inefficient. The processor is going to be repeatedly polling a mutex to
see if it can gain access. So this spin approach is only suitable for the very simplest of
operating systems or for mutexes which are highly likely to always be available.


L3P06 Semaphores

In general, there's two things we can be waiting for. One is to wait for something to be done by
another process on a shared data structure and the other is to wait for an external device to do
something (via an interrupt). Indeed, if we're on a uniprocessor, then while we're busy
spinning in a spinlock, actually, no other process is running, so the first of these two cases is
definitely not going to work.

Also, for general efficiency, we'll want the spin to be replaced by an interaction with the
operating system scheduler and for the core to halt, saving energy, if there really is nothing to
do. For instance, when we want to acquire a mutex and it's busy, we'll want to cause the current
thread to suspend and for the scheduler to choose something more sensible for the processor to do
in the meantime.

This means that the operating system scheduler needs to have some visibility of the user space
variables and data structures.

There's various approaches to the design of this. We don't want the operating system to be
complicated in its scheduling decisions because that would be inefficient.
But we do want a reasonably flexible set of primitives.

Operating systems have varied over the decades in quite what set of solutions they provide here.
One of the oldest and most noteworthy is the so-called Semaphore.

An early operating system was called the THE multi-programming system from Dykstra and his group, and this had semaphores.  So a
semaphore is nothing more than an integer stored in memory, which can be accessed by the various client threads and also the
operating system scheduler.

We can initialise it to 0, but sometimes we'll initialise it to a natural number.
But the basic aim is that it never goes negative.

There are two operations on the semaphore. They're called wait and signal.

Sometimes these operators are called down and up. When we look at the code that implements these
operators. Shortly, we'll see why.  In the original work, these were commonly called P &V. I don't
know if the names P&V meant anything in Dutch, but certainly I've never been able to remember which
one is which, and we probably won't use those names.

According to how we initialise the semaphore, we can use these two primitives to implement two main
programming paradigms. The 1st is mutual exclusion but with optimisations to spinning so that we do
enter the scheduler and don't waste excessive amounts of CPU time.

And the other is condition synchronisation which is where we want to make a thread sleep until some particular condition holds.

And the typical example will be that there is something waiting to take out of the queue in a producer consumer type scenario.


L3P07 Semaphore Implementation

Well, how is the semaphore implemented? Let's look at the operating system pseudo code for the implementation of the two primitives.

I said the semaphore consists of an integer (a shared variable), but it also has a queue.
The queue contains threads which are waiting on the semaphore. We'll explain what that means in a second.

The queue cannot (normally) be inspected from user space or application code.  In fact, the queue
might not exist as a free standing component inside the operating system.  It could well be bundled
in with other data structures used by the scheduler using some sort of some sort of pointer
arrangement.

But it behaves like a queue in all important respects.

And looking at the pseudocode, firstly, for wait we see that wait is essentially trying to decrement the value held in the
semaphore. Indeed, if the semaphore is greater than 0, then wait simply decrements it and returns. But if it is zero, then, since
the semaphore is not allowed to go negative, instead the thread will be blocked.

There are various ways of expressing this pseudocode.  In this example, there's no WHILE loop inside the wait primitive. But in an
alternative, another part of the system may have incremented the semaphore and then a WHILE loop here would exit and then
decrement it again, getting the original effect, but ignore that for the moment.

Let's look at the SIGNAL primitive next. The signal primitive is essentially the opposite of wait.
Wait tries to decrement the semaphore signal, signal tries to increment it, hence the names down and up.

But the signal primitive doesn't blindly increment the semaphore. First it cheques to see if there
are any threads waiting in the associated queue.  If there are no threads waiting, it increments
the semaphore and returns. But if there are threads waiting on the thread queue, then one of those
is taken off and allowed to run. We could potentially have first come first served FIFO discipline
in this queue, or we could choose one which has the highest priority.

So, it should already be apparent one of the basic use modes for the semaphore WAIT is going to
wait for the semaphore to be greater than 0 and then decrement it.

Signal is nominally non-blocking and increments the semaphore in essence, but depending on the precise
coding we may not increment it. We may instead free up a waiting thread as indicated in this
pseudocode.  But, in general, if we have made a higher priority thread ready-to-run, we can expect
to get preempted.

This means that the variable SEM holds a pool, in a sense we'll discuss, a number of available items: WAIT will
take one and SIGNAL will return it to the pool.

This is just pseudocode and if implemented in the operating system kernel in a non preemptive way, this would be fine, but obviously
these primitives themselves are not intrinsically thread safe as coded here.  So appropriate mechanisms must be taken if
implemented in a pre-emptible environment.


L3P08 HARDWARE SUPPORT FOR WAKEUPS

We'll just have a little distraction and return to hardware support on multicore processes. We
spoke about the various hardware mechanisms for achieving atomicity in the instruction set.
[The fine details are not overly-relevant for this course.]

But what does it mean to wake up a thread? Well, on a uniprocessor we might think we get non-preemptive execution at the lowest
level, but this is disrupted by interrupts either from an I/O device or a timer. And in user space we also have blocking system
calls where user code voluntarily yields.

After both forms of disruption (interrupt and system call), a uniprocessor operating systems is
coded to invoke the system scheduler, decide what is most worthy to run and to carry on with that
thread that was decided in that way.

Now on a multiprocessor, it's possible that the actions of one core mean that something else is
more worthy to run, but we don't necessarily want to suspend that initiating core straight
away. Instead, we want some mechanism for other cores to be aware that something has changed
and something may have become more worthy of running.

So all we need is a means for one core to send a hardware interrupt to another core.
This is called an interprocessor interrupt or an inter-core interrupt (ICI).

Special hardware is provided to do this. It might be part of a general load balancing interrupt
controller that distributes real hardware interrupts onto appropriate calls.  Or it might be a
free standing hardware component somewhere on the chip which provides mailbox facilities for cores
to interrupt each other.

But the basic behaviour is not going to be any different from an external device interrupt, whether
it's a different core or an external device doesn't actually make a great deal of difference.

At the initiator, whether it's a core or a device with its device driver, it is going to change
some data structure that the scheduler looks at. Then it will issue the INTERCORE INTERRUPT.

The receiving call need only have a pretty much null device driver for handling that interrupt.
Nothing is done by the interrupt, but the side effect is that the core receiving the interrupt
enters its scheduler. It looks at the shared data structures for load balancing across the system
and decides what's most worthy to, perhaps performing a context switch.

But that level of hardware knowledge is not examinable in this course.  This is really just an
aside for those of you who are interested to know how multicore operating systems might be
structured.


L3P09 Mutual exclusion with a semaphore


Let's look at a little worked example of mutual exclusion with a semaphore.  I will annotate this slide if you're currently looking
at just the simple still version of it in a future release of this video. The slide shows 3 threads called A, B & C and time runs
from the top of the page downwards.

On the left we see a semaphore which is an integer value and a list. Our integer value for the seller for is initialised to one and
the list of threads waiting on it is empty.

This will show us how to achieve mutual exclusion. We only want one thread at a time to be in a critical section.

So the first thread thread A comes along the sum of four has the value one in it. It issues wait, which decrements it to zero, does
not block, and it enters its critical section. Then while A is in the critical section, B comes along and wants to enter the
critical section.

So it issues wait, but it has to block because the semaphore is currently 0 so it B goes on to the work queue for this semaphore and
then yet another one comes along C and it's also going to wait and it joins the queue as well.

After a bit, thread A finishes its work and signals the semaphore.  Because the queue is non-empty, one of the members of the queue
is going to be taken off and made ready to run.

The lucky thread is thread B, so thread B now unblocks from its wait call and can enter its critical section and when it's finished,
it issues its SIGNAL.

Now thread C is given a wake up. It runs its critical section and finishes with the SIGNAL. And because there is nothing waiting on
the semaphore any longer, signal has the effect of incrementing the semaphore value back to its starting value of one.

Hence we see that no two threads are ever in the critical section at once, which is what we want. Hurrah.


L3P10 CONDITIONAL SYNCHRONISATION.

So now we'll consider a second primary use case for the semaphore. We'll have two threads that
want to communicate with each other. And we'll initialise the semaphore to 0 this time.  In this
example we have threads A & B.  A is some sort of server and it's waiting for work to be given to
it by thread B.

There are two possible situations which can arise as we start up the system.  It could be that A
becomes ready to serve before B offers any work, or it could be that B issues some work before A
has started.

So the left-hand diagram shows wait before signal, which is where A, our server, has started up and
it doesn't have anything to do at the moment, so it waits on the semaphore: it will block there
because the semaphore is initialised to zero and nothing is waiting on it.
It is put on the semaphore queue.

Later on, B has some work to give to A, so it issues a signal on the semaphore which unblocks A and
A will start working. When A has finished, it may typicallu loop round to the top of the
diagram. Again. We might be in either of the left- or right-hand paradigms at that point.

The other startup paradigm is the right-hand side: signal before wait. B issues work by signalling
the semaphore, but A hasn't yet become ready to serve or it's still busy with the last work
item. So when B issues SIGNAL, the semaphore is incremented to one to show that one work item
is ready, waiting to be served.

A comes along and issues a WAIT to wait on the semaphore where the semaphore already has work
associated with it, so the WAIT is non-blocking and the value in the semaphore will be decremented
back to zero straight away.

This basic pattern provides the baseline for a variety of resource management, producer/consumer
scenarios which we'll have a look at. Some more of them will be in the next lecture, but we'll look
at one or two today.


L3P11  N-RESOURCE ALLOCATIOON

A third use of the semaphore for Pooled Resource Allocation.

We can also use semaphores to manage resource allocation where we have a pool of N equivalent instances of something such as a
printer. If we have N identical printers, then we will initialise our semaphore to N.  The semaphore is initialised to the value of
N and any job wanting a printer will issue wait on that semaphore.

We'll also need auxiliary data structures for keeping track of which printers are currently in use, and so on, but let's ignore
those details for the sake of this simple explanation.


A client will not block, it will simply decrement the semaphore, provided it was greater than 0, so it will be a non-blocking
acquire of a resource.

But after N have all acquired resources, all the printers will be in use and the next client to try to wait on the semaphore will
instead block and go on the semaphore's wait queue.

When a printer has finished being used, it will signal the semaphore which will increment or release awaiting client.

And as I said, some auxiliary data structures are going to be needed in practice, but this shows the basic principle.


L3P13 PRODUCER-CONSUMER PROBLEM


So we've seen three basic design patterns: mutual exclusion, condition synchronisation and N-way resource allocation.

Now we look at combining some of these in the producer/consumer examples.

Later we'll consider a multitude of producers and consumers that are asynchronously sending work between the producers and the
consumers using a shared buffer pool.  It does not matter which consumer processes the work that came from any particular producer.
But to start with,  we consider a single procducer and a single consumer.

Now the buffer pool will be finite capacity, but we want the whole system to be lossless, and therefore we can't allow any buffer
OVERFLOW problem.  So we will have blocking reads and writes or blocking enqueues and dequeues (puts and gets) on the shared buffer.

The call which inserts and enqueues something in the buffer will block if the buffer is full.  The
call or primitive that takes something out of the buffer (dequeues it) will block if the buffer is
empty.


[Later, for a general shared data structure, we'll consider the special case of multiple readers
and a single writer for that data structure.  The multiple-reader/single-writer paradigm crops up
quite a lot, and commonly there are special, slightly optimised primitives for dealing with this
situation.]

Ignore work stealing as mentioned on slide 13 for now.
The principle of work stealing is that consumers search around many queues that might contain work.


Maintaining FIFO order is not necessarily part of the producer consumer problem in general, it also applies to LIFO stacks.
But we here discuss a FIFO example.


L3P14 PRODUCER-CONSUMER PSEUDO SOLUTION

A common structure used for a FIFO queue is the circular ring buffer.

Logically, items progress through the buffer from the input to the output, but in reality nothing moves. Instead we keep track of
input and output pointers.  The buffer itself consists of a statically allocated array. Here it just contains integers, and it's
called buffer.

We have a couple of integer pointers, IN and OUT, which we have both initialised to 0 and when they both have the same value, buffer
is empty.  The IN pointer refers to the location where the next item is to be inserted, and the OUT refers to the location where the
next item to be taken out is currently residing.

We need 2 methods to operate on the queue. One is to insert an item and the other is to take an item out and as set, both are
potentially blocking.

The pseudocode here has highlighted in it a bit of natural language, but the rest of the code is fairly straightforward. What we
have illustrated here is in fact the complete thread with its outer loop, for both the producer and the consumer. An infinite while
loop is called an 'eternal loop' and in this context a 'process loop'.

In the producer thread we alternately get the next item from whatever is generating it, and then we invoke the method for inserting
it in the queue, which is the next three lines of code which is here shown in line.  And similarly for the consumer thread, we have
to dequeue an item using some sort of method call which again has been inlined for conciseness.

And then we consume the item and then we get the next one.

After we've used the current IN or OUT pointer, we incremented modulo the buffer length and it's that modulo effect which gives the
name circular buffer or ring the FIFO.

Rather than providing a general method for
inserting and removing items from the buffer, we've inlined the code in the producer and consumer
thread.


L3P15 OO-STYLE PRODUCER-CONSUMER FIFO

[This slide was added I think in 2023 and so slide numbers might be out w.r.t. videos for the remainder.]


The buffer code here has been refactored as an OO class.
The code shape is different, but the executed code is identical.
This OO style is what's often used in practice, especially w.r.t. the MONITOR programming style, mentioned later.

Note: both exported methods are blocking.
Our method implementations will shortly be generalised to be re-entrant


L3P16 PRODUCER-CONSUMER SOLUTION


Let's look at the working code for producer/consumer using two semaphores to replace the pseudocode.
The two semaphores are called sPACES and ITEMS. Spaces is initialised to N and items is initialised to 0.

The `if there is space' condition is replaced with a WAIT on spaces.
And the `if there is an item' psudocode is replaced with a WAIT on items.

In the first instance, the wait for spaces will not block, it will just decrement the spaces semaphore.  Whereas the wait for items
will block the thread because the items semaphore is 0.

We also, of course, have to have a SIGNAL. You can't have a semaphore which doesn't have both some waits and some signals on it. So
we SIGNAL the items semaphore after we have inserted an item and we SIGNAL the spaces semaphore after we've removed an item in the
consumer thread.


L3P17 PRODUCER-CONSUMER SOLUTION

So does this work? It seems that we've replicated some state. We have three different ways of representing how many items are in the
buffer, the spaces, the items, and the difference between the in and out points out. You might think that that is a terrible
replication of state.  And replication of state tends to lead to bugs.

Now, we don't have any explicit mutual exclusion illustrated here, and actually it's not needed because of the way the examples have
been coded: there is exactly 1 producer and exactly 1 consumer.  So the methods for queue and dequeue are in no way reentrant. A
REENTRANT method is one where potentially 2 threads or more can be executing it at once.  So there's not really any race condition
between, for instance, in the producer thread between where we read the value of IN to index the buffer, and where we update it and
assign it to a new value.  Likewise for the consumer thread.

On the replication of state, it's not completely redundant because in fact the situation where IN
equals OUT represents two cases: one is that the buffer is empty, which was our initial case;
and the other is where it's full.

And when it's empty, consumer will sleep waiting on the ITEMS semaphore and when it's full, the
producer will sleep waiting on the SPACES semaphore.

And, as said, IN and OUT are only accessed in one thread each, and so there aren't any races in how
they're used. This works nicely enough, but we will need to generalise it to consider multiple producers and or multiple consumers.


L3P18+19 GENERALIZED PRODUCER-CONSUMER


So, so far we've had exactly one producer thread and exactly one consumer thread. More generally, we might have many threads, either
multiple consumers, multiple producers or both. If so, then we will need explicit mutual exclusion.  We need to make the methods, as
I called them, for enquiring and dequeuing THREAD SAFE. The easiest way of doing that is to put  mutexes around their bodies so that they
become critical regions themselves.

This will stop, two consumers trying to consume the same item.  And it will stop two producers trying to store their item in the
same slot. We know that semaphores can act as mutexes, so we can do this with perhaps one or possibly two additional semaphores.

So the generalised PC solution for any number of readers and writers is to put critical regions around the manipulations of the IN
and the OUT pointers. We call our additional semaphore guard and we will initialise it to 1, as normal for mutual exclusion.

On the producer, after we've waited for spaces, we now need to wait for the guard to acquire the lock, and when we've done our
critical code, we can signal the result.

Similarly for the consumer thread, we'll acquire the guard before our critical section, which is after we've waited for an item to
exist.  And then when we've dequeued that item, we will signal the spaces and release the mutex in either order.

[In a later lecture, we'll look at how to phrase this code as a monitor and also write lock-free data structures which could avoid
some of this, but that's for the future.]


Of course, this was made simultaneous enqueue and dequeue not allowed because we've shared a single mutex over both the producer and
the consumer methods, so we leave it as an exercise to perhaps add one more semaphore and avoid that situation.


L3P20 SEMAPHORES SUMMARY


Let's see what we've learned about semaphores.

They are a powerful abstraction for implementing concurrency control.  We can use them for mutual exclusion and condition
synchronisation.

They're much better than naive use of the read-and-set instruction because the operating system is involved and will avoid the time
wasted during spinning. The processor will be given something more useful to do or else halt saving electricity.  [The halt
instruction on a processor makes it stop until the next interrupt comes along.]

This has been a fairly low-level presentation for clarity purposes. Typically these low-level operations will be hidden or certainly
made more easy to use by various high-level language constructs.

In our presentation here, we've relied on the programmer manually associating a semaphore with what needs to be synchronised or
excluded. There are various possible programmer errors, such as forgetting to wait before accessing shared data, or providing a code
path that doesn't SIGNAL afterwards or simply using the wrong semaphore in a given context.

The situation will only get more and more difficult with more complex programmes.

So generally speaking, we should use higher level abstractions.  There are different primitives that we might use.  And there are
high-level language constructs which automatically deploy these primitives for us.

Both these techniques will help reduce programmer error. So for instance, in POSIX Pthreads, which is perhaps the most mainstream
concurrency primitive set for recent decades, there is no semaphore; there's just a mutex and various other facilities provided.

Semaphores may still be used inside the operating system, especially in legacy code.


L3P21 MUTUAL EXCLUSION AND INVARIANCE


An invariant is a property that is declared to always hold (is always true).  But in reality that property might temporarily not hold
while something's being manipulated.

Typically we put the manipulation inside a critical section so that nothing else can look at the
situation while the property doesn't hold, so we're using locking to avoid exposure of inconsistent
transient states.

An example might be incrementing the real time clock and calendar.  When it gets to midnight, we
need to change both the time and the date.  If we can't atomically store both fields on our
processor architecture, then we have to change one after the other.  If we use a simple carry
technique then we may have updated the time but not yet changed the date when somebody reads it and
they will see the time having jumped back magically nearly 24 hours when it wasn't supposed to.

So our invariant must hold at the time the mutex is acquired, can be violated while we hold the
mutex in the critical region and then restored before the mutex is released.

So take now a doubly-linked list. The invariant is that an entry is in the list or not in the list, but, given different pointer
chains for forward and backwards, we cannot on a uni-processor update them both at once and on a multiprocessor we can't either ---
at least not reliably, and without lots of other synchronisation overheads.

So we may need to either use creative data structures which avoid this, which we'll touch on in another lecture, or else we'll have
to have a critical region around the update for the situation where a component is still on the forward list but has been removed
from the backwards list, or vice versa.

L3P22  SUMMARY AND NEXT TIME 

So we've looked at implementing mutual exclusion using low-level hardware hardware support for atomicity and we've also discussed
briefly inter-processor interrupts.

Using spinning isn't a particularly efficient way of proceeding because of the time wasted in a spin.  That also wastes energy as
well, which may be important if you're running off a battery, or if you care about the planet.

So tying the synchronisation primitives into the operating system scheduler is the best way to go,
and the semaphore is the baseline example of how to do that. We saw how to use semaphores both for
condition synchronisation, resource allocation and mutual exclusion.

We looked at the two-party and generalised producer consumer relationships.

And we spoke about the role of invariants in data structure design and why it may be appropriate to have a lock around a data
structure while it's temporary temporarily violating its invariant. Next time we'll extend to the multiple reader, single writer
model of shared data structures, look at some alternatives to semaphores and locks, and look at concurrent primitives in practise.


We'll also talk about starvation and fairness.  Because it's all very well having a system that doesn't have any races, but we do
need to make sure that everything runs and get some service.


Thank you.

END