CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 3 TRANSCRIPT (C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory. L3 Mutual exclusion, semaphores, and producer-consumer relationships. Reminder from last time concurrent systems can be very complicated, but ultimately they're finite state machines. Therefore, finite-state machine theory applies to them, and they can be analysed as systems of communicating automata. But the challenge is concurrent access to shared resources. Two writes to the same variable may happen in any order, and reads may be interleaved as well. The scheduler doesn't give us control over the relative ordering at a fine level, and hence we see races and unpredictable results. So we fix that using a variety of mechanisms, some at the hardware level, such as the atomic instructions of the computer, and then we build various software structures on top of that. These are, generally speaking, going to be tightly coupled with the operating system for efficient implementation. The main mechanism for avoiding races is going to be mutual exclusion, where we make sure that at most one piece of code is handling a group of interdependent variables at one time. And that, of course, is called a critical section. The example from last time was the beer buying exercise. The problem was we couldn't take out a lock. It takes some time for somebody to go and buy the beer and during that time there's no indication that somebody else shouldn't start also going to buy the beer. We're trying to provide a solution to that using a note which was going to act as what we'll now call a mutual exclusion lock, but these need to be implemented properly. L3P03 THIS TIME This time we'll talk about how to implement mutual exclusion. We'll introduce one of the most old fashioned forms of synchronisation, primitive, which is called the semaphore. In general, as well as wanting to provide mutual exclusion, we'll also want to be able to provide other forms of synchronisation between threads, such as a thread starting to run again when a particular condition holds. There are various types of classic example based around resource allocation and certain standard paradigms for that which we consider: the producer consumer paradigm is where one agent produces items, puts them in a queue and another agent takes them out of the queue. Adding and subtracting to the queue needs generally to be protected with seller? falls? or other synchronisation primitives. There may be just one producer and multiple consumers, or vice versa, or multiple producers and multiple consumers. Each of these is a common design pattern, which we'll look at. To implement mutual exclusion, we shall use locks. A lock for providing mutual exclusion is called a MUTEX. Only one client aiming to enter a critical section is able to hold the mutex and the others will have to wait while the mutex is busy before they can enter. Normally we try to make the critical sections as short as possible so that the time holding a mutex is minimised. This tends to help the system run more smoothly. L3P05 Implementing mutual exclusion The simplest way to implement mutual exclusion is to directly use the hardware primitives provided by the system, such as a read-and-set instruction. That instruction can be run in user space. It doesn't require any protection, although obviously the memory that it accesses must also be in user space. We'll delimit our critical section with a couple of primitives, one called ENTER the critical section and one called LEAVE the critical section. The programmer must call ENTER_CS at the start of the critical code and make sure they call the LEAVE_CS as they leave the critical section of code under all control flows. Modern high level languages provide various means to avoid the programming error of forgetting to invoke the leave operation. But we won't discuss many of these right here. The lock primitive can be implemented directly by using the reading set instruction inside a while loop. A thread wanting to gain access to the critical section will call LOCK and it will 'spin' as we call it, repeatedly executing the read and set instruction until that returns false. When it returns false, it means no other thread was in the critical section and hence it can gain access to the lock. Exiting the critical section is very easy. We just use a straightforward store instruction to set the mutex back to 0. This is really simple, assuming all clients do have access to the mutex variable through the memory protection system. It is fine inside the kernel of an operating system, and it is fine in user space where all the threads are in the same process. But I guess you can see what the disadvantage is? The while loop is very inefficient. The processor is going to be repeatedly polling a mutex to see if it can gain access. So this spin approach is only suitable for the very simplest of operating systems or for mutexes which are highly likely to always be available. L3P06 Semaphores In general, there's two things we can be waiting for. One is to wait for something to be done by another process on a shared data structure and the other is to wait for an external device to do something (via an interrupt). Indeed, if we're on a uniprocessor, then while we're busy spinning in a spinlock, actually, no other process is running, so the first of these two cases is definitely not going to work. Also, for general efficiency, we'll want the spin to be replaced by an interaction with the operating system scheduler and for the core to halt, saving energy, if there really is nothing to do. For instance, when we want to acquire a mutex and it's busy, we'll want to cause the current thread to suspend and for the scheduler to choose something more sensible for the processor to do in the meantime. This means that the operating system scheduler needs to have some visibility of the user space variables and data structures. There's various approaches to the design of this. We don't want the operating system to be complicated in its scheduling decisions because that would be inefficient. But we do want a reasonably flexible set of primitives. Operating systems have varied over the decades in quite what set of solutions they provide here. One of the oldest and most noteworthy is the so-called Semaphore. An early operating system was called the THE multi-programming system from Dykstra and his group, and this had semaphores. So a semaphore is nothing more than an integer stored in memory, which can be accessed by the various client threads and also the operating system scheduler. We can initialise it to 0, but sometimes we'll initialise it to a natural number. But the basic aim is that it never goes negative. There are two operations on the semaphore. They're called wait and signal. Sometimes these operators are called down and up. When we look at the code that implements these operators. Shortly, we'll see why. In the original work, these were commonly called P &V. I don't know if the names P&V meant anything in Dutch, but certainly I've never been able to remember which one is which, and we probably won't use those names. According to how we initialise the semaphore, we can use these two primitives to implement two main programming paradigms. The 1st is mutual exclusion but with optimisations to spinning so that we do enter the scheduler and don't waste excessive amounts of CPU time. And the other is condition synchronisation which is where we want to make a thread sleep until some particular condition holds. And the typical example will be that there is something waiting to take out of the queue in a producer consumer type scenario. L3P07 Semaphore Implementation Well, how is the semaphore implemented? Let's look at the operating system pseudo code for the implementation of the two primitives. I said the semaphore consists of an integer (a shared variable), but it also has a queue. The queue contains threads which are waiting on the semaphore. We'll explain what that means in a second. The queue cannot (normally) be inspected from user space or application code. In fact, the queue might not exist as a free standing component inside the operating system. It could well be bundled in with other data structures used by the scheduler using some sort of some sort of pointer arrangement. But it behaves like a queue in all important respects. And looking at the pseudocode, firstly, for wait we see that wait is essentially trying to decrement the value held in the semaphore. Indeed, if the semaphore is greater than 0, then wait simply decrements it and returns. But if it is zero, then, since the semaphore is not allowed to go negative, instead the thread will be blocked. There are various ways of expressing this pseudocode. In this example, there's no WHILE loop inside the wait primitive. But in an alternative, another part of the system may have incremented the semaphore and then a WHILE loop here would exit and then decrement it again, getting the original effect, but ignore that for the moment. Let's look at the SIGNAL primitive next. The signal primitive is essentially the opposite of wait. Wait tries to decrement the semaphore signal, signal tries to increment it, hence the names down and up. But the signal primitive doesn't blindly increment the semaphore. First it cheques to see if there are any threads waiting in the associated queue. If there are no threads waiting, it increments the semaphore and returns. But if there are threads waiting on the thread queue, then one of those is taken off and allowed to run. We could potentially have first come first served FIFO discipline in this queue, or we could choose one which has the highest priority. So, it should already be apparent one of the basic use modes for the semaphore WAIT is going to wait for the semaphore to be greater than 0 and then decrement it. Signal is nominally non-blocking and increments the semaphore in essence, but depending on the precise coding we may not increment it. We may instead free up a waiting thread as indicated in this pseudocode. But, in general, if we have made a higher priority thread ready-to-run, we can expect to get preempted. This means that the variable SEM holds a pool, in a sense we'll discuss, a number of available items: WAIT will take one and SIGNAL will return it to the pool. This is just pseudocode and if implemented in the operating system kernel in a non preemptive way, this would be fine, but obviously these primitives themselves are not intrinsically thread safe as coded here. So appropriate mechanisms must be taken if implemented in a pre-emptible environment. L3P08 HARDWARE SUPPORT FOR WAKEUPS We'll just have a little distraction and return to hardware support on multicore processes. We spoke about the various hardware mechanisms for achieving atomicity in the instruction set. [The fine details are not overly-relevant for this course.] But what does it mean to wake up a thread? Well, on a uniprocessor we might think we get non-preemptive execution at the lowest level, but this is disrupted by interrupts either from an I/O device or a timer. And in user space we also have blocking system calls where user code voluntarily yields. After both forms of disruption (interrupt and system call), a uniprocessor operating systems is coded to invoke the system scheduler, decide what is most worthy to run and to carry on with that thread that was decided in that way. Now on a multiprocessor, it's possible that the actions of one core mean that something else is more worthy to run, but we don't necessarily want to suspend that initiating core straight away. Instead, we want some mechanism for other cores to be aware that something has changed and something may have become more worthy of running. So all we need is a means for one core to send a hardware interrupt to another core. This is called an interprocessor interrupt or an inter-core interrupt (ICI). Special hardware is provided to do this. It might be part of a general load balancing interrupt controller that distributes real hardware interrupts onto appropriate calls. Or it might be a free standing hardware component somewhere on the chip which provides mailbox facilities for cores to interrupt each other. But the basic behaviour is not going to be any different from an external device interrupt, whether it's a different core or an external device doesn't actually make a great deal of difference. At the initiator, whether it's a core or a device with its device driver, it is going to change some data structure that the scheduler looks at. Then it will issue the INTERCORE INTERRUPT. The receiving call need only have a pretty much null device driver for handling that interrupt. Nothing is done by the interrupt, but the side effect is that the core receiving the interrupt enters its scheduler. It looks at the shared data structures for load balancing across the system and decides what's most worthy to, perhaps performing a context switch. But that level of hardware knowledge is not examinable in this course. This is really just an aside for those of you who are interested to know how multicore operating systems might be structured. L3P09 Mutual exclusion with a semaphore Let's look at a little worked example of mutual exclusion with a semaphore. I will annotate this slide if you're currently looking at just the simple still version of it in a future release of this video. The slide shows 3 threads called A, B & C and time runs from the top of the page downwards. On the left we see a semaphore which is an integer value and a list. Our integer value for the seller for is initialised to one and the list of threads waiting on it is empty. This will show us how to achieve mutual exclusion. We only want one thread at a time to be in a critical section. So the first thread thread A comes along the sum of four has the value one in it. It issues wait, which decrements it to zero, does not block, and it enters its critical section. Then while A is in the critical section, B comes along and wants to enter the critical section. So it issues wait, but it has to block because the semaphore is currently 0 so it B goes on to the work queue for this semaphore and then yet another one comes along C and it's also going to wait and it joins the queue as well. After a bit, thread A finishes its work and signals the semaphore. Because the queue is non-empty, one of the members of the queue is going to be taken off and made ready to run. The lucky thread is thread B, so thread B now unblocks from its wait call and can enter its critical section and when it's finished, it issues its SIGNAL. Now thread C is given a wake up. It runs its critical section and finishes with the SIGNAL. And because there is nothing waiting on the semaphore any longer, signal has the effect of incrementing the semaphore value back to its starting value of one. Hence we see that no two threads are ever in the critical section at once, which is what we want. Hurrah. L3P10 CONDITIONAL SYNCHRONISATION. So now we'll consider a second primary use case for the semaphore. We'll have two threads that want to communicate with each other. And we'll initialise the semaphore to 0 this time. In this example we have threads A & B. A is some sort of server and it's waiting for work to be given to it by thread B. There are two possible situations which can arise as we start up the system. It could be that A becomes ready to serve before B offers any work, or it could be that B issues some work before A has started. So the left-hand diagram shows wait before signal, which is where A, our server, has started up and it doesn't have anything to do at the moment, so it waits on the semaphore: it will block there because the semaphore is initialised to zero and nothing is waiting on it. It is put on the semaphore queue. Later on, B has some work to give to A, so it issues a signal on the semaphore which unblocks A and A will start working. When A has finished, it may typicallu loop round to the top of the diagram. Again. We might be in either of the left- or right-hand paradigms at that point. The other startup paradigm is the right-hand side: signal before wait. B issues work by signalling the semaphore, but A hasn't yet become ready to serve or it's still busy with the last work item. So when B issues SIGNAL, the semaphore is incremented to one to show that one work item is ready, waiting to be served. A comes along and issues a WAIT to wait on the semaphore where the semaphore already has work associated with it, so the WAIT is non-blocking and the value in the semaphore will be decremented back to zero straight away. This basic pattern provides the baseline for a variety of resource management, producer/consumer scenarios which we'll have a look at. Some more of them will be in the next lecture, but we'll look at one or two today. L3P11 N-RESOURCE ALLOCATIOON A third use of the semaphore for Pooled Resource Allocation. We can also use semaphores to manage resource allocation where we have a pool of N equivalent instances of something such as a printer. If we have N identical printers, then we will initialise our semaphore to N. The semaphore is initialised to the value of N and any job wanting a printer will issue wait on that semaphore. We'll also need auxiliary data structures for keeping track of which printers are currently in use, and so on, but let's ignore those details for the sake of this simple explanation. A client will not block, it will simply decrement the semaphore, provided it was greater than 0, so it will be a non-blocking acquire of a resource. But after N have all acquired resources, all the printers will be in use and the next client to try to wait on the semaphore will instead block and go on the semaphore's wait queue. When a printer has finished being used, it will signal the semaphore which will increment or release awaiting client. And as I said, some auxiliary data structures are going to be needed in practice, but this shows the basic principle. L3P13 PRODUCER-CONSUMER PROBLEM So we've seen three basic design patterns: mutual exclusion, condition synchronisation and N-way resource allocation. Now we look at combining some of these in the producer/consumer examples. Later we'll consider a multitude of producers and consumers that are asynchronously sending work between the producers and the consumers using a shared buffer pool. It does not matter which consumer processes the work that came from any particular producer. But to start with, we consider a single procducer and a single consumer. Now the buffer pool will be finite capacity, but we want the whole system to be lossless, and therefore we can't allow any buffer OVERFLOW problem. So we will have blocking reads and writes or blocking enqueues and dequeues (puts and gets) on the shared buffer. The call which inserts and enqueues something in the buffer will block if the buffer is full. The call or primitive that takes something out of the buffer (dequeues it) will block if the buffer is empty. [Later, for a general shared data structure, we'll consider the special case of multiple readers and a single writer for that data structure. The multiple-reader/single-writer paradigm crops up quite a lot, and commonly there are special, slightly optimised primitives for dealing with this situation.] Ignore work stealing as mentioned on slide 13 for now. The principle of work stealing is that consumers search around many queues that might contain work. Maintaining FIFO order is not necessarily part of the producer consumer problem in general, it also applies to LIFO stacks. But we here discuss a FIFO example. L3P14 PRODUCER-CONSUMER PSEUDO SOLUTION A common structure used for a FIFO queue is the circular ring buffer. Logically, items progress through the buffer from the input to the output, but in reality nothing moves. Instead we keep track of input and output pointers. The buffer itself consists of a statically allocated array. Here it just contains integers, and it's called buffer. We have a couple of integer pointers, IN and OUT, which we have both initialised to 0 and when they both have the same value, buffer is empty. The IN pointer refers to the location where the next item is to be inserted, and the OUT refers to the location where the next item to be taken out is currently residing. We need 2 methods to operate on the queue. One is to insert an item and the other is to take an item out and as set, both are potentially blocking. The pseudocode here has highlighted in it a bit of natural language, but the rest of the code is fairly straightforward. What we have illustrated here is in fact the complete thread with its outer loop, for both the producer and the consumer. An infinite while loop is called an 'eternal loop' and in this context a 'process loop'. In the producer thread we alternately get the next item from whatever is generating it, and then we invoke the method for inserting it in the queue, which is the next three lines of code which is here shown in line. And similarly for the consumer thread, we have to dequeue an item using some sort of method call which again has been inlined for conciseness. And then we consume the item and then we get the next one. After we've used the current IN or OUT pointer, we incremented modulo the buffer length and it's that modulo effect which gives the name circular buffer or ring the FIFO. Rather than providing a general method for inserting and removing items from the buffer, we've inlined the code in the producer and consumer thread. L3P15 OO-STYLE PRODUCER-CONSUMER FIFO [This slide was added I think in 2023 and so slide numbers might be out w.r.t. videos for the remainder.] The buffer code here has been refactored as an OO class. The code shape is different, but the executed code is identical. This OO style is what's often used in practice, especially w.r.t. the MONITOR programming style, mentioned later. Note: both exported methods are blocking. Our method implementations will shortly be generalised to be re-entrant L3P16 PRODUCER-CONSUMER SOLUTION Let's look at the working code for producer/consumer using two semaphores to replace the pseudocode. The two semaphores are called sPACES and ITEMS. Spaces is initialised to N and items is initialised to 0. The `if there is space' condition is replaced with a WAIT on spaces. And the `if there is an item' psudocode is replaced with a WAIT on items. In the first instance, the wait for spaces will not block, it will just decrement the spaces semaphore. Whereas the wait for items will block the thread because the items semaphore is 0. We also, of course, have to have a SIGNAL. You can't have a semaphore which doesn't have both some waits and some signals on it. So we SIGNAL the items semaphore after we have inserted an item and we SIGNAL the spaces semaphore after we've removed an item in the consumer thread. L3P17 PRODUCER-CONSUMER SOLUTION So does this work? It seems that we've replicated some state. We have three different ways of representing how many items are in the buffer, the spaces, the items, and the difference between the in and out points out. You might think that that is a terrible replication of state. And replication of state tends to lead to bugs. Now, we don't have any explicit mutual exclusion illustrated here, and actually it's not needed because of the way the examples have been coded: there is exactly 1 producer and exactly 1 consumer. So the methods for queue and dequeue are in no way reentrant. A REENTRANT method is one where potentially 2 threads or more can be executing it at once. So there's not really any race condition between, for instance, in the producer thread between where we read the value of IN to index the buffer, and where we update it and assign it to a new value. Likewise for the consumer thread. On the replication of state, it's not completely redundant because in fact the situation where IN equals OUT represents two cases: one is that the buffer is empty, which was our initial case; and the other is where it's full. And when it's empty, consumer will sleep waiting on the ITEMS semaphore and when it's full, the producer will sleep waiting on the SPACES semaphore. And, as said, IN and OUT are only accessed in one thread each, and so there aren't any races in how they're used. This works nicely enough, but we will need to generalise it to consider multiple producers and or multiple consumers. L3P18+19 GENERALIZED PRODUCER-CONSUMER So, so far we've had exactly one producer thread and exactly one consumer thread. More generally, we might have many threads, either multiple consumers, multiple producers or both. If so, then we will need explicit mutual exclusion. We need to make the methods, as I called them, for enquiring and dequeuing THREAD SAFE. The easiest way of doing that is to put mutexes around their bodies so that they become critical regions themselves. This will stop, two consumers trying to consume the same item. And it will stop two producers trying to store their item in the same slot. We know that semaphores can act as mutexes, so we can do this with perhaps one or possibly two additional semaphores. So the generalised PC solution for any number of readers and writers is to put critical regions around the manipulations of the IN and the OUT pointers. We call our additional semaphore guard and we will initialise it to 1, as normal for mutual exclusion. On the producer, after we've waited for spaces, we now need to wait for the guard to acquire the lock, and when we've done our critical code, we can signal the result. Similarly for the consumer thread, we'll acquire the guard before our critical section, which is after we've waited for an item to exist. And then when we've dequeued that item, we will signal the spaces and release the mutex in either order. [In a later lecture, we'll look at how to phrase this code as a monitor and also write lock-free data structures which could avoid some of this, but that's for the future.] Of course, this was made simultaneous enqueue and dequeue not allowed because we've shared a single mutex over both the producer and the consumer methods, so we leave it as an exercise to perhaps add one more semaphore and avoid that situation. L3P20 SEMAPHORES SUMMARY Let's see what we've learned about semaphores. They are a powerful abstraction for implementing concurrency control. We can use them for mutual exclusion and condition synchronisation. They're much better than naive use of the read-and-set instruction because the operating system is involved and will avoid the time wasted during spinning. The processor will be given something more useful to do or else halt saving electricity. [The halt instruction on a processor makes it stop until the next interrupt comes along.] This has been a fairly low-level presentation for clarity purposes. Typically these low-level operations will be hidden or certainly made more easy to use by various high-level language constructs. In our presentation here, we've relied on the programmer manually associating a semaphore with what needs to be synchronised or excluded. There are various possible programmer errors, such as forgetting to wait before accessing shared data, or providing a code path that doesn't SIGNAL afterwards or simply using the wrong semaphore in a given context. The situation will only get more and more difficult with more complex programmes. So generally speaking, we should use higher level abstractions. There are different primitives that we might use. And there are high-level language constructs which automatically deploy these primitives for us. Both these techniques will help reduce programmer error. So for instance, in POSIX Pthreads, which is perhaps the most mainstream concurrency primitive set for recent decades, there is no semaphore; there's just a mutex and various other facilities provided. Semaphores may still be used inside the operating system, especially in legacy code. L3P21 MUTUAL EXCLUSION AND INVARIANCE An invariant is a property that is declared to always hold (is always true). But in reality that property might temporarily not hold while something's being manipulated. Typically we put the manipulation inside a critical section so that nothing else can look at the situation while the property doesn't hold, so we're using locking to avoid exposure of inconsistent transient states. An example might be incrementing the real time clock and calendar. When it gets to midnight, we need to change both the time and the date. If we can't atomically store both fields on our processor architecture, then we have to change one after the other. If we use a simple carry technique then we may have updated the time but not yet changed the date when somebody reads it and they will see the time having jumped back magically nearly 24 hours when it wasn't supposed to. So our invariant must hold at the time the mutex is acquired, can be violated while we hold the mutex in the critical region and then restored before the mutex is released. So take now a doubly-linked list. The invariant is that an entry is in the list or not in the list, but, given different pointer chains for forward and backwards, we cannot on a uni-processor update them both at once and on a multiprocessor we can't either --- at least not reliably, and without lots of other synchronisation overheads. So we may need to either use creative data structures which avoid this, which we'll touch on in another lecture, or else we'll have to have a critical region around the update for the situation where a component is still on the forward list but has been removed from the backwards list, or vice versa. L3P22 SUMMARY AND NEXT TIME So we've looked at implementing mutual exclusion using low-level hardware hardware support for atomicity and we've also discussed briefly inter-processor interrupts. Using spinning isn't a particularly efficient way of proceeding because of the time wasted in a spin. That also wastes energy as well, which may be important if you're running off a battery, or if you care about the planet. So tying the synchronisation primitives into the operating system scheduler is the best way to go, and the semaphore is the baseline example of how to do that. We saw how to use semaphores both for condition synchronisation, resource allocation and mutual exclusion. We looked at the two-party and generalised producer consumer relationships. And we spoke about the role of invariants in data structure design and why it may be appropriate to have a lock around a data structure while it's temporary temporarily violating its invariant. Next time we'll extend to the multiple reader, single writer model of shared data structures, look at some alternatives to semaphores and locks, and look at concurrent primitives in practise. We'll also talk about starvation and fairness. Because it's all very well having a system that doesn't have any races, but we do need to make sure that everything runs and get some service. Thank you. END