CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 4 TRANSCRIPT (C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory. CCR, monitors, and concurrency in practice. Last time we looked at building mutual exclusion out of hardware primitives directly and briefly looked at the interprocessor interrupt. Then we considered semaphores. How to use them for mutual exclusion, conditional synchronisation and resource allocation. We looked at the generalised producer consumer situation with a FIFO queue between them we considered a single producer and a single consumer and multi producers and multi consumers. And we touched on invariants and their relationship with Mutex locking. We looked at the semaphore and saw that it's better than directly using the hardware primitives. But that considerable care is needed. It's easy for the programmer to get it wrong and to forget one of the wait or signal primitives, especially on code with multiple flows of control. So semaphores are not widely used in application code these days. But they may be found inside thread-safe implementations of common library components and in the operating system kernel itself. They say what to do, but they're not strongly tied to any particular programming style. THIS TIME Today, we'll look at multiple reader, single writer, producer, consumer situation. We'll look at various codings of this design problem. And we'll look to see how they vary in terms of fairness or whether they can suffer from starvation problems. Also, we'll move on from looking at semaphores and simple mutexes to some more sophisticated design patterns. Patterns that are more likely to crop up in user code. User code tends to be written in block structured languages, certainly for imperative programmes. So we'll look at some of the history and some of the current practise for coding in such languages, touching on conditional critical regions, the MONITOR concept CONDITION VARIABLES, which I should say straight away can sometimes better be thought of as 'CONDITION QUEUES'. And look at 2 alternative semantics for what to do when synchronising between different tasks inside a MONITOR. That's the signal-and-wait versus the signal-and-continue semantic. We'll look at the POSIX Pthreads API, which is a widely used if somewhat primitive set of primitives. And that will be it for our discussion of primitives. L4P05 Multiple-Readers Single-Writer (MRSW) The multiple reader, single writer paradigm crops up a lot. It's very common for a data structure to be updated from one particular source, but once updated, many readers are wanting to have a look at what's contained in there. It's a bit like a notice board with a single person allowed to update it, but many people looking at it. We were talking about the FIFO or LIFO queue structure previously, but for these examples we'll just think more of a notice board or a shared data structure, regardless of its functionality. So a typical example use would be a cache where data has been fetched from a remote source by a worker, and then it's shared out by multiple local readers, such as from the domain name server. A DNS maps from textual names of machines and resources to IP addresses and port numbers, and that sort of thing. So with a cache or a notice board multiple readers can be reading it simultaneously. They don't need to be synchronised with each other in any way. But the data structure consists of multiple words and it cannot be updated atomically by a single store. So one technique is to have two copies and a pointer to say which one is the active one, and then that pointer can indeed be atomically updated. That's sometimes called the SHADOW PAGE mechanism, but we won't consider that here. MRSW situation will be a real data structure. One copy only stored in memory. It might just be two words long. Or it could be a megabyte long, that doesn't matter. The important aspect is that no readers are allowed to see it in an inconsistent state while the writer is halfway through changing it. Remember invariants from last time. So the writer must have exclusive access during updates. The obvious and preferred implementation is to have two locks. One is a read lock and one is a write lock. A simple implementation will use 2 semaphores. The first semmaphore is used for writing and it just acts as a mutual exclusion lock. Any writer will have to claim that write lock before it can write. Now I've just said any writer, but we're talking about a single writer. Isn't that a contradiction? What we mean is that there's a single writer updating at any one time. But there could be a pool of contending writers that take it in turns to perform updates. And the second semaphore will protect another variable which counts the number of readers in use. That number will be incremented each time a new reader comes in, and decremented when a reader leaves. The first reader that comes in will essentially acquire a mutex, stopping any writers, and that mutex will only essentially be released once all the readers have stopped reading and have all left. L4P06 Simplest MRSW solution So our simple solution uses the number of readers variable initialised to 0. And two semaphores, one for reading and one for writing, which are both used as mutexes, so they're initialised to 1. The code executed by the writer thread is fairly straightforward. It essentially has to acquire the write mutex by waiting on the write semaphore. It then performs its updates to the data. And finally, it releases the mutex that it holds by signalling the write semaphore. The reader side is a slightly more complicated. We have two critical regions. One is as a thread starts to read and the other is executed when the thread has finished its reading activity. Each critical region is guarded by the READ semaphore. We can see the two shaded areas bracketed with WAIT and SIGNAL on the reader thread. The area marked as "dot dot ellipsis read data" is not a critical region in itself. Any number of threads can be in there. It's not subject to exclusion, except that no writing is allowed. The writer thread is essentially prevented from doing any writing because the readers have grabbed its mutex. The mutex was acquired when the first reader came in and it's not released until the last reader has gone out. When the first reader came in we incremented NR to 1. And when the last writer has gone out when we've decremented NR to 0. L4P07 Simplest MRSW solution Consider this implementation to see what's good and what's bad about it. Well, it's correct. It does what we want. Only one writer will be able to access the data structure, but provided there is no writer, any number of readers can have access to it. However, the writers can starve: if the readers continue to arrive, there might be no occasion where there aren't any. This will make the writer wait, potentially forever, a situation that we call starvation, STARVATION OF SERVICE. Well, what would be fairer? A more equitable solution? Well, perhaps the writer needs only to wait for all the current readers to exit. This essentially gives it a token or position in a round-robin schedule. It's then going to be serviced eventually, assuming that each readers does itself exit after some time, but not necessarily simultaneously, which is the key point. L4P08 A fairer MRSW solution This fairer solution can be implemented with one additional semaphore. We've called the new semaphore TURN, and that too is initially initialised to 1, so it behaves a bi like a further mutex. As a writer arrives it acquires the TURN mutex. And it will ultimately release that when it's finished. But we see that all readers check the TURN semaphore and they just stop if it's held. So all readers that arrive after turn has been grabbed by a writer are held up and they won't increment NR. But existing readers are free to continue working, doing their read data and finally decrementing NR. L4P09 Conditional Critical Regions Now we move on and consider higher level language support for concurrency. What we want to do is make the block-structured nature of the imperative programming language more closely reflect and tie in with the synchronisation primitives. This should give hints to the programmer, or indeed make the concurrency primitive invocation completely automatic. What we need to do is, at least logically, and perhaps physically associate a lock with the data that it's protecting. And then the methods that have access to that data will also somehow be associated with that lock. In an object oriented programming style, this is all fairly natural. So one early solution in the 70's was the conditional critical region. We start to use the compiler for the HLL to help us and we are able to design new language constructs and new phases in the compiler processing to assist with concurrency. This code fragment in some perhaps fictional language illustrates some of the concepts that we want to develop. We see the declaration of three variables A, B and C, and they're marked as shared. In general, it's very helpful to a compiler to know which variables are going to be shared between threads. Of course, that can also be worked out by static analysis if we have the complete source code for the program at the time we compile. Or failing that, some ancillary declarations. But here we're making it explicit in the language. That may be good style as well, because it might enforce better programmer discipline as well as helping the compiler. Our fragment shows that a region of code in the block-structured style is going to make use of certain shared variables. We've listed the variables A and B as part of the region syntax. (Again, you could say maybe the compiler could work that out.) The code in our critical region uses a primitive called AWAIT which supports unrestricted conditions as its predicate. A predicate will return true or false. Pesumably the thread is allowed to proceed when it evaluates to true. Now, as an aside, putting arbitrary conditions in an AWAIT primitive is a very problematic thing to do, which we will return to. But I'll just flag it now. The system scheduler needs to be very efficient and it needs to be able to work out what's ready to run very quickly. Otherwise we'll have overhead on each context swap. So if we're allowed to put arbitrary conditions in our wait primitives, then we might be giving the operating system scheduler a somewhat arbitrary amount of work to do, potentially unbounded in processing delay terms, before it can decide what to run next. So this would be generally a bad design style. It also means, potentially, that the operating system must be able to evaluate arbitrary code in some high level language, but we won't go into that too much. Now let's assume it's a fairly simple to evaluate predicate, such as whether a variable is zero or non-zero. The precise semantics of this fictional example are not clear, but presumably we have mutual exclusion inside this region. Perhaps the exclusion is defined so that no two regions that name a variable in the shared pool can be active at once, so this region will be exclusive with respect to any other code that names A and B in its region statement. The next slide might show us a little bit more about what the expected semantics need to be. L4P10 CCR example: Producer-Consumer The slide shows just two instances of threads, one for the producer, one for the consumer. But now we're using critical regions for the operations, we could have any number of instances without a problem. There's no explicit lock, but the system will implement the required locks, given that we have listed the variables that are going to be updated in the constructs. All of the use sites list the same 3 variables in, out and buffer, so the system might be smart and allocate one shared lock for that set of three rather than three individual locks. And we're also using a powerful AWAIT primitive where an arbitrary predicate can be given as its argument. So this looks like a definite step up. The locking primitives are being handled by the high Level language compiler. This is a pretty elegant coding style. There's clearly a space of possible designs. Here we've asked the programmer to explicitly list the variables that are going to be locked against, and we also had the "shared" modifier on the variable declaration. And it can be argued over how much of that should be manual, how much of it should be explicit in what the programmer writes, and how much the compiler could statically and automatically determine and infer. But let's look at this await construct. If there are no other threads in the critical region and the expression evaluates to true, then our thread can directly enter the critical section without delay. That's nice and efficient. But the expression might not hold or there might be other threads in the region. Supposing the expression does hold, but there are other threads in the region. Well, we're now going to have to suspend until those other threads have left. So I wonder, will that expression still hold at that later time? Should we check it again? And if there's considerable work in evaluating the expression (there could be trigonometric functions in there, for instance) then how frequently should we reevaluate the expression? Ideally we need to have further static analysis which looks at the SUPPORT of the expression. The phrase support is used in compiler design to refer to the free variables of the expression. Essentially, we need to check whether any of those have changed value, because then it would be a good time to re-evaluate the expression. If you're learning Verilog at the moment, you'll have come across the "always_com" construct which does something exactly like that, but for general software application. We can't expect the system scheduler to frequently be evaluating arbitrary bits of user code. Indeed, what if the expression has side effects? But the implementation that's implicit to this kind of AWAIT construct ... INCOMPREHENSIBLE UTTERANCE ... . is for the operating system to re-enable the thread that is going to evaluate the expression for the expression to be evaluated. And if it doesn't hold yet, then for the thread to go back to sleep and the thread to be surrendered back to the scheduller. And then it will have to be tried again at some future time. Basically we're in a spin lock with an expensive predicate to spin on. Now, sometimes these situations can't be avoided. That is indeed what we need to do. But the general approach in modern language design is not to have such a primitive available in the language. (The await primitive in C# is basically a join.) For these rare situations, we ask the programmer to write an explicit while loop and to implement the repeated re-evalation manually and as efficiently as they can. (We'll also use similar while loops as a basic paradigm in signal-and-continue monitor programming). As a mainstream solution for the baseline, we want something a bit simpler than the generalised await. The answer perhaps is going to be CONDITION VARIABLES, which we'll look at in conjunction with MONITORs. (But straight away, let me give you a warning that condition variables do not store any user data. They're not variables, really.) L4P12 Monitors An approach remains popular and relatively successful is called the MONITOR. Of course, that word is used for many things in computer science, but we're talking about concurrency monitors in this sense, and they're not monitoring anything really. They're not giving us a report: it's a slightly different use of the word; they are monitoring illicit concurrent activity. With the CCR we might have had our bits of code that were mutually exclusive dotted around the system. In the monitor approach, we're going to collect all the code that operates on a specific shared datastructure together and have one mutex associated with each instance of it. I say the word "instance" because I'm naturally thinking about object-oriented programming and the instance is of a class. The monitor construct fits very well with object-oriented programming. Indeed, the first object-oriented programming language was SIMULA, which was for highly concurrent simulations of multi agent systems. (Recall we spoke about these being good candidates for user-space threads packages in L1.) In OO terms, our monitor is an instance of a class which has some methods, but at most one thread may be running at anyone time across the bodies of all those methods. But, there will still be situations where a thread needs to pause waiting for a condition: for this we'll need a further synchronisation technique, the condition variable. The basic idea is that only one thread can ever be executing within the monitor at one time, but others can be somehow paused inside the monitor. Firstly, we need a mutex associated with each monitor instance. When a thread calls a method in the monitor, if another thread is active in the monitor, it will fail to acquire that mutex and it will block waiting for entry to the monitor. When the current thread that's in there stops (notice I said "stops" and not "exits") there's two possibilities: one is that it indeed exits and the other is that it pauses nominally inside the monitor. But the essential aspect is that all code written inside a monitor can be written and coded as though it were single threaded (ie non reentrant). Hence, if we're updating two parts of the data structure inside a monitor, we don't need to worry about which order we do them or invariant-breaking intermediate states --- no other thread will be reading them at the same time. And if we're reading the data held in the monitor, which is sensibly going to be the fields of the object (although we could adopt other coding rules), then we can be sure it won't change while we're reading it. The Java language has monitors and we turn a class into a monitor using the "synchronised" keyword. That keyword can be put in various places inside the definition of a class. These vary in, for instance, whether all instances of the class share the same lock, or whether it is on a per-instance basis and that sort of thing. But we'll concentrate on the mainstream theory before looking at any Java peculiarities. L4P13 Example Monitor syntax Here's an example definition of a monitor in some fictional language. We have a special keyword monitor for declaring it. We have braces to delimit what goes inside it. We have this strange foo in angle brackets there. What does that mean? Well, I didn't write this slide, but I can imagine that three things need to be in there potentially. The first would be to give the monitor some sort of instance or class name. The second would be some sort of template type parameter, inspired by C and Java-like template syntax (or generics if you want to call it that). But thirdly, it might be some sort of name of the mutex to be used if we're going to be explicit about it. But any decent implementation will make this automatic to a certain extent, with a default, for instance, that every instance gets its own mutex and we'd only need further annotation in here if we want to deviate from that. Otherwise, the contents of the monitor are fairly straightforward. We essentially have three different forms in here: - We have declarations of the variables that are going to be protected by the monitor. These could be static variables or they could be per-instance variables, - Then we have the methods that operate on those variables defined as normal, - And then we have some sort of constructor and initialisation code for the values of those variables. So that's all seemingly very straightforward. At most one of these procedure bodies can be active at any one time. There's a slight semantic question about whether one procedure in a monitor can call another procedure in the same monitor without re-acquiring the mutex. It's most sensible if they can. Otherwise you would just manually code so that the common code is factored out into a free-standing subroutine that's a private method of the monitor and share it in that sort of way. But actual implementations do vary in how they interpret that rule. L4P14 Conditioon Variables (Queues) Associated with every monitor, there is a mutex somewhere. But as we've seen, mutual exclusion isn't the only primitive needed. There are other concurrency primitives required for synchronisation between communicating tasks. The monitor as so far described just provides us with the mutual exclusion. So what if a thread enters a monitor but finds that the state inside the monitor isn't ready for what it wants to do, and it wants to suspend itself? Well, clearly it could leave the monitor with a specific return code and rely on its caller re-invoking it again at some point in the future. But this would basically take us back into a spinning arrangement. The information about when it should be re-invoked is probably inside the monitor. To facilitate this, we have the concept of the CONDITION VARIABLE. Now CONDITION VARIABLE is the historic name for this construct. But it's not really a variable that the user can store any information in. And it's not a Boolean that becomes true when the condition itself of interest starts to hold. So possibly a better a more informative name is the CONDITION QUEUE. It's a variable only in the sense that it logically refers to the head of some sort of queue of waiting threads. The condition variable is explicitly declared and used by the programmer. It doesn't have an integrated counter: there is no value that can be incremented and decremented like in a semaphore. Instead, it supports 3 operations. So remember these condition variables (aka condition queues) are inside a monitor, so they are always logically associated with an instance of a mutex for that monitor. There can be multiple condition variables inside a monitor. But there's only one entry mutex for the monitor (instance) as a whole. The first operation is to WAIT on a condition variable. This is a voluntary yield of the current thread --- the thread that invoked wait. The thread becomes suspended and it's added to a logical queue which is associated with the condition variable. But these queues are typically opaque to the user programmer, much in the same way as the queue associated with the semaphore is hidden from the programmer (normally). The critical thing about the wait on a condition variable is that the monitors mutex is released at the same time as the current thread suspends. This means that another thread that is waiting to enter the monitor or another thread that comes along shortly afterwards is able to proceed and enter the monitor despite there being one or more threads in the monitor already waiting on the condition variable. The next operation on a condition variable is to SIGNAL it. When a thread calls signal, it may suspend itself or not. We'll discuss that divergence shortly. But regardless of that, the side effect of calling signal is that if there is another thread waiting on the condition variable, then that thread is unblocked and can continue running inside the monitor. If there's more than one thread blocked on the condition variable, signal will normally just release one thread. There's another variant called BROADCAST which would unblock all of the threads waiting on a particular condition variable. And BROADCAST is sometimes called, SIGNAL_ALL. And as we'll see, for some monitor systems, the distinction between signal and broadcast should not be completely relied on by the programmer, and defensive programming should be deployed. L4P15 Monitor Producer-Consumer solution? Let's see if we can use our monitor for producer/consumer. We'll now switch over to showing the methods for enqueue and dequeue or (produce and consume as labelled on some slides) explicitly. As with the previous OO FIFO slide last time, this makes it clear that they can be called by a multitude of different calling threads. Again, we'll use the ring FIFO with IN and OUT pointers: the circular buffer. Our monitor contains the two methods, the buffer and the IN and OUT pointers. It also contains the condition variables not_full and not_empty. But for ease of coding in this example we have made the increment operations not wrap on N (they're not modular N); instead, we're just taking the pointers modulo N at the point where we use them. So this isn't a real-world implementation because they will eventually overflow, but it serves to illustrate the point nice and concisely here. The condition variables will both be initialised by the system to be empty queues (their condition queues). (If you have an old version of these slides, these condition variables are initialised to true and false respectively. That's a complete nonsense. I've deleted that now. Somebody else who wrote the slide thought that you might be storing values in a condition variable. And actually, it's quite helpful to for me to make it completely explicit that we don't store values in these condition variables and it. And that it's a common pitfall to think that you might.) Consider: a thread comes along and wants to enqueue the first item. The in and out pointers will both initially be 0, so we won't be waiting on the notfull condition variable. We store our item and increment the in pointer. The composition of these two operations is intrinsically atomic because of the exclusivity (the mutual exclusion) provided by being in the monitor context. We can see that the enqueue/produce method also signals the not_empty condition variable, so any thread that's blocked on that will potentially be woken up, which is going to be very helpful if something's already waiting to get something out of this buffer. And also on the enqeue/produce method we have a call to wait on the entrance when the buffer is already full. In other words, if the buffer is already full, we want to stall the caller thread. We don't want to enqueue and overrun the buffer. Hence we call WAIT which will suspend the enqueing thread. Turning now to the consume method, this is going to check whether the buffer is empty and that will be determined by the in and out pointers having the same value, the difference being 0. If they are empty, we will WAIT for the not empty condition. And it's very important to note that we've got an 'IF' statement there: if IN-OUT is zero, then wait for not_empty. We'll look at that 'IF' in more detail shortly, but it makes sense for the moment. What we're saying is that when we unblock from that WAIT, if we've blocked at all, then the condition that we were waiting on holds: IE IN is different from OUT and there's something for us to dequeue. So now we're running. We can do the logical dequeue operation, which is to read out the item addressed by the OUT pointer and then increment that pointer. We then return the item. We see also that if the queue were full, then we can SIGNAL the not_full condition and then any threads waiting to store something in there will be unblocked and they can continue making their stores. So this all looks very nice, but does it work? L4P16 Does this work? Does monitor based produce a consumer solution work? Well, it's going to depend on the precise semantics of WAIT and SIGNAL. Our first problem potentially arises when one thread in a monitor unblocks another one in the monitor that was waiting on a condition variable. Let's look at a scenario where that's going to happen. Putting aside the current example, just imagine a monitor in general with two threads in it. T1 enters the monitor and calls WAIT and suspends itself. Now T2 will enter and invoke SIGNAL on the condition variable that T1 was waiting on. T1 is unblocked, capable of running again. But we can only have one thread active inside a monitor! Two threads running inside the monitor would be very very very wrong. How can we change the semantics of WAIT and SIGNAL to avoid this problem? There's two possible semantics are going to work. One is called signal-and-continue, and the other is called signal-and-wait. With the signal-and-continue semantics, T2 is going to continue running after its signal and T1 will start to operate at some point later. And there's no guarantee how much later that will be. Other things could go on in the mean time: the state of the shared variables may be adjusted in the meantime. And what if several things were woken up? Well, again, there's no guarantee on the order that we can or should rely on. So this, on the face of it, doesn't sound that good. But I think we'll come to like it in the long run. The alternative semantics is called signal-and-wait. This makes the SIGNAL operation blocking and the thread which issues SIGNAL is going to start waiting while the unblocked thing continues instead. L4P17 Signal-and-Wait (“Hoare Monitors”) Let's explore signal-and-wait a bit further. These are called "Hoare Monitors" after Tony Hoare, who is a member of this department, so they ought to be good, you'd have thought. The situation's going to get a bit more complicated in terms of the number of queues involved. Let's give them names. We call the entry mutex for the monitor as a whole, E. We've been using the letter C to refer to one of the condition variables in our example monitor. We're going to need a further condition queue (condition variable) which we'll call S. The entry queue operates as we'd expect if: the monitor is occupied, threads are going to wait in the entry queue E. This may not necessarily be FIFO, of course, it could be based on priorities or something else, but at least it should be fair, meaning that everything gets serviced eventually (as we discussed in a previous lecture). And fair enough, if a thread T1 waits on a condition variable C, then it gets added to the queue for C that's straightforward. So now T2 comes along and it SIGNALs waking up T1. We want to implement the signal-and-wait semantics. We're going to add T2 now to a new queue S, which is logically in some sense ahead of the entry queue E. By making this transfer of the thread from one queue to another, we have stored the fact that it's been signalled. T1 then continues and eventually exits or perhaps WAITs on something else. When the monitor has eventually become free of anything running inside it, we look at what's waiting to come in. Normally we would just look at the entry queue E, but now we have the additional, higher priority queue S. We serve that in preference and then only dequeue from thread queue E when S is empty. L4P18 Signal-and-Wait pros and cons Let's look at the pros and cons of the signal-and-wait semantics for a monitor. When we signal a thread, it's going to wake up straight away. So if we signal that thread when a particular condition holds, THEN THAT CONDITION WILL STILL HOLD AS THE THREAD WAKES UP. And that's really what we want, because we're using these conditional variables to communicate the fact that a condition holds between one thread and another and this will do it successfully. But on the downside, we have the additional queue of threads called S. And it can be a bit more difficult to reason about: a call to SIGNAL may or may not result in a context switch. This turns out to be a real pain. The key attractive idea of a monitor is that we have effective exclusion during a method body. We are not expecting to be preempted or to be blocked. We can write code that knows that all the side effects it creates will be seen by other threads atomically. Well, this isn't really going to be the case. If calling SIGNAL causes a potential context switch, that paradigm gets broken in a big way. We should be using the concept of invariants to guide our programming and such invariants should hold at each context switch. We'll need to make sure that we restore invariants if we've tampered with them before we call SIGNAL. So we'll have to look very carefully at where we put signal in our control flow. L4P19 Monitor Producer-Consumer solution? If we turn to look at the producer consumer example, we might see a problem. Where are the problems in this monitor implementation of producer consumer? Well, if we look at the siting of the two calls to signal, they've both been placed at not an optimum point. I guess this was deliberate on the part of the slide writer. You might be inclined to avoid this bug if you are a more mature programmer. One call to SIGNAL is between the use of the IN pointer and its increment. We the same with the use of the OUT pointer and its increment. At both of those points, our fundamental invariant is broken. So being preempted at that point is likely to lead to bugs: underrun, overrun, replicated data and so on. This makes us consider the alternative semantics. The alternative is signal-and-continue. L4P20 Signal-and-Continue This form of monitor came from the Mesa programming language developed at Xerox Parc, (Palo Alto Research Centre), which was a leader in computer science for many decades. I should add that it's still going that it's not part of Xerox anymore. The signal-and-continue semantics doesn't have an additional entry queue called S: it just uses the main entry queue. But more importantly, it doesn't block the thread that invokes signal --- there's no immediate change to the concurrency environment from the point of view of the issuing thread. Signal is like any other change to the mutable environment. So this is a lot simpler. It gives us a more familiar programming paradigm for our critical region. The whole of the method body is non-preemptible. Except, of course, if we voluntarily yield by calling WAIT on a condition variable, but that's not preemption. But what's the downside of this? Well, the thread that had waited waited on some particular condition variable and under the other semantics, that condition was now bound to hold at the timehe thread resumed --- provided that the signal was issued responsibly, of course. Under the new semantic, the thread that's woken up needs to check whether the condition still holds. Indeed, in one extreme form of this, a single condition variable is all that's available in for the monitor (or even shared over all monitors!) and any thread that sleeps on it and gets woken up needs to check the condition that it was expecting to hold does indeed hold. But either way, we have the possibility that the condition doesn't hold when the thread is woken up: anything may have happened between the signal and the wave. So the waking up thread needs to check the condition again and we'll do that using a WHILE loop. The next three slides will go into that, but in summary, just replace the 'if' with a 'while'. L4P21 Signal-and-Continue example (1/2) Here's a worked example: we have two producers and one consumer thread. The buffer is going to be full as we start our example. So any enqueue operation is going to have to dally. (The word dally in concurrency theory actually has a special meaning and I was just using it instead of wait in that last sentence. Dally is a very short wait in the expectation that something that we're waiting for is about to happen imminently. Which is not necessarily the case for this example.) P1 enters. It wants to enqueue something, but the buffer is full, so it's going to wait on the condition not_full. Therefore, it will enter the wait queue for that condition variable and the main mutex (the entry mutex) will be unlocked, allowing something else to happen. Let's assume that the consumer thread comes along next and enters. But then P2, another producer, comes along and tries to enter, but it can't because the consumer is already in there and holds the entry lock. The consumer carries on working and consumes the item and finally it's going to SIGNAL the not_full condition variable. Remember, P1 was waiting on the condition variable and P2 is waiting on the entry lock. The consumer will carry on running. It will consume its item and it will SIGNAL the not_full condition variable. The consumer then happily exits. P1 now unblocks and is going to reenter the monitor by reacquiring the entry lock if it can. If we assume some sort of FIFO ordering on the entry lock, then P2 got that first, so P2 will actually enter first. P2 puts its work item in the buffer and exits INCOMPREHENSIBLE in the entry lock. This has filled up the buffer again. But when P2 exits, P1 can now enter. P1 has had its wake up on its condition variable. It was woken up because the buffer wasn't full any longer. But actually it is full again. Therefore, as coded, we have a bug. P1 will now put its item into the full buffer, causing a buffer overrun and consequential chaos. So what's the fix for this? It's very simple. We'll have to change our IF to a WHILE so that we retest the condition when we've resumed from await. L4P23 Monitor Producer/Consumer Solution Here's the modified code: there's two very small modifications. We've changed 2 occurrences of the word 'if' to the word 'while'. Of course, if the loop goes round only once, then 'if' is the same as 'while'. In both cases WAIT is first called when the condition of interest doesn't hold. When we unblock from wait, the WHILE will make us check the condition again. If it holds now well and good. But if it still doesn't hold, or it's held and gone away again in the meantime, then we will re-enter the wait. Note also the semantics of SIGNAL call have changed. It's no longer a yield. So we can freely insert the signal calls wherever we want in our code, without consideration for whether we get preempted [by another thread in the monitor --- general preemption/stuttering remains fine]. L4P24 Monitors Summary Overall, the monitor is a very effective structure for concurrency control. It froups together the related components, the methods and the data. Today we would call this an object. It's considerably easier and more idiomatic than the use of semaphores, but there can still be perilous coding opportunities if you're not careful. Having at most one thread in the monitor at once is overly conservative in some applications. The classic example is the multiple reader single writer. We want multiple readers to co-exist, but we can get around that and still use a monitor by requiring that the readers have two methods in the monitor that they observantly call. One will be a BeginRead and the other will be an EndRead. And for the write up we might do the same thing, but alternatively often it's the case that the write operation can usefully be put entirely inside the monitor as a single method. I leave that to you as an exercise to code up a working multiple reader, single writer scenario using the monitor construct. L4P25 Concurrency In Practice Let's look at some concrete examples of concurrency primitives in popular languages today. Ultra Modern languages might make it even easier for us, but then there might not be anything to show you. Our first concrete example is the pthreads library that we've seen in demos already. Posix is the portable operating system interface. POSIX interfaces are available on many operating systems. Especially Unix based ones, but also it's available on Windows (WLS 1 was a Posix gateway to Windows internals). Pthreads provide some other facilities, such as the ability to pass data between threads and to have thread-local data and so on. But there isn't a lot more to it than what I'm about to cover. Pthreads came from a uni-processor background, but ported naturally to multiprocessors. Threads will be dynamically allocated to processor cores based on demand, although now there are some additional primitives for defining AFFINITY to a particular core, if you want. But anyway, let's just look at the basics here. L4P26: Example: pthreads (1) We'll review three of the primitive setss from pthreads: mutexes, condition variables and barriers. There is also a counting semaphore in /usr/include/sem.h but here we just have a binary mutex. As we've seen, a mutex is very like a semaphore with only two values in use: locked and unlocked. The baseline way to access pthreads is through C. pthreads can also be used with C++, This makes a difference when it comes to spawning another thread, because in an object-oriented language you will not only need to give the entry point for the new thread to start in, you'll also need to give it an object 'this' pointer to run anything. The pthreads create, destroy, join and so on primitives, as shown in these slides, take an additional 'void *' argument where that you can type-cast as you like, and for create this can be used to pass the target object handle explicitly. It's not very elegant. So C++ now (2011 onwards) has its own thread abstractions that are cleaner to use, std::thread. Of course, one can use some grander libraries instead of Pthreads that make it a bit cleaner, but I'm strongly in favour of learning how things work from the bottom up, and this is definitely the bottom. On this slide, we show the primitives available for operating on a mutex. Mutex is an opaque type: it's a structure of type pthread_mutex_t which is defined in the pthreads.h header file. Once you've instantiated one of these, these are some of the operations that you can perform on it. Each of these subroutines takes the address of the mutex structure as a argument passed by reference. To initialise a mutex we call the pthread mutex_init subroutine. But since this is C and not C++, we must manually take the address to pass-by-reference and hence we have a pointer-style argument. To lock the mutex you call the LOCK primitive and this will block the thread if something else is currently holding it. And when the critical section is finished, you call UNLOCK, which will be non-blocking but may possibly free up something else that's blocked and the caller could get pre-empted straightaway. There's also the TRY_LOCK call which is a non-blocking version of LOCK. This will not block. It may or may not acquire the lock and it will return, telling you whether you did or you didn't acquire, but it won't block in the case that you didn't. And of course neither variant blocks in the case where you did. L4P27 Example: pthreads (2) Looking at the support for monitors and condition variables, we'll find the following primitives. This has to work for both C and C++, and in particular this is a C API. C does not have objects, and so the concept of a monitor is not enforced in any language-level construct when we're using C, we just have to program following that paradigm: it's a matter of manual discipline. To write a monitor, we first of all have to declare a mutex. All threads entering and leaving the "monitor" must manually acquire and release the LOCK. Then, as a goodie, we can have some condition variables. We define the condition variables much like we defined the mutex and we have a corresponding initialization function which sets them to be an empty queue. Remember, there are no user values stored in a condition variable. The COND_WAIT call doesn't understand intrinsically which monitor the condition variable is associated with, so we have to pass both the condition variable and the monitor's mutex as arguments by reference to the WAIT. The SIGNAL call is non-blocking. It's in the mesa-style and it will just wake up any one particular thread waiting on the condition variable/queue. There's also the BROADCAST, which will wake all up. NB: the additional 'while' around the waits that we will use in a Mesa style condition variable scenario gives us some defensive programming. It will not matter quite how many threads are woken up by either of these two calls in many programs, so sometimes it's just a matter of efficiency. But in other times it does matter and you do actually need every thread to be woken up because there won't be any other wake up event to wake up that thread, and it could end up sleeping forever on the condition variable. L4P28 Example: pthreads (3) The barrier construct is new ground for us. We haven't considered this before. It's a handy programming paradigm. A barrier is a synchronisation method that's used for some fixed and finite number of threads, NTHREADS, such as 10. We specify the number of threads in the initialization call to the barrier as N here. These threads are normally server type threads in an outer loop where they encounter the barrier wait call at one point, perhaps at the top or bottom of their outer loop. The barrier semantic is that any thread arriving at the barrier has to wait at that barrier. It's suspended until all the threads in that thread group have arrived at that barrier, and then they're all unblocked at once. (So the last-arriving threads does not have to wait actually!) The barrier isn't really a primitive in that you can implement its functionality using the other concurrency primitives. An exercise? But it's a useful thing to offer. Previously this course had quite a lot of emphasis on the free BSD kernel, but I just show this slide now for completeness rather than for examination purposes. "Free BSD" means the Berkeley software distribution of Unix. It's free because it's in the public domain. Nearly everything that's running today which isn't Windows is one way or another based on Unix and large chunks of the free BSD source code. Linux rewrote it, but used many of the same names and signatures and design styles for all of the main functionality. L4P29 Example: FreeBSD kernel In the days of uni-processors, it was fine for the kernel to be single-threaded. But with multi-processors we could possibly assume that the kernel's not going to be doing very much and we've exported all of our work into demons (and so on) that run in user space (so we can still get away with a single-thread operating in the kernel). We do _need_ a multi-threaded kernel to exploit the multithreaded hardware platform. But in reality, kernels like the free BSD kernel have evolved to be highly concurrent even inside the kernel. So as well as providing the standard concurrency APIs to user-space programs, the additional richness can also be made available sometimes to user-space program. Each user thread has a corresponding kernel thread that handles the system calls from that user thread, giving concurrency while in the kernel. (By "concurrency" here I mean parallel processing speed up.) It has facilities for deferred work. Deferred work is essentially A complete closure, future or a "funk". In ML terms, "fun: unit->alpha". A piece of work that requires no further arguments, but which can be computed asynchronously when processing power is available. Modern high-level languages have such constructs like the "async" keyword in CSharp that can collect the result from a future. But deferred work is also useful for handling interrupts, especially when an interrupt service routine must exit very quickly but has additional work that can be done soon afterwards as deferred work outside the interrupt context. The kernel has interesting monitoring and debugging facilities which can track, for instance, the contention experienced by a program. It can be dynamically optimised into better behaviour. For instance, when acquiring a mutex, is it better to spin on it briefly because you know that actually it will be available fairly readily? Or is it better to suspend the thread and to get a later reschedule from the scheduler? The kernel has deadlock detection, which we'll talk about another time and it has lock order checking, which we'll also talk about another time. L4P30: Example: Java synchronization (1) Now Java is an object oriented language and so the monitor paradigm is pretty easy to implement. In Java, every object you create may intrinsically has a mutex associated with it. You might think that's a waste of space, but typically its sufficient for the JAVA VM to use the unique object handle with the scheduller entering that handle in a collection type to represent the mutex is held. I think the "synchronised" keyword in a class definition turns every instance into its own monitor. Here, the first example has the keyword "synchronised" in the signature for a method declaration in its storage class modification list. This method will acquire its parent instance's mutex as the caller tries to enter it. If the lock is free, then we enter as normal. Otherwise we suspend waiting for the lock to become available. And that same mutex will be shared by all other methods of the instance that have the synchronised keyword in their declaration --- the object becomes a monitor. In the second example there is a localised critical section using a "synchronised" block in the body of the code. In the second fragment, the synchronised() argument is "this", which is the mutex associated with the current instance. But sometimes that's not what we want. We might just want literally a critical section here, and we don't care how many other threads are also elsewhere in this instance, in which case we could name a local lock as the argument to the synchronised keyword, and sometimes we might want the exclusion to be wider scope, in which case we would share a mutex amongst a number of synchronised statements and get mutual exclusion amongst all of them. In Java, the exclusion is said to be re-entrant. This means that a thread which is in a synchronised method can call other methods that are synchronised on the same lock, which would be others in the same class instance normally. And it calls those without any disruption, any suspension, there's no synchronisation overhead for that, they're just called directly like a normal subroutine call. But when we return from the top level synchronised subroutine, the lock is released. Perhaps I should have said top level synchronised method, but you know what I mean. L4P31 Example: Java synchronization (2) What do we want to do for synchronisation within a Java monitor? We'll need the equivalent of a condition variable. (Or a condition queue as I would urge you to think of it.) The first thing to note is the point I mentioned before, which is that we can always get away with just one condition variable per monitor. That's because we're using signal-and-continued semantics where the WAIT will be wrapped up in a WHILE, which will check which condition really holds at the time it unblocks. Looking at these two Java fragments here, we see a WAIT and a NOTIFY. It's a NOTIFY_ALL in fact, but that doesn't matter. We see that wait and notify don't take any arguments. They are intrinsically referring to the one condition variable that exists for this instance. We see that the wait is indeed inside the body of a while loop, which is checking the relevant condition. So we might be checking for several conditions on a single wait call, or we might have various wait calls each checking for a specific condition. All of these programming paradigms are possible. But clearly we have less focused wake up than if we were able to have multiple condition variable. So that might be a source of inefficiency. But it could also be less bug prone, since waking up more threads than necessary is likely to suffer less from missed wake ups! Finally, at the bottom of the slide, as with locks, you see we can name the object that we want to call wait on. This looks like a pretty hairy programming style. Probably best avoid. There be dragons! L4P32 Example: Java synchronization (3) Memory consistency is a very important and topical subject. We're talking about shared memory. We know that something that one processor writes other processes are expected to be able to see and read regardless of how many levels of cache and so on are involved. But SEQUENTIAL CONSISTENCY is a matter of whether writes are observed in the same order by all other processes. With something like the Java virtual machine, we are potentially able to define our own sequential consistency semantics. But we need to do this with care. We need, as always, to balance between what the underlying hardware is giving us essentially for free, and what we would like to offer the user that might incur some overhead on our part as a VM implementer. But the baseline model for most of these systems is that although reads and writes might sometimes appear in different orders to other processes (especially on Arm), whenever we invoke a concurrency primitive, then there is an intrinsic memory fence and the region writes before that primitive are definitely before the primitive and separated from those which we perform after the primitive is invoked. When we look at lock-free data structures where we won't invoke any primitives, and then we'll need to know a bit more about our memory consistency model. In both Java and dot net, they have refined the memory consistency model several times since the languages were first invented. And the library of synchronisation primitives in Java has been augmented recently as well, with thread pools, semaphores, cyclic barriers and so on. A concurrent collection is essentially a thread-safe collection which can be re-entrant on its read and write method. But it also may have additional facilities like a concurrent/parallel map (aka a concurrent "for all"). This will invoke a method on every member stored in the collection, potentially running them in parallel based on how much parallelism is available in the underlying hardware. L4P33 Parallel C++ Extensions: Cilk and OpenMP We saw that Pthreads was a way of achieving parallelism for C and C++ code using an additional library, but it can also be done, perhaps more elegantly and at a finer grain, using extensions to the compiler. A really smart compiler might be able to work out how to do this for itself, and we'll look at problems that that might encounter shortly. But on this slide, we see two ways where the programmer explicitly indicates where parallelism should be deployed by explicit by marking up the source code. In the SILK example on the left, we see these keywords SPAWN and SYNC. SPAWN is an indication that a subroutine invocation should be run with its own thread. The SYNC command means collect together the results from the previous spawns before proceeding any further. In this example you might think that sync is fairly unnecessary because clearly we'll need both results before we can compute X + Y. But that was the nature of this particular approach. On the right-hand side, we have an OpenMP example where the programmer source code has been extended with several hash pragma markups. If the hash pragma markups are ignored, then we have a single-threaded programme which should have the correct behaviour. But when they're not ignored by an augmented compiler, the FOR loop will receive special attention. In particular, it's going to run each iteration in parallel rather than in series. We can see that the body of the FOR loop contains a reference to a "Normalise()" subroutine that we don't know anything about, but presumably it has sufficient complexity to warrant being run in parallel with other instances of itself to amortise the inevitable overheads of managing the parallelism. We also need to note that the results of the individual iterations are going to be combined using summation, and as we'll mention shortly, it's important that that is an associative operator. L4P34 Parallel Iteration in Modern HLLs Many modern imperative high-level languages have structures similar to these on this slide, but we should also note that if we were programming in a more decorative language such as ML, the pure subset of ML that's purely functional, then the compiler has complete freedom over what to run when. It can optimise parallelism as it wishes. Here we see a recent example from the C# language from Microsoft, which is a very good language, but it is an imperative language. This example shows the parallel ForEach construct being applied to a sequence in C#. A sequence is an abstract lazy list or something else that can be iterated over. This example is essentially implementing a fold summation and I've put the ML equivalent at the bottom so that we can see what's going on a bit more clearly. The construct is using three anonymous functions or Lambda expressions. The first one takes unit and gives us the termination value to be used in the end of the parallel reduction step. The second one is very much like the ML equivalent. It's the function that's being iterated, the difference being that it has this extra argument called loop state where I think you can keep additional state as you wish, but I'm not an expert on this and I'm not expecting you to be an expert on it. And the final and 3rd anonymous function is the one that's going to combine together the results from each iteration that's put in parallel. What we see here is that the programmer has structured the code, so there's a part of the loop that is not necessarily thread safe, where each iteration is run on the same processing node or core and another part which is thread safe, which is going to be used to combine the results from each of the parallel part. The clever thing here is that there's no explicit partitioning of the sequence into what's going to be run on an individual node and what's going to be run across nodes. The best partitioning will vary according to the number of processing nodes available and many other local conditions. However, the body of the final function here needs to be thread safe now. Adding to a scalar variable is a is available as a thread safe primitive on most modern CPU architectures, but in all the ones it might have to be made out of a load in a store and as a possible preemption between those. So just to be sure here, we've been wrapped it up inside a lock construct using another part of the the C# concurrency primitives to establish a mini monitor here. Of course, none of the details of this are examinable for this course I'm presenting the general approach only. In summary, concurrent systems require a means to ensure both safety and progress safety is provided by the mutual exclusion from critical sections so that other parts of the system see a coherent view that observes declared system invariants. We also need progress guarantees. We need to be able to trigger other tasks reliably and we don't want to encounter deadlock, which we'll talk about next time. Of the various primitives that we've looked at, many of these are used in practise, but they can vary from one system to another in the details of their implementation. These variations can lead to obscure bugs, as can sequential consistency differences between, say Intel and Arm architectures. Where the concurrency primitives are built into the high level language, we're generally speaking on safer ground than when we're using the primitives provided by the operating system directly, which is what we're doing with C and C++. With the low-level approaches, there are risks of programming problems, such as taking out a lock and failing to release it on some code path, especially if you're using a goto or something like that. There's also the problem of lost wake ups. If a signal is sent to a condition queue that has nothing waiting on it, then the signal is lost. If the thread then arrives at the WAIT and it's waiting for that lost signal, well, it's never going to come back again, and potentially we'll have an ongoing eternal bug. In summary, we looked at the multiple reader, single writer locks and we looked at the alternatives to using low level semaphores and mutexes. We briefly discussed conditional critical region type ideas. Then we zoomed in on the monitor and added the condition variable to the monitor and we talked about signal-and-wait versus signal-and-continued semantics for the monitor. We had a brief look at concurrency primitives in practise in older languages and Java. But that's all for the primitives, and now we'll move on to slightly grander problems and grander solutions. Next time we'll look at deadlock and livelock and priority and priority inversion. We look at avoiding deadlock using resource allocation graphs. And the difference between deadlock prevention, detection and recovery. We'll look at how priority affects scheduling and the particular problem of priority inversion (which can be solved using priority inheritance). Thank you for your attention.