CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 2 TRANSCRIPT (C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory. L2.P3: THIS TIME Good morning, everybody. Last time we learned that concurrency exists both in parallel programming and in distributed computation. In contemporary computing, the principal difference between these two is whether there is cache-consistent shared memory available to the participants. But note we don't have shared memory either in the case of time sharing processes with separate address spaces, even though that's on the same machine. This part of the course is principally about parallel programming in a shared address space. The first problem that comes up is the race condition where two threads are concurrently accessing the same shared data, but not really aware of what each other are doing. And writing parallel programs is notoriously error-prone. Concurrency bugs are very hard to anticipate. This time, we'll take a quick look at parallel computer architecture and the relevant reference models. We look at the primitives provided in the instruction sets of older and contemporary machines for support of atomic operations. We'll look at how concurrent programs can be viewed from the perspective automata theory. And see how that view can help us prove the correctness of programs, either using pen and paper or automatic tools such as model checkers. We'll look at one of the classic examples from concurrency theory, the Dining Philosophers. And finally, we'll look at alternatives to providing atomic operations in the instruction set. such as the Bakery Algorithm. L2.P4 General Comments Concurrency is essential in modern systems. Even from the earliest days of computing, the IO devices tended to be mismatched in speed with the processor. For instance, you may have heard of a SPOOL file. Typically, a program generating data to be printed would not directly send it to the printer. It would instead send it to a file called a spool file. The SPOOL file would then be sent to the printer asynchronously at the rate that the printer operates while the processor is freed up to work on new work. And so the word spool is actually an acronym. It's short for "simultaneous peripheral operation offline". Even today, all I/O tends to run asynchronously using interrupts. And given that it's now much more energy efficient to build lots of computers, rather than one faster one, parallel processing, either tightly-coupled or in a distributed system, is the mainstream computing paradigm. But this throws up lots of challenges. We need our systems to be safe and reliable, which mainly means being race free. If there's a race, we get unreliable random behaviour. We need mechanisms to enable one process to communicate and synchronise with another one reliably. We need to ensure liveness and not suffer deadlock problems, of which more in another lecture. Often it can be a good approach to write a single-threaded program that does what you want first. Then only run it in a parallel or concurrent form later on, once you're happy with the basic functionality. But you should write the initial version of the program with the awareness that you are going to later run it in parallel, because this can have major consequences for the design of your data structures. And as always, where an existing library solves a problem, then you should always use that rather than implementing your own unless you're implementing your own just to find out how to do it. Or if you think you could do it better. L2P5. Computer Architecture Let's just revise basic computer architecture: the architecture, designed by John von Neumann, which everybody has used ever since more or less. There are four parts to a computer and we are only interested in two of them here. My top picture here shows the main memory and one CPU. That's two parts of a computer. Traditionally, the other two parts are the secondary storage, which is the disk drive(s) and the I/O devices. We'll need network I/O devices for distributed computing based on networks. But for the sake of this first half of the course, we're just interested in the processors and main memory. The memory contains the operating system and the application program. The CPU has two modes of operation. Normal user-mode and kernel-mode, with different access rights accorded to them. In a time-sharing system, there will also be a timer which generates an interrupt at a regular rate such as 10 Hertz or 100 Hertz, or somewhere in between. In a real system with I/O devices we get interrupts from those as well. Both these forms of interrupt are very important in our analysis of concurrent programs and operating system design. because a thread is going to get preempted and something will be changed by the interrupt or another running thread before it's resumed. So in a sense, memory locations magically change their values under our feet. In a parallel processing environment, as we have in every laptop these days, we have multiple CPU cores as illustrated on the bottom picture. We can also have more than one memory bank, but that is a hardware detail that tends to be hidden from us. All of the memory is nominally in one flat uniform physical address space. But each core will have a memory management unit which does virtual memory translation, so each process running on a core will see a different image and a subset of that main memory. L2P6 Operating System Behaviour Let's just revise how the operating system schedules tasks. By task I mean thread. The threads will belong to processes, with potentially multiple threads per process. On a uni-processor, only one thread can be running at anyone time. The context of all the other threads must be saved by the operating system in thread control blocks (aka process control blocks). At a context swap, the operating system will save the processor state in the control block. The scheduler will decide what is the most worthy thread to run next. Its context will be loaded into the physical registers and the operating system will do essentially a 'return from interrupt': that thread will continue running without any knowledge that it had stopped. Here we see a classic thread state diagram. Although you might see other more complex diagrams, there are really only three states of interest. A new thread starts in the running mode and it continues until it either issues a system call that blocks it (that's a voluntary yield in that the programmer made the system call and knew full well that the thread may well block at this point. Or the thread may be preempted by an asynchronous interrupt from a device or the timer, in which case, in principle, it's still ready to run. So each of the suspended threads is either blocked or ready to run. For a blocked thread, the operating system needs to be able to tell cheaply what it's blocked on. We'll return to that in another lecture. The operating system will consider which is the most worthy amongst the ready-to-run threads when it decides what to run next, which is the SCHEDULING DECISION. When interrupts from devices occur, they generally alter data structures and something that was blocked can often become ready-to-run. So most operating systems are designed that once an interrupt has occurred, on return from interrupt, the operating system scheduler is invoked and a new determination of the most worthy work is made. L2P7: Hardware foundations for atomicity 1 What hardware do we need for concurrency? Well, we've already mentioned the operating system timer for time sharing, but we're also going to get interrupts from other sources. Such interrupts can happen at anytime, and a thread can be preempted at anytime without warning. This is pretty much an impossible situation. It's pretty much impossible to write software that could be interrupted at any point (see lock-free programming in L8 though). Normally we are bound to have some critical sections. At the hardware level, all processor have a means for disabling interrupts. Generally you don't want another interrupt while you're already processing one interrupt. Although there can be multiple levels of interrupt in practise where a higher level interrupt can interrupt a currently running lower level interrupt, but we'll put that aside and just imagine that there is normal and interrupt mode. The action of the hardware on getting that interrupt consists of several steps, including entering supervisor mode, changing the program counter, possibly stashing some registers, but in particular it also sets the interrupt disable flag, which is also the hardware reset state of that flag. The flag will be found in a special processor control register alongside their NZC and V flags that you may have come across, and so on. Anyway, there is a hardware flag that is manipulated by both software and hardware to turn interrupts on and off. Also, there is a special instruction, a variation on the normal subroutine return instruction, which is the return from interrupt instruction, and that will have the effect of restoring that flag to its previous value. The details of all of that aren't important for this course. We just need to know that there is such a register that can turn interrupts on or off and that it's manipulated in hardware, but also it can be manipulated in software. The instructions for setting and clearing the flag are supervisor-mode (aka kernel-mode) instructions, but in this course we're not going to be worried too much about the distinction between supervisor and kernel mode. Most of what we're discussing applies equally. So on the simple uni-processor system, we can turn interrupts off. We are then guaranteed not to be preempted. Memory locations will not change under our feet. This gives us the perfect environment for implementing a critical section and at the end of that section we turn interrupts on again. While the interrupts are off, though, we cannot respond to external events, so real-time behaviour may well be influenced. We may get a buffer overflow, for instance, if data arrives while we have interrupts off for an extended length of time. That's a hardware buffer I'm thinking of there. Later on, we'll talk extensively about software FIFO buffers. Moreover, this simple technique of turning interrupts on or off is not going to be suitable for use when there are other CPUs that change the contents of memory or when DMA exists, copying data into memory from external devices. So we're not going to use the crude approach of turning interrupts off method for managing critical sections. L2.P8 Hardware foundations for atomicity 2 Last time we saw the problem with the lack of atomic operations. If we want to check a flag and then set it if it wasn't set already on a simple instruction set, this requires a load and then a store. These are separated in time and it's possible that another device also loads between that load and our store. That's the classic race. We need the load and the store to be in some sense atomic, with nothing happening between the load and the store. So we need a new ISA primitive for protection against parallel access to the memory from another CPU, or indeed the same CPU running an interrupt. Older architectures provided instructions such as compare-and-swap, which we'll look at in the next slide. More modern architectures tend to use not an atomic instruction, but a pair of instructions that have an atomic effect somewhere else hidden in the system, such as load-linked and store-conditional. Modern architectures also have a variety of other atomic instructions, such as setting a bit in a word or incrementing a word, and so on. But we don't need to concern ourselves with those at the moment. They may come in handy for implementing lock-free data structures later on. L2P9: Atomic Compare and Swap (CAS) As I said, older instruction sets tended to support an instruction such as atomic compare-and-swap, or some variant of it. Compare-and-swap instruction takes a memory address and a value to compare against and a value to set that memory location to and it returns a result as well. The value to compare against and the value to set it to could be stored in registers or they could be hard coded as part of the instruction. That doesn't make a lot of difference. But certainly the memory address to operate on needs to come from a register and the return value from the instruction needs to go into a register on the machine. The semantics of compare and swap are straightforward. If the value in memory matches the compare value, then the new value is stored. If the old value doesn't match the compare value, then nothing happens. And either way we get a result which tells us whether the store was successful. Here we see the assembly language implementation of this for the x86 architecture. The instruction of interest is called compare exchange. We can see it has a lock prefix in front of it. It takes 2 arguments, one is the address of the word to be operated on, which is called foo_lock, and the second is a register, the DX register. The details of this are not important. Compare-and-exchange is of course the same thing as compare-and-swap. The compare-and-swap instruction here is inside a loop and this loop is going to repeat the operation until it's successful. Later on, we'll call this a MUTEX ACQUIRE SPINLOCK. The value to be stored is 1. It's loaded initially into the DX register. The x86 exchange instruction puts its result in the AX register, even though AX isn't mentioned in the operand list of the instruction. That's one of the sort of peculiarities of older CISC instruction sets. We call that a non-orthogonal behaviour and it makes compiler writing a pain. But if we're writing this. by hand in assembler, then we know the semantics of the instruction and have to remember where the result went. But the low-level details are certainly not important. Another variant of this instruction is the atomic test-and-set, which is where the value to be tested against and the value to be stored are hardwired to unity. The downside of the compare-and-swap instruction is nothing to do with this course. It's to do with hardware architecture. But essentially, the whole of the memory system needs to be locked between the load and the store phases of the atomic instruction, and this isn't necessarily a good thing in a highly-parallel, multi-processor environment. L2.P10: Load Linked-Store Conditional (LL/SC) The more modern equivalent of compare-and-swap, which is found more commonly on today's RISC architectures, is called load-linked/ store-conditional. It's also known as load-exclusive/store-conditional. Rather than having one instruction which makes 2 memory operations, we have two instructions that make one memory operation each. For those interested in computer architecture, which is hopefully all of you, the instruction pair can be implemented in a slightly more cache friendly way, using some of the cache structures, in fact. And also it doesn't require that the processor core is microcoded. Looking at the two halves in detail: the load-linked instruction is just like a normal load. It loads the value from memory into a register. But the memory system remembers the processor ID that just loaded that location. We can't see that that's happened from anything on the slide, but that's what's going on under the bonnet. The value that we've loaded, we can operate on. We can check it as we would need to do in the compare phase of a compare-and-swap. We can do arithmetic on it, or we can just produce a new constant which we're going to store back there using the store-conditional instruction. The store-conditional instruction is where the magic happens: store-conditional from its very name either stores or it doesn't. If it stores, then we say it's successful. Otherwise we say it's failed. And what is the condition for failure? It's that no other processor has issued a load-linked or a store-conditional on that memory region in the meantime. There are some possible variations on the detailed semantics. For instance, any store at all to that location might potentially invalidate and cause the store conditional to fail. What do I mean by memory region? Well, that will vary from one architecture to another and also might vary according to how the cache structure is currently organised, which can be adjusted in control registers. But basically, a memory region could be a single location, it could be 64 bytes, a cache line, something like that. Or it could indeed be a whole memory bank. So that's why the slide here says neighbourhood. That will include the address of interest and perhaps some surrounding addresses as well. Again to implement MUTEX ACQUIRE SPINLOCK, we'll put it inside a loop and iterate it until we have success. We can require two loops in principle: one to get a successful LL/SC pair, and an outer one to make the spin, which is equivalent to the loop that was present in the compare-and-swap example. The example assembly code on the bottom of the slide shows what you would run on a RISC-V. But note this just shows how to implement the test-and-set part and we'll need an external loop to make it into a spin. And again, the fine details of that are not relevant for examination purposes on this course. The takeaway from the last two slides is nothing more than processes do tend to have atomic instructions in their instruction set. However, and interestingly, it's also possible to achieve atomic like behaviour without atomic instructions and at the end of this lecture we'll have a look at one of the algorithms for that. L2.P11 Finite State Machine Revision and Terminology A digital computer is a finite state machine.Of course, there's so much memory that the number of states is totally mind boggling. Finite state machine theory does apply to a complete computer and to a system of communicating CPU cores or programs. But to model concurrent programs as finite state machines is generally speaking impractical. It certainly requires a very high level of abstraction, so that just the important properties are represented in our model. For instance, we might consider that the state of a program is just which basic block the program counter of each thread is in. And perhaps also add in the state of some important synchronisation primitives, such as the values of semaphores, which we'll talk about another time. Given such an abstraction, it is then possible to reason about programs and even automatically prove their correctness under various assumptions. As I hope you know already, the finite state machine can be formally defined as a tuple. We have a set of states, a starting state, an input alphabet and a next-state function (aka the transition function). Based on that, we can define some terminology to describe the general behaviour of finite state machines. In my diagram, FSM-1, we have an initial state. We might have several initialisation states that lead from one to another, forming the startup code for our program. But then we enter some sort of main eternal process loop. This is the general structure of most concurrent programs. Its known as a LASSO shape. We'll call these states the live state and from any live state we can get back to all the other live states at anytime in the future. So that's Q1, Q2 and Q3 in this diagram. The terminology I'm using here doesn't necessarily precisely align with what others might use, but it's going to be appropriate for this course. Generally speaking, there could be one state transition diagram for each thread of our program. But if several threads are running the same code, their diagrams have the same shape. We assume that the main, important, significant work of the thread is done in the live states. When we have two threads running, of course, the state space of the computer as a whole is the product of the state spaces of the individual threads. That would be the Cartesian product, of course. Not all of that may be REACHABLE as we'll discuss. We define here a LIVE STATE to be one that can be returned to infinitely often in the future. On the other hand, a dead or a deadlocked state has no successors. The machine essentially is going to stop if we enter it, or at least this thread will. Other threads may still be able to continue. In digital hardware, we tend to refer to such a state as a dead state, whereas in concurrency theory we may say that the situation is deadlock, but we'll define that term much more thoroughly in another lecture. Also on this diagram we see states Q5 and Q4. These are outside of the live state set because once we've entered them, we can no longer get back to our live states. We could call these bad states. L2.P12 Finite State Machine: Fairness and Livelock Syphons In the FSM-2 figure, we have two sets of states which are in a cycle. If we ignore the letter F on state Q4 for the moment, which I'll come to shortly, then which is the live set of this diagram? The live set is the set from which it's always possible to get to each other member of the live set in the future. So if we're in Q5 and Q6, we might say they're live clearly. What about Q2, Q3 and Q4? Well, they might be considered to be part of the start up trajectory for the live set made of Q5 and Q6. So perhaps there's a bit of ambiguity with this diagram. But we can resolve this by marking some of the states with the letter F for fair. This denotes what we call a FAIR STATE. We must always be able to get back to it whatever happens. Live set must always include all the states marked with F. Hence Q4 must be in our live set and hence our main live set is now going to be Q4, Q2 and Q3. What does that mean for Q5 and Q6? Well, they're a bit like our deadlock state from the previous finite state machine. Once we've entered them, we can no longer get back to our live set. But this thread hasn't stopped working completely. We can say that it's entered a live lock cycle (or syphon) where it's repeatedly doing something, but it's no longer able to get back to its main operating set. If our thread enters a state from where it's no longer to reach all of the fair states, then that state was not a live state. If we also labelled say, Q5 as a fair state, then this would be an impossible and the labelling is invalid. Because we can't have fairness for Q4 and Q5, if Q5 is entered, we can no longer get back to Q4, so they can't both be fair states. Informally, we can defined livelocks as "we have not deadlocked, but we can no longer make useful progress". There are perhaps some more rigorous definitions of these terms that you would find in a formal concurrency course. But the definitions that I've given here are going to be sufficient to give us the required insight into what's going on with concurrent programming. L2P13: Finite State Machine: FSM view of thread control flow. Now we put some annotations on the arcs of our diagram and consider the input alphabet Sigma. There are two annotations potentially for each arc. The first is the condition that must hold for the arc to be taken. The second is the effects to be generated when we follow the arc. The word 'effect' here is short for side effect. The side effects mainly of interest will be storing values in variables, but it also includes writing an expression to an output device. The conditions for following an arc need to be boolean predicates. These will range over the values of shared variables and variables that are local to a given thread, and possibly even the state of other threads in terms of which state they're in. Together, all of these form our input alphabet. In automata theory terms, the input alphabet is really vectors (or the Cartesian product of) all of these possible observable variables. In out diagrams, we separate the condition and the effect with the slash symbol. (In some future diagrams, the distinction between input and output is blurred because it is shared event or rendezvous.) Let's have an instance of this diagram for each possible human being sharing a flat. These human beings are asynchronously composed. So when one of them makes an action, it's not correlated or clocked at the same time as another one taking an action. This is very different from the composition of the flip-flops in synchronous digital logic, where you have a common clock signal and each participant makes a transition at the same time. We'll take a quick look at that in some follow-up slides. In terms of shared state: There is a global variable which is observable and mutable by all threads, which is how much beer there is in the fridge. And another global variable is whether there is a note on the fridge or not. Our initial state Q0 for beer drinking shows two possible successors. But another possibility is that a human being who is executing this life doesn't do anything in the current moment. Maybe another human being makes a move instead. When we have more than one thread running different code, each thread will have its own instance of a control flow graph. It will have its own state transition diagram. L2P14 Finite State Machine: Product of Machines 1 Let's leave the beer aside and look at a couple of simpler threads which we're going to compose together. Suppose the first thread has two states A and B, and the second has three states labelled zero, one and two. Then the product space will have six states. For two machines in product, we can plot the composition in two dimensions conveniently, and if we look at this two-dimensional plot of the product, we can see that the arcs for the horizontal and vertical movement directly mirror what the individual machines did. But this is for the asynchronous product where only one machine changes state at a time. If we instead have a synchronous product, in general, we'll get diagonal arcs. In future diagrams, we probably won't show most of the reflexive arcs. These are arcs where the source and destination are the same state. That's certainly the case for the stuttering product, which we discuss on the next slide. But sometimes a reflexive arc does have a side effect, in which case we will need to show it. L2P15 Finite State Machine: Product of Machines 2 What's the difference between asynchronous and synchronous? It's pretty straightforward. In the asynchronous product, exactly one machine steps once. That's what we get on a time sharing uni-processor where there is only one core, so at most one thread can advance at any time. The asynchronous composition doesn't mean strict alternation between threads. It's not thread A and then thread B and then thread a again. There will be a lot of threads in general and each thread might make more than one step before another thread moves at all. That's what we call a STUTTERING ASYNCHRONOUS composition. Rather than a TURN-TAKING composition. In the synchronous composition, however, all machines step at once. This will give us the diagonal arcs in the 2-D plot, and I don't know what the word for it is in multi-dimensions (non-planar?). The synchronous composition, of course, is highly relevant in digital hardware, where everything is clocked off the same clock within a clock domain, but that's a different course. In this course, we'll be considering exclusively asynchronous, stuttering composition. That's what you get from an operating system scheduler. L2P16 Finite State Machine: Product of Machines 3 Now multiplcative composition of the machines is not all that happens, they also interact. As we combine machines, we get a more complex system in terms of the number of states: it grows quadratically. But the possible behaviours of the individual machines tends to reduce. And ultimately, you'll see, this can lead to deadlock where there is no longer any behaviour possible in certain corners of the composite state space. Let's introduce a first coupling between our two machines, machine one and machine two. We'll ultimately consider two couplings, one in each direction. We'll call the situation where one machine depends on the other, but not the other way round. A half-coupled scenario and then we'll talk about the fully coupled scenario where they both depend on each other's behaviour. The input to machine two is a single boolean called Y. Let's make Y a function of the state of machine one. Let's suppose Y is simply equal to the condition that M1 is in state A. This will immediately remove arcs from the product diagram. Here I've plotted the synchronous and asynchronous half-coupled products. I didn't previously draw out to the, uncoupled synchronous product diagram. You can do that yourself if you want an exercise. Another exercise is just to look at these two diagrams and to see whether we've introduced any dead states. Is every state still reachable from every other state? If so, they are all in our live set. Yes, I've just had a look and I think they are all alive. Now let's introduce the second coupling, a reverse coupling. So we're also going to determine X, which is the input to machine one from some aspect of machine 2. We'll choose the X holds when M2 is in state 0. Of course, in real world situations there will be additional state that's not drawn in the state diagrams of the machines. It could be local variables to a thread, or it could be shared variables between threads. Each state is an abstract projection of a lot of states and whether a thread's PC is included may vary according to whatever is the most sensible projection. So, local variables value could be expanded into the state space of the thread and PC values could be taken out. Having one PC value per basic block is often a reasonable projection. L02.P17: Finite State Machine: Product of Machines 4 Now we'll look at our example called fully-coupled product machines. Its not 'full' in any technical sense, but we now have couplings in both directions and they both influence each other. Again, we could have the synchronous and the asynchronous composition. Note that even though there are no free variables or external inputs any longer in the system it's possible to get non deterministic behaviour from the asynchronous product because there is arbitrary behaviour in the scheduler as to which machine is going to step next. So we could see a state with more than one arc leading out of it. Also, we now see the new concept of states becoming unreachable. There's no way to get to them from the start state. In the fully-coupled asynchronous product at the top, we see that we can move from a zero along the top to A2 and deadlock there. Or we can go up and down between A0 and B0. If we were ever in B2, which we shouldn't normally get into, but under error circumstances such as being hit by a high energy radioactive particle, or a heavy axe, we could possibly get erroneous behaviour, and if we were in B2 we would exit and go to B0. The synchronous product has its distinctive diagonal arcs, but isn't very much more interesting. L2P18: Forward reference to semantics course notation. The semantics course will denote stuttering parallel composition as shown, using a vertical bar. L2P19: THE DINING PHILOSOPHERS The dining philosophers problem is a classic exercise in concurrency theory. It's due to Dykstra, along with many others like the Sleeping Barber problem and so on. They're all amenable to this simple finite state machine analysis. We have 5 philosophers sitting around the table. The life of a philosopher is very simple. It is completely governed by this eternal process loop: the philosopher alternately thinks and eats. Each philosopher requires 2 forks to eat, but being a bit poor, each philosopher can only afford one fork, so sharing of forks is needed. The control flow graph for the philosophers given in this pseudo code here. When the philosopher wants to eat, he picks up the fork on their left, then picks up the fork on their right; the philosopher eats and then puts down the two forks. In this pseudo code, we've used the subroutine names, wait and signal which will become apparent in the next lecture. If we have 5 philosophers and each philosopher has about 5 states in its life, then we have five to the power of five states in our composite diagram. That's a bit much to draw out on a sheet of paper. But if you draw out two or three, you might be able to find the problem and illustrate it. I'll give you a heads up straight away: the problem is that because each philosopher picks up their left fork first. There is a deadlock state where each philosopher has picked up the fork on their left but can't pick up the fork on the right because of the philosopher on their right, holding it in their left hand. This problem has a circle in it. The philosophers are sitting around a circular table and circles and cycles are a primary cause of deadlock. In fact, they are needed for deadlock: we'll return to that in a future lecture. Also there's a textbook solution to solving this problem which can be applied in general and not just to philosophers. L2P20 Reachable State Space Algorithm Given that we have our concurrent system behaviour expressed as finite state machines, there are a variety of automatic and manual tools that can be run to determine properties of the system. One of these tools os the automatic model checker and I'll put a demo of using the CBMC simple model checker on another link. Model checkers basically check two things, either safety or liveness. Safety checking is based on the reachable state space of the system. If we have multiple coupled finite state machines, we need to multiply them all together to get one finite state machine for input to model checking, and that's what's denoted by the FSM in the top line of this slide. The reachable state space is simply the states reachable from the initial state. That can easily be found with the following "iterate until closure". We initialise the reachable state set with just the starting state. And then for each state in the set, we add on all those that can be reached in one hop from there. And we iterate until nothing further changes. This is called an iterate until closure or a FIXED POINT ITERATION. When we're model checking for a safety property, then we have some predicate which must hold in all reachable states. It doesn't matter if it holds outside the reachable state space, because we're not supposed to be able to get outside the reachable state space. If we're controlling the air lock on a space station, our safety property might be that the inner door and the outer door must never be simultaneously open. [But on a flying saucer that condition might be all right, provided that we've landed on a planet with breathable atmosphere.] The control systems for the air lock doors must not enter the unsafe states. There must be no sequence of pressing the buttons that enables somebody to be jettisoned into space, as all the air rushes out. There's two ways of checking the safety predicate. We can either check it for each incremental version of the reachable state space as we develop it, in which case we could potentially report a violation earlier. Or we can check that predicate holds for all reachable states, once we've computed the state space set. For real world systems of any moderate complexity the reachable states space becomes infeasibly (exponentially) large, of course. But automated tools have techniques for handling that, perhaps using binary decision diagrams and so on, but that's not for this course. L2P21: Live States Algorithm The other thing that we can check with the model checker is liveness properties. A liveness property is essentially that something can be achieved in the future. We'll put it another way: that all possible paths from where we are will eventually lead to a state where a given property holds. An example liveness exercise might be "it's always possible to unlock the doors on my car" and the classic deadlock situation, certainly with older cars used to be that you would put the car keys in the boot and you would shut the boot and that would be deadlock because we can't open the boot without the key and the key is in the boot. Of course, if you have a fancy car like my wife has, then it's smart enough to spot that you put the keys in the boot and the boot won't shut. So the general principle of a liveness property is that whatever we do, we will eventually be able to make this liveness property hold: at some point in the future, we will get there, one way or another. There are more complicated, liveness properties as well, and nothing on this slide is examinable for this course anyway. The liveness algorithm is pretty much the reverse of the safety algorithm. We're going to initialise our live set L to be the set of all states rather than the starting state, which is what we did last time. To understand this algorithm, it might rather be helpful to think of that situation where we have a ring of people, each one seated, but they're seated on the knees of the person sitting behind them. The iteration is to preserve states in the live set where the successor is live. And the initial state will always be considered to be live. If you work through the details, you'll see that not only is the main live set (as we defined it earlier, the mesh of interconnected states reachable from each other) not only is that included, but also the trajectory from the start up state into that mesh. That will be included too. And if you want you can trim that off by considering a further step where you intersect with, not the successors, but the predecessors instead. Alternatively you can initialise to the reachable state space. Anyway, all the details of those are beyond this course, just for educational purposes. I'm mentioning that model checkers check liveness and safety. If it's new to you right now, don't worry, it will be covered properly in other courses in the Tripos. L2P22: Mutual Exclusion Without Atomic Actions Earlier on, we spoke about the hardware primitives for atomic actions found in ancient and modern computers. But it's informative to know that there are ways of achieving atomic behaviour without atomic instructions. Lamport's Bakery algorithm is perhaps the most famous one, and we'll discuss that on the next slide. Don't confuse this with the Bankers Algorithm, which we'll cover in another lecture. As well as the Bakery Algorithm, there's Dekker's algorithm for two parties and so on. They all work essentially in the same way. They will all use global shared variables, but the participants, the individual threads will know their thread identifier, their TID. Their TID is an integer and this will essentially be used as a subscript to a shared global array. On an operating system, you can invoke a system call to find out your current thread identifier. In Unix it's called gettid(), not surprisingly. Since all the computers we're likely to use do have atomic instructions, you might think that getting away without them is not very valuable. But actually there's something to learn here for distributed computing where we have distributed, read and write to a database, perhaps from many different computers. Again, we'll need to achieve atomicity one way or another, and perhaps some of these ideas could occasionally be useful in that sort of scenario. Anyway, Doctor X will handle that sort of issue in the second-half of this course. These algorithms also serve as lovely toy examples for model checking exercises for those learning to use a model checker. However, these algorithms will tend to rely on SEQUENTIAL CONSISTENCY. Which is that a sequence of values stored by one writer are seen to have been written in that order by observers running on other cores. On certain processor architectures, we may need to manually execute fence instructions to preserve that sequential consistency. But that's getting beyond what we need to talk about in this course. L2P23: Lamport Bakery Algorithm Here's the pseudo code for Lamport's Bakery algorithm. The idea is motivated by a bakery where there is a queue of customers waiting to be served, but it's a virtual queue. On entry to the bakery, customer takes a number from the ticket issuing machine and then waits for their number to be called. The customer also needs to know their own personal identifier, which for threads will be their TID and we'll assume that this is in a range that can directly index a small array. Although in real world systems we probably need to compress it down somehow because the TID could be an arbitrary 32 bit integer. We have two methods defined in the pseudo code, lock and unlock, and these will be used as mutex ACQUIRE and mutex RELEASE when we enter and leave a critical section as discussed next time. The LOCK operation is essentially going to perform an atomic test-and-set for us. The algorithm operates with two arrays indexed by TID: a Boolean called enter, and a numeric array called number. Let's put aside the enter array for the moment. So ignoring that, the first line of LOCK is that we're going to compute one plus the Max over J of the number array. J will range over all entries in the array which is over all possible threads. The array is initialised to 0, so this will be 0 in the very first instance and the very first user will be allocated ticket number 1 + 0 which is 1. A nice aspect of this coding compared with the real-life bakery is that the range of numbers tends to be fairly small. That's because when everybody's left the bakery, the number array will all be 0 again and the next person who comes in will be allocated number 1 again. So the numeric space gets recycled automatically by this algorithm. Or this coding of the algorithm I should say. There's another variation where we just allocate successively higher numbers to each new coming person, which is possibly more like the real life bakery where the number counter might only get reset when the shop closes. Now, of course, as seasoned concurrency experts now that we are on Lecture 2, we can see that there's a race potentially straight away. A number of threads might simultaneously be scanning the number array and they will come to the same conclusion about what their ticket number is. So it's entirely possible for two threads or more to be allocated the same number. Now we're going to allocate service based on comparing the number fields with each other, and in fact we're going to allocate service next to the person with the lowest number, which is what would happen in a real bakery: lowest number that has not yet left the system. We disambiguate the situation where two clients have the same number by forming a pair. The pair consists of the ticket number paried with the client's TID. Since all the tips are unique then we can be sure that no two clients have the same pair. Now we can compare pairs using lexicographical ordering, where we give more significance to ticket number and less significance to TID and just use the TID as a tie break for two clients with the same ticket number. The second-half of the lock code is pretty simple. Again, we'll ignore the enter array for now. What we do is for all entries in the array, we see if there is one lower than the one that we hold, provided it isn't 0 (because we use the value 0 as a special value that denotes that the slot is empty). To reiterate, we just scan looking for a lower value lexicographically than our current one. If so, that customer is currently being served. And we're spinning, waiting for that condition to no longer hold. The condition will no longer hold when the other customer exits the system invoking the unlock primitive. As you can see, the unlock primitive simply zeroes the number entry for that person. So remarkably, this more or less works if you manually step through it. If you put aside any possibility of races for the moment, you'll see that it works nicely. The one case where there is a possible race is where two threads have both read from the number array and they haven't yet updated their number field. We know that they'll both end up with the same value potentially, but we've already disambiguated that. But if one of the threads has made significant progress into the second-half of the lock code before another thread has stored its number, this is where the race may be a problem. And that's where this additional array, the Boolean array of enter flags, comes in useful. We set our entry in the enter array while we are scanning and updating our own number and we free it afterwards. And very elegantly, if you look through it all very carefully, you'll see that this is sufficient to resolve all possible race errors. That's because we don't proceed to look at the number array while any of the enter array flags is set. So we don't have time in the lecture to prove this is correct, but I'll leave that to your private study. Or you might even try and put it in a model checker and prove it, but probably that's a bit advanced for the average second year student right now. But we do typicallly have some remarkable people on our course! L2P24: Model checking demo. L2P25: Summary + Next Time In summary, we've looked at the underlying hardware primitives found on modern computer architectures, discussed memory consistency a little bit. We've looked at a finite state model of the control flow graph for a thread and we've spoken about how this could be analysed by our model checker. But do note that model checkers today cannot directly process the source code for a complete program. That's not reasonable with today's tools. Instead, they can analyse an abstraction of the program which may be sufficient to prove the point of interest. Next time we'll look at access to these hardware primitives and how they can be integrated efficiently with the operating system scheduler. We'll introduce one of the oldest concurrency primitives, which is the semaphore. And also look at how the same construct can be used to provide synchronisation between threads. By which we mean telling one thread that it's now time to look at the work produced by another thread. END OF TRANSCRIPT