CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 5 TRANSCRIPT (C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory. L5 Transcript: Liveness and Priority Guarantees L502 Good morning, everybody. Last time we looked at the multiple reader single writer lock. This is the sort of thing that you'll find in a real-world primitive library. And INCOMPREHENSIBLE , we'll find constructs to help support more elaborate concurrency behaviours, especially monitor based solutions with their conditioned variables or condition queues. We looked at two variant semantics for the signal primitive for a monitor, and concluded that signal-and- continue is the preferred way. Overall, we need our concurrency system to guarantee safety and progress. For safety, the main thing we've considered so far is mutual exclusion in our critical sections. And for progress, we've looked at mechanisms for enabling one thread to unblock another one. This inter-thread triggering can be unreliable in the sense that the scheduler tends to be non-deterministic: a thread that's woken up by a SIGNAL under one interleaving of threads could well suffer a missed wake-up situation under another interleaving where it hasn't waited at the time the corresponding signal arrives. This problem arises from the design of some of these primitives, which maybe is not ideal, and the programmer will have to put defensive structures around their use to avoid these problems. Another source of progress problems is deadlock, which we'll discuss this time. L504 This time we'll look at liveness and deadlock in detail. Deadlock isn't inevitable. Certain conditions have to exist for deadlock to occur. We can investigate those conditions using a resource allocation graph. The graph can be computed offline in advance using static analysis. This will lead us to consider deadlock avoidance, deadlock detection and deadlock recovery. We'll also look at threat priorities and scheduling issues. In particular, we'll discuss priority inversion, which can arise as threads interact. A solution to that will be priority inheritance. L505 We need to ensure that we eventually make progress: we want to avoid deadlock where threads are sleeping, waiting for each other to move. We also want to avoid livelock where the threads are moving, but nothing useful is happening. And practically speaking, we also need good performance. We need to avoid starvation, or in other words, we need to make sure that every thread eventually makes some progress. That's the technical definition of FAIRNESS. And we may have further real-world concepts of fairness as well, such as the relative performance of each thread that we'd like to see. We also want minimal overhead from our concurrency primitives so that we get good performance. Sometimes we see excessive serialisation. This is where we have a group of threads that we expect to execute in parallel, but owing to the structure of the locks or signalling between them, they end up with one running after the other. They get serialised and the parallel speed-up that we're expecting does not materialise. So our aims for liveness and good performance may sometimes be at conflict with good safety. Obviously we have to design our system so that we maintain safety and get the best performance possible. We saw that as we composed our finite state machines, fewer behaviours were possible. And so although a thread might look as though it can make progress on its own, the combined system deadlocks. L506 Concurrent system deadlock is manifested by some group of K threads that cannot wake up. Each is waiting for something to be done by another one in that set of K, but since that one's asleep as well, that doesn't happen. We have a deadlock. If you watch 'The X Factor', you'll see that they sometimes enter what they call deadlock. Actually, what they mean is it's a tie and they need a tie-break. Well, they have a tie break procedure, which they can then use. So it's not really deadlock as we would call it: there's certainly no cyclic dependency. In general, if you have a well-defined procedure for automatically resolving a problem, then probably it's not deadlock. If you have an emergency type procedure that you realise you eventually must invoke, you might call that a deadlock recovery procedure and therefore we were recovering from deadlock. We'll discuss that again later on. Here's another example, this time taken from the railway system. The standard operating procedure for a crossing on the railway, where two trains were approaching was defined as follows "When two trains approach each other at a crossing, both shall come to a full stop, and neither shall start up again until the other one has gone." Do we call it catch 22? I'd call it deadlock. One thing to notice about the two trains "each is waiting on the other". We have the cycle and cycles and deadlock go together. Let's look at a situation from computing science. Here we have two threads. Thread one and thread two. Thread one takes out block X and then it takes out lock Y. Thread 2 potentially takes out the locks in a different order. It takes out lock Y and then possibly it takes out lock X. We've drawn A horizontal line across the two threads. If both get there concurrently we could have a deadlock since if the condition holds, neither will make progress. L507 Requirements for deadlock Like all concurrency bugs, deadlock might be rare. Supposing that conditioned on the previous slide was mostly false. Quite often a bug may be found early in design, but sometimes that kind of bug might not be found at all during the design, and it's found only after deployment. We'll see an example where that happened on Mars soon. In practice, there are four necessary conditions for deadlock to arise. The first is that there must be a bounded number of owners. For an everyday critical section, that number of users is one. It's either not in use or it's locked by one user. But we're going to generalise and consider resources that can be simultaneously in use by some number of users at once. For deadlock to arise that number of users needs to have a concrete upper bound. There's a limited availability. We also need our hold and wait situation, which is where our participant grabs a resource and then waits for another resource to be available. Thirdly, we need no preemption: once the holder has acquired a resource, it cannot be freed by other means. Lastly, critically, we need a circular wait. We need a cycle of dependencies between participants. Such as the two trains at the crossing, each waiting for the other. We need all four of these necessary conditions to hold. The concurrency primitives that we have in today's systems enable us to write programs that satisfy all four of these, and indeed it's very hard to write programmes that don't tend to use the first three of these on a common basis. So our main axis of attack is going to be to avoid number four, the circular wait. It's tempting to think that deadlocks will only arise with mutual exclusion locks. But there are many other resources in a computing system which can suffer from deadlock. Such as memory allocations or locking a server for access to it. And this type of resource, as I said earlier, is designed to be held by or shared amongst more than one user at a time. But if there's a hard limit to the maximum number of bites on such a resource, then that's where deadlock is likely to crop up. L408 Resource allocation graphs Consider a graphical representation of use patterns between resources and users: a resource allocation graph. There are various styles of these. Here's the most basic one. We will use circles for our customers which are going to be threads, possibly processes, but threads normally. And we'll use a box to indicate a resource. To start with, we'll consider single owner resources like a mutex for a critical region where there can only be one holder at a time. We'll have two types of edges, which we will show solid and dashed. The solid ones come from a resource and go to a customer. They indicate that that resource is currently locked by that customer. The dashed ones come from a customer to a resource and they indicate that that customer would next like to get hold of that resource. What we now need to lookout for is any cycles in the resulting graph. And this particular graph does indeed have a cycle. Nothing's going to happen if the next thing T1 wants to do is acquire RB, then it won't be able to get it because RB is held by T3 and so on around the cycle. None of them is going to release what they hold because they're waiting for something else to become free, and that's something else is held by somebody else in the cycle. [It's possible they might have a choice of future behaviours using trylock etc, meaning more than one dotted line from a thread at any instant, but we'll put that idea aside for now.] We could compute this graph offline in some static analysis of the source code or we might be able to generate it on the fly dynamically as the program runs. To do it online does require that the primitives contain sufficient information: they might have to store the identity of who holds them, where that might not be needed otherwise. And the cost of cycle finding in the graph is not negligible at runtime either. So an offline analysis is preferable whenever possible. L5P09 Resource allocation graphs (2) Let's generalise this graph now slightly. We'll allow a resource to be allocated to some bounded number of holders simultaneously. We annotate the resources with the maximum number of clients that they can be allocated to at once. That's going to be an upper bound on the number of solid lines emanating upwards from a resource. If we look at this graph, we can see there's a cycle in the middle. T2, T3, RB and RC are in a cycle. Does this mean there's a deadlock? Or that there's bound to be one shortly. Both the resources in the cycle RB and RC are 'saturated'. By saturated I mean the the number of arcs emanating from them is at the maximum it's allowed to be. So the downward arcs from T2 and T3 indicating what they'd next like to do, cannot immediately be satisfied. So T2 and T3 are going to have to wait and they're in a cycle. But both T1 and T4 look perfectly happy, so they're liable to be proceeding and they will probably release their locks in due course. This will enable T2 and T3 to satisfy their demands, because the resources that they want are no longer saturated. So we conclude that in this graph, the presence of the cycle indicates that the system could be likely to deadlock, but doesn't guarantee that it will. Certainly, if there were no cycle, then there would be no deadlock possibility. L5P10 Resource allocation graphs (3) In the general case, there's more than one way that a thread might proceed. The graphs that we've looked at through that a thread has exactly 1 intent for what it's going to do next. But in real life, a thread might use the TRY_LOCK call or something equivalent and have some options to explore. Remember the TRY_LOCK primitive? It grabs the mutex if it can, but otherwise returns failure instead of blocking a thread. Grander forms of graph can represent these alternative ways forward. They may have additional node types that indicate possibilities. One such is called the AND-OR-WAIT for graph, which you can read up about. But that sort of thing is non examinable. L5P11 Deadlock: Three Design Approaches Deadlock is a nasty spectre. We don't want it. What should we do? Well, our two main options are to ensure it never happens or to let it happen but recover somehow. To ensure that it never happens, we can use offline static analysis and make sure that we don't ever run a program that might deadlock. Or we can use some runtime algorithms that dynamically avoid deadlock. We've looked at the possibility of constructing graphs at runtime and searching for cycles in them. Next we'll look next at the bankers algorithm, which is another runtime technique. Or we can let it happen and then detect it somehow at runtime and then recover from it somehow. Detecting it might be a bit of a problem. We could have a timeout based approach and recovery can be a bit of a problem. We might need to rollback various operations and we'll talk all about that in future lectures. And the third way forward, which is very widely used in practice, is to ignore the problem. We call this the ostrich algorithm. Burying our head in the sands and the solution to that is again deadlock recovery, but at a higher level. "Have you tried turning it off and on again?" Turning it off and on again may well work through your domestic TV when Netflix refuses to let you exit. Which is a common enough problem in my household. But it's not really applicable for the global telephone network, or indeed for our credit card clearing software. [My cousin works at one of the two main agencies that clears credit cards. It uses software that was written in IBM System 360 assembler decades ago and has never ever been edited since.] L5P12 Deadlock Static Prevention How can we avoid deadlock statically? We know there are four necessary conditions. Perhaps we can avoid fulfilling one of those conditions? The first is mutual exclusion or bounding the number of customers to some static number of owners upper bound. So we could violate that by just allowing additional access, but this is likely to be unsafe. The number of accesses was limited for some good reason, no doubt. We might be able to optimise the situation. For instance, the multiple reader single writer is an optimization because there's no problem having multiple readers. It's just that we mustn't have any readers at the same time as a writer. What about the hold-and-wait condition? The problem really arises from the serial nature of acquiring the resources, and that perhaps could be defeated if we could in some sense acquire them atomically. Then we'd have what we need to proceed, or we wouldn't have taken out any. Well, we can't really do things simultaneously on everyday computer architectures and certainly in the second-half of this course in distributed computing, that's going to be impossible. But perhaps we can somehow get the effect of simultaneous acquisition. For instance, trying to get things serially but not really doing anything. And if any of them fail to be obtained, then just free them all again and start again. Hmmm, this seems more like a deadlock recovery technique... What about number three, no preemption? Well, we could steal a resource that's been allocated to something. Is this going to be a good idea? We don't want to just steal it and not tell. We better send it a message that we've stolen it, or at least make sure that a reliable error comes back if it tries to use something that's been deallocated under its feet. Hmm, maybe that might work. We'll return to something like that under the topic of transactions. And what about condition four, the circular dependency? Supposing we give every lock a unique number and put them in a total order. We could then make sure that we only acquire locks in ascending order. If we do that, we will never have the cycle. Actually, a partial order is probably going to be sufficient, because we don't care about relative ordering between things in totally disjoint groups or in certain other remote corners of the overall graph. Well, that sounds quite attractive. I wonder how we implement it? Maybe the programmer has to do it manually, or perhaps the high level-language system can guide us and make sure that we obey such a discipline. Or perhaps it can even be built into the operating system in some way, such as the free BSD witness system. If you care to look into that. [But the details of that system are not examinable.] L4P13 Example: Dining Philosophers Let's look at the dining philosophers problem that we had to look at briefly in a previous lecture. The circular table gives us a circular dependency and the deadlock case is pretty clear. For each philosopher, i, we have acquired fork i and we're waiting for fork (i+1) mod 5. Perhaps we can apply the global ordering discipline to this, and changing the order in which the philosophers pick up their forks, solve the problem. L4P14 Example: Dining Philosophers Solution Here's the modification that we need to make. We'll allocate twp new local variables, first and second. These are the numbers of the two forks that we need to pick up. And we'll sort them numerically. First will be lower than second. This will work: it's a very small modification and a huge step for mankind. Choosing the order of acquiring locks is not an expensive activity: often it can be done entirely at compile time. And it's hard to think of a situation where the order of acquiring locks has any semantic effect. By and large, we need all the locks before we can do anything, although sometimes it may be inefficient to acquire them all before we proceed. It could lead to excessive serialisation. And other times we won't know which lock we want to acquire next because it will depend on something that we've just read under the current lock that we've most recently taken out. In which case we might have to release that lock and go back a bit and take out locks again, just so that we maintain order. L5P15 Deadlock Dynamic Avoidance So that was static deadlock prevention. We can also consider dynamic deadlock avoidance at runtime. There's two forms of this: we can change the behaviour of a thread by encouraging it to go down a different execution path, or we can stall the thread to stop it taking out one of the locks, that's going to lead to the deadlock cycle. We make sure that when we release that thread to run again later on, it's not going to foul with the other threads where it was going to get the deadlock pattern. Just stalling such a thread isn't going to lead to deadlock in itself, because something else is going on and that something else is going to ultimately free up some resources that our current thread is going to enjoy using soon after. What general properties do we need for such a scheme to be feasible? For this sort of exercise, it's best to think in terms of tasks which are work items that will use resources and ultimately finish. We assume that we know the maximum number of possible resources of each type that a task is going to use or need under any of its possible control flows. And we assume that a task that is granted all of the resources that it wants is going to successfully complete and exit and free up all of the resources that it needed by the time it exits. We also assume that we can readily read off how many resources each task is using at any instant. That is a vector quantity because we have resources of different types. Now when a task makes a request for a resource, we will only grant it if we can guarantee that there will ultimately be no deadlock even if all the other tasks take all of the resources its been determined they might possibly need. If we can't make that guarantee, then stall this task, and that stalling, as I said, doesn't lead to deadlock in itself. It's just going to defer operation of that task while other tasks are usefully doing work. L5P17 Banker's Algorithm The most famous algorithm in this class is the Banker's Algorithm. It is only useful where we know a lot of information about a task statically. It is a relevant approach in certain use cases that is not a general solution for systems that we don't know much about. Where we don't have the necessary information for dynamic avoidance we'll just have to use deadlock detection. When we've detected deadlock, we need a deadlock recovery procedure, of which more anon. But first, let's observe that a predicate or evaluated function which detects deadlock or potential deadlock can be used in two ways. It can be used as part of the dynamic avoidance system. Or it can be used as a genuine deadlock detection system, which would then invoke deadlock recovery techniques (which we'll talk about anon). So let's assume that we have a deadlock checking algorithm. We come across a new request for a resource. What we can do is nominally grant that resource and then run the deadlock checking algorithm to see whether deadlock arises. If it would deadlock after the allocation, then we don't make the allocation;, we just pause the requester, let the rest of the system continue and at some point reconsider giving the requester what it wants. Now here's a key and subtle point: it is sufficient to check that there's at least one non-deadlocking way forward. We don't have to do anything to encourage the tasks to follow that route. There may be other ways forward. But if there was only that one, then that's the one they will end up following. The deadlock checking evaluator will return safe or unsafe. A safe state is one where there is a way forward without deadlocking. What should we use as our evaluator function? Well, we looked at the wait-for graphs. For the simple WAIT-FOR graph with a maximum of one customer per resource, the safe states are those without any cycles. When resources can be shared up to a certain number of customers and we have the necessary information about a task, then we'll be able to use the Banker's Algorithm. But there are many more schemes in the huge literature on this subject. The Banker's Algorithm is motivated by the behaviour of an investment banker. An investment banker doesn't want to run out of capital. They're making multiple investments and each investment has different cash needs and needs of other things such as management consultants (yuck) or key staff or temporary buildings. But like our tasks, each investment has a finite lifetime and will eventually finish, hopefully returning some profit, but also return all of the resources that it required to the free pool ready for future projects. Let's suppose we have M different types of resource and N active tasks. We can avoid deadlock by managing vectors of resource allocation. There is a maintained vector V of the currently available resources. At day zero, this will be initialised to the number of instances of each resource that's available. As time goes on, some will be allocated to various tasks and we'll keep track of the current allocations in a 2-D array, A. There needs to be another array of the same dimensions called R. This holds how many further requests of each resource a task is maximally going to ask for at any one time. The basic assumption to use this algorithm is that that information exists. We initially installed a number in the our array for that. But as a task progress and we grant allocations to a task, then we can decrement the entry accordingly. So at all times it reflects the maximum further number of requests for a given resource that a task might possibly make or need. From these two vectors it can be determined whether we're in a safe state or not. We'll describe that on the next slide, but it involves marking rows of A. If we can't complete marking, then we're in an unsafe state. So remember, as I said on the last slide, we're going to do this normally when we get a new request from the live system: we're going to nominally allocate it, run this algorithm, find out whether it gives a safe or an unsafe state. If it's safe, then we'll make the real allocation and allow things to proceed. We'll update the arrays accordingly. So one location in the R array will be decremented and the corresponding A entry will be incremented. Also the corresponding place in the V vector will be decremented because that resource is no longer in the shared pool ready for others to use. Otherwise, we will suspend the requester for the meantime and reconsider after something's been released and the V array has gone up again. The assumption underlying this is that if we can fulfil a task's request then it will continue to run. Eventually it will complete and free up its resources. And indeed, it may free up some of them before it completes, and that's also fine. We will be able to exploit them with this algorithm. L5P18 Banker’s Algorithm (2) OK. How do we evaluate whether a state is safe or not using the bankers algorithm? What we're going to do is run a simulation forward from where we currently are. It's a very quick simulation. It will determine if there's a possible way forward. As I said two slides ago, it doesn't matter if the real system follows the simulation path that we've just explored. The existence of _a_ path is sufficient for us to be able to grant the current request. We apply this algorithm to the nominal system which is just been augmented with the tentative reservation. We'll see if that tentative nominal system is safe or not. We do that by successively marking all the rows of A with the flag. If we can't mark them all, then it was unsafe. First of all, we set the flag for all rows of A that only contain zeroes because they can't be part of a deadlock consideration. Also, for tasks that are fully finished, then their future allocation would be zerp and we could delete those rows entirely from the array, so they're not in further consideration. Our simulation is going to use a working vector, which we'll call W, and we initialise that to the current setting of V. That's the value of V after we've done the nominal decrement for the resource we're tentatively just about to allocate. The essence of the simulation is very straightforward: we add in to W the resources freed up by tasks that nominally finish in the future at some point. When one task is finished, it's freeing up resources that may enable another task to progress. The condition for marking a row is that all of its entries in the R array are less than or equal to the corresponding entry in the current W array. If so, that task can be marked. It is considered to be able to run to completion. It will then free up its resources and the resources that it will be freeing up are those that it currently holds in the A array. So we add the contents of that row of the a array to our working vector. L5P19 Banker’s Algorithm: Example 1 Let's look at an example of running the banker's algorithm. We have 5 threads to T0 to T4. At the current instante we have the already allocated resources. We have three types of resource, X, Y and Z. And at this point of execution, we also have the R matrix which records you'll recall from the last slide, the maximum number of resources of each type the task/thread is still going to need in order to complete. And we normally have some free resources in the system, but their counts are currently all zero at this point, so we won't be able to allocate anything until something gets released. So the algorithm proceeds as follows. We will simulate the future progression in the work vector W. We'll need to find an unmarked row in A INCOMPREHENSIBLE to start with and process it. Let's first of all dispatch all of those that don't require any further resources. T0 doesn't require any further resources, so it's going to run to completion and it will free up the resources it currently has. Which is a one for the Y so that will transfer into W. And also T2 here doesn't require any further resources so it will run quite happily, and it's currently-in-use resources will also become deallocated and we can add those into W as well. Let's consider thread T3 that requires one X to complete. As we can see over here we have 3X's available. So this is going to be able to complete and when it does free its resources will also be accumulated in W. Now we have T1 and T4 potentially to run next. T1 requires 2 further X's and two further Z's. T4 will require two further Z's. Over here in W we have sufficient resources to run either of those. Let's choose T4 and so it's going to run and free up its resources into W. And now that just leaves T1. T1 needs two X's and two Z's. There are sufficient resources in W so that can also run to completion, all the rows are marked. The deadlock set is empty: there are none left unmarked. It would be ok to make the currently requested allocation. L5P20 Banker’s Algorithm: Example 2 Let's look at a modified scenario. This time T2 here is going to require one resource of type Z. So how's this going to work out? Well, let's run the algorithm. We'll start off by marking off those that don't need any further resources. So T0, as before, will be able to run to completion and its assets will be transferred over to the W working vector. And now we're going to see whether there's another line that we can cross off. Let's have a look. Well, T1 will need 2X's and we don't have any X's. It will also need twp Z's. We don't have any Z's, so we definitely can't cross that one off. T2 needs a Z. We don't have any Z's. T3 needs one X, but we don't have any X's. And T4 needs two Z's and we don't have such an asset. Oh dear, we can't proceed. There are no tasks that can run to completion because we don't have sufficient resources. Each is going to depend on another freeing up some of the resources that it's already had allocated. These threads are in a deadlock set. This is not a happy situation. L5P21 Deadlock recovery As I said, we can use the deadlock detection algorithms used for deadlock avoidance to also serve as detectors for real deadlock. As well as the sophisticated algorithms, we can also use brute force techniques based on timeout. Most systems have a hardware watchdog timer, also known as the dead-man's handle. These are hardware devices that have to be stimulated on a regular basis by writing a hardware register. And if they don't get stimulated, then they apply something brutal like a high priority interrupt or a system reset. But often these won't be relevant. The processes that we're interested in detecting deadlock amongst are not the sort which are typically involved in regularly pinging a hardware timeout mechanism. Those hardware mechanisms are more suited to detecting complete O/S level lockup rather than localised deadlock amongst a few participants. A simpler solution is to kill something and thereby release its held resources. And of course we should choose something that's in the deadlocking set. Killing something is a bit brutal, but often it's the best we can do. A hardware reset, of course, is killing everything that's on that platform. Linux has an out-of-memory detector which, when memory is getting very low, kills something in the hope of rescuing the system a bit. But it might kill something fairly critical to the user, like the window manager. Rather than a complete kill, we could perhaps roll back to a checkpoint: that might work. Many systems don't detect or recover from deadlock at all, they just rely on the programmer or the user switching it off and on again. But it is possible to be elegant about it, especially when applications are coded using exception handlers to gracefully recover. Note that "killing something" is breaking one of our deadlock conditions: we're breaking the no preemption condition. L5P22 Livelock Livelock, generally speaking, is easy for humans to detect. Again, the system stops making any effective progress, and so time-out based solutions will detect it. But if we look at the low0level behaviour of the threads, then they are continuing to do something and so there's less of a manifest pattern to livelock. Here's an example of livelock arising from a programming paradigm that's tried to be creative and defensive against deadlock. Thread one locks X and then it tries to lock Y. But if it fails then it's friendly. It unlocks X, has a sleep (yield for an arbitrary amount of time) and then it locks X again and has another go. Now if another thread is doing the same sort of thing but with a different ordering of the locks. Then we will have a deadlock like situation. And they will iterate. It could go on for quite some time before one of them happens to get both locks. This is a typical paradigm that we might see under badly behaving "optimistic concurrency control" which we'll see later on. We need these yields to haave a random element in them to disrupt the pattern. L5P25 Scheduling and thread priorities Let's turn now to thread priority considerations. Which thread should we run when there's more than one runnable thread. Where a thread unlocks something that's been contended, and then we now have two possible threads that could be running and when a condition variable broadcast wakes up multiple threads then there are multiple suddenly runnable: which to choose? The two baseline scheduling disciplines are static priority and round-robin. In round-robin, all of the contenders are put in some logical cyclic ring and the next one along from the last one that was served is served next. Alternatively we can assign a fixed priority to a thread and then choose the one out of the ready set which has the highest priority every time. We might choose the highest priority for real-time tasks and then we put interactive and short-lived events next, followed by bulky processes. At the bottom of the heap will be the idle process that contains a HALT instruction and is run when nothing else needs to run. And when it runs the CPU core stops until the next interrupt, saving energy. In a Unix-like operating systems, you can adjust the static priority of a process with a system call called "nice", which is also available as a command-line command. The static priorities don't have really have to be static, they can be varied. They can be adjusted by various calls such as the nice call, and that may be done to dynamically balance the performance of the system, for instance to boost priority of the task after it's recently done some I/O. Gang scheduling is where we consider a group of threads together (or processes even) and schedule them all at once in a gang. That is a good approach where these cooperating tasks share resources in common. This will help with caching and also perhaps paging if that's going on when the required data structures get paged in from disc or secondary storage. And we may care to have a certain amount of core affinity. That means we will tend to run a given thread on a specific core each time it is scheduled to run. And again, this is good for cache consistency because the data that that cache needs is likely to be in the localised caches for that processor core. The design of the scheduling system tries to optimise multiple metrics at once, which isn't generally always possible. We might want good latency or good throughput or low energy use, and generally speaking we will always want some fairness or at least to avoid total starvation. L5P24 Priority inversion If we've put a lot of work into giving threads priorities, then we want the system ultimately to pay some attention to them. So what happens when a thread waits for another thread of a different priority? If a high-priority thread and a low-priority thread both tend to use a lock in common, then we would hope that the low-priority thread is respectful and releases that lock sharpish so that the time that the high priority thread is held up is minimised. That's just good system design. [Also, of course, when several threads are blocked on a lock, the scheduler should chose the highest priority one to run when the lock next becomse free.] But sometimes it can go wrong and the classic example is the three-thread priority inversion problem. So suppose we have 3 threads T1, T2, T3 with T1 being the highest priority and T3 being the lowest priority. We have a lock which is shared amongst the outer two, T1 and T3. Call the lock LL. Consider the following pattern of events. Supposing T3 has acquired the lock LL, but then T1 preempts T3 because it's higher priority. And then T1 is going to sleep waiting for the lock. This should normally enable T3 to continue so that it releases the lock for T1. But while T3 is running, T2 comes along and that's higher priority, so it can preempt T3. So there we are. We have T2 running and T1 blocked. T1 is blocked waiting for T3. This is our classic priority in version. T1 is ready to run; T2 is running but T1 is higher in priority. T2 has got to stop before T3 can continue, release the lock and finally let T1 take its rightful place. This is not deadlock or livelock, and it's not desirable, especially in real-time systems. This particular problem in fact disabled the main activities of the Mars Pathfinder robot. You can read Mike Jones paper from 1997 about that event. Fortunately, the bug could be diagnosed and a patch applied remotely. L5P25 Priority Inheritance The typical solution to this is something called "priority inheritance". Priority inheritance is a fairly easy mechanism to add to a set of concurrency primitives. All we do is temporarily boost the priority of a lock holder to that of the highest thread waiting on that lock. This is pretty easy to do inside the operating system scheduler as it considers what to run, although it does require the lock holder's identity is recorded in the lock, but that's not a bad thing to do in general: it can certainly help with debugging and that sort of thing. So returning to the classic scenario, if we have priority inheritance, T3 will now run with T1's priority while it's holding the lock LL that T1 was waiting for. T3 now has higher priority than T2 and so T2 would not have preempted it. This solves the main issue and provides concrete benefits to the system's responsiveness. The Mars Pathfinder was running the very popular real-time microkernel called Vxworks, and a Nat??? system. You can specify whether priority inheritance works on a per-instance basis, as you create locks. But clearly the wrong decision was made either as a matter of global policy or local stupidity at the time that critical mutex was established. There are alternative and less scientific approaches. In some versions of Windows, there was a monitor which checks whether any process hasn't run for 300 ticks, which would be about 3 seconds or something like that, and if so, its scheduller quantum is doubled and its priority was doubled as well, giving it a bit of a higher priority in a somewhat brute-force approach. But the principle being demonstrated there is actually quite useful in general as a matter of fairness. In general, things that haven't run for some while should be given a better chance and a bigger bite at the cherry next time they do want to run. The term quantum here refers to the number of operating system time slices that a task is given before it's preempted. L5P26 Problems with priority inheritance Priority inheritance is a bit of a hack in some sense. If there's a lot of it going on, then maybe your system isn't designed that well to start with. We do end up with threads running at the wrong priority for some of their time. That may disrupt our budgeting. It can get a bit complicated when there are a lot of locks involved. You might consider how it would work for a multiple-reader, single-writer system for instance. And of course, we don't just have locks, we have other forms of inter thread synchronisation. With locks, we can easily record which thread is holding the lock. But when we're waiting on a semaphore, then there won't normally be any notion of which thread might signal that semaphore. And the same goes for condition variables. Do we boost the priority of every thread that might possibly raise that signal? So overall, it's better if we redesign our system to avoid the need of priority inheritance as much as possible. So this generally means avoid sharing resources between threads of different priority. L5P27 Limits to Parallelisation With the way that digital electronics on silicon has gone, using concurrency and parallel computing is the primary means of getting more performance today. So new software and new programming languages need to take into account the need for parallelism, and we're also faced with the problem of trying to somehow auto- parallelize old legacy software. In this course we shall next study concurrent transactions and the important thing is how these transactions interact with each other. The amount of interaction limits the amount of parallelism that we can achieve. If there are no dependencies between two parts of a program, then they can be run in any order or in parallel. When there are no dependencies, we call it an "embarrassingly parallel" problem, and the reason for that is that as we increase the number of cores, if we don't get linear speed up with the number of cores, then the standard of our engineering skills is an embarrassment. Classic examples of embarrassingly parallel problems arise in graphics. For instance, computing each pixel in a Mandelbrot picture can be done in parallel with no dependencies, and the same goes for decoding each JPEG tile of a JPEG image. Where the result of one computation is needed before we can proceed to the next. Then we have a DATA DEPENDENCY between them. It's also possible to have a data anti-dependency where the result of one computation is going to overwrite a value needed for another one, but that can be avoided by having sufficient storage and using renaming in the storage management. With the data dependency, the result of one computation is needed as the input for a subsequent one, but we can also have a controlled dependency. Under a CONTROL DEPENDENCY, whether a computation is needed or not depends on the result of a previous one. With a control dependency, we can speculate and do the work anyway, even if we don't know whether it's going to be needed. But we must make sure that we don't commit any results that shouldn't have been visible. If the work wasn't actually commanded. And we can also speculate about data dependencies. We can speculate that a location isn't likely to be changed and hence we can safely read the value from it and base our computation on it. Or we can speculate that it will be changed and we can guess what value it's going to be changed to and base our speculative work on that prediction. L5P28 Available Parallelism of a task. Here's an example data dependency graph involved in a computation. In this example, work consists of little chunks which are all the same size. We need to run four of these chunks before we have enough starting information to go four different ways. We can then at the end combine the results of the four parallel strands, do two more units of work and then we've finished. The total number of units of work is 35. If each takes the same amount of time to run, then on a uni-processor this will take 35 units of time. If there are 4 processing elements available then we can organise the work as shown in the diagram and the total execution time will come down to 14. If we have further processing elements available, then we won't be able to do any better than this schedule. You can try it if you like. We could also workout the execution time using two and three cores. I think for three cores it's 16 units. If we take the ratio of execution time on one processing element to what we can achieve with an unlimited number of processing elements, then we have the AVAILABLE PARALLELISM of the task. Here it comes out as two and a half. [This diagram, by the way, comes from my textbook which came out this summer 2021. So I hope your college library has a copy.] L5P29 Auto-parallelisation possibilities A lot of old software and classic algorithms are expressed imperatively in procedural programming languages, and they were designed for single-threaded execution. Let's just look at whether we can auto-parallelize some of these algorithms so they work well on today's multicore hardware. In the green example here we have something which parallelizes quite well. We have a for loop and this basically follows the map reduced paradigm. We have some computation which is made for each index value in the FOR loop. That is the so-called mapped computation. We again assume that xf0 is expensive and worthy of synchronisation overheads. And then we have the reduce operation, a scalar reduction. The answer is a single scalar variable. The way that the various iterations are combined, the way their results are combined and accumulated in that scalar variable, is associative. There are quite a number of associative operators that we know, such as maximum or addition (not floating-point addition, by the way) and these are binary operators. When we have a lot of values to reduce to a single value using a binary operator, we inevitably use some sort of binary tree structure. And the point about an associative operator is it doesn't matter what structure of tree we use, we'll get the same answer. Hence, if this is a PARALLEL FOR being orchestrated by some system that we don't know quite the details of, we are confident that we're going to get the same answer regardless of which order it composes the individual results to get the final scalar result. Obviously, the details of these two examples don't matter. What does matter is the general structure of the data flow between the various operations. You could say that the variable 'i' (the loop induction variable) is shared by the various iterations in this green example and also the red example. But that doesn't matter: we take it for granted that we can cheaply replicate the control of induction variables and loop control flow. The subtle difference with this red example is the way that the variable VD behaves. We see it's given an initial value before the loop starts, but moreover, in each loop iteration it is both read and written. The result of one loop iteration updates VD and that value of VD is read in the next iteration. So this is a data dependency and this is going to prevent parallelisation. In fact, we call it a "loop carried data dependency" or just a LOOP CARRIED DEPENDENCY for short. L5P30 Mutable arrays (and collections) are the biggest pain for auto-parallelisation. Now these two examples are nice and easy to analyse because they're operating only on scalar variables. But when we come to look at arrays, things get much more tricky to analyse. In fact, they're going to become undecidable in general. If you don't know it already, you'll learn the meaning of undecidable in the Computation Theory course. The main memory of a computer is in fact one of the most powerful components for doing computation. If we look at, for instance bin sort, we can sort a very large amount of data in linear time, and that's because the memory uses a lot of transistors to decode which particular location out of 1,000,000 and billions that you are currently addressing. And of course that main memory is made available to the programmer in imperative programming languages using the array construct. For early computers in the 1950s, there wasn't much main memory, and algorithms were very much focused on making the most of the tape drives, moving data in and out of that small amount of memory. And funnily enough, we're sort of going full circle these days when we try to process data sets, which are much, much bigger than will fit into main memory of any of our processing elements. But anyway, all through the 60s, seventies, 80s, and 90s and so on, the array construct was the most important component for many algorithms. Of course, the algorithms might be expressed using sets, but the sets would be built using arrays underneath. However, if we want to use pure programming languages, purely functional or declarative that don't mutate any variables, then the array suddenly becomes far less useful. We might be able to do something by storing in each location exactly once. But not being able to change the values stored in locations does hamper many traditional algorithms. And this is really quite a big shame considering that the array access is such a powerful hardware primitive on traditional computers. For auto parallelization, however, arrays start to become unattractive. They suffer from what's known as the name or address alias problem. In the examples on the previous slide, we used scalars and if we'd used arrays with fixed subscripts, there would still be no problem. We would be able to tell by looking at the code whether a given access to that array was referring to the same location as another access or a different one. So the name alias problem is not being able to tell at compile time whether two accesses to an array will definitely be different or will definitely the same. So if we can't tell then we have a name alias problem. Sometimes it is very easy to tell. One might be at location X and the other at location X + 1. Well, we can easily tell that they're not the same as each other. And a set of rules will be able to give us a clear answer for a large number of simple cases. But from Computation Theory you'll discover there will always be programs where it is impossible to tell. If we cannot tell whether two accesses to an array will definitely be different, then we have a potential data dependency between them, and this will limit the amount of auto parallelism that we can deploy for exactly the reason on the bottom of the last slide. However, maybe this isn't such a severe problem. Things are moving on after all, modern computers do not offer uniform random access to main memory. They have multiple layers of memory hierarchy, and the power of the array is diminished compared with what it used to be. They also speculate in hardware over whether two addresses are the same or not. And with the advent of multiple cores, the importance of the traditional algorithms has diminished because new algorithms give better performance on parallel cores. For instance, for single source shortest path, the classic Dijkstra algorithm does not perform at all well on a multicore computer, and instead an alternative algorithm which might do a bit more work, but is better able to use the cores will tend to be deployed instead. I'm having a bit of a rent here. The content of this slide is not examinable on this course, but you might well find that it influences quite a lot of the rest of your programming career. L5P31 Summary + next time In summary, we looked at liveness and deadlock and the requirements for deadlock to exist. We looked at sketching the pattern of use in various resource allocation graphs and considered deadlock avoidance offline and online deadlock detection and deadlock recovery briefly. Then we looked at thread priority and briefly mentioned various scheduling algorithms. Finally, we considered the priority inversion problem and looked at the priority inheritance mechanism for addressing that. Next time we will look at our underlying assumptions about cache-consistent shared memory, whether that's a good idea. We'll see some fundamental alternatives to shared memory. Principally that will be reliable message passing. And we'll look at transaction processing systems where transactions have to effectively complete atomically. You'll remember the ACID properties mentioned in the databases courses, and we'll look at the isolation and serializability of the effects of transactions. END