CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 8 TRANSCRIPT (C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory. DURABILITY Last time we looked at isolation and strict isolation and three types of transaction processing system and how they provide atomicity. But a transaction processing system needs to provide more than atomicity. It also needs to provide durability. And we need the mechanisms for undoing the effects of transaction should we abort before it is committed. The principal mechanism to support this is going to be the log, but the log itself also has to have certain atomicity and persistence properties that we'll discuss. We look at write-ahead logging (WAL). We look at checkpoints, recovery and rollback. We'll also look at how to program in memory without using locks, if time permists: Lock-free programming. And we'll briefly introduce the transactional memory paradigm where rights are made to a local memory and then that is committed atomically. L8P03: Crash Recovery & Logging Transaction processing systems need to preserve the four ACID properties. We've spoken a great deal about isolation and if the aggregate effect of a transaction's individual updates preserve invariants, then we've covered implicitly managed consistency. So how can we ensure atomicity and durability? We need to make sure that the transaction is always done entirely or not at all. That is the essence of atomicity. For durability, we need to make sure that a transaction that has been reported as committed remains committed regardless of what happens, even if the system crashes. Well, we need to make some assumptions about the nature of crashes that we have. A crash is not going to be the same as a hack from a malicious agent. We're going to exclusively use the FAIL STOP model of system failure, which is, If the system crashes, then all of the primary storage, which is the in-memory data is lost, but that all data on the disc in secondary storage, whether it's spinning disc or SSD, remains available after a reboot. But more precisely, the FAIL STOP model relates to a system that just stops when it fails. It doesn't fail in a way where it starts mowing the roses. This comes from a lawn mower analogy where the lawn mower either is working well, cutting the grass or it stops and it's no longer cutting the grass, but it doesn't drive off under its own steam into the prized rose bed and cut down the roses. L8P04 Semantics of secondary store With simple spinning discs of previous decades, every sector or block written on the disc has a checksum cyclic redundancy check. A glorified checksum on the end of that sector, which has to tally when the data is read back. Otherwise the whole block is considered invalid on read. And with those designs, it's very reasonable to model the write of a disc block as an atomic operation. If the system crashes during the right, the cyclic redundancy check will not be written on the end properly, and the whole block will be invalid and will be essentially erased when we read it back. So the crash possibilities are that the block isn't written at all, so we have the old data there. It's erased because we've half written it, but not fully written it or it's written properly and the crash happened after we'd written that block. So we have pretty much an atomic model where we treat the erasure as lost data. The situation with flash based SSD is more complex. If we write data to two different places on the SSD, actually, internally the SSD might only store it in one place and we might think we're doing something good for redundancy purposes. But under the bonnet, the SSD is undoing out attempt to implement redundancy, and so on. There's an article on the course website about how atomic operations really work on flash based secondary storage. But for the sake of this course, we'll just assume that a disc block write is atomic. L8P05: Using persistent (non-volatile) storage So given we have an atomic write to our secondary storage, you might think the simplest solution is just to write every updated object straight out to disc when we commit the transaction, and then we can read that back if we have a reboot. But generally speaking, a transaction will write to multiple places, perhaps on different disc drives or on different computers even. But the throughput requirements might be a problem as well. Generally speaking we need to have a disc write queue of blocks waiting to be written out. We can't afford the time to flush that queue every time we commit a transaction. So the solution is to split our update into two stages. We'll firstly write the proposed updates to our write-ahead log, and this could be in quite a concise form. And then we can later on make the actual updates. If we have a crash during the first phase, then these were only proposed writes, so no data is going to be lost, and if we have a crash during the second phase then we'll be able to undo or redo the transactions using the log information. Achieving conciseness in the log can be done in various ways, but basically we need to use lossless compression, such as recording the smaller edits made to a larger data structure. Oh yes, and don't forget that another thing that we have to handle is where transactions abort. And where we have non-strict isolation we can get cascading aborts. So although these next few slides concentrate on using the log to help us recover from crashes. The related related state recovery that needs to be done when we wish to rollback a transaction can be done from the log. L8P06: Write-ahead logging The log is an unordered, append-only file on disc. It will contain entries of the following sort of form. The transaction identifier, the object to be modified, the operation on the object, the old value, and the new value. That sort of thing. And as we'll see, the log doesn't need to grow indefinitely, it just needs to hold recent history sufficient to recover from a crash, and also to help with rolling back transactions. But note that we must have sufficient information in the log so that for each operation we can potentially redo it or undo it. Each transaction will have a number of entries in the log: there are start and end bracketing entries and then there will be intermediate entries for all of the operations on the individual objects. Of course, normally we have a lot of concurrent transactions going on and the individual operations for each transaction will be interleaved in the log, reflecting the lack of strict isolation. The write-ahead log is going to need for itself some sort of commit or flush operation that makes sure that what has been written to it is persistently stored, and this operation is to take place between the 1st and 2nd phases on the previous slide. For those who are interested, you might want to read up and see that modern processors and bus standards have a so-called write persist operation which overcomes any write buffering that's in the bus structures and caches for performance reasons and makes sure that data is flushed all the way through to the so-called "point of persistence" where it's known that it will be there after a crash or a reboot. That sort of detail, of course, is beyond the examinable scope of this course. I'm just mentioning that for those interested in advanced processor architecture and bus standards. L8P07: Using a write-ahead log As I said, operating systems always have tended to use a lazy write back for performance reasons, and with our write-ahead log, it's no different, except that we will need to put this flush in, as I've said at the critical points. Both the writes to the log and the writes of updates to the main data-holding pages are going to be subject to lazy write back. We always write the log records before we write the corresponding data. But when we wish to commit a transaction, we will first write to the log that final commit record, and then we will synchronously flush the log. This means the transaction context won't do anything else until it hears back that the log is flushed to its persistence point. That's what we meant by synchronous there. Then we can return to the client that the transaction is committed. So in Unix there the fsync and fsyncdata system calls that you can use on an open file. We'll only report to the application code that the transaction is committed when we have received a positive fsync return code from flushing the log. Now critically, note that doesn't mean that we've flushed out all the data changes. The data writes can still be sitting in the lazy write buffers. [The difference between fsync and fsync data is beyond the scope of this course to some extent, but actually it involves all the same concepts that we're covering in this course. Many Unix systems use the EXT4 journalling file system, which you may care to read up on and so, for everyday data, it's actually maintaining a system very much like what we're discussing here. So this whole thing is actually going on at two different levels in a sense, but if you're new to this, then just ignore that remark.] Of course, now we are flushing something, even though it's just the log, nominally on every commit we are again perhaps going to lose performance and it will be intolerable. So what we will do in a better system is to batch up transaction commits. And we will make the flush less frequently, but for application code that's waiting, confirmation of a commit, there will be a performance penalty if we dally in that way. [Note that we are presenting a reference implementation of write-ahead logging in these slides and real-world implementations will optimise away some of this this flush of the log and other "stop-the-world" aspects like the checkpoints introduced shortly.] And again, I'm using the word dally as a technical term, which is where we wait around hoping that we can do something worthwhile, rather than pressing ahead with something else that's going to accrue a greater total penalty if we pressed ahead with it and then immediately needed to do something almost the same as we were doing straight afterwards. So at any one time, we're going to have the head of the write-ahead log in memory (in RAM) and the rest of it flushed out to disc. By "head" of a log, I'm referring to the end that we're adding to (whereas in queuing one joins the tail of a queue and receive service when you reach the head!). L8P08: The Big Picture This diagram gives the general setup. We can see that the in RAM part of the log has transaction T1 updating X, T2 updating Y, then transaction 2 aborted and the most recent thing is that transfer Transaction 3 has started. We also see the object values in RAM. These are dirty values that have yet to be written out to the persistent storage. Whereas on the disc we have the committed part of the log: the bit that's been flushed out to persistent storage. Also on disc we see the object values that have been written out (of course). See that T0 has committed but its final change of X to 2 has not yet hit the disc. Indeed it probably won't since T1 has changed it twice further, but those changes need to be erased if T1 aborts, revealing T0's final change. There's a subtlety here. You might tend to think that the disc only contains the committed values, but that's not the case. We don't artificially delay the write out of changed values to persistent storage. We don't wait until the transaction has finished, but we do make sure that the log entries that describe the changes that we've made as part of uncommitted transactions have been written out to the disc persistently before the values themselves are written out, hence the log will contain what we need to know for recovery. Remember, the basic assumption is that everything that's persistent on secondary storage is preserved over a crash, whereas everything that's in RAM is lost at a crash that could happen at anytime. L8P09: Checkpoint Approach Clearly a log that gets infinitely long is not practical, especially if we have to read the whole of it every time we have a crash recovery. Only a finite amount of information is useful and it is practical to identify a cut off point, before which everything is redundant for rollback and recovery purposes. Prior to that cut off horizon, only the information required for audit purposes (eg previous years' bank statements) might be kept. A periodic checkpointing approach is commonly used. The simple checkpoint is a complete snapshot of the unsaved changes to objects and the log, at a particular point in time. This (obviously?) encapsulates everything needed for recovery. The term "snapshot" implies taking full notes of everything required atomically, which essentially implies a "stop the world while I do it" approach, since, in reality, making such an extensive checkpoint will take some time. Note that some/many transactions are still in progress at the checkpoint time. Waiting for an opportunity when no transaction is in flight could be a wait forever (like MRSW writer starvation). Using a strict write-ahead policy, a record of all unsaved changes to objects is at all times persistently saved in the on-disk part of the log. Hence the work at the checkpoint is to flush the log to disk and then to delete historic parts of it that are assuredly not needed for any future abort rollback or crash recovery. It is also common to write out unsaved changes to objects as part of the checkpoint. This is particularly relevant if the server is about to be shut down! Since the checkpoint needs to be a self-consistent "snapshot", it is easiest to pause all additions to the log and in-memory changes to objects while the checkpoint and log trimming are performed; object reading by running transactions can continue. [But, once you're familiar with the material in this lecture, you will perhaps see it's an incremental change to allow updates to objects to continue in the background during the 'snapshotting' provided they don't change the snapshotted versions and, as always, these further changes are first written out to a new log etc.. It's layers upon layers. But the simple approach described above is a fine and sufficient baseline.] So sticking with an atomic checkpoint, we need a heuristic for when to make it. For an object flushing snapshot, we don't want to do it overly frequently because that will reduce the benefit of asynchronously writing back to the disk. So what we need is some reasonable granularity covering the last 30 seconds or so, or the last 50 transactions or 5000 transactions if that was faster than the last 30 seconds. That sort of heuristic. L8P10: Checkpoint Approach And here are the four steps that we conduct to perform a checkpoint. Firstly, we do the flush of the log. We make sure that everything that's in RAM for the log is transferred out to disc. Then we write to the log a special checkpoint record which lists all of the currently active transactions, that is, those that have been started but which haven't either been committed or aborted. We'll only be using that checkpoint record if we have to do crash recovery and we'll be searching for the related operations for the transactions listed in the checkpoint record. We can now flush all of our dirty objects out to disc to make sure the values in persistent storage are reflecting the most recent edits or changes. And that might be a fairly slow process,. It's just part of our normal operation anyway. We're going to keep one additional data structure, which is the address of the most recent checkpoint that was committed to disc. This has to be kept persistently as well and it could be a disc block number or it could be some sort of index or file offset in the log. We need to keep this note in a well known place. If we were doing this manually, we might write it in our diary or on the whiteboard or something like that. On a computer it can be stored in a particular disc block that's always referred to, such as disc block 401 on a particular drive. [In reality, it's best to keep the last N, the last three or four. checkpoint addresses in our special place. As is well known to anyone who's used these systems and reality. If we have a special place and a crash it's always the special place that tends to get corrupted. So we'll have several special places in reality, each at a well known address so that we have some resilience against one of them being corrupt.] L8P10: Checkpoints and recovery Here's the general setup with a checkpoint-based log. At the time we failed, there will have been a most recent checkpoint and that we can find from our special noting place. Now there's five possibilities for the state of a transaction with respect to a failure point. 1 It could have successfully committed before the checkpoint. 2 It could have been active during the checkpoint, but have been successfully committed afterwards before the crash. 3 It could have started before the checkpoint and still be ongoing at the time of crash, indeed, it could have started several checkpoints ago. 4. Or in the T4 case, it could have started after the checkpoint and committed successfully before failure. 5. In the T5 case it could have started after the checkpoint and still be outstanding at failure time. Of these five cases, all but the first require recovery after failure. The two that had reported successful commits before crash need to be redone, and those which didn't commit need to be undone. L8P11 Recovery algorithm The recovery algorithm proceeds as follows. We have two collections U & R for undo and redo. We initialise U to the set of active transactions at the checkpoint time. These will potentially need undoing if they didn't commit. We initialise R to be empty. We walk forward in the log starting at the checkpoint record position. If we see a start record, then we add that transaction to U since it potentially will need undoing. And if we see a commit record, we move the transaction from U to R. So each transaction that's going to need some tidy up is now either in U or R. First of all, we undo the transactional side effects that need undoing, and we do that in reverse order. So we walk backwards from the end of the log and process all of the. Entries in the log that match EU list and we undo their effects on the persistent storage. The persistent stores may not have been fleshed out, so we also need to be able to handle that. This means that correcting a credit with a matching debit won't work if we can't be sure the credit happened. Instead, an explicit-state, idempotent-style recovery action is needed. (Idempotent means doing it more than once is the same as doing ot once.) Ultimately, we will have removed all evidence of the transactions that we need to undo. Now we process the redo list. We work forward from the checkpoint point, redoing the side effects of all the transactions that are on the are list. Some of these may have been written out, in which case we'll be making no changes to the written out data. Some of them may have been written out twice, in which case ultimately we'll be making no changes to the written out. Idempotency again. And some of them may not have been written out, in which case we will be making the change. Finally we have redone all of the side effects of all of the transactions that successfully committed before the failure time. Once we've completed the recovery, we've essentially formed a new checkpoint at the time of failure. Any entries in the log that are no longer needed for crash recovery can be truncated. Can we simply start a new log going forward? It might be that certain long-running transactions span multiple checkpoints and if one of those aborts, then in order to undo its operations, we might need earlier parts of the log. The order in which we process the undo operations, doing them backwards is important because that will properly handle the cases where multiple transactions touch the same data. We perform the redoing in the natural forward order because we're just simulating essentially what happened before. You might be tempted to worry that we need to commit a transaction that depends on a change made by something that we've undone, but that never happens because of a basic invariant in such transaction processing systems, whereby we never commit such a transaction. So we're sure that we can commit and have committed all of the things that it depended on. L8P12: Write-ahead logging: Assumptions There are a few further details that need to be solved in a real-world system. We've assumed that the operations that a transaction performs on each object are themselves atomic operations, and in earlier lectures we've looked at how to apply locks and so on to make operations on individual objects atomic. And we've assumed the atomic sector or block write paradigm for our secondary storage. But if one of these objects occupies more than one sector or block, then the save of that object from the operation on it that modified it may itself crash halfway through and the on disc copy of the object may be self inconsistent. Alternatively, many modern file systems are implemented using copy-on-write methods (this lap-top has the butter file system!), which is where when we change or update something, we don't change the original stored version at all. We write a new version or partial part of the new version to somewhere fresh and update a pointer somewhere further up to indicate which is the current copy of the live version. This is especially true on media that's write-once or media with a bulk erase which needs wear levelling: Flash memory is of that nature. The latest generation of computer hardware has new non-volatile memory slots on the motherboard (cross-point memory, Intel Optane, M10 cards) and so having a strange mixture between primary and secondary storage is increasing in mainstream computing and it's also of course, fairly common in embedded systems. Where copy-on-write is being used, it's often the case that for very low additional overhead we can have a lightweight undo operation on an object supported by the file system itself, so maybe that will be used to augment and to simplify log based recovery of multi-block objects. And as I've already mentioned, for SSD the underlying media might not be honest about what's happening. You may think that you've written a sector, but really it's written a large, uh, much larger amount of data or nothing at all. On some very high density discs, the minimum amount that you can write is a whole track and that's only if you're about to write the track afterwards as well, because it constructively uses the cross talk. As you write one track, it disturbs the magnetic regions in the nearby track, which is ever so finely spaced and so on. So all of these tricks are going on at the lowest level, which can mean that, well, basically it means that we have to use qualified discs, especially provided for database applications rather than relying on general purpose, lowest quality discs that are made available for everyday laptop use and that sort of thing. On Linux, given adequate permissions, you can directly open /dev/hda and get physical access to the blocks of your disc. But are you really getting physical access to those? This could be a virtual disc made-up of several physical discs under a RAID system, or a logical volume manager or so on. There's so many levels there these days: you really need to know what you're doing. Also, we've assumed fail stop operation. But mowing the roses can happen! L8P13: Transactions: Summary Let's review what we've learned about transactions. So standard mutual exclusion techniques are not programmer friendly when we're dealing with more than one object; we're going to need to take out multiple locks on multiple objects, or else have a single coarser-grain lock which would minimise concurrency. So transactions allow us a better way of proceeding. We can perform multiple operations, reads and updates on many objects, but get the effect of them all happening atomically. The underlying transaction processing system implements the isolation that we want, allowing safe concurrency and even providing fault tolerance. There is commonly an abort facility where a transaction can abort and the underlying system will rollback the effects and it may cause other transactions to abort when isolation is not strict. These transactional approaches obviously form the foundation for database systems, but they're also increasingly used in everyday file systems so that an unexpected reboot or battery going flat doesn't require hours of disc scanning to restore the disc to a workable state next time you power up. So that's all for this part of the course. I will put the lock-free programming and transactional memory topics in a separate recording. END