Chapter 15 Crash resilience and persistent data

Exercises

15-1 How do you think an operating system which has bugs and is about to crash might behave differently from the fail-stop model described in Section 15.2?

Why is it unlikely that an operating system that is about to crash will write all over the disks it manages?

Is it more likely that an operating system that is about to crash will write all over main memory? Note that such erroneous writes may be to non-volatile memory that is being used as a cache for data and metadata waiting to be written to disk.

How could it be made less likely that a non-volatile cache would be corrupted in this way?

We are concerned that if we take steps to preserve main memory it might have been corrupted by a failing OS. A failing OS might write at random in main memory and probably has the privilege to write anywhere.

We have seen how disks are programmed: driver code sets up the main memory and disk addresses for a required transfer of one or more blocks. Although a failing OS could cause an isolated error, by corrupting a disk buffer, for example, it is very unlikely to cause many erroneous writes to take place before being stopped.

If we use a non-volatile write cache we should protect it from a failing OS. Needham et al. proposed a protocol in microcode. The idea was to make the OS perform a simple test before writing to NVRAM. An alternative approach is to write-protect NVRAM so that the OS must change the protection of the required page correctly before writing to it.

15-2 Define "idempotent operation" and "atomic operation". Give examples of operations that can and cannot be made idempotent. How can an operation that cannot be made idempotent be made atomic?

An idempotent operation is repeatable. No matter how many times you carry it out the same effect is achieved. An atomic operation is implemented so that either all or none of it is done. We make an atomic operation repeatable by ensuring that, in the event of a crash, for example, when we do not know whether an operation wrote to persistent store or not, we can restore the system state to its value before the operation started. The operation can then safely be repeated.

x := 3 is idempotent and machine-level atomic

x := x+1 is not idempotent and not necessarily atomic, depending on

the machine level implementation of the instruction.

"Write this block of data at offset n" in a data structure in main memory is idempotent but not machine-level atomic. That is, concurrency control is necessary if the operation of writing the block is to be guaranteed to be carried out without interference from concurrent processes.

"Write this byte sequence at byte-offset n" in a file is idempotent. It is atomic if only one disk-block needs to be written but not otherwise.

An operation can be made atomic by recording the system state before the operation so that if it does not complete the system can be returned to its original state.

15-3 How would you support atomic operations in a system which updates persistent data values "in place"; that is, the new values overwrite the old values in the persistent store.

Record each operation in a log with the persistent data values before and after the operation.

15-4 What is meant by a shadow copy of a portion of a filing system?

We have seen that the filing system uses disk pages as the unit of allocation and transfer. Suppose that when a file is changed it is not updated in place. New pages are used and a new version of the metadata is created. When the new metadata overwrites the old on disk (we assume in a single block transfer) the change has been effected and the old pages can be recovered as free pages. Before this happens both the old and new versions of the file are achievable. Note that the old and new versions of the metadata may point to some common pages and some different ones, in which case a page is said to have a shadow page.