Fault Tolerance

So far in this chapter we have described techniques for isolating faults and helping the distributed applications programmer avoid some of the pitfalls of synchronization and consistency in a distributed system. We have not described how a distributed system can <#688#> improve<#688#> availability in any way. The starting point is to recognize that although there are more possible independent failure modes in a distributed system than there are in a centralized one, applications may not need many of the hosts or storage systems to be capable of running. Then we should that there is a high degree of replication of hardware, communications and operating system facilities in a distributed system. Some of these components (e.g. CSMA/CD and FDDI/DAS Local Area Network technology) are highly reliable, and have no dependent failure modes. Other components may not be very reliable but are inexpensive to replicate (e.g. microprocessors/small disks/memory). The aim of reliable distributed system software is to take advantage of the hardware replication or reliability to place, replicate or migrate processes and data (methods/objects) to avoid/mask failures.