The Object Model and Fault Transparency

The Object Oriented approach is a good way of providing fault confinement. Faults are unlikely to propagate from a well defined module, except in a very restricted way in the for m of erroneous messages which may be checked by other objects in a number of ways, Fault detection and diagnosis may be based on comparing the kind of output expected from an object with the actual output. The Abstract Data Type approach will help with this. In an Open Distributed System, one of the base methods in the Class Hierarchy may be the ;SPM_quot;liveness;SPM_quot; test (echo/loopback facility). This will also help isolate and diagnose faults, including those to do with timeliness. Faults are often masked by various means based on replication or error correction. For replication of a service to be transparent to the application/user, there must be some agent that collects replies and performs some majority voting on these replies. The service provided by this agent can be derived from the non-replicated form of the service object. In some systems, it may be sufficient to retry any failed operation. This relies on the fault model presented by the operation. For example, it will only guarantee freedom from a value failure if the retried operation is either idempotent (meaning that repeated completion of operations have the same effect)or has ``all or nothing semantics''. This may get round transient faults (e.g. network noise), but will not provide any bounds on the time to complete the operation. Reconfiguration may required when the number or rate of appearance of faults exceeds what can be handled by the mechanisms outlined so far. Providing this functionality in a distributed system often involves migration of objects, but when these mechanisms are used, they should not be visible to the application. One extreme case of reconfiguration is to employ some recovery technique to re-establish an earlier system state known to be correct. This may involve roll-back or un-doing of some number of logged operations. Once recovery is complete, it may be feasible simply to restart all the outstanding operations. If they are idempotent this is straightforward. If not, it may require informing the original end user so that they may re-submit the original command (e.g. re-type their request to withdraw x from a cash point). A distributed system may be partitioned by a temporary network failure. An important aspect of fault tolerance is the ability to repair the system transparently. In a distributed system, faults and errors are more common and complex than in a centralized system. This is usually due to the communications infrastructure (the larger the system in geographical scale, the more likely errors in communication. It is also due to the heterogeneous nature of an Open Distributed System. It is not guaranteed that al components (workstations/servers/etc.) are of the same quality. It must be stressed that in a distributed system, faults and errors may consist only of the <#582#> lack<#582#> of information. The primary way in which these type of conditions can be detected is by use of timeout and retry facilities. It is frequently found that an application that relies on a ;SPM_quot;reliable transport;SPM_quot; of messages fails in obscure ways. Alternative approaches use self-checking server processes. Of course, the detection of a value fault can be handled by consistency checks at any point in a system. In distributed systems, most of the facilities available to the conventional communications programmer should be visible to the applications programmer, but <#583#>em only when<#583#> wanted. The object model provides us with a convenient way of inheriting base methods. These can include base exceptions for handling timeouts and such events. Any operating system or application has to provide services within time constraints. Some constraints are looser than others. The time to deal with the arrival of data from a monitoring device (e.g. Radar signal from Air Traffic Control System) may be very much less than the time to deal with running a process for a user (e.g. formatting a document). A system must provide varying degrees of reliability. Some data may not be recoverable and some sequences of events may not be repeatable. The relative costs of fixing or preventing faults are different at different stages of a system design and implementation. Appopriate cost/benefit trade-offs need to be made by the engineer when thinking about faults and their consequent failure, depending on loss of service availability, integrity or performance.