The Object Oriented approach is a good way of providing fault
confinement. Faults are unlikely to propagate from a well defined
module, except in a very restricted way in the for m of erroneous
messages which may be checked by other objects in a number of ways,
Fault detection and diagnosis may be based on comparing
the kind of output expected from an object with the actual output.
The Abstract Data Type approach will help with this. In an Open
Distributed System, one of the base methods in the Class Hierarchy may
be the ;SPM_quot;liveness;SPM_quot; test (echo/loopback facility). This will also help
isolate and diagnose faults, including those to do with timeliness.
Faults are often masked by various means based on replication or
error correction. For replication of a service to be transparent to
the application/user, there must be some agent that collects replies
and performs some majority voting on these replies. The service
provided by this agent can be
derived from the non-replicated form of the service object.
In some systems, it may be sufficient to retry any failed operation.
This relies on the fault model presented by the operation. For
example, it will only guarantee freedom from a value failure if the
retried operation is either idempotent (meaning that repeated
completion of operations have the same effect)or has
``all or nothing semantics''.
This may get round transient faults (e.g. network noise), but will not
provide any bounds on the time to complete the operation.
Reconfiguration may required when the number or rate of appearance
of faults exceeds what can be handled by the mechanisms outlined so
far. Providing this functionality in a distributed system
often involves migration of
objects, but when these mechanisms are used, they should not be
visible to the application.
One extreme case of reconfiguration is to employ some recovery
technique to re-establish an earlier system state known to be correct.
This may involve roll-back or un-doing of some number of logged
operations.
Once recovery is complete, it may be feasible simply to restart all
the outstanding operations. If they are idempotent
this is straightforward. If not, it may require informing the
original end user so that they may re-submit the original command
(e.g. re-type their request to withdraw x from a cash point).
A distributed system may be partitioned by a temporary network
failure. An important aspect of fault tolerance is the ability to
repair the system transparently.
In a distributed system, faults and errors are more common and complex
than in a centralized system. This is usually due to the
communications infrastructure (the larger the system in geographical
scale, the more likely errors in communication. It is also due to the
heterogeneous nature of an Open Distributed System. It is not
guaranteed that al components (workstations/servers/etc.) are of the
same quality.
It must be stressed that in a distributed system, faults and errors
may consist only of the <#582#> lack<#582#> of information. The primary way in which
these type of conditions can be detected is by use of timeout and
retry facilities. It is frequently found that an application that
relies on a ;SPM_quot;reliable transport;SPM_quot; of messages fails in obscure ways.
Alternative approaches use self-checking server processes. Of course,
the detection of a value fault can be handled by consistency checks at
any point in a system.
In distributed systems, most of the facilities available to the
conventional communications programmer should be visible to the
applications programmer, but <#583#>em only when<#583#> wanted. The object model
provides us with a convenient way of inheriting base methods. These
can include base exceptions for handling timeouts and such events.
Any operating system or application has to provide services within time
constraints. Some constraints are looser than others. The time to deal
with the arrival of data from a monitoring device (e.g. Radar signal
from Air Traffic Control System) may be very much less than the time
to deal with running a process for a user (e.g. formatting a
document).
A system must provide varying degrees of reliability. Some data may
not be recoverable and some sequences of events may not be repeatable.
The relative costs of fixing or preventing faults are different at
different stages of a system design and implementation. Appopriate
cost/benefit trade-offs need to be made by the engineer when thinking
about faults and their consequent failure, depending on loss of
service availability, integrity or performance.