A model of failures gives a handle for how to provide fault
transparency. At some level of a system, a fault occurs. If we provide
a mechanism to mask it, we can make our system operate correctly. If
we do not, the fault may result in a failure.
Failures are usually categorised as follows:
-
A crash is a where a system ceases to run the programs that are
intended.
-
A system that fails ``silent'', is one that crashes and does not report its
fault, and may even return valid (but meaningless) responses to
messages.
-
A ;SPM_quot;Fail Stop;SPM_quot; system is one that gives guarantee that it will give no
response when it crashes.
-
A commission fault is one where a system silently fails to update
state of some variable in stable storage (say disk)), but appears to
have.
-
A value fault is said to have occurred after some
sequence of updates where the state in stable storage is not what was
intended.
-
A timing fault is where a system fails to meet some deadline.
Each type fault has an appropriate method for hiding it so that the
corresponding failure does not occur. Of course, each method will have
an associated cost.