A <#569#> fault<#569#> is a defect in a system that MAY lead to an error, It may be
permanent, transient or intermittent. It may in fact never betray
itself.
An <#570#> error<#570#> is a piece of information in a system that results from a
fault and may cause a failure when processed in good faith.
A <#571#> failure<#571#> is a deviation in the observable behavior of the system from
its specification. This can include the failure to provide some
service within some specified interval.
The quality of a reliable system is often measured in terms of its <#572#>
Mean Time Between Failure<#572#> (MTBF), its <#573#> Mean Time To Repair<#573#> (MTTR)
and its <#574#> Availability<#574#>.
The first is a reflection of how often something fails, the second
how long it takes to become available again.
The last is the percentage of time it offers the specified service.
The ability to be fault tolerance can be based on many approaches.
These all increase the overall availability of the distributed system despite
internal faults. They all do so with some associated cost to some
aspect of the performance of the system.
In addition to these modes, timeliness brings another list of
requirements:
-
Tightness of deadline
There must be enforcement of bounds on delivery time for real time
systems. If a non-deterministic communications medium is used (e.g.
Ethernet) then it must be used within the statistical performance
bounds that are an acceptable risk.
-
Hiding Sins of Omission
Any retry mechanisms must be bounded by a best estimate of the time to
expect an answer (c.f. Round Trip Time estimation),
-
Bounds on outages
outages should not persist to the point of overburdening recovery
mechanisms - i.e. the Mean Time To Repair should be specified.
-
Priorities
If different systems need different service rates, it may be that this
must be implemented right down to the lowest level to give the right
performance.