Some Definitions.

A <#569#> fault<#569#> is a defect in a system that MAY lead to an error, It may be permanent, transient or intermittent. It may in fact never betray itself. An <#570#> error<#570#> is a piece of information in a system that results from a fault and may cause a failure when processed in good faith. A <#571#> failure<#571#> is a deviation in the observable behavior of the system from its specification. This can include the failure to provide some service within some specified interval. The quality of a reliable system is often measured in terms of its <#572#> Mean Time Between Failure<#572#> (MTBF), its <#573#> Mean Time To Repair<#573#> (MTTR) and its <#574#> Availability<#574#>. The first is a reflection of how often something fails, the second how long it takes to become available again. The last is the percentage of time it offers the specified service. The ability to be fault tolerance can be based on many approaches. These all increase the overall availability of the distributed system despite internal faults. They all do so with some associated cost to some aspect of the performance of the system. In addition to these modes, timeliness brings another list of requirements:

Tightness of deadline There must be enforcement of bounds on delivery time for real time systems. If a non-deterministic communications medium is used (e.g. Ethernet) then it must be used within the statistical performance bounds that are an acceptable risk.
Hiding Sins of Omission Any retry mechanisms must be bounded by a best estimate of the time to expect an answer (c.f. Round Trip Time estimation),
Bounds on outages outages should not persist to the point of overburdening recovery mechanisms - i.e. the Mean Time To Repair should be specified.
Priorities If different systems need different service rates, it may be that this must be implemented right down to the lowest level to give the right performance.