Conventional Hardware and Software Reliability

Hardware support to enhance reliability of single processors has centered round the support for confining errors to processes and providing error correction for memory. I/O errors are usually dealt with by retry or replication of I/O devices. Hardware support for software includes the use of memory management units to isolate the address spaces accessible to different processes. The natural extension of this is to have capability based hardware where all resources are protected and only those processes with the correct capabilities may access a given resource. Processes are usually grouped into several levels in a hierarchy. Typically, there are user and superuser privileged processes. In some systems, interrupt service routines (really transient processes) are divided into some number of levels of prioritized processes within the superuser priority. Access from lower priority to higher priority may not be allowed. All different priority processes run with different stacks, often protected by guardwords. Single instructions are usually provided to switch context (including all process state and priority). Most hardware architectures allow certain operations to be uninterruptable (e.g. test and set...).. Some network hardware may provide some level of reliability. For example some financial service information networks provide at least duplicate links and transmit all information over both routes. This is not simply for resilience to line failure, but also so that errors in transmission can be detected by comparing received messages. One secure system went even further in protecting the data from interference. Access was made only through approved ;SPM_quot;black box;SPM_quot; network interfaces which implement all the required communications, including generating constant random traffic to prevent traffic pattern analysis by intruders, and to provide very rapid network failure detection. Software Reliability is currently based on a various stages of testing. In the future, it may be possible to use automated program proving techniques, especially when more systems are formally specified using methods like Abstract Data Types. [See chapter 5]. The notion of ;SPM_quot;reliability growth;SPM_quot; is inherent in the idea that a set of staged tests through the software development cycle will improve the quality of the system. These staged tests are:

Development During software development, errors may be detected. Based on the number and experience, an estimate can be made of overall non-detected errors. Based on the complexity of the program (branching of the call graph etc.), an estimate of the number of likely errors can be made. It is certainly true that the more channels a distributed application uses, the more likely it is to be error prone. But testing parts of a distributed application in isolation rarely detects more than simple problems.
Testing When the software is integrated, a set of coherent tests may be carried out including deliberate error seeding of the program. Parts of the program are corrupted, and the effect on the rest of the system is observed. The results give some indication of the robustness of the program. This techniques is useful in a distributed system. It is often vital to have checked whether a server is safe from rogue clients generating bad requests, and that clients will not fail badly when servers return meaningless results.
Validation When software is complete, it is usually validated by running in an non-operational environment with test data taken from the real environment or with a distribution based on that from a real world model.

The more tests run, the more confidence in the reliability of the software. Unfortunately, this leads to excessive cost in testing. An alternative is to restrict the test data by choosing it from ;SPM_quot;likely;SPM_quot; data - chosen by looking at the nature of input the program expects. What makes Distributed Systems a particular challenge is the fact that they often exhibit a higher degree of non-deterministic behavior than centralized systems (due to the inevitable concurrency). Thus the number of traces of the system can be extremely large. In a distributed system, it may be possible to monitor a network, and replay all the messages in a test harness, and so test the new system with real data in an highly effective way. However, it is worth giving an example of a failure of a system to illustrate how hard this can be to do effectively.