Conventional Hardware and Software Reliability

Hardware support to enhance reliability of single processors has centered round the support for confining errors to processes and providing error correction for memory. I/O errors are usually dealt with by retry or replication of I/O devices. Hardware support for software includes the use of memory management units to isolate the address spaces accessible to different processes. The natural extension of this is to have capability based hardware where all resources are protected and only those processes with the correct capabilities may access a given resource. Processes are usually grouped into several levels in a hierarchy. Typically, there are user and superuser privileged processes. In some systems, interrupt service routines (really transient processes) are divided into some number of levels of prioritized processes within the superuser priority. Access from lower priority to higher priority may not be allowed. All different priority processes run with different stacks, often protected by guardwords. Single instructions are usually provided to switch context (including all process state and priority). Most hardware architectures allow certain operations to be uninterruptable (e.g. test and set...).. Some network hardware may provide some level of reliability. For example some financial service information networks provide at least duplicate links and transmit all information over both routes. This is not simply for resilience to line failure, but also so that errors in transmission can be detected by comparing received messages. One secure system went even further in protecting the data from interference. Access was made only through approved ;SPM_quot;black box;SPM_quot; network interfaces which implement all the required communications, including generating constant random traffic to prevent traffic pattern analysis by intruders, and to provide very rapid network failure detection. Software Reliability is currently based on a various stages of testing. In the future, it may be possible to use automated program proving techniques, especially when more systems are formally specified using methods like Abstract Data Types. [See chapter 5]. The notion of ;SPM_quot;reliability growth;SPM_quot; is inherent in the idea that a set of staged tests through the software development cycle will improve the quality of the system. These staged tests are: The more tests run, the more confidence in the reliability of the software. Unfortunately, this leads to excessive cost in testing. An alternative is to restrict the test data by choosing it from ;SPM_quot;likely;SPM_quot; data - chosen by looking at the nature of input the program expects. What makes Distributed Systems a particular challenge is the fact that they often exhibit a higher degree of non-deterministic behavior than centralized systems (due to the inevitable concurrency). Thus the number of traces of the system can be extremely large. In a distributed system, it may be possible to monitor a network, and replay all the messages in a test harness, and so test the new system with real data in an highly effective way. However, it is worth giving an example of a failure of a system to illustrate how hard this can be to do effectively.