Hardware support to enhance reliability of single processors has
centered round the support for confining errors to processes and
providing error correction for memory. I/O errors are usually dealt
with by retry or replication of I/O devices.
Hardware support for software includes the use of memory management
units to isolate the address spaces accessible to different processes.
The natural extension of this is to have capability based hardware
where all resources are protected and only those processes with the correct
capabilities may access a given resource.
Processes are usually grouped into several levels in a hierarchy.
Typically, there are user and superuser privileged processes. In some
systems, interrupt service routines (really transient processes) are
divided into some number of levels of prioritized processes within
the superuser priority. Access from lower priority to higher priority
may not be allowed. All different priority processes run with
different stacks, often protected by guardwords. Single instructions are
usually provided to switch context (including all process state and priority).
Most hardware architectures allow certain operations to be
uninterruptable (e.g. test and set...)..
Some network hardware may provide some level of reliability. For
example some financial service information networks provide at least
duplicate links and transmit all information over both routes. This is
not simply for resilience to line failure, but also so that errors in
transmission can be detected by comparing received messages.
One secure system went even further in protecting the data from
interference. Access was made only through approved ;SPM_quot;black box;SPM_quot;
network interfaces which implement all the required communications,
including generating constant random traffic to prevent traffic
pattern analysis by intruders, and to provide very rapid network
failure detection.
Software Reliability is currently based on a various stages of
testing. In the future, it may be possible to use automated program
proving techniques, especially when more systems are formally
specified using methods like Abstract Data Types. [See chapter 5].
The notion of ;SPM_quot;reliability growth;SPM_quot; is inherent in the idea that a set
of staged tests through the software development cycle will improve
the quality of the system.
These staged tests are:
-
Development
During software development, errors may be detected. Based on the
number and experience, an estimate can be made of overall non-detected errors.
Based on the complexity of the program (branching of the call graph
etc.), an estimate of the number of likely errors can be made.
It is certainly true that the more channels a distributed application uses, the
more likely it is to be error prone. But testing parts of a distributed
application in isolation rarely detects more than simple problems.
-
Testing
When the software is integrated, a set of coherent tests may be
carried out including deliberate error seeding of the program. Parts
of the program are corrupted, and the effect on the rest of the system
is observed. The results give some indication of the robustness of the
program. This techniques is useful in a distributed system. It is
often vital to have checked whether a server is safe from rogue
clients generating bad requests, and that clients will not fail badly
when servers return meaningless results.
-
Validation
When software is complete, it is usually validated by running in an
non-operational environment with test data taken from the real
environment or with a distribution based on that from a real world
model.
The more tests run, the more confidence in the reliability of the
software. Unfortunately, this leads to excessive cost in testing. An
alternative is to restrict the test data by choosing it from ;SPM_quot;likely;SPM_quot;
data - chosen by looking at the nature of input the program expects.
What makes Distributed Systems a particular challenge is the fact that
they often exhibit a higher degree of non-deterministic behavior than
centralized systems (due to the inevitable concurrency). Thus the number
of traces of the system can be extremely large.
In a distributed system, it may be possible to monitor a network, and
replay all the messages in a test harness, and so test the new system
with real data in an highly effective way. However, it is worth giving
an example of a failure of a system to illustrate how hard this can be
to do effectively.