In a campus system, it was observed that a number of networked PCs crashed
occasionally, but only late on Friday afternoons.
The only known change to their software was the introduction of
a recently written Ethernet device driver.
The eventual bug was found to be that a periodic broadcast packet on
the network advertising the number of users on another particular machine,
exceeded the buffer size for received packets in the PC. Normally, PCs
only talked to each other, and therefore never sent large packets, or
large broadcast packets. Most of the week the large machine had fewer
users. But on a Friday, coursework was due in, so many more students
used the machine.
In software terms, the problem was that the check on the packet length
was after the check on who the packet was for (and subsequent copy
of the packet into an insufficiently large buffer).
Operational testing is identical to validation, except of course for
the cost of failures.
During maintenance (repairs, upgrades) it is always possible that
reliability decreases. An example of the danger of maintenance in a
distributed system is what happens when an upgrade to a system is
then distributed to all the other machines in a system. If, for
example, an error is introduced that means that machines are no longer
able to contact one and other, then it will be impossible to
automatically rectify this fault. This has implications for the
allowed rate of change of systems when they are distributed.