The CAR Conferencing System used RPC because
the system designers were
familiar with RPC, and not so familiar with lower level network
programming tools. The modularisation that we chose lent itself to
separate implementation and testing. Different programmers were able
to carry out these tasks, even though several of these programmers had no
previous experience with networks. Testing could take place without
the need for a network at all, since RPCs could be called in the same
system. The benefit was that the complete system was produced in a
matter of months.
It was expected that conferences would be small and tightly controlled
so that a single, centralised Conference Server would be adequate.
Networks were believed to be reliable.
Our experience of running the system across European sites
interconnected by IP routers across the narrow band ISDN, was that
demand for wide area conferences amongst larger
groups exceeded our expectations. We also found that
network failures were common.
Points of failure will cause the entire system to fail:
-
If any of the Switch Controller, Switching Server, Directory Server
or Conference Server fail, the entire system fails.
-
If the servers above are distributed across a network, then if any link
between them fails, then the entire system blocks until that link
recovers.
-
Worse still, the nature of synchronous blocking RPC is such that if a
link fails between the Conference Server acting as a Notification
Client and a Notification Server, the entire system blocks until
that link recovers.
Of course, the programmer using an RPC system can always set a finite timeout.
However, the programmer then has two problems: determining a valid timeout
without access to underlying communicatons code (e.g. round trip time
estimartotors nthe transport protocol) is hard; in the,, event of failure
(whetehr the call was received and acted on or not) the caller must handle the handle the exception. One might prefer to use a messaging system from the start.
Since the Conference Server is central to a conference, it is also a
performance bottleneck for the system. Recent experience
broadcasting the Internet Engineering Task Force Meeting over the
Internet in the US [#IETF##1#]
using a more loosely structured system
show that it is not unreasonable to have 500 participants in a
networked multimedia conference. However, there is little hope of the
CAR Conferencing System Architecture scaling to 500. In the
Internet, 500 sites rarely have full connectivity, and
any single link outage will thus frequently block the progress
of the entire conference. It may be that real-time
multimedia chatlines will become as commonplace as
Internet Relay Chat and Bulletin Board use - in this case, we may see
many systems spread over the wide area, and inevitably have to
deal with partial availability.
We believe that our instantiation of components as
processes, and the subsequent communications architecture was a poor
implementation strategy.
To use Open Distributed Processing terminology, from the
information viewpoint we have
reflected the right model, but from the engineering viewpoint, we may
not have.
Simple replication of the servers provides no solution to the lack of
fault tolerance - an entire protocol must be developed to make any
access to those replicated servers location transparent, and map
failures into the right application exceptions. In other
words, the decomposition into services has done little more than basic
software engineering would, and does not address the real distributed
computing aspects of the problem. A classical approach consisting of replicated
servers, timeout and re-location by client of a secondary server would rectify
these problems but seems perhaps a bit ad hoc.