22.1 Introduction
22.2 Process groups
23.3 Consistency of data replicas
22.4 Ordering message delivery
22.5 Distributed, n-process mutual exclusion
22.6 Summary of Part III
Exercises 

22.1 Introduction
In this chapter we focus on the special problems associated with designing
algorithms which are implemented with distributed components.  Many are 
useful building blocks for designing distributed systems.

Chapter 5 introduced the fundamental properties of distributed
computations: concurrent execution of components, independent failure modes
of components, communications delay, potential inconsistency of data and
absence of a single time-frame.

Here, we shall assume that the nodes involved have clocks that are
synchronised as described in Chapter 5. Some of the algorithms we consider
will make use of locally issued timestamps, using process identifiers to
arbitrate between requests with equal timestamps.  We shall be concerned
with the fact that any component of a computation, or any connection path,
may fail at any time.  It is often impossible to tell which of these has
happened. The situation is made worse by the possibility of comunications
delay. It may be that timeouts result from heavily loaded networks or
servers rather than failures.  We took account of this for a simple
request-response protocol in Chapter 15. The higher level components we
consider in this chapter are built above such communications.

We have already taken a preliminary look at some distributed algorithms,
but without a full discussion of failure implications.  In Chapter 7 we saw
that clients of distributed filing systems are likely to maintain a cache
of files or blocks of files for their users.  Maintenance of cache
consistency requires a distributed algorithm. In the approaches we studied
the file server played a central role.  In Chapter 17 we considered how
distributed deadlock detection might be carried out, Again, we omitted to
discuss the failure modes of the approaches we considered.

Chapter 21 covered distributed transactions and we studied an atomic
commitment protocol, two-phase commit. We assumed a known group of
processes were to reach consensus on whether to commit or abort a result.
In this chapter we extend the discussion to the composition of the group
and the election of a coordinator by a group of processes.

Naming was introduced in Chapter 5 as a fundamental issue underlying all
system design. We saw that the data held by name servers should be highly
available and that a tradeoff may have to be made between availability and
consistency of data. Implementation approaches are discussed here. 


22.2 Process groups

Many algorithms assume that a certain number of components are
participating, indeed the algorithm may make use of the number of processes
involved and their identities. Before such an algorithm commences we need
to have a means of controlling group membership. Typical primitives for
managing groups are as follows:

   CREATE (group-name, <list-of-processes>)
   KILL   (group-name)
   JOIN   (group-name, process)
   LEAVE  (group-name, process)

The system may provide a `group server' to which all these requests must
be sent or it may be that JOIN  and LEAVE may be sent to any member of the
group. It would of course be necessary for a process to have the necessary
access privilege to be able to KILL a group.  

It is necessary to specify whether communication involving groups is closed
or open.  For example, a distributed algorithm may be such that a process
must be a member of the group in order to participate in the algorithm.  
Alternatively, a group of servers may exist to carry out a service, in
which case the service is invoked by a message from a process outside the
group to any member of it.  

We shall consider cases where the group has no internal structure and the
algorithms are fully distributed.  Alternatively, we shall study
algorithms in which the group is structured and a leader acts as
coordinator.

Any process may fail at any time.  A crashed process is deemed to have left
any group of which it is a member and must rejoin on restarting.
The group control algorithms must be robust under failure, for example
on failure of a process during leave and join. Failure of a group leader is
considered in the next section.

Systems concerned with fault tolerance may be built above a group
abstraction.  For example, the processes of a group might each compute and
vote on the result of a computation. Communication is likely to be based on
a reliable multicast protocol where any message is guaranteed to be
delivered to every member of the group. ISIS is an example of such a
system, see (Birman and Joseph, 1987; Birman, 1993).

22.2.1 Leadership election
The coordinator of an algorithm may in some cases  be any member of a
group.  For example, the group members may each be maintaining a data
replica and any member that receives an external request to update the data
will act as coordinator of the update, see Section 22.3.

If there is no obvious initiator then a leader may be `elected'.  Let us
assume:
* each process has a unique ID known to all members
* the process with the highest ID is the leader
* any process may fail at any time.

The Bully election algorithm
One election algorithm that has been proposed is called BULLY.  All
processes behave as follows:

* P notices there is no reply from the coordinator
* P send an ELECT message to all processes with higher IDs
* If any reply then P exits
* If there are no replies then P wins, obtains any state needed to function
  as leader then sends a COORDINATOR message to all processes
* On receipt of an ELECT message a process must both reply to the sender
  and start an election if it is not already holding one.

<Figure 22.1> Election algorithm: BULLY

Notice that there can be concurrent elections in progress, see Figure 22.1(ii)
and that the algorithm can involve a large number of messages.  Also, 
it is not specified how long a process should wait before deciding that no
reply has come back and that it is therefore the new leader.  This is an
issue for the lower level protocols; the high level algorithms act
according to the best information available form the lower levels.

A ring-based election algorithm
An alternative is to use a RING algorithm in which processes are ordered
cyclically, the order being known to all processes.  A failed process can
therefore be bypassed. The algorithm requires messages to be acknowledged
so that any process that fails during its execution can be bypassed.  All
processes behave as follows, see Figure 22.2:


* P notices the coordinator is not functioning
* P sends an ELECT message containing its own ID to the next process in the
  ring
* on receipt of an ELECT message:
  a) without the receiver's ID - add this ID and pass on the message
  b) with the receiver's ID (the message has been round the ring)
     - send a message (COORDINATOR, highest-ID in the message)
       around the ring.

<Figure 22.2> Election algorithm: RING
     
The process with the highest ID then functions as coordinator; all the
processes have found out this ID as the message traversed the ring.

The algorithm involves fewer messages than the BULLY algorithm.  
Communication delay may again interfere with the algorithm in that a
heavy loading of a process's node or the communication path to it may cause
it to be bypassed.  In both cases the algorithms are specified at a higher
level than this. 

23.3 Consistency of data replicas

Data that is needed to support the smooth operation of a system is often
replicated.  An example is the naming data held by name servers that was
discussed in Section 15.9. Replication of data ensures that if a single
computer that holds such data fails, another copy exists.  Replication may
also be used to ensure that there is a copy reasonably near to all points
of the system, which is particularly relevant to large scale systems. Data
is therefore replicated for reasons of availability and performance.
Applications and services in general may maintain data replicas for these
reasons.  Communications delay and distributed updates means that replicas
can get out of step. We may have sacrificed consistency to achieve
availability.

When data replicas exist in a system there must be a policy about how
changes to such data are handled.  Alternative policies are as follows:

* Weak consistency
  A change made to one of the replicas is visible immediately.  The system
  will propagate the change to the other replicas in due course but in the
  meantime the system has become inconsistent.

* Strong consistency
  Another approach is to attempt to make the change to all the copies
  before allowing the data to become visible again.  The problem here is
  that even if all the computers holding the replicas are available, the
  process is much slower than updating one nearby copy.

Here we have a simpler model than in the transactional systems of Chapter
21.  We have replicas of a single object which may be read and written by
distributed processes. Another term which captures the requirement for
strong consistency is single copy serialisability (SR1). 

If any replica is on a system which is not available we have either to
forbid the update or handle the inconsistency when the system becomes
available again.  It has been said that distributed computing means that
the fact that a computer you have never heard of has crashed means that you
can't get on with your work! We see how this problem can be alleviated in
the next section.

Another scenario is that all the replicas may be on systems that are
running but the network has become partitioned.  There is the possibility
of inconsistent updates being made within the separate partitions.

A special case of replication is that systems of moderate scale may
maintain a hot standby. In this case a duplicate is maintained in step
with the primary copy so that if the primary fails a switch can be made to
the duplicate without loss of state.  We shall not consider this option
further but will focus on the more general case of multiple distributed 
replicas.


22.3.1 Quorum assembly for strong consistency

An approach to alleviating the problem of contacting every replica is to
allow an update to take place when a majority of the replicas
can be assembled to take part in the process.  This is called a write
quorum.  A read quorum is also defined and its value
is chosen to ensure that at least one of its member replicas contains the 
most recent update.  If there are <n> replicas a write quorum
<WQ> and read quorum <RQ> are typically defined to be:

WQ > n/2
RQ + WQ > n

For example, if <n>=7, you may have to assemble and write to 5
replicas and assemble 3 for reading.  On reading you check the time of last
update of all of them and use the most recent.  The system software is
likely to attempt to bring all the replicas up-to-date behind the scenes.
Figure 22.3 illustrates.

<Figure 22.3> Quorum assembly

Note that the replicas are maintained by a group of processes of size <n>.
Quorum assembly is tolerant of failure of the nodes. An atomic commitment
protocol must then be used to commit (or fail to commit) the update for all
(or no) members of the quorum, see Section 21.7.  Quorum assembly is not
tolerant of the group size being increased concurrently with assembly of a
quorum and this must be taken into account in the protocol which implements
JOIN.

22.3.2  Large scale systems

Quorum assembly alleviates the problem of having to contact every replica
to make an update in a system where strong consistency is required.  In
large scale systems the quorum size might still be large and the proces of
assembly very slow.  Figure 22.4 illustrates how strong and weak
consistency might both be used in a large scale system. A hierarchy of
replicas is created.  Recall that a similar approach was used for NTP
(Network Time Protocol to synchronise clocks of computers on the Internet,
 see Section 5.6.4.

<Figure 22.4> Hierarchical organisation for large scale

Figure 22.4 shows a number of primary servers at the top level of the
hierarchy.  They are responsible for maintaining strongly consistent
replicas. The system specification may be that all updates must be made by
contacting some primary server.  That server is then responsible for
assembling a quorum among the primary servers then using an atomic
commitment protocol to commit the update securely.  The primary servers
then propagate the update down their respective hierarchies.

A process which requires a read to be guaranteed to be up-to-date must
contact a primary server.  If a quick response is required any local server
may be contacted.  There is a small risk that an update to the data item
required is still on its way to that server.

(Ma, 92) used a hierarchical scheme of this kind for name servers.  (Adly
et al, 93; 95) give a design which allows a variety of application
semantics to be provided within a hierarchically structured system.  The
protocols are tolerant of changes in group membership.  (Adly 95) gives an
overview of replication algorithms for large scale systems as well as
describing this work in detail.


22.4 Ordering message delivery
We have seen that a group of processes may be set up and may execute
distributed algorithms. Each member may be maintaining a data replica  
and we have seen how a single update may be committed by all members of the
group or by none of them, using quorum assembly followed by an atomic
commitment protocol. Concurrent update requests are serialised by the fact
that only one quorum assembly can succeed. Strong consistency has been
achieved, we have single copy serialisability and have imposed a total
order of updates system-wide (within the group).

In the above we did not take into account the order in which the 
update requests originated.  We could specify an additional requirement:  
that updates are committed at the replicas in timestamp order. In an
application domain where this is appropriate it could well be beneficial to
separate the application making updates from the message delivery system as
shown in Figure 22.5.

<Figure 22.5> Ordering message delivery

In some applications this strong consistency is not needed.  If we remove
the requirement for consistency  we can allow messages to be delivered
to the application in any order. We can ensure that every update reaches
every replica eventually but the order in which the updates are made is not
controlled.

Between these extremes we may require causal ordering of message
delivery.  Here we assert that a message received by a process can
potentially affect any subsequent message sent by that process.  Those
messages should be received in that order at all processes.  Unrelated
messages may be delivered in any order. Figure 22.6 illustrates.

The causal ordering condition may be stated more formally as follows:

sendi(m) -> sendj(n) implies deliverk(m) -> deliverk(n)

<Figure 22.6> Causal order 


22.4.1 Vector clocks

We assume that every message is to be delivered eventually to every
process.  The message delivery system must be able to deliver messages to
the application in causal order if required.  This implies that some
messages which are received `early' must be held by the message delivery
system before being delivered.  A data structure must be maintained on
which such decisions can be based.  One approach is to use so-called
`vector clocks', see Schneider's chapter in (Mullender, 93) and Figure
22.7.
<Figure 22.7> Example showing vector clocks 

We assume a fixed number <n> of processes, each with a corresponding entry
in a vector.  The entry for a process indicates the event count at that
process; the idea is that described for logical time and event ordering
in Section 5.6.3.  Each time a process sends or receives a message it
increments its own event count (its entry in the vector).

SEND (vector, message) 
When a process sends a message it tags the message with its value of the
vector.  This indicates the belief of that process about the state (event
counts) of all other processes. 

RECEIVE(vector, message)
When a process receives a message it updates its vector entries for other
processes from the values in the vector when these are greater then the
values in its locally maintained vector. 

Each process, implementing the message service for an application,
therefore maintains its best knowledge of the state of the system. Based on
the values in the vector it can decide whether or not to pass a given
message up to the application.  Figure 22.7 illustrates.

Note that we have assumed that the number of processes is fixed.  Each
process has a name which indicates its slot in the vector.  This notation
has been used historically but a set notation would be more general. It
would allow the group size to vary as processes join and leave and would
avoid the problem of a gap in the vector caused by a leaving process.


22.5 Distributed mutual exclusion

.........from old appendix A ..............

add at end ............................

The algorithms may be criticised as follows.
The centralised algorithm is fair; it implements first come first served
priority.  It is also economical in the number of messages required.
However, it has a single point of failure and the coordinator will become a
bottlenexk for large scale systems.  The conditions of failure of the
coordinator and denial of access to the CR cannot be distinguished.  This
can be solved by using an acknowledgement of the request.  The requestor is
then trusted to wait for the CR.

The token ring algorithm is reasonable efficient, although the token
continues to circulate when no-one wishes to enter the CR.  It is not fair
in that the ring ordering does not honour the order of requesting the CR.
The algorithm is susceptible to loss of the token.  Acknowledgement
messages are needed to detect failure of components which must then be
bypassed. 

The distributed algorithm replaces a single point of failure with n
separate points of failure and n bottlenecks.  Although it is fair it is
expensive in the number of messages required.  As above, acknowledgement
messages should be used to distinguish between failure and denial of the
CR.


22.7 Summary of Part III

.................as in old Chapter 20 add at end:

Finally we stepped back from transactions involving persistent data and
examined some multi-component distributed computations.  The algorithms
covered were suitable for use as building blocks within various categories
of application.  For each algorithm we considered the implications of the
fundamental properties of distributed systems.


Exercises

22.1 How is the distributed deadlock detection algorithm of Section 17.10
affected by failure of the nodes involved?

22.2 Expand the algorithms and protocols for cache coherency outlined in
Section 7.6.5.  Discuss their failure semantics.

22.3 Discuss object coherency and failure semantics for a distributed
object mapping system as described in Section 7.7.2.

22.4 Discuss how the nodes executing an atomic commitment protocol after
write-quorum assembly should handle an incoming JOIN or LEAVE message from
a process which is not in the write quorum.

22.5 For the BULLY and RING election algorithms: 

   How many messages are involved in electing a new coordinator?
   
   What should happen when a failed coordinator restarts?

22.6 Give examples of applications which you would you expect to require
strong consistency of data and of those which you would expect to prefer or
tolerate weak consistency.

22.7  For the n-process mutual exclusion algorithms:

  Is each algorithm correct and free from deadlock?

  How many messages are required for the entry and exit protocols, with and
  without failure detection?

  How many processes does each process need to communicate with?

  How easy is it for the group of processes to be reconfigured 
  by JOIN and LEAVE?