22.1 Introduction 22.2 Process groups 23.3 Consistency of data replicas 22.4 Ordering message delivery 22.5 Distributed, n-process mutual exclusion 22.6 Summary of Part III Exercises 22.1 Introduction In this chapter we focus on the special problems associated with designing algorithms which are implemented with distributed components. Many are useful building blocks for designing distributed systems. Chapter 5 introduced the fundamental properties of distributed computations: concurrent execution of components, independent failure modes of components, communications delay, potential inconsistency of data and absence of a single time-frame. Here, we shall assume that the nodes involved have clocks that are synchronised as described in Chapter 5. Some of the algorithms we consider will make use of locally issued timestamps, using process identifiers to arbitrate between requests with equal timestamps. We shall be concerned with the fact that any component of a computation, or any connection path, may fail at any time. It is often impossible to tell which of these has happened. The situation is made worse by the possibility of comunications delay. It may be that timeouts result from heavily loaded networks or servers rather than failures. We took account of this for a simple request-response protocol in Chapter 15. The higher level components we consider in this chapter are built above such communications. We have already taken a preliminary look at some distributed algorithms, but without a full discussion of failure implications. In Chapter 7 we saw that clients of distributed filing systems are likely to maintain a cache of files or blocks of files for their users. Maintenance of cache consistency requires a distributed algorithm. In the approaches we studied the file server played a central role. In Chapter 17 we considered how distributed deadlock detection might be carried out, Again, we omitted to discuss the failure modes of the approaches we considered. Chapter 21 covered distributed transactions and we studied an atomic commitment protocol, two-phase commit. We assumed a known group of processes were to reach consensus on whether to commit or abort a result. In this chapter we extend the discussion to the composition of the group and the election of a coordinator by a group of processes. Naming was introduced in Chapter 5 as a fundamental issue underlying all system design. We saw that the data held by name servers should be highly available and that a tradeoff may have to be made between availability and consistency of data. Implementation approaches are discussed here. 22.2 Process groups Many algorithms assume that a certain number of components are participating, indeed the algorithm may make use of the number of processes involved and their identities. Before such an algorithm commences we need to have a means of controlling group membership. Typical primitives for managing groups are as follows: CREATE (group-name, ) KILL (group-name) JOIN (group-name, process) LEAVE (group-name, process) The system may provide a `group server' to which all these requests must be sent or it may be that JOIN and LEAVE may be sent to any member of the group. It would of course be necessary for a process to have the necessary access privilege to be able to KILL a group. It is necessary to specify whether communication involving groups is closed or open. For example, a distributed algorithm may be such that a process must be a member of the group in order to participate in the algorithm. Alternatively, a group of servers may exist to carry out a service, in which case the service is invoked by a message from a process outside the group to any member of it. We shall consider cases where the group has no internal structure and the algorithms are fully distributed. Alternatively, we shall study algorithms in which the group is structured and a leader acts as coordinator. Any process may fail at any time. A crashed process is deemed to have left any group of which it is a member and must rejoin on restarting. The group control algorithms must be robust under failure, for example on failure of a process during leave and join. Failure of a group leader is considered in the next section. Systems concerned with fault tolerance may be built above a group abstraction. For example, the processes of a group might each compute and vote on the result of a computation. Communication is likely to be based on a reliable multicast protocol where any message is guaranteed to be delivered to every member of the group. ISIS is an example of such a system, see (Birman and Joseph, 1987; Birman, 1993). 22.2.1 Leadership election The coordinator of an algorithm may in some cases be any member of a group. For example, the group members may each be maintaining a data replica and any member that receives an external request to update the data will act as coordinator of the update, see Section 22.3. If there is no obvious initiator then a leader may be `elected'. Let us assume: * each process has a unique ID known to all members * the process with the highest ID is the leader * any process may fail at any time. The Bully election algorithm One election algorithm that has been proposed is called BULLY. All processes behave as follows: * P notices there is no reply from the coordinator * P send an ELECT message to all processes with higher IDs * If any reply then P exits * If there are no replies then P wins, obtains any state needed to function as leader then sends a COORDINATOR message to all processes * On receipt of an ELECT message a process must both reply to the sender and start an election if it is not already holding one.
Election algorithm: BULLY Notice that there can be concurrent elections in progress, see Figure 22.1(ii) and that the algorithm can involve a large number of messages. Also, it is not specified how long a process should wait before deciding that no reply has come back and that it is therefore the new leader. This is an issue for the lower level protocols; the high level algorithms act according to the best information available form the lower levels. A ring-based election algorithm An alternative is to use a RING algorithm in which processes are ordered cyclically, the order being known to all processes. A failed process can therefore be bypassed. The algorithm requires messages to be acknowledged so that any process that fails during its execution can be bypassed. All processes behave as follows, see Figure 22.2: * P notices the coordinator is not functioning * P sends an ELECT message containing its own ID to the next process in the ring * on receipt of an ELECT message: a) without the receiver's ID - add this ID and pass on the message b) with the receiver's ID (the message has been round the ring) - send a message (COORDINATOR, highest-ID in the message) around the ring.
Election algorithm: RING The process with the highest ID then functions as coordinator; all the processes have found out this ID as the message traversed the ring. The algorithm involves fewer messages than the BULLY algorithm. Communication delay may again interfere with the algorithm in that a heavy loading of a process's node or the communication path to it may cause it to be bypassed. In both cases the algorithms are specified at a higher level than this. 23.3 Consistency of data replicas Data that is needed to support the smooth operation of a system is often replicated. An example is the naming data held by name servers that was discussed in Section 15.9. Replication of data ensures that if a single computer that holds such data fails, another copy exists. Replication may also be used to ensure that there is a copy reasonably near to all points of the system, which is particularly relevant to large scale systems. Data is therefore replicated for reasons of availability and performance. Applications and services in general may maintain data replicas for these reasons. Communications delay and distributed updates means that replicas can get out of step. We may have sacrificed consistency to achieve availability. When data replicas exist in a system there must be a policy about how changes to such data are handled. Alternative policies are as follows: * Weak consistency A change made to one of the replicas is visible immediately. The system will propagate the change to the other replicas in due course but in the meantime the system has become inconsistent. * Strong consistency Another approach is to attempt to make the change to all the copies before allowing the data to become visible again. The problem here is that even if all the computers holding the replicas are available, the process is much slower than updating one nearby copy. Here we have a simpler model than in the transactional systems of Chapter 21. We have replicas of a single object which may be read and written by distributed processes. Another term which captures the requirement for strong consistency is single copy serialisability (SR1). If any replica is on a system which is not available we have either to forbid the update or handle the inconsistency when the system becomes available again. It has been said that distributed computing means that the fact that a computer you have never heard of has crashed means that you can't get on with your work! We see how this problem can be alleviated in the next section. Another scenario is that all the replicas may be on systems that are running but the network has become partitioned. There is the possibility of inconsistent updates being made within the separate partitions. A special case of replication is that systems of moderate scale may maintain a hot standby. In this case a duplicate is maintained in step with the primary copy so that if the primary fails a switch can be made to the duplicate without loss of state. We shall not consider this option further but will focus on the more general case of multiple distributed replicas. 22.3.1 Quorum assembly for strong consistency An approach to alleviating the problem of contacting every replica is to allow an update to take place when a majority of the replicas can be assembled to take part in the process. This is called a write quorum. A read quorum is also defined and its value is chosen to ensure that at least one of its member replicas contains the most recent update. If there are replicas a write quorum and read quorum are typically defined to be: WQ > n/2 RQ + WQ > n For example, if =7, you may have to assemble and write to 5 replicas and assemble 3 for reading. On reading you check the time of last update of all of them and use the most recent. The system software is likely to attempt to bring all the replicas up-to-date behind the scenes. Figure 22.3 illustrates.
Quorum assembly Note that the replicas are maintained by a group of processes of size . Quorum assembly is tolerant of failure of the nodes. An atomic commitment protocol must then be used to commit (or fail to commit) the update for all (or no) members of the quorum, see Section 21.7. Quorum assembly is not tolerant of the group size being increased concurrently with assembly of a quorum and this must be taken into account in the protocol which implements JOIN. 22.3.2 Large scale systems Quorum assembly alleviates the problem of having to contact every replica to make an update in a system where strong consistency is required. In large scale systems the quorum size might still be large and the proces of assembly very slow. Figure 22.4 illustrates how strong and weak consistency might both be used in a large scale system. A hierarchy of replicas is created. Recall that a similar approach was used for NTP (Network Time Protocol to synchronise clocks of computers on the Internet, see Section 5.6.4.
Hierarchical organisation for large scale Figure 22.4 shows a number of primary servers at the top level of the hierarchy. They are responsible for maintaining strongly consistent replicas. The system specification may be that all updates must be made by contacting some primary server. That server is then responsible for assembling a quorum among the primary servers then using an atomic commitment protocol to commit the update securely. The primary servers then propagate the update down their respective hierarchies. A process which requires a read to be guaranteed to be up-to-date must contact a primary server. If a quick response is required any local server may be contacted. There is a small risk that an update to the data item required is still on its way to that server. (Ma, 92) used a hierarchical scheme of this kind for name servers. (Adly et al, 93; 95) give a design which allows a variety of application semantics to be provided within a hierarchically structured system. The protocols are tolerant of changes in group membership. (Adly 95) gives an overview of replication algorithms for large scale systems as well as describing this work in detail. 22.4 Ordering message delivery We have seen that a group of processes may be set up and may execute distributed algorithms. Each member may be maintaining a data replica and we have seen how a single update may be committed by all members of the group or by none of them, using quorum assembly followed by an atomic commitment protocol. Concurrent update requests are serialised by the fact that only one quorum assembly can succeed. Strong consistency has been achieved, we have single copy serialisability and have imposed a total order of updates system-wide (within the group). In the above we did not take into account the order in which the update requests originated. We could specify an additional requirement: that updates are committed at the replicas in timestamp order. In an application domain where this is appropriate it could well be beneficial to separate the application making updates from the message delivery system as shown in Figure 22.5.
Ordering message delivery In some applications this strong consistency is not needed. If we remove the requirement for consistency we can allow messages to be delivered to the application in any order. We can ensure that every update reaches every replica eventually but the order in which the updates are made is not controlled. Between these extremes we may require causal ordering of message delivery. Here we assert that a message received by a process can potentially affect any subsequent message sent by that process. Those messages should be received in that order at all processes. Unrelated messages may be delivered in any order. Figure 22.6 illustrates. The causal ordering condition may be stated more formally as follows: sendi(m) -> sendj(n) implies deliverk(m) -> deliverk(n)
Causal order 22.4.1 Vector clocks We assume that every message is to be delivered eventually to every process. The message delivery system must be able to deliver messages to the application in causal order if required. This implies that some messages which are received `early' must be held by the message delivery system before being delivered. A data structure must be maintained on which such decisions can be based. One approach is to use so-called `vector clocks', see Schneider's chapter in (Mullender, 93) and Figure 22.7.
Example showing vector clocks We assume a fixed number of processes, each with a corresponding entry in a vector. The entry for a process indicates the event count at that process; the idea is that described for logical time and event ordering in Section 5.6.3. Each time a process sends or receives a message it increments its own event count (its entry in the vector). SEND (vector, message) When a process sends a message it tags the message with its value of the vector. This indicates the belief of that process about the state (event counts) of all other processes. RECEIVE(vector, message) When a process receives a message it updates its vector entries for other processes from the values in the vector when these are greater then the values in its locally maintained vector. Each process, implementing the message service for an application, therefore maintains its best knowledge of the state of the system. Based on the values in the vector it can decide whether or not to pass a given message up to the application. Figure 22.7 illustrates. Note that we have assumed that the number of processes is fixed. Each process has a name which indicates its slot in the vector. This notation has been used historically but a set notation would be more general. It would allow the group size to vary as processes join and leave and would avoid the problem of a gap in the vector caused by a leaving process. 22.5 Distributed mutual exclusion .........from old appendix A .............. add at end ............................ The algorithms may be criticised as follows. The centralised algorithm is fair; it implements first come first served priority. It is also economical in the number of messages required. However, it has a single point of failure and the coordinator will become a bottlenexk for large scale systems. The conditions of failure of the coordinator and denial of access to the CR cannot be distinguished. This can be solved by using an acknowledgement of the request. The requestor is then trusted to wait for the CR. The token ring algorithm is reasonable efficient, although the token continues to circulate when no-one wishes to enter the CR. It is not fair in that the ring ordering does not honour the order of requesting the CR. The algorithm is susceptible to loss of the token. Acknowledgement messages are needed to detect failure of components which must then be bypassed. The distributed algorithm replaces a single point of failure with n separate points of failure and n bottlenecks. Although it is fair it is expensive in the number of messages required. As above, acknowledgement messages should be used to distinguish between failure and denial of the CR. 22.7 Summary of Part III .................as in old Chapter 20 add at end: Finally we stepped back from transactions involving persistent data and examined some multi-component distributed computations. The algorithms covered were suitable for use as building blocks within various categories of application. For each algorithm we considered the implications of the fundamental properties of distributed systems. Exercises 22.1 How is the distributed deadlock detection algorithm of Section 17.10 affected by failure of the nodes involved? 22.2 Expand the algorithms and protocols for cache coherency outlined in Section 7.6.5. Discuss their failure semantics. 22.3 Discuss object coherency and failure semantics for a distributed object mapping system as described in Section 7.7.2. 22.4 Discuss how the nodes executing an atomic commitment protocol after write-quorum assembly should handle an incoming JOIN or LEAVE message from a process which is not in the write quorum. 22.5 For the BULLY and RING election algorithms: How many messages are involved in electing a new coordinator? What should happen when a failed coordinator restarts? 22.6 Give examples of applications which you would you expect to require strong consistency of data and of those which you would expect to prefer or tolerate weak consistency. 22.7 For the n-process mutual exclusion algorithms: Is each algorithm correct and free from deadlock? How many messages are required for the entry and exit protocols, with and without failure detection? How many processes does each process need to communicate with? How easy is it for the group of processes to be reconfigured by JOIN and LEAVE?