Chapter 5 Fundamentals of Distributed Systems 5.1 Introduction 5.2 Evolution of distributed systems for the workplace 5.3 Ubiquitous computing 5.4 Models and software architecture 5.5 Special characteristics of distributed systems 5.6 Time in distributed systems 5.7 Naming in distributed systems 5.8 Mobile users, computers and objects 5.9 Requirements for security in distributed systems 5.10 Summary 5.1 Introduction Since distributed systems have become commonplace we should take distribution of software into account at an early stage in our study of concurrent software design. In Chapter 2 we took an architectural view of how software systems might be structured by decomposition into modules, subsystems and objects and introduced the idea that such components might reside on different computers in a distributed system. Each component of such a system comprises an operating system, with internal modular structure, which in turn supports modular services and applications. We saw that a distributed system can even be built above different operating systems running on different hardware. A platform might run as a service above the heterogeneous operating systems. The idea is to convert the various operating systems' system call interfaces into a higher level common interface to be used by the higher level modules of the distributed system. In Chapter 3 we studied the hardware - software interface and the functionality of communications subsystems within and above operating systems. We therefore have the basic framework in place: that distributed computations can be built at higher levels, making use of communications services. In Chapter 4 we saw how processes and threads are supported by operating systems and therefore understand in detail how modules are executed dynamically. We now establish a framework for understanding how a distributed computation comprising distributed modules can be executed dynamically by distributed processes. In this chapter we establish the fundamental properties of distributed systems that distinguish them from centralised systems. For the rest of the book we look first at the general principles of software systems design and then go on to study the special problems of a distributed implementation. The advent of Local Area networks in the 1970s provided the motivation for the development of distributed systems. We start this chapter by discussing the evolution of distributed systems, first for a professional, workplace environment and then for more recent personal and home environments. Terminology A number of different names are used in the literature to indicate a single component computer system in a distributed system, such as host, node or site. We shall use the term node and also client and server where appropriate. 5.2 Evolution of distributed systems for the workplace In the 1970s Local Area Network (LAN) communications technology became available, see Chapter 3. The lightweight protocols which exploited the low error rates and high bandwidth made it possible to base distributed systems on frequent interaction between component systems. A number of design tradeoffs were explored in research environments. Issues to be decided were what kind of system to put on each user's desk and what to provide as shared services, across the network. 5.2.1 Pool of processors plus servers, Figure 5.1a The Cambridge Distributed System was an early example of a distributed computing environment which was in everyday use throughout the 1980s. When it was designed, workstations were expensive so each user was given only a dumb terminal. Work was done by acquiring a processor from a pool and requesting an operating system to be loaded into it. The pool was managed by the Resource Manager (a server) and other servers existed for filing, printing, boostrapping, serving the time, gateways and authentication. 5.2.2 Diskless workstations plus servers, Figure 5.1b The V system at Stanford (home of the later Stanford University Network which became SUN) provided a diskless workstation on the desk. The idea is that code and data can easily be transferred across the network when required and information which needs to be shared is naturally held in file servers. System software can be held and maintained remote from the end systems. The workstations can be relatively inexpensive. 5.2.3 Workstations plus servers, Figure 5.1c At Xerox PARC, the home of Ethernet, the philosophy was to provide a workstation on the desk, even before this was economically feasible outside a research environment. The idea was to stimulate the development of next-generation software by providing a next generation environment. As time went by price/performance tradeoffs changed; also workstation software became as complex in functionality as the mainframe software it had replaced. Current workstations can function as timesharing systems if the owner authorises remote logins. A shared file service offered by a number of servers has become the conventional way to share data. The system on the desk will probably have local storage but this is likely to be used for system software, temporary storage and space for ``swapping out'' programs that are not in immediate use rather than functioning as part of a permanent filing system. 5.3 Ubiquitous computing The price/performance ratio of small computers has led to their widespread use as personal and small-business systems. Network access for electronic mail and other internet access such as the world-wide web has become increasingly popular outside professional computing laboratories and offices. The computer companies are now looking to the profits that may be made from widespread ubiquitous computing, home area networks and the like. It is interesting to see the tradeoffs which were explored for the computing workplace revisited for the home environment. The `network computer' white paper by Oracle advocates that the end system should be as inexpensive and simple to use as possible. SUNs Javastation is based on a similar approach, see http://www.oracle.com/nca/html/nca_wp.html http://www.sun.com/961029/JES/whitepapers/javastation/javast_ch1.html Such systems will need a network interface which will add to their cost. A diskless workstation is proposed with extra processing and all storage across the network. Environments in which a single application is run continuously can use a simple system of this kind; for example for flight reservation, airport check-in, tax calculation etc. In the home, an implication of a diskless system is that the user is obliged to transfer all programs and data across the network and their use can therefore be monitored and charged. 5.4 Model and software architecture Whatever the physical architecture of the distributed system to be used we need to establish principles for designing and engineering the distributed software that is to run on it. The following questions raise the basic issues. 1. Model What are the entities that comprise the distributed system? How do they interoperate? How is their behaviour specified? 2. Architecture How are the components named, located and protected? 3. Engineering How is acceptable performance achieved? Is the fact that the system is distributed transparent to the user or the application programmer? We defined a system model in Chapter 2. Components might be modules, abstract data types or objects. When the system is operational the components must be able to interwork. We saw one approach to providing support for this interworking as a platform or layer of software that lies above heterogeneous operating systems and converts their different interfaces to a uniform one. We shall study the details of such platforms later. We shall also study the architectural and engineering issues raised above at a later stage. We now set down the fundamental properties of distributed software systems which apply whatever the model and architecture. 5.5 Special characteristics of distributed systems The distinguishing characteristics of a distributed system may be summarised as follows: 1. Concurrency The components of a distributed computation may run at the same time. 2. Independent failure modes The components of a distributed computation and the network connecting them may fail independently of each other. 3. No global time We assume that each component of the system has a local clock but the clocks might not record the same time. The hardware on which the clocks are based is not guaranteed to run at precisely the same rate at all components of the system, a feature called clock drift. 4. Communications delay It takes time for the effects of an event at one point in a distributed system to propagate throughout the system. 5. Inconsistent state Concurrency, failures and communications delay together imply that the view of any state maintained by the distributed computation will not be consistent throughout the system. We are concerned with property 1, concurrent software design, throughout the book. The details of communications protocols are outside the scope of this book but it is worth noting that an aspect of their design is to take account of the possibility of failure, property 2. We shall take failure into account whenever we consider a distributed implementation of a particular service or algorithm, for example in distributed filing systems and later in general distributed algorithms. Property 4, potential delay, makes the detection of failure quite difficult; the protocol designer has to distinguish between delay, when components are operating slowly, and failure, when something has crashed. Property 5 is a concern of Part 3 of the book and we postpone an in-depth treatment until then. For the rest of this chapter we focus on time in distributed systems and discuss how naming schemes, such as those introduced for objects in Chapter 2, can be designed for systems which might be large scale and widely distributed. 5.6 Time in distributed systems 5.6.1 Physical time We should first note that earth time is established only by convention. There is no universal time. Einstein proved that the velocity of light is constant for all observers irrespective of their velocity, see Figure 5.2.
Time beyond the earth Let us assume that our distributed system is earth-based so that we can assume some notion of conventional physical time. Standard earth time is based on the earth's rotation in terms of days and years but accurate values are now derived from an atomically defined caesium clock. This time is (mis)named Universal Coordinated Time (UTC). It is possible to buy for a computer a device that can receive a UTC signal either from a satellite or from a radio station. Examples of satellite services are GPS (Global Positioning Service) and GEOS (Geostationary Operational Environment Satellite). The accuracy of time received from these services varies with atmospheric conditions but is typically 10ms for radio broadcasts and 0.1 (GEOS) to 1ms (GPS) for satellite service. It is not possible to equip every computer with a time receiver but those that are may function as time servers to the others. We can assume that each computer has a programmable timer module based on a quartz crystal oscillator. These devices are programmed to interrupt at some interval, typically 10ms, see Section 3.2.8. The accuracy of these devices is typically 1/10*6 (1 second in 11.6 days) and they also vary with temperature i.e. they are subject to `clock drift'. We next consider how applications might make use of time. This motivates the need for drifting local clocks to be kept in synchrony with a 'network time server' and we then outline how this may be achieved. 5.6.2 Use of time by distributed processes The following examples cover possibilities of how time might be used in a range of applications. 1. Resource reservation, such as airline booking. We may have a specification such as `if the resource requests of two transactions may each be satisfied, but there are insufficient resources for both, then the transaction with the earliest timestamp wins.' 2. Banking a) The interest calculation program runs after midnight. All transactions made before midnight are attributed to the previous day. b) Was a credit made to a certain bank account before money was withdrawn from it? 3. Programming environments Only those files of a related set with timestamps which indicate they have been edited since the last compilation of the set (for example in a UNIX 'make') will be recompiled. 4. Share dealing a) Can it be proved that executive X read certain sensitive information before initiating the transaction to buy or sell some shares? b) The cost of the shares you purchase will be taken to be their value at the time of your transaction. It should be noted that although we are sometimes concerned with the time at which something happened we are also concerned about the order of events. Bearing in mind clock drift and communications delay event ordering in distributed systems can be difficult to establish. 5.6.3 Logical time - event ordering Figure 5.3 shows three nodes in a distributed system with a time graph for a process running at each node. Events within a single process are defined to be sequential. Let us first consider their behaviour without relating it to physical time. Process X communicates with process Y (by sending a message, say). We can assert that the send in process X happened before the receive in process Y. Process Y subsequently communicates with process Z. We see that communication imposes a partial ordering on events in the system of processes.
Communicating processes in a distributed system Using < for `happened before' we can assert: Events in region x1 < events in the regions y2 and y3, Events in region x1 < events in region z2, Events in region y1 and y2 < events in region z2 For other regions we `can't say'. This reasoning must impose constraints on any mechanism for achieving a common view of time in a system of communicating processes. Let us now suppose that the processes each have a local clock which, as we have seen above, may drift. We could use the local clocks to timestamp events within each process and we could agree a convention for resolving who has won when timestamps are equal, for example by appending the process id to the timestamp. it seems that a total ordering of events in a system is within reach, provided we can ensure that clock drift does not cause the constraints we established above to be violated, for example:
Message send and receive times Suppose that process X puts a timestamp tx on the message it sends to process Y: send (m,tx). When process Y receives the message: receive (m,tx) its own clock reads ty. if ty > tx all is well if tx < or = ty we have a violation of the event ordering constraint. Process Y could reset its timer to tx plus one increment and all would be well, except that system time would drift further and further ahead of real time. We would however have a total ordering of events in the system. In the next section we look at how local clocks can be synchronised with standard earth time while maintaining the event-ordering constraints. 5.6.4 Algorithms for clock synchronisation Algorithms for synchronising with a UTC receiver First suppose that one computer has a receiver and is functioning as a time server. (At this point you should remember the fundamental properties of distributed systems and ask `suppose it fails?'). The other computers (clients) poll the time server periodically (the period depending on the accuracy required) to ask for the time. The server sends back the time. On receiving the time message from the server, a client may just use this value or may adjust it to allow for communications delay, for example by adding the known minimum delay or by adding half the time since the message was sent. If the value from the time server is greater than local time the local clock can either be set to it or be adjusted to run a little faster for a short time to build up gradually to this time. If the value received is less than local time the local clock is drifting ahead of standard time. We may not put it back because of event ordering constraints but we can adjust the time increments for a period to slow it down, for example from 10ms to 11ms. This algorithm has a single point of failure - the single clock server. For reliability we could have a number of servers, poll them all and use the first reply. For further detail see [Cristian 89]. Algorithms which do not rely on a UTC receiver If no computer has a receiver the participants may attempt to arrive at an average value of the time shown by their various clocks. One computer may act as coordinator, ask all the machines for their times, compute the average value from their replies and broadcast this to all of them. They can then adjust their clocks as described above. An operator might set the time server manually from time to time. For details see [Gusella and Zatti 89]. Algorithms for large scale synchronisation The algorithms discussed above rely on all participants contacting a server. Such an approach does not scale to a large number of participants. An example of a large scale system is the Internet in which the Network Time Protocol (NTP) achieves clock synchronisation to an accuracy of a few tens of milliseconds. A common approach to cope with large numbers is to impose a hierarchy on them. A three-level hierarchy is used by NTP. At the top level are a number of primary servers with UTC receivers. Each of these is responsible for propagating this time down a subtree. At the second level of each subtree are a group of servers which interact with their primary UTC server as described above. Each second level server then broadcasts its time to the level three servers for which it is responsible. As always in distributed systems we should specify the behaviour under failure. Also, large scale systems are typically subject to reconfiguration, with nodes joining and leaving the system and this should also be taken into account. These and other refinements of the basic approach are explained in [Mills 91].
Outline of network time protocal (NTP) 5.7 Naming An operating system maintains a name space for a single system. Examples of the objects it supports are processes, files and I/O streams. We saw in Chapter 2 that object based systems are able to use a unified naming scheme for all objects and sketched how this could be defined for a single system. When we consider distributed systems we need to expand name spaces beyond those maintained by a single operating system. Names are defined in a context. It may be that a distributed system design is based on a homogeneous operating system which supports a distributed object model. Alternatively, we may have to work in a heterogeneous world and devise a naming scheme for the distributed application we wish to build. For example, names may be used only within the context of a distributed file service, mail service, news service or bank account management service. It is useful to model this as a type manager naming objects of that type. In all cases each object named must have a unique name within the context in which it is used. We now consider how uniqueness may be achieved. 5.7.1 Creating unique names Uniqueness may be achieved through so-called unique identifiers or by using a hierarchical naming scheme. Unique identifiers (UIDs) The name space is a specified number of bits, so a UID is a number in the range 0 -> 2**N -1 for an N-bit UID. 32, 64 and 128 bits are typical choices. A UID is never reused and a given bit pattern either refers to the same object at all times or to no object at all. Hierarchies Uniqueness is achievable by using a hierarchy as well as a long bit pattern, for example, puccini.cl.cam.ac.uk is a unique name for a computer. A manager of the domain cl must ensure that puccini is unique in that context; the manager of the domain cam must ensure that the name cl is unique in that context and so on. 5.7.2 Pure and impure names A useful distinction is between a pure name which in itself yields no information and an impure name [Needham]. Pure names A pure name yields no information such as the location of the named object, or the context in which the name is to be resolved i.e. looked up. An example is a UID which is interpreted as a flat bit pattern with no internal structure. All we can do with a pure name is compare it with other names of that type, for example in table lookup. The major problem with pure names is therefore to know where to look them up. It might be that a pure name refers to nothing, but how do we avoid a global search to be sure that no object with that name exists? Impure names An impure name yields information and commits the system to maintaining the context in which the name is to be resolved, for example: puccini.cl.cam.ac.uk (a computer) jeanb@alexandria.lcs.mit.edu (a registered user); jmb@cl.cam.ac.uk (a registered user). We have examples here of location dependent names. Note that there is no way of telling from the names that they both refer to the same person; jmb is unique within the domain cam, jeanb is unique within the domain alexandria. If Jean Bacon from Cambridge spends some time at MIT she is registered separately in both places and must arrange manually that any interactions with her Cambridge name are forwarded to her MIT name. host-id, object-id (a form of unique object identifier) Here we have an interesting point. It may be that the algorithm used to ensure the uniqueness of a UID is to append the host-id at which the object was created to a monotonically increasing count at that host. It appears at first sight that the object name yields location information. System policy must determine whether the name is to be treated as a pure or impure name. If pure, the host-id should either be ignored or at most be regarded as just a hint about the object's location. If the name is impure we may expect the creating host to keep a record of where any object has moved to. The major problems with impure names are object mobility (an object cannot move without changing its name) and the difficulty of restructuring a hierarchical name space. 5.7.3 An example: The Internet Domain Name Service (DNS) An outline of this familiar name [Mockapetris 89] service will help to introduce the issues of name service provision. In practice, the objects named are computers, mail hosts and domains. Examples of domain names are: uk uk.ac uk.ac.cam An example of a computer name is: puccini.cl.cam.ac.uk An example of a host name is: swan.cl.cam.ac.uk The type of the object named is therefore not apparent from the name itself. Below a notional root are top level domain names such as: com (US companies) edu (US academic institutions) gov (US government) net (network management) org (organisations) int (international) uk (the United Kingdom's root) fr (the root for France) ...and so on.... Note that the naming scheme has an unfortunate US bias, for example, within the UK domain are the nested domains: uk.ac (UK academic institutions) uk.co (companies within the UK domain) and similarly within all countries other than the US. Although the name space is badly designed it is impossible for it to be restructured without invalidating huge numbers of existing names, for example for all names that begin edu to be renamed to begin us.edu. A domain has a manager but the management role may be delegated to sub-domains. It is the responsibility of the manager of a domain to ensure that unique names are assigned within it. A directory is a sequence of names, each with an associated list of values (attributes). Users may query DNS to obtain the attributes associated with a given name, for example: computer-name, location? -> IP address user, mail host? -> list of computer names ordered by preference The DNS name space is huge. Before 1989 a naming database was held centrally and was down-loaded into selected hosts periodically. When the scale of the internet made this approach impossible, DNS was introduced. The domain database is partitioned (as are all large scale naming databases) into directories and the directories are replicated at suitable locations on the internet. To resolve a multi-component name such as puccini.cl.cam.ac.uk puccini is looked up in the directory for the cl domain cl is looked up in cam cam is looked up in ac ac is looked up in uk. If these directories were stored on different computers, four computers would have to be contacted to resolve the name. These directories need not necessarily be held at different locations, i.e heavily used directories may be stored and replicated strategically to avoid many computers being contacted to resolve a multi-component name. As we have seen, name resolution could be a lengthy process and to increase efficiency a resolved name is likely to be cached by your local software. Also, because many users are using DNS simultaneously there is scope for batching queries and responses. Updates are made locally by the domain manager. This makes the local copy of that domain directory up-to-date but all other copies have become out-of-date. Changes are propagated to all copies which will be up-to-date in due course. In this example we have seen: * the specification (syntax) of unique names within the DNS hierarchical name space; * management of the name space; * querying the name service; * partitioning and replicating the naming database because of its large scale; * name resolution; * some efficiency tactics. We now discuss these issues in general terms. Further examples are given in Chapter 22. 5.7.4 Naming, name services and binding Name services provide clients with values (attributes) associated with named objects. Name space A name space is a collection of valid names recognised by a name service, for example: a 128-bit UID a pathname in a filing system such as /a/b/c/d in UNIX a DNS multi-component name such as puccini.cl.cam.ac.uk The structure (syntax) of names must be specified. Naming (management) domain A naming domain is a name space for which there exists a single administrative authority for assigning names within it. This authority may *delegate* name assignment for nested sub-domains. For example, we have seen within DNS that the manager of the domain uk.ac may delegate management of the subdomain uk.ac.cam. Directories The values (attributes) associated with the names for a domain are held in a directory for that domain. The name service is offered by a number of name servers which hold the naming directories and respond to queries. Binding or name resolution This is the process of obtaining a value which allows an object to be used. For example, to determine the network address of a named computer or the network addresses of the computers which receive mail (hold the mailbox) for a named individual. For large scale systems the naming database is invariably partitioned, replicated and distributed among the name servers so resolving a multi-part name is an iterative process which involves navigation among name servers. It is good programming practice to bind late and to embed unresolved names, not addresses, in programs. 5.7.5 Attributes stored by name services A name service maintains an information database. The fundamental requirement is to hold the locations of various types of objects such as users and resources so that they can be contacted or used. A number of attributes may be associated with a given name and a query must specify both the name and which attribute is required. An alternative form of query, which may or may not be supported by a name service, is the attribute-based query. Here an attribute is supplied and a list of names with that attribute is returned. This is sometimes referred to as a Yellow Pages Service, although a naming service may have the functionality of both the white pages (directories) and yellow pages services of the telephone companies. We shall focus on the white pages or directory function. Name services have typically held the following types of object and associated attributes: user: login name, a list of mail hosts for that user, ordered by preference, authenticator e.g. password (usually held elsewhere) computer: architecture, OS, network address, owner service: list of group: list of names of members of the group alias: name (directory: list of computers which hold that directory) It may be that, as an optimisation to avoid several queries, the network addresses of the mail hosts etc. are stored and returned on the first query. The type of service listed here is likely to be stable, long-lived and widely used, such as a mail service. We shall see later that distributed programs that are short-lived and 'private' will tend to use a special purpose name service for name to location mapping. A design choice is whether a directory should be treated as just another type of named object or whether directories should form a separate structure. In either case, a directory name will resolve to a list of computer names (plus network addresses) which hold that directory. It may be argued that making a great deal of information freely available is bad for system security. It could be useful for a potential intruder to be able to look up which OS and which version of a given service are running. 5.8 Mobile users, computers and objects 5.8.1 Mobile computers The use of small computers such as laptops and palmtops has increased over the years. Developments in technology have made it possible to build laptops with substantial processing power and memory and a certain amount of local storage. Increasingly, their owners expect to be able to attach such computers to a network wherever they happen to be. The requirement is then to be able to use their home environment from anywhere in the world as well as general services. This requirement raises many questions: * how should a mobile computer be named so that it can be recognised whenever it attaches to a network? * how should a user's home environment; especially files in the filing system, be kept secure but allow authorised remote access, as described above? * which services should be used when the computer system is reattached? There may be an instance of a service that is closer than the one in use before the computer was detached. * if the user was sharing objects with others before detaching, how should the shared objects be restored to a consistent state when the user reattaches? The potential problem is that reading and writing by the sharers of the objects may have continued without restriction during the detached period. 5.8.2 Mobile users and objects Another scenario is that the computing environment is fixed but users may move around. At present the user has to take explicit action by making a connection and logging in from a different system. If she has moved outside her home security domain she may have to make special arrangements to be allowed to login from outside. In the future, the system may provide a higher level of support. For example, a user may wear a locating device such as an active badge. A working environment such as an office or university department will contain a network of sensors so that the badges are detected as their wearers move around. The users environment may move with her or she may cause it to transfer to any computer by clicking on her badge at a nearby detector. This can be extended for wide area interworking among a number of geographically separated departments; naming, location and security issues must first be addressed. At present such systems are being investigated in research projects. 5.9 Requirements for security in distributed systems Network based systems have additional security requirements over and above those of centralised systems (see Section 2.7). Applications such as home banking and internet shopping have added to the urgency with which the problems are being addressed. In outline, * The communications medium is insecure; there is the potential risk of 'wire-tapping' - information in transit may be intercepted and read by unauthorised parties. * End-to-end consistency of security mechanisms; the source and destination systems for information transfer may be as secure as required but information may pass through intermediate systems. Their degree of security should be specified. * Authentication in distributed systems is more complex than in centralised systems because secret information, such as a password, may need to be transferred across a network. An authentication protocol may establish that 'you are who you say you are'. It may be that an eavesdropper has observed the authentication procedure and steals your password, if it is transferred 'in clear', for subsequent use. More subtly, an eavesdropper might steal what is inferred as your encrypted pasword, or some key which indicates that you have logged in successfully, or a ticket (capability) that allows you to access some object. Security mechanisms must incorporate checks and safeguards to guard against a wide variety of risks. 5.10 Summary The aim of this chapter is to make the reader familiar with the special properties of distributed systems at an early stage in the study of software system design. Most of the readers of this book will be using a networked computer with a window interface as their working environment. Distribution and concurrency are therefore natural properties of everyday systems. We have looked at time in distributed systems, first emphasising their fundamental relativistic nature. Current technology allows the clocks of our networked computers to be synchronised at a reasonably fine granularity, but we should be aware that software design may be based on assumptions about time and event ordering. Object naming was introduced in Chapter 2 and here we have shown how naming schemes can be designed to allow for large scale and wide distribution. In Part 3 we consider multi-component concurrent computations. At that stage we shall cover distributed algorithms in detail including those involved in maintaining large scale naming systems.