Chapter 4 Design Considerations for QOS The basic, low level mechanisms required to support QOS in an operating system are run time resource allocation, accounting, policing and a means by which applications can determine the QOS which they are currently obtaining. This chapter describes design of an operating system which incorporates these mechanisms. 4.1 Approach While it would be possible to take an existing operating system or kernel and add to it the mechanisms required to support QOS, it is easy in doing so to lose track of the implementation goals; there is the possibility of having to alter the QOS mechanisms to fit them into the system thus obfuscating their exact requirements, intentions and objectives. Once they have been implemented, a large or complex system might also impede accurate measurement of their performance. Additionally, the mechanisms being discussed here are present at the very lowest levels of an operating system, so their use may have repercussions throughout large portions of the system; much of the time spent propagating these side effects could be unrewarding. Depending on the number and complexity of services provided, the design of a complete operating system ab initio can be a large task. Since the scope of the work presented in this dissertation is necessarily limited by time and resources, the following sections will focus on those issues in the design which are pertinent to QOS and support for CM. Nevertheless, it is still possible to design and implement enough basic system functionality to demonstrate the viability of the approach. This motivates the approach taken in this and the following chapters in which the design, implementation and evaluation of an experimental kernel is presented. The 48 kernel is tailored to the provision of mechanisms which provide the necessary system­ level support for QOS guarantees. Current experience with operating system design evinces the advantages of structur­ ing operating systems as a group of client and server processes which are supported by a microkernel. The microkernel provides exactly the functionality required by the clients and servers to interact both among themselves and with the system hardware. This functionality consists of protection between the address spaces of different pro­ cesses, some form of Inter­Process Communication (IPC), resource allocation in the form of a scheduler or dispatcher and an interface between the hardware and those processes which require access to it. These basic services are sufficient to support user level applications and system servers and more complex systems may be devel­ oped from the low level software which provides them. Consequently, in order to prove sufficient for the construction of systems of a reasonable utility and size, the design presented in this chapter will have to provide at least this basic functionality. 4.2 Design Goals A number of supplementary requirements guide the design and are used as ulti­ mate decision criteria to select one from a number of seemingly equivalent design possibilities should the need arise. Firstly, the resources used by system software are unavailable to the user, so one of the goals of any system design ought to be to provide a reasonable environment in which a user's applications may run while maximising the amount of resource which is available to the user. Secondly, at the lower levels, system software ought to provide only the most basic, necessary mechanisms. Policy decisions ought to be left to the users of the resources provided by the system. Thirdly, it is important to keep in mind the environment in which the end system will be operating; this will aid the derivation of a suitable set of design requirements and ensure that neither too much nor too little emphasis is placed on any one aspect of the design. In the case of the design presented here, the environment is a workstation which is connected to other workstations and server machines via a high speed network. This does not mean that the design is to be so myopic that the resulting system is only useful within a workstation; the use of sufficient modularity in the design ought to make the system amenable to distribution and the basic kernel functionality ought also to be usable in other applications such as a dedicated server. Also to be noted is that a server is a relatively controlled environment when 49 compared to a workstation, so that some features, such as memory protection, which are often necessary in a workstation may be dispensed with in a server to obtain better performance. 4.3 Addressing Recently, there has been some renewed interest in the construction of systems which use a Single Address Space (SAS). In the case of systems structured as sets of processes which communicate among themselves and cooperate by sharing data, the ability to refer to some data from different processes by using one address enables certain optimisations to be obtained. In particular, in the case of processes which communicate, large data structures can be passed between processes as arguments and return values by simply passing the address of the data. This obviates the need to convert addresses between address spaces or to copy the data as is done in some systems [Accetta86] [Hildebrand92] and results in a significant performance improvement over such systems. Use of a single address space will affect the design of other parts of the system. 4.3.1 Loading Executables In a Multiple Address Space (MAS) system, program code and data can be bound to their run time addresses before execution because it is known that each program will run at the same virtual address. A similar strategy could be used in a SAS system if addresses are never reused. Whenever a program is linked for execution, it is allocated a range of addresses by the system, and it will be permanently bound to that range. The allocation has to remain valid across system reboots, and the system needs to take great care in remembering where the base of unallocated ad­ dress space is. This approach is less viable in machines with a small (32 bits or less) address space as the size of the address space will limit the lifetime of the system. Recently, a number of microprocessors which have much larger (up to 64 bits) address spaces have been developed[DEC92] [MIPS91]. These large address spaces mean that addresses may not have to be reused in a typical system until it has been running for a very long time; use of addresses at the rate of 1 gigabyte per second would exhaust a 64­bit address space after approximately 500 years. It is important to keep in mind that in such a system, the size of the address space only determines, for a given rate of address consumption, the lifetime of the system after which it has to be restarted from its initial state; it is in many respects, unrelated to the fact that there is only one address space. 50 In a system in which addresses may be reused, the binding of programs to their run time addresses can be left until the programs are executed. At this stage, a system loader can obtain a suitable range of addresses from the system and relocate the program to use these addresses as it is loaded. In a small address space, this can be used in conjunction with some additional code which maintains a map of unallocated addresses to prolong the lifetime of the system. The ability to reassign addresses whenever a program is executed for the first time removes the need to remember the pointer to the base of unused addresses across system reboots. This pointer can be set to the start of the address space at every system initialisation. 4.3.2 Sharing Text Allowing multiple processes to share a single text segment can improve the utilisation of memory, so some consideration has to be given to how this might be done in a SAS system. If all program addresses are resolved just prior to execution, text sharing can be implemented with the aid of a data segment base register dsb. At link time, the static data are coalesced into a single data segment and each datum is allocated an appropriate offset from the base of the segment. When the program is loaded into memory, the system allocates space for the data segment and initialises dsb to point to the base of the segment. Data are referenced from the shared text segment by using these offsets to index off the dsb register. Figure 4.1 shows how two processes which share a text segment access their own versions of the data segment. For the Process 1 Context Process 2 Context load reg,v(dsb) Shared Text Segment Process 2 Data Segment Process 1 Data Segment v v dsb dsb Figure 4.1: Sharing text in a single address space. design presented here, a single address space is chosen to obtain the performance advantages outlined above and to gain some experience with its use. The resulting 51 system can also be used in further work to investigate some of the outstanding issues pertaining to the use of a single address space such as the implementation of more dynamic systems which allow run time binding to shared library modules. 1 4.4 Memory Access Protection Memory access protection in operating systems is used to limit the scope of the accesses which can be performed by a program. Typical operations which can be performed on data by a program are read, write and execute. Appropriate use of protection mechanisms can prevent both accidental and malicious accesses of all of these types. The costs of using such protection are that it requires extra hardware to implement and that a certain amount of time is required to perform the checking of accesses. Because a workstation is intended to support a single user, 2 the use of protection to guard against malicious access is not always necessary and many accidental accesses can be prevented by strict compiler checking and careful programming. Systems have successfully been constructed on these principles [Swinehart86]. The microprocessors used within current workstations usually provide some form of hardware memory protection mechanism which is often integrated with the address translation unit [Kane88]. These typically provide an unprivileged user mode, a supervisor mode in which it is possible to execute privileged instructions and the ability to set read, write and execute permissions on areas of memory, down to the granularity of the hardware page size. Current operating systems use this hardware in essentially two ways. Monolithic architectures [Ritchie74] place the whole of the operating system in su­ pervisor mode. In such systems, communication between system modules requires the cost of a procedure call, the lack of reinforcement of system module boundaries means that their interfaces can become blurred and more code executes in privileged mode than is necessary. Microkernel architectures [Accetta86] [Rozier89] [Hildebrand92] are based on a ker­ nel of reduced functionality which provides a high­level interface to the hardware, support for processes, memory management and some form of IPC which is typically based on message passing and implemented within the kernel. In such systems, the 1 This subject is currently being investigated within the Computer Laboratory by Timothy Roscoe [Roscoe94]. 2 In an environment in which a number of ``workstations'' are interconnected via a LAN, this is not always the case. 52 operating system software is structured as a set of servers which provide system services to client programs. Message passing is a low­level facility, so often a Re­ mote Procedure Call (RPC) mechanism is provided as a convenient programming abstraction. Characteristic of such systems is that the nature of their construction facilitates distribution over networks of processors, but while the boundaries be­ tween system modules are well enforced, communication between modules requires one or two message passing operations. The design attempts to utilise the protection hardware by providing facilities to share data among protection processes which need access to it. The granularity of protection domains is such that a protection domain encompasses a system server or application. Intra­module protection is static and relies on compiler type checking while inter­module protection is dynamic and requires appropriate hardware. Appli­ cations can interact with the system's memory and protection management server to arrange to share a part of their domain of protection with other system entities. The amount of code which executes in supervisor mode is minimised by the provi­ sion of a facility which allows a sufficiently privileged process to request for sections of its code to be executed in privileged mode. This is motivated by a number of factors. Moving between user and supervisor mode is becoming less expensive on modern RISC processors because the hardware does not automatically incur the overhead of saving the processor state, this task being left to hand crafted exception management code which is provided as part of the system supervisor. Conceptually, microkernels contain a small subset of the functionality required of an operating sys­ tem, but in reality this requires a not inconsiderable amount of code to implement. This code executes in supervisor mode and on some architectures (for example the MIPS R3000) necessarily has unlimited access to the whole of the system address space. The ability to request that the processor be placed into privileged mode means that the majority of the code which usually constitutes a microkernel can be executed in user mode at a correspondingly lower privilege level, enabling better use to be made of the available protection hardware. A final factor which motivates this element of the design occurs as a result of having dynamically loadable system entities such as device drivers. Separating the portions of the driver code which absolutely must execute in supervisor mode from the rest improves the chances of being able to check this sensitive code before it is loaded into the system. The long term goal of this is that if sufficiently clever checkers can be constructed, then the system can relax its restrictions about who is allowed to use such checked, privi­ leged code to the extent that, certainly system provided libraries and possibly even user applications, might be afforded the ability to execute short sections of code in privileged or uninterruptable mode. Figure 4.2 shows the effect which such a design has on the structure of an operating system. Most of the functionality provided by a typical microkernel has been moved into a user mode domain; requests to have the 53 Privileged Code Machine Hardware Kernel Domain System Domains Application Domains Figure 4.2: The structure of a system which uses privileged library. processor placed into supervisor mode which can be identified as originating from the privileged kernel or system domains are granted. This identification may be based on a knowledge of the process which is currently running. 4.5 Communication Within the system being designed, communication occurs between two processes, between a process and the system and between the hardware and system processes. These forms of communication will occur frequently, so they ought to cost as little as possible. Many systems fall prey to lack of performance caused by programming generality. Not only do many microkernel systems offer a limited set of message passing primitives, but the cost of message passing between processes on the same machine is much more expensive than a local procedure call. In at least one system, this has resulted in the migration of user mode servers into the kernel and supervisor mode purely for performance reasons [Bricker91]. Such performance gains in themselves are not directly related to the implementation of QOS within a system, but the performance of IPC primitives is of great impor­ tance in evaluating the viability of a design. Microkernel architectures offer the advantages of well enforced modularity at the expense of more costly inter­module interactions. Since IPC will be used frequently in a system, it is important that it be implemented in as efficient a manner as possible. The ultimate goal of doing so is to make it of the same cost as a subroutine call. The aspect of IPC design which will be influenced the most by a QOS implementa­ tion is timeliness. Many current systems provide support for the timely scheduling of processes which communicate via IPC by assigning static priorities to the processes. The scheduling decisions which are based on these priorities are bound implicitly 54 within the IPC mechanism so applications have no means of explicitly expressing their temporal requirements or of controlling the times at which scheduling decisions are made. The design should incorporate sufficient flexibility to enable applications to express temporal requirements explicitly and allow them control over when then scheduling decisions which result from IPC operations are to occur. 4.5.1 Components of Communication Communication between processes consists of two independent actions: data trans­ fer and synchronisation. In an implementation which is based on message passing, these two actions are combined in the message passing primitives provided by the system. The manner in which data is transferred is determined by the maximum message size and the amount of control given to a process over the type of synchro­ nisation desired and varies depending on whether the primitives are synchronous or asynchronous. Combining these actions into the one message abstraction incurs unnecessary overhead when it is desired to use only one of them independently. The use of messages containing no data for synchronisation and of messages to request the current value of some data between processes which are executing on the same machine are examples of this. The desire to provide support for CM streams within the design provides additional motivation for separating the synchronisation and data transfer aspects of commu­ nication. Applications which need to synchronise their actions with CM streams often do not need access to the stream data itself. This factor guided the design of the stream agents presented in [Sreenan92] which can be used to filter a stream of CM data and present to an application a sequence of synchronisation events. Recent advances in the design of workstation architectures emphasise this separa­ tion. [Hayter93] presents the Desk Area Network (DAN), a workstation architecture which is well suited to handling CM streams. DAN replaces the traditional back­ plane bus with an ATM switch, which is used to interconnect devices such as the processor, frame store and LAN interface. [Pratt92] presents the ``ATM camera'' which converts an analogue video signal into a stream of ATM cells and connects directly to the switch in the DAN. The switch can be programmed to direct a CM stream from the camera to the processor if the video data is to be delivered to an application. If the video data is only to be displayed, the switch can be configured to route the video data stream directly to the frame store, bypassing the processor. A later version of the camera [Pratt93] produces in addition to the data stream, a control stream containing synchronisation and timing information. Within the DAN, separation of the data and synchronisation components of a video stream en­ able the data stream to be routed directly to the frame store and the control stream 55 to be delivered to an application. It can be concluded that, while message passing is a convenient abstraction, useful for building distributed systems, it can be constructed from more basic primitives which provide a lower level of abstraction. The separation of these primitives into data transfer and synchronisation operations makes them better suited to a number of applications such as the handling of CM streams. It also enables certain optimi­ sations to be made when communication is between entities which are running on the same machine. 4.5.1.1 Sharing Data Data can be shared between two processes provided that they can agree which data to share; the data can be protected according to how much each of the processes trusts the other and they can synchronise accesses to the data. An example of the usefulness of data sharing can be found in the information about itself which a process is able to know and which usually resides within the operating system; the current time and accumulated processor usage are examples of such information. When this is resident in the operating system a system call is required to obtain it. If the operating system made this information available to the process, it could be read directly, thus avoiding the overhead of a system call. The mechanics of such operations should be hidden behind suitable interfaces provided by system libraries so that application portability is not compromised. If the system trusts a process, the process can be given open access to the data used by the system. If the process is not trusted, memory management hardware can be programmed to allow the process only the ability to read the data and not write it. Data consistency can be provided by the use of atomic updates or some more general synchronisation primitives as might be provided by the system. The type of synchronisation best suited to a particular instance is dependent upon the type of data being shared and not all synchronisation requirements will need to use the mechanisms provided by the system. The agreement between processes as to what data are to be shared must therefore also include how it is to be shared. 4.5.1.2 Synchronisation Synchronisation marks points in the execution of processes where scheduling deci­ sions have to be made. In a message passing system, a scheduling decision has to be made after a message is sent. A blocking send primitive necessarily suspends the sending process, surrendering the processor, until a reply is received. A more flexible system allows the sender to return, still in possession of the processor. A synchronous 56 send can be implemented using asynchronous primitives by atomically implement­ ing a send then a receive. The synchronisation primitives designed ought to provide at least the functionality of the synchronisation provided by asynchronous message passing implementations in that after a synchronisation point has been reached, a process is given the choice of whether or not it wants to retain possession of the processor. 4.5.2 Events The synchronisation aspect of communication can be modelled as the signalling of the occurrence of an event by one process to another. Figure 4.3 depicts the phases through which the path of execution of a processor passes when processing a hypothetical event. The event is said to arrive when the system first becomes aware Record Deliver Schedule Time (t) 0 1 2 3 4 5 6 7 8 9 10 System Event Arrival Event Response Schedule P 1 P 2 11 12 Figure 4.3: Arrival, processing and delivery of an event. of the event's occurrence (time t = 1); in the case of the example, the event may be generated by a process or as the result of some external hardware interrupt. At the time of arrival, P 2 is running and P 1 is the event's target. In response to an arrival, the system must at the very least make a logical record of the arrival so that it is not forgotten. Some time later, (time t = 3), this record is examined, the event is delivered to its target, and a scheduling decision is made. Delivery of an event does not necessarily mean that the target process is given the processor next; this decision is made on other information and in the case of the figure, after delivery of the event to P 1 , P 2 is given the processor next. Eventually, another scheduling decision is made, and P 1 gets to run, at which stage, it becomes aware of the event delivery and may choose to respond to it whenever suitable. The model can be applied with equal success to both hardware generated events and their software 57 equivalents. The ability to postpone delivery and scheduling after an event has arrived enables the implementation of critical sections. P 2 can implement a critical section by setting a flag upon entering it. This flag indicates to the event handler that the current process is not to lose the processor until it has left the critical section. The event handler, upon seeing this flag set, simply records the event and resumes P 2 . Upon leaving the critical section, P 2 clears the flag and then allows any pending events which may have arrived while it was in the critical section to be processed. In the case of interrupts, event masking, recording and delivery are implemented by the machine hardware. The scheduling decisions made after event delivery (at t = 4 and t = 7) imply that the model distinguishes between the logical value of any event (i.e. the record that an event has happened) and the time at which the response to the event occurs. In cases where the scheduler is provided with some information concerning the temporal requirements of an event, it can use this information to schedule processes so that these requirements are met. The ability to make such scheduling decisions extends to any user level schedulers which might exist within a process; in the example of the figure, even though P 1 is running at t = 9, it postpones the response to the event until t = 10, because it has more urgent work to do beforehand. 4.5.3 Design of the Event Mechanism The design includes a mechanism called an event channel which is a unidirectional path through which one process can communicate the occurrence of events to an­ other. Within its own domain of execution, a process is free to accumulate records of events as it pleases, and doing so does not have any direct effect on the scheduling of the process. When it decides to, a process may communicate the occurrences of events n i on channels e i by executing the call signal(vec), where vec is a vector of (e i ; n i ) pairs. After doing this, the events have arrived into the system. For each event channel e i , the system determines the end point of the channel. This locates an integer counter within the protection domain of the target process which the system atomically increments by n i ; this increment records the delivery of the n i oc­ currences of the event. A scheduling decision is then made. When the target process is next executed, it is informed of the arrival of events and may respond to them as it sees fit. The use of the vector argument vec to the signal call allows signalling on a single event channel or atomic signalling on a number of event channels. The latter provides support for a type of multicast signalling facility. A process which receives events is expected to maintain a record of which events it 58 has processed and which are still pending. This can be done by reading the value of the event to determine the total number of events which have been delivered (d, say) and comparing this with a locally maintained count of the number of events to which a response has been generated (r, say). At any time, two invariants hold: (1) The number of pending events is d \Gamma r and; (2) d \Gamma r – 0. The presentation of events has so far assumed that event channels are in existence when they are used; a means is required whereby event channels can be created. The basic problem to be solved in this instance is that before a process P 1 , say, is allowed to send events to another process P 2 , say, P 2 must agree to accept events from P 1 . The event mechanism is designed to be usable for communications between all of the active entities in a system, including communications between a process and the system itself. For this reason, when a process is created, it is provided with at least the means to communicate with any necessary system entities. An optimisation is to provide processes with communications channels to all of the frequently required system servers. 3 A process P 1 can use these established channels to request a channel for sending events to another process P 2 by executing the system call request(P2), which re­ turns to P 1 a channel identifier e. At this stage, any attempt to signal events on e will result in an error condition. The request call causes P 2 to receive via its existing interface with the system an asynchronous request for a connection e from P 1 which is identified uniquely by the pair (P 1 ; e). P 2 can decide either to accept or to reject the request. Rejection causes an error return in P 1 . Acceptance is indi­ cated by calling accept(P1, e, &v) which informs the system that P 2 is willing to receive events from P 1 on channel e, and that the event deliveries are to be recorded in the variable v, whose location is given by &v. This causes the state of the channel to be updated so that events may be signalled on it and also causes P 1 to be notified that the channel may now be used. An event channel may be closed by the process at either end; this causes the process at the other end of the channel to be notified of the closure and any further signals to return an error status. 3 This is the equivalent of the provision of unix processes with the standard I/O connections stdin, stdout and stderr when they are created. 59 4.5.4 IPC from Primitives Shared memory and events can be used to implement efficient IPC within the same machine. A unidirectional communications path can be established from process P 1 to process P 2 using an event and a shared memory segment with protection which allows P 1 permission to read and write and P 2 to read. P 1 can marshal one or more sets of arguments directly into the shared memory segment then signal the appropriate count of events to P 2 . Operating in conjunction with a similar structure operating in the reverse direction, this implements a bidirectional communications path. 4.5.5 Hardware Interface The IPC abstraction provided by the system can be used to interface user mode processes to hardware devices. Device driver processes are assumed to be privileged in that they are allowed access to device registers either by mapping them into their protection domain or via the execution of privileged code. The interface between a device driver and the system then consists of the device registers (and buffer memory if present) and an event which is signalled by the system as the result of a device interrupt. In cases where good performance is required, the device driver can export to other processes an IPC interface in which the shared memory segment maps some onboard device buffers, so that the processes marshal their arguments directly into the device buffers, avoiding unnecessary copying. The ability to do this effectively is determined by the granularity of protection af­ forded by the system memorymanagement unit. In the majority of current processor designs, the control of protection and cacheing is associated with virtual memory page size. Device drivers however, can utilise a granularity which allows them to protect down to the size of the device registers and consideration should be given to this in the design of the system hardware. 4.6 Scheduling Scheduling occurs at a number of levels, the QOS manager makes long term schedul­ ing decisions such as whether or not to admit a new process or withdraw resources from one process and give them to another, the system RTRA is responsible for presenting resources to processes in a timely manner and within the processes which use them, user­level thread schedulers are responsible for allocating resources to the 60 application threads. The mechanisms required to implement this resource manage­ ment strategy need to perform run time resource allocation, accounting and policing; the implementation of each of these requires some knowledge of the current time. 4.6.1 Time To allocate resources in a timely fashion, the notion of time has to be ubiquitous throughout the system. Arithmetic operations on temporal quantities will be fre­ quent, so a sensible representation of time can limit the amount of resources required to effect these calculations. The resolution of the clock ought to be sufficient to mea­ sure the time taken to execute some code on a processor to within a few instructions for the purposes of accurate profiling. Within the design, all times are represented as 64­bit integers which record the number of nanoseconds having elapsed since some well known epoch and can distin­ guish events which are less than 2 64 nanoseconds or approximately 580 years apart. Maintaining times as a single integer reduces the number of instructions required for temporal arithmetic over representations which use counts of seconds and decimal fractions of seconds. The system clock consists of two registers; the current register stores the time since the epoch and the alarm register stores the time at which the next clock interrupt is to be generated. A system clock module uses this (ideally hardware) clock device to implement a timeout queue which provides equivalent virtual clock devices for all processes. Processes may request that the clock module signal the occurrence of an alarm event some time into the future by programming their virtual alarm clocks. All processes are given direct, read only access to the current register to provide them with a cheap, accurate value of the current time. 4.6.2 Process Classes and States The design distinguishes between the various states in which a process exists by maintaining a number of queues. Q c contains all of the processes which hold con­ tracts awarded by the QOS manager and are runnable. When a contracted process has used all of its allocated resources and requires more, it is placed in Qw until more resources become available for it. Processes which have indicated that they wish to wait for some event are placed into Q b , the queue of blocked processes and best effort processes are placed into Q f . The most suitable algorithm for determining which process ought to get the pro­ 61 cessor next will be the result of considerable experience in the everyday use of the system, so a simple dispatching policy is described to demonstrate the use of the mechanisms provided. Whenever the process dispatcher is called to give the proces­ sor to the most eligible process, it chooses a process from Q c ; this aims to satisfy the requirements of the contracted processes before any other processes. If this queue is empty, a process is chosen from Qw , the aim being to give contracted processes the chance to make use of any resources which become fortuitously available after they have already received their contracted quota. If Qw is empty, the processor is offered to the best effort processes in Q f . This algorithm will be used in the system presented in this dissertation, but further work is needed to refine it. 4 For the two classes of processes (the contracted processes in Q c and Qw and the best effort processes in Q f ), the RTRA implements a dispatching policy. The best effort processes may be scheduled using some form of multiple level feedback queue. Recall from section 3.8 that the RTRA dispatching policy has a large impact on the type of contracts which it is possible to offer to processes, and that the designer is under no obligation to choose one particular policy over another, provided that the QOS manager is aware of the policy which is being used so that it can perform admission control and that processes are made aware of the nature of the contract which is being offered to them when they are admitted. Section 2.1.1, however, motivates the use of dispatching policies which take into account the temporal requirements of the contracted processes on the grounds that doing so will enable the system to meet more deadlines with a given amount of physical resource than otherwise would be the case. A simple static priority scheme could be used in conjunction with the RM algo­ rithm to effect some temporal ordering in which the processor is shared amongst the contracted processes. Static priority assignments have a number of disadvantages. Firstly, they afford a limited flexibility in expressing temporal requirements; static assignments may need to be recalculated in a dynamic system when new processes are given contracts. Secondly, the processes may be running multiple threads, the temporal requirements of which require more than a single priority to express. The design instead uses deadlines in conjunction with the EDF algorithm to sequence the execution of the processes. Each process specifies its deadline (in the case of multiply threaded processes, this will be that of the thread whose deadline is ear­ liest) to the RTRA, which uses this information to order the queue of contracted processes such that the process with the earliest deadline is at the head of the queue. When the processor is given to a process, the process is informed of the next earliest 4 For example, the algorithm presented does not make use of the fact that the response times of the best effort processes in Q f can be reduced by delaying execution of the contracted processes in Q c for as long as possible while not allowing them to miss their deadlines [Tindell93]. 62 deadline in the system; a process scheduler then knows that it can execute all of its threads whose deadlines are earlier than this before having to give up the processor. This allows processes which want to cooperate to obtain a high utilisation of system resources. There are times during the execution of a process when it can do no useful work even if it is given the processor; a process which is processing a video stream has no requirement for the processor until the frame of video for which it is waiting has arrived. A process may also detect that it does not require all of the processor time which has been allocated to it. In these circumstances, a user level scheduler may idle and waste processor cycles. The policing mechanism will prevent this from affecting other contracts, but best effort processes may benefit from being able to use these otherwise wasted cycles. The design allows processes to call into the kernel when they want to give up the processor until an event occurs. This causes the process to be removed from one of the runnable queues and placed into Q b until any event occurs. Arrival of an event when a process is in this state causes the process to be rescheduled into an appropriate runnable queue. Allocation of processor cycles is important to an application, but applications which block may not always want to be informed of it. The mechanism allows processes to specify whether or not they wish to be informed of processor allocation while they are waiting for an event to occur. Having used all of its resources, a process is removed from Q c and placed into Qw to wait for its next allocation of processor bandwidth. Should additional resources become available before then, a process can be selected from Qw and given those resources. This allows processes which are able to make use of additional resources to do so. 4.6.3 Accounting The design incorporates a basic accounting mechanism called a timer which records the accumulated time for which it has been running in nanoseconds. A number of primitives are provided for operating on timers: timer reset(t) initialises the timer to zero; timer start(t) starts the timer accumulating elapsed time; timer stop(t) freezes the timer and; timer read(t) returns the time accumulated by the timer. The RTRA maintains, for each process, a timer which records the processor time accumulated by that process. 63 4.6.4 Policing A policing mechanism is required to ensure that contracted processes use no more cycles than they have been allocated by the QOS manager. The design incorporates a simple policing scheme which assumes that, if a process has been allocated a processor bandwidth of (C; T ), then the process is eligible for C units of processor time every T units of real­time. For every process, a timer t a is maintained which is reset to zero at the start of each period T and records the processor time accumulated by the process during that period. Whenever the process loses the processor, either because it is preempted by another process, or because the alarm clock signals that it has no more processor time left, t a is stopped. The contracted processor time which is remaining for that process is calculated as t r = C \Gamma t a . If t r = 0, the process has no more processor time left and it is removed from the runnable queue and forced to wait until the start of its next period when it is allocated another C units of processor time. Whenever the process is about to be given the processor, the system alarm clock is set to interrupt t r into the future, timer t a is started, and the processor is given to the process. 4.7 Virtual Processor Interface The resource allocation strategies used to provide applications with various qualities of service sometimes give the applications fewer resources than they require. Sec­ tion 3.11 describes some ways in which applications can degrade their QOS when resources are scarce. Applications can do this more easily if they are informed of the availability or otherwise of resources. This is effected within the design by having the system present applications with this information through the VPI. The inter­ face includes a notification mechanism which informs applications when resources have been allocated, when they are available and when additional resources have been made available. The same mechanism can be used to inform processes of the delivery of events such as timer and communication events. The purpose of the notification mechanism is to divert the path of execution of a process into the user level scheduler, which may make the decision of what to do based on the current resource availability. If the thread which was running when the notification was delivered is to be suspended and another thread run in its stead, some means of obtaining the thread's saved context from the system is required as well as a means of resuming a new thread. Notification will occur asynchronously with respect to the execution of a process, so some means is required of synchronising the delivery of such information with access to critical sections. 64 4.7.1 Activation To notify a process of the occurrence of an event, the distinction is made between process activation and process resumption. Usually, when the processor is given back to a process, execution is resumed from the previously saved state and pro­ cesses execute unaware that they lose the processor and regain it from time to time. activation allows a process to specify to the system the address of some code which is where it would like execution to begin whenever it is given the processor; this is typically the entry point to some user level thread scheduling code. EVENT Status Register Machine Context Activation Address NEW AIP Event Array Deadline Received Acknowledged ALLOC PRMPT EXTRA Received Acknowledged Figure 4.4: Activation section of the virtual processor interface. Figure 4.4 shows the section of the VPI which implements activations; this consists of an activation address, a status register, and space for a saved machine context. Table 4.1 describes the functions of the bits in the status and control registers. When a process is created, its activation handler is set to be the address of the Bit Meaning ALLOC Processor time has been allocated PRMPT Process has been preempted EXTRA Extra processor time is available EVENT An event has been signalled NEW Events occurred during activation AIP An activation is in progress Table 4.1: Definitions of bits in the virtual processor interface registers. process's initialisation code. As part of its initialisation, a process installs into the activation address register, the address of some activation handler code. When­ ever a process loses the processor, its context is saved into the machine context field. Whenever the processor is given to the process, the status register is updated to re­ 65 flect the reason for activation and the process is activated by forcing its execution to start at the address specified in the activation handler. Figure 4.5 shows the t 0 t 1 t 2 t 3 t 4 t 5 ALLOC PRMPT EXTRA C 1 C 2 C e T Figure 4.5: Execution of a process showing activation points and reasons. timing diagram of a process which has a reserved processor bandwidth of (C; T ) and is preempted at time t 2 , so that it receives the bandwidth in two portions C 1 and C 2 such that C 1 +C 2 = C. At time t 0 , the process is allocated C processor cycles and is eligible to run but must wait for a higher priority process. At t 1 the process is given the processor, and because it has just been allocated more resources, the ALLOC bit is set in the status register prior to activation. At t 2 the process is preempted and at t 3 the process is again given the processor. No resources have been allocated since the last activation but the process still has C 2 cycles available, so the PRMPT bit is set in the status register to reflect this. By t 4 the process has used all of its processor allocation, but the system is otherwise idle and decides to give the process additional cycles so the EXTRA bit is set in the status register is set and the process is activated. During the execution of the activation handler, the AIP bit in the status register is set, causing any further activations to be queued. The activation handler uses the status register bits EVENT, EXTRA, PRMPT and ALLOC to determine its course of action. If this involves rescheduling threads, the state of the thread which was running when the process last lost the processor is available in the machine context area. Before continuing with the execution of the process, the activation handler must enable activations by clearing AIP then check the NEW bit in the status register to see if other events have occurred during the execution of the handler. If so, the handler must call back into the system to give it the opportunity to present any pending events. Critical sections can be implemented using a global, boolean variable which records entry into any such sections. Before entry to a critical section, a thread sets the variable to TRUE, should the process be activated while in a critical section, the thread scheduling code in the activation handler can inspect this variable to determine what to do with the thread. Possible actions are to resume the thread until it leaves the critical section, or to suspend it and run another thread which is known not to be in conflict with it. Another means of implementing critical sections uses two activation handlers, the usual handler which contains a thread scheduler and an alternate handler which always resumes the current thread. Entry to a critical section can be 66 effected by installing the alternate handler, which makes any activations invisible during the critical section. 4.8 Summary This chapter has presented the design of a kernel which incorporates mechanisms providing support for QOS within the operating system. The design includes an efficient IPC mechanism which makes use of a single address space, shared memory and means of signalling events between processes. The separation of data transfer and synchronisation aspects within the IPC mechanism make it better suited to some CM applications such as remote synchronisation with a stream of CM data than message based IPC mechanisms. The design also incorporates accounting and policing mechanisms, which are required if QOS contracts are to be maintained. It also augments the virtual processor interface which is seen by applications that applications can be informed of the availability of resources within the system over time. 67