The purpose of an operating system is to multiplex shared resources between applications. Traditional operating systems have presented physical resources to applications by virtualisation, e.g. UNIX applications run in virtual time on a virtual processor - most are unaware of the passage of real time and that they often do not receive the CPU for prolonged periods. The operating system proffers the illusion that they are exclusive users of the machine.
Multimedia applications tend to be sensitive to the passage of real time. They need to know when they will have access to a shared resource and for how long. In the past, it has been considered sufficient to implement little more than access-control on physical resources. It is becoming increasingly important to account, schedule and police shared resources so as to provide some form of QoS guarantee.
Whilst it is necessary to provide the mechanisms for multiplexing resources, it is important that the amount of policy hard-wired into the operating system kernel is kept to an absolute minimum. That is, applications should be free to make use of system-provided resources in the manner which is most appropriate. At the highest level, a user may wish to impose a globally consistent policy, but in the Nemesis model this is the job of a QoS-Manager agent acting on the user's behalf and under the user's direction. This is analogous to the use of a window manager to allow the user to control the decoration, size and layout of windows on the screen, but which does not otherwise constrain the behaviour of each application.
Figure 1.1: The Structure of a Nemesis System
Nemesis was designed to provide QoS guarantees to applications. In a microkernel environment, an application is typically implemented by a number of processes, most of which are servers performing work on behalf of more than one client. This leads to enormous difficulty in accounting resource usage to the application. The guiding principle in the design of Nemesis was to structure the system in such a way that the vast majority of functionality comprising an application could execute in a single process, or domain. As mentioned previously, this leads to a vertically-structured operating system (figure 1.1).
The Nemesis kernel consists of a scheduler (one version was less than 250 instructions) and a small amount of code known as the NTSC, used for Inter-Domain Communication (IDC) and to interact with the scheduler. The kernel also includes the minimum code necessary to initialise the processor immediately after booting and handle processor exceptions, memory faults, unaligned accesses, TLB misses and all other low-level processor features. The Nemesis kernel bears a striking resemblance to the original concept of an operating system kernel or nucleus expressed in [Brinch-Hansen 70].
The kernel demultiplexes hardware interrupts to the stage where a device specific first-level interrupt handler may be invoked. First-level interrupt handlers consist of small stubs which may be registered by device-drivers. These stubs are entered with all interrupts disabled and with the minimal number of registers saved and usually do little more than send notification to the appropriate device driver.
The term domain is used within Nemesis to refer to an executing program and can be thought of as analogous to a UNIX process - i.e. a domain encapsulates the execution state of a Nemesis application. Each domain has an associated scheduling domain (determining CPU time allocation) and protection domain (determining access rights to regions of the virtual address space).
Nemesis is a SAS operating system i.e. any accessible region of physical memory appears at the same virtual address in each protection domain. Access rights to a region of memory, however, need not be the same. The use of a single address space allows the use of pointers in shared data-structures and facilitates rich sharing of both program text and data leading to significant reduction in overall system size.
All operating system interfaces are written using an IDL known as MIDDL which provides a platform and language independent specification. Modules, with interfaces written in MIDDL, are used to support the single address space and allow operating system code to be migrated into the application. These techniques are discussed in detail in [Roscoe 95b].
Table 1.1: Alpha NTSC Call Interface.
The NTSC interface may be divided into two sections - those calls which may be invoked by any domain and those which may only be invoked by a privileged domain. Unprivileged calls are used for interaction with the kernel scheduler and to send events to other domains. Privileged calls are used to affect the processor mode and interrupt priority level and to register first-level interrupt stubs. As an example, table 1.1 lists the major NTSC calls for the Alpha architecture.
The NTSC interacts with domains and the kernel scheduler via a per-domain area of shared memory known as the DCB. Portions of the DCB are mapped read-write into the domain's address-space, whilst others are mapped read-only to prevent modification of privileged state. The read-only DCB contains scheduling and accounting information used by the kernel, the domain's privilege level, read-only data structures used for implementing IDC channels and miscellaneous other information. The read-write section of the DCB contains an array of processor-context save slots and user-writable portions of the IDC channel data structures.
Nemesis presents the processor to domains via the Virtual Processor Interface (VP.if). This interface specifies a platform independent abstraction for managing the saving and restoring of CPU context, losing and regaining the real processor and communicating with other domains. It does not however attempt to hide the multiplexing of the underlying processor(s). The virtual processor interface is implemented over the NTSC calls described in section 1.2.2.
When a domain is handed the processor, it is informed whether it is currently running on guaranteed time, or merely being offered use of some of the slack-time in the system. QoS-aware applications must take account of this before deciding to adapt to apparent changes in system load. This may be used to prevent QoS feedback mechanisms from reacting to transient improvements in resource availability.
When a virtual processor loses the CPU, its context must be saved. The DCB contains an array of context-save slots for this purpose. Two indices into this array specify the slots to use when in activation mode and when in normal mode, based on the current state of the activation flag.
When a domain is preempted it will usually be executing a user-level thread. The context of this thread is stored in the save slot of the DCB and may be reloaded by the activation handler of the domain when it is next upcalled. If a domain is preempted whilst in activation mode, the processor context is saved in the resume slot and restored transparently when the domain regains the CPU rather than the usual upcall.
The only means of communication directly provided by the Nemesis kernel is the event. Each domain has a number of channel-endpoints which may be used either to transmit or to receive events. A pair of endpoints may be connected by a third party known as the Binder, to provide an asynchronous simplex communications channel.
This channel may be used to transmit a single 64-bit value between two domains. The event mechanism was designed purely as a synchronisation mechanism for shared memory communication, although several simple protocols have been implemented which require nothing more than the event channel itself, e.g. the TAP protocol, described in [Black 94], used for start-of-day communication with the binder. Unlike message-passing systems such as Mach [Accetta 86] or Chorus [Rozier 90], the kernel is not involved in the transfer of bulk data between two domains.
Nemesis also separates the act of sending an event and that of losing the processor. Domains may exploit this feature to send a number of events before being preempted or voluntarily relinquishing the CPU. For bulk data transports such as the Rbufs mechanism, described in section 1.4.3, pipelined execution is usually desirable and the overheads of repeatedly blocking and unblocking a domain may be avoided. For more latency-sensitive client-server style communication a domain may choose to cause a reschedule immediately in order to give the server domain a chance to execute.
Various forms of IDC have been implemented on top of the Nemesis event mechanism. Some of the most commonly used are described below.
Since Nemesis domains share a single address space, the use of shared memory for communication is relatively straightforward. Data structures containing pointers are globally valid and the only further requirement is to provide some synchronisation mechanism to allow the data structure to be updated atomically and to prevent readers from seeing an inconsistent state. Very lightweight locking primitives may easily be built on top of the kernel-provided event mechanism.
Same-machine RPC [Birrell 84] is widely used within Nemesis. Although the majority of operating system functionality is implemented within the application, there are many out-of-band operations which require interaction with a server in order to update some shared state.
When a server wishes to offer a service to other domains, it locates an IDC transport module and requests a new offer (step 1 in figure 1.2) from it. It then places it into the Object Table of the server domain (2). The object table maintains a mapping between offers and interfaces.
Figure 1.2: Offering an RPC service
Before RPC can take place, the offer must be given to the client by the server. Typically, the server will place an offer into a traded namespace shared between all domains on the system and the client will retrieve it from there but it could just be passed directly.
When the client obtains the offer, it invokes a bind operation on the offer (step 3 in figure 1.2, step 1 in figure 1.3). This causes the offered IDC service to be added to the client domain's object table. A third party domain, the binder, then establishes the event channels necessary for invocations between the client and the server. The binder domain also causes the connection detailed to be looked up in to the object table of the server and thus the server side connection state to be established.
Figure 1.3: Binding to an RPC service
The default RPC transport is based on an area of shared memory and a pair of event channels between the client and server domains. To make an invocation (step 1 in figure 1.4), the client marshalls an identifier for the call and the invocation arguments into the shared memory (step 2) and sends an event to the server domain (step 3). The server domain receives the event, unmarshalls the arguments (step 4) and performs the required operation (step 5). The results of the call, or any exception raised are then marshalled into the shared memory (step 6) and an event sent back to the client (step 7). The client unmarshalls the result (step 8) and returns from the client stub. Marshalling code and the client and server stubs are generated automatically from the MIDDL interface definition and loaded as shared libraries.
Figure 1.4: Invoking an RPC service
The average cost of a user-thread to user-thread null-RPC between two Nemesis domains, using the default machine-generated stubs and the standard user-level threads package, was measured at just over on the Sandpiper platform [Roscoe 95b].
It is worth noting that the above complexity is largely hidden by the use of standard macros. Typical code on the server side looks like:
ServerType server; IDC_EXPORT("svc>myservice", ServerType, server);
And, on the client side:
ServerType server; server = IDC_OPEN("svc>myservice", ServerType); result = Server$Method(server, args);
Figure 1.5: High Volume I/O Using Rbufs
The transport mechanism is once again implemented using Nemesis event-channels and shared memory. Three areas of shared memory are required as shown in figure 1.5. One contains the data to be transferred and the other two are used as FIFOs to transmit packet descriptors between the source and sink. The head and tail pointers of these FIFOs are communicated by Nemesis event-channels.
Packets comprised of one or more fragments in a large pool of shared memory are described by a sequence of (base, length) pairs known as iorecs. Figure 1.5a shows iorecs describing two packets, one with two fragments and the other with only a single fragment. Rbufs are highly suited to protocol processing operations since they allow simple addition or removal of both headers and trailers and facilitate segmentation and reassembly operations.
In receive mode, the sink sends iorecs describing empty buffer space to the source, which fills the buffers and updates the iorecs accordingly before returning them to the sink. In transmit mode, the situation is the converse. The closed loop nature of communication provides back-pressure and feedback to both ends of the connection when there is a disparity between the rates of progress of the source and sink.
The intended mode of operation relies on the ability to pipeline the processing of data in order to amortise the context-switch overheads across a large number of packets. Sending a packet on an Rbufs connection does not usually cause a domain to lose the CPU. Figure 1.6 shows the MIDDL interface type for the Rbufs transport.
Figure 1.6: MIDDL interface for Rbufs (IO.if)
Scheduling can be viewed as the process of multiplexing the CPU resource between computational tasks. The schedulable entity of an operating system often places constraints both on the scheduling algorithms which may be employed and the functionality provided to the application.
The recent gain in popularity of multi-threaded programming due to languages such as Modula-3 [Nelson 91] has led many operating system designers to provide kernel-level thread support mechanisms [Accetta 86, Rozier 90]. The kernel therefore schedules threads rather than processes. Whilst this reduces the functionality required in applications and usually results in more efficient processor context-switches, the necessary thread scheduling policy decisions must also be migrated into the kernel. As pointed out in [Barham 96], this is highly undesirable.
Attempts to allow applications to communicate thread scheduling policy to the kernel scheduler [Coulson 93, Coulson 95] lead to increased complexity of the kernel and the possibility for uncooperative applications to misrepresent their needs to the operating system and thereby gain an unfair share of system resources. For example, in the above systems user processes are required to communicate the earliest deadline of any of their threads to the kernel thread scheduler.
Nemesis allows domains to employ a split-level scheduling regime with the multiplexing mechanisms being implemented at a low level by the kernel and the application-specific policy decisions being taken at user-level within the application itself. Note that the operating system only multiplexes the CPU resource once. Most application domains make use of a threads package to control the internal distribution of CPU resource between a number of cooperating threads of execution.
Inter-process scheduling in Nemesis is performed by the kernel scheduler. This scheduler is responsible for controlling the exact proportions of bulk processor bandwidth allocated to each domain according to a set of QoS parameters in the DCB. Processor bandwidth requirements are specified using a tuple of the form (p, s, x, l) with the following meaning:
The p and s parameters may be used both to control the amount of processor bandwidth and the smoothness with which it is provided. The latency hint parameter is used to provide the scheduler with an idea as to how soon the domain should be rescheduled after unblocking.
The kernel scheduler interacts with the event mechanism allowing domains to block until they next receive an event, possibly with a timeout. When a domain blocks it loses any remaining CPU allocation for its current period - it is therefore in the best interest of a domain to complete as much work as possible before giving up the processor.
The current kernel scheduler employs a variant of the EDF algorithm [Liu 73] where the deadlines are derived from the QoS parameters of the domain and are purely internal to the scheduler. The scheduler is capable of ensuring that all guarantees are respected provided that and is described in detail in [Roscoe 95b]. Despite the internal use of deadlines, this scheduler avoids the inherent problems of priority or deadline based scheduling mechanisms which focus on determining who should be allocated the entire processor resource and provide no means to control the quantity of resource handed out.
In order to provide fine-grained timeliness guarantees to applications which are latency sensitive, higher rates of context-switching are unavoidable. The effects of context-switches on cache and memory-system performance are analysed in [Mogul 91]. Mogul showed that a high rate of context switching leads to excessive numbers of cache and TLB misses reducing the performance of the entire system. The use of a single address space in Nemesis removes the need to flush a virtually addressed cache on a context switch and the process-ID fields present in most TLBs can be used to reduce the number of TLB entries which need to be invalidated. The increased sharing of both code and data in a SAS environment also helps to reduce the cache-related penalties of context-switches.
Intra-process scheduling in a multimedia environment is an entirely application-specific area. Nemesis does not have a concept of kernel threads for this reason. A domain may use a user-level scheduler to internally distribute the CPU time provided by the kernel scheduler using its own policies. The application specific code for determining which context to reload is implemented in the domain itself.
The activation mechanism described in section 1.3.1 provides a convenient method for implementing a preemptive user-level threads package. The current Nemesis distribution provides both preemptive and non-preemptive threads packages as shared library code.
The default thread schedulers provide lightweight user-level synchronisation primitives such as event counts and sequencers [Reed 79] and the mutexes and condition variables of SRC threads [Birrell 87]. The implementation of various sets of synchronisation primitives over the top of event counts and sequencers is discussed in [Black 94].
It is perfectly possible for a domain to use an application specific threads package, or even to run without a user-level scheduler. A user-level threads package based on the ANSAware/RT [ANSA 95] concepts of Tasks and Entries has been developed as part of the DCAN project at the Computer Laboratory. The ANSAware/RT model maps naturally onto the Nemesis Virtual Processor interface.
Nemesis is a SAS operating system; i.e., translations from virtual to physical addresses are shared between all executing domains. This is in contrast to existing commercial operating systems such as Unix, VMS and Windows NT, where each executing process (or task) its own `personal' virtual address space.
The use of an SAS by an operating system gives a number of advantages:
Global translations do not in any way imply any lack of inter-domain protection: all domains execute in some protection domain which contains a set of access rights for any given page. Thus there is a clear seperation between the concepts of protection and of translation.
On a single machine in Nemesis, any process will `see' the same virtual address space, the only difference being the access rights that the process has on each part of the address space. The virtual address space is divided into pages, which are the smallest protectable component of the space.
The machine itself presents a Physical Address Space, which includes any physically addressable location in a machine's memory, not just RAM. The physical address space is divided into frames, which are the physical equivalents of pages. A frame typically contains the same number of bytes as a page.
Protection is in terms of sets of mappings of the virtual address space to the set of access rights. These mappings, known as protection domains, represent the protection environments in which domains execute.
The basic unit of allocation of virtual memory is the stretch, which is a contiguous region of the virtual address space, all of which has the same accessibility in any protection domain. This is similar to a region in Mach or Chorus, or even to a MS-DOS arena. It is not, however, to be confused with a Unix-style segment, which can shrink and grow at will -- a stretch corresponds to the same set of contiguous pages throughout its lifetime. Stretches are disjoint, i.e., they cannot overlap. One consequence is that stretches cannot be refined: if you have a stretch it doesn't make sense to talk to the VM system about a subset of the stretch.
A stretch must be associated with a stretch driver before it can be useful. A stretch by itself is merely an abstract object created by the stretch allocator; it has no meaning to the rest of the system. The stretch driver provides physical resources, page fault handling, or whatever, to the stretch, and sets up virtual to physical mappings by invoking the mapping systems. The stretch driver is a close analogue of what in System VR4 is called a segment driver, or what in Mach is called a memory object.
The translation system is a machine-specific component which enters, removes and updates both mapping and protection information for virtual addresses.
An outline of the components of the virtual memory system is shown in figure 1.7.
Figure 1.7: Virtual memory system components
Of key importance within Nemesis is the idea of accountability: every domain should perform whatever tasks are necessary for its own execution. In terms of the VM system, this means that a domain is responsible for satisfying any faults which occur on stretches which it owns. There is no concept of a ``system pager'', or of an ``application level pager'': instead, every domain is its own pager. The domain will typically invoke a stretch driver provided in an existing library, though it may use a private implementation if it desires.
In order to support such a mechanism, Nemesis extends guarantees to the allocation of physical memory. Every domain has a guaranteed number of physical frames which it is allowed. This is initially negotiated at domain creation time though may be renogiated at a later stage. A frame stack structure is shared between the frames allocator and the owning domain. The former keeps track of the number of frames which have been allocated, while the domain maintains their attributes (i.e. free, mapped, etc.).
On a page fault, the NTSC sends an event to the faulting domain. The next time the domain is activated, its driver for the stretch in question should resolve the fault. This may involve simply mapping one of its unmapped frames, or may involve paging out a resident page. The decision on which page to replace is totally under the control of the domain which hence may use whatever policy it prefers. In the case that resolution of the page fault may take significant time, the domain's user-level thread scheduler may decide to simply run another thread.
Temporal bounds on the actual paging operation are facilitated by the use of a user-safe disk (USD) [Barham 97]. Each domain owning a pageable stretch generally has a pre-established binding to the USD, owns certain extents (which it may use for paging) and holds a particular rate guarantee. Hence it is possible for the domain to predictably determine an upper bound to the amount of time required to load a given page from disk. This allows individual domains to make both disk layout and page replacement decisions which maximise either throughput or capacity as it sees fit.
A domain may also be allocated physical frames beyond the scope of its guarantee. These are known as optimistic frames and may, if necessary, be revoked by the frames allocator. Clearly revocation of a physical frame could have catastophic consequences for a domain, and hence a domain need be careful with what it maps into such frames.
Only low level I/O primitives are supported with the majority of high-level data-path processing performed by the client using shared libraries. The structure of a generic Nemesis device driver is shown in figure 1.8. As can be seen from the diagram, the functionality is implemented by two main modules:
Figure 1.8: Nemesis Device Driver Architecture
The DDM is intended to contain the minimal functionality to provide secure user-level access to the hardware and support QoS guarantees to clients. The DDM serves three main purposes:
Application domains communicate with block- and packet-based device drivers using an asynchronous shared-memory communications mechanism known as RBufs [Black 94]. Each connection to the driver has an independent queue and so the QoS-crosstalk, prevalent in FCFS queueing systems, is avoided. Rbufs also provide effective low-latency feedback to applications since clients are able to observe both the length of the queue and the rate at which requests are being serviced and may use this information to adapt their behaviour.
Out-of-band control of the translation, protection and multiplexing functions of the DDM is performed by a separate management entity known as the Device Control-Path Module (DCM). The DCM communicates with the multiplexing layer of the DDM in order to set up new connections and adjust the QoS-parameters associated with existing connections. The DCM is never involved with the in-band operations of the device driver.
The DCM uses high-level descriptions of access-permissions (e.g. meta-data for a file-system, or window placement in a window system), together with access-control policies to generate the low-level protection and translation information required by the DDM. These low-level permissions are often cached within the DDM's per-connection state records to reduce the number of interactions with the device manager. This cache provides conceptually similar functionality to the TLB of modern processors.
Device drivers typically require access to hardware registers which can not safely be made accessible directly to user-level code. This can be achieved by only mapping the registers into the address space of the device driver domain.
Some hardware registers are inherently shared between multiple device drivers, e.g. interrupt masks and bus control registers. The operating system must provide a mechanism for atomic updates to these registers. In kernel-based operating systems this has traditionally been performed by use of a system of interrupt-priority levels within the kernel. On most platforms, Nemesis provides similar functionality via privileged NTSC calls.
The majority of I/O devices have been designed with the implicit assumption that they can asynchronously send an interrupt to the operating system which will cause appropriate device-driver code to be scheduled immediately with absolute priority over all other tasks. Indeed, failure to promptly service interrupt requests from many devices can result in serious data loss. It is ironic that serial lines, the lowest bit-rate I/O devices on most workstations, often require the most timely processing of interrupts due to the minimal amounts of buffering and lack of flow-control mechanisms in the hardware. [Barham 96] describes how this phenomenon influences DMA arbitration logic on the Sandpiper.
More recently designed devices, particularly those intended for multimedia activities, are more tolerant to late servicing of interrupts since they usually have more internal buffering and are expected to cope with transient overload situations.
In order to effectively deal with both types of device, Nemesis allows drivers to register small sections of code known as interrupt-stubs to be executed immediately when a hardware interrupt is raised. These sections of code are entered with a minimal amount of saved context and with all interrupts disabled. They thus execute atomically. In the common case, an interrupt-stub will do little more than send an event to the associated driver causing it to be scheduled at a later date, but for devices which are highly latency sensitive it is possible to include enough code to prevent error conditions arising. The unblocking latency hint to the kernel scheduler is also useful for this purpose.
This technique of decoupling interrupt notification from interrupt servicing is similar to the scheme used in Multics which is described in [Reed 76], but the motivation in Nemesis is to allow effective control of the quantity of resources consumed by interrupt processing code, rather than for reasons of system structure. [Dixon 92] describes a situation where careful adjustment of the relative priorities of interrupt processing threads led to increased throughput under high loads when drivers were effectively polling the hardware and avoiding unnecessary interrupt overhead. The Nemesis mechanisms are more generic and have been shown to provide better throughput on the same hardware platform [Black 94].
The majority of device-driver code requires no privilege, but small regions of device driver code often need to execute in kernel mode. For example, performing I/O on a number of processors requires the use of instructions only accessible within a privileged processor mode. Nemesis provides a lightweight mechanism for duly authorised domains to switch between kernel and user mode.