Chapter 5 Implementation This chapter describes the implementation of an experimental system called Nemo which was built to investigate some aspects of the design presented in the previous chapter. This implementation runs on a DECstation 5000/25 workstation, which uses a MIPS R3000 processor. 5.1 System Overview Figure 5.1 shows the structure of a typical Nemo system. The Nemo Trusted Su­ Figure 5.1: Structure of a typical Nemo system. pervisor Call (NTSC) code implements those functions which are required by user 68 mode processes and which need to run in supervisor mode. It is also responsible for mapping the low level hardware interface onto the VPIs which appear in each of the processes. The NTSC code provides support for three types of processes. System processes implement the majority of the services provided by the operating system. Device driver processes are in many respects similar to system processes, but are distinguished by the fact that they are attached to device interrupt stubs which execute in supervisor mode and hence are part of the NTSC code. Application processes contain user programs. Processes interact with each other via the system IPC facilities which are implemented using events and, if required, shared memory. The following sections describe the operation of the main system entities. 5.2 NTSC Code The NTSC code is about 2.5 kilobytes in size, written entirely in assembler and implements the routines which provide the support for VPs. Processes gain entry to the NTSC code via the standard system call mechanism and, once they have entered it, their execution is not preempted. Code is placed into the NTSC for a number of reasons: - it needs access to privileged instructions and needs to execute within supervisor mode; - it is executed as the result of an exception such as an interrupt, in which case the hardware has forced execution of the exception handler to be initiated in supervisor mode or; - the code implements commonly used functionality which needs to be run with­ out preemption. The routines which are provided for use by processes to implement activation han­ dlers and which will be discussed in a later section are examples of the latter. NTSC calls are separated into two classes, one containing calls which may only be executed by a suitably privileged system process such as the kernel, the other containing calls which may be executed by any process. Table 5.1 lists the NTSC code segments which currently exist and their functions. Of these segments, sc activate and sc halt are executable only by the kernel, the rest may be executed by any pro­ cess. sc halt causes the system to be halted, and execution to enter the bootstrap monitor. sc ici and sc dci allow a range of addresses in the instruction and data caches respectively to be invalidated. The function of the remaining code segments is described in later sections. 69 Name Purpose sc activate Activate a process sc halt Halt the processor sc rfa Return from activation sc rfar Return from activation, restore context sc kernel Activate the kernel sc deliver Deliver any pending events sc ici Invalidate a region of the instruction cache sc dci Invalidate a region of the data cache Table 5.1: NTSC code segments. The NTSC maintains a small amount of data which include: kvp, the address of the kernel's VP; cvp the currently active VP and; ccx the address of the structure into which the machine context is to be saved when a process loses the processor. These are maintained as a result of the execution of NTSC code by the kernel and user processes. 5.2.1 Device Interrupt Stubs The NTSC code is also responsible for providing an interface between device hard­ ware and its associated driver process; an example of this is provided by the clock device. The DECstation 5000/25 provides a periodic source of interrupts which have an inter­arrival time of 1 millisecond; this source is used by the NTSC to implement the system clock. Figure 5.2 shows the interface to the Nemo system clock, which consists of two registers called current and alarm. current contains the number of Interrupt Current Time Alarm Time (Current >= Alarm) ? 1 : 0 Figure 5.2: Nemo clock device interface. milliseconds which have elapsed since the epoch. alarm contains the time at which the system next wants to receive a clock interrupt. The clock driver will generate an 70 event whenever the value of the current register is greater than or equal to the value of the alarm register. The NTSC stub for the periodic interrupt is implemented as a fast and a slow path. The entry to this code saves the minimum number of reg­ isters required, increments the current register and compares the updated value with the value of the alarm register. If the alarm has not expired, the fast path is taken; the partially saved state is restored and execution returns from the exception. If current is greater than or equal to alarm, the alarm has expired and the stub needs to generate an interrupt. In this case, the remaining context is saved and the interrupt is delivered to the kernel in the form of an event. The current and alarm values are maintained within the NTSC code, but are made available to processes as shared memory segments. The current register can only be modified by the NTSC code, but may be read by any other process. Processes are given the address of this register at initialisation time and so have ready access to the current system time to millisecond resolution at the cost of a memory read operation. In addition to the address of the current register, the kernel is informed of the address of the alarm register, which it may write with the time at which it would next like to be sent an event by the NTSC. This device driver implementation exemplifies a number of important aspects of the mechanisms whose design was presented in chapter 4. Firstly, in the case of devices which present only a low level hardware interface to the system software, code can be implemented within a device driver stub to implement a higher level interface; this allows the system implementor to trade hardware complexity and cost off against the processor cycles required to implement a high level device interface. Secondly, the implementation of the interface between the clock device registers and the processes which use them provides extremely efficient accesses to the register values without compromising the basic shared memory IPC abstraction. Thirdly, the same shared memory and event mechanisms which are used for communications between processes are also used to provide the interface between hardware devices and their device driver processes. 5.3 Processes Processes are the active entities within a Nemo system and consist of a domain of protection and a thread of execution. A process may be multi­threaded, in which case it will implement its own thread scheduling algorithms in addition to the process scheduling which is done by the system. 71 5.3.1 The Virtual Processor Interface Of interest in this discussion is the interface between a Nemo process and the system. Nemo provides each process with a virtual processor (VP); resources are allocated to each VP by the RTRA and resource availabilities are communicated to processes via the Virtual Processor Interface (VPI), which is defined by the code segment in figure 5.3. Within a process's VPI, activate contains the address at which /* * Virtual processor interface */ typedef struct -- void (*activate)(); /* Activation handler */ u—int status; /* VP status (see below) */ u—int disable; /* Disable activations */ context—t ecx; /* Execution context */ context—t acx; /* Activation context */ char *kpm; /* Kernel call shared memory */ char *pkm; /* Upcall shared memory */ ť vp—t; /* * Status register */ #define VP—STS—ARMSK 0x0F /* Mask for activation reason */ #define VP—STS—PRMPT 0x01 /* Process was preempted */ #define VP—STS—ALLOC 0x02 /* Been allocated some time */ #define VP—STS—EXTRA 0x04 /* Obtained extra resources */ #define VP—STS—EVENT 0x08 /* Events have been delivered */ #define VP—STS—NEW 0x10 /* New events are pending */ #define VP—STS—AIP 0x20 /* Activation in progress */ Figure 5.3: Nemo virtual processor interface. execution is started whenever the processor is given to the process. This field can be altered with a single write by the process. Whenever a process is activated, the status field is updated to reflect the reason for activation as explained in section 4.7.1. While not executing an activation handler, the NTSC code saves a process's machine context in the ecx field of the VPI whenever the process loses the processor. If the 72 machine context needs to be saved while executing an activation handler, it is saved in the acx field and this context is resumed when the process is again given the processor. This removes the need to write reentrant activation handlers, but also means that a process may not immediately be informed of scheduling events which occur during activation. kpm and pkm contain the addresses of areas of memory which are shared between a process and the kernel. To perform the equivalent of a system call, a process marshals its arguments directly into the memory pointed to by pkm and then executes a call to sc kernel, which causes the NTSC code to deliver an event to the kernel and activate it. Conversely, whenever the kernel wishes to call a process to inform it of the results of a particular system call, it marshals the results directly into the memory pointed to by kpm and sends an event to the process to alert it to the presence of the results. 5.3.2 Activation and the VPI Activation is performed by code within the NTSC as the result of: - the kernel issuing an sc activate call to give the processor to a particular process; - an interrupt causing the NTSC to take the processor from the currently run­ ning process and activate the kernel and; - a process issuing a sc kernel call to surrender the processor and activate the kernel. In the usual case, when a process loses the processor, its context is stored into its VP's ecx field. When NTSC code is called to activate the process, its cvp variable is updated to point to the VP to be activated, and ccx is set to point to the acx field of the current VP. The reason for activation is updated in the VPI status field, the VP STS AIP bit is set, and execution is continued at the address specified by the VP's activation field in user mode. The activation handler can now process any pending events and make any scheduling decisions which it requires. If it decides to select a new thread for execution, the saved machine context of the thread which was running when it last lost the processor is available in ecx. Any external occurrences which would ordinarily result in an activation cause the VP STS NEW bit to be set if they occur while an activation is in progress. To leave an activation handler, a process executes a call to either sc rfa() or sc rfar(). Both of these NTSC calls check to see if any new activations arrived 73 while the previous activation handler was executing and if so, clear the VP STS NEW bit and reactivate the process. If no new activations had arrived, cvp is set to point to the current VP's ecx field. After this, sc rfa() causes execution to return to the instruction following the call to sc rfa(). sc rfar() takes one argument which is a pointer to a context which is to be resumed. A thread scheduler may use this to resume its previous execution context by calling sc rfar(&vpp­?ecx), where vpp is a pointer to its VPI, or it may nominate the address of the saved context of a new thread. 5.4 Building a Nemo System The NTSC code and the code for each of the processes is a separately compiled and linked object module. A machine dependent configuration utility called build is used to read a configuration file and bind the specified modules together to con­ struct a system image. Figure 5.4 shows an example configuration file. In this # # Configuration file for Nemo # # File Name Period Cycles # ntsc /ntsc kernel /kernel dir /lib/dir video0 /proc/0 67 30 video1 /proc/1 67 30 Figure 5.4: Example Nemo configuration file. configuration file, entries in the first column contain the unix pathname of the linked object modules which are to be used to build the system. The second column contains the ASCII strings which are used to locate the modules. In the case that the entity is a user process, its name begins with the string /proc/ and the third and fourth columns of its entry in the configuration file respectively contain the period with which processor time is allocated to the process and the amount of processor time which is allocated. The example shows that there are two contracted processes, one called video0 and the other called video1 which are both allocated a processor bandwidth of (67; 30) milliseconds. 74 In addition to calculating the final addresses of the processes and relocating them, build also initialises the important system data structures. Information from the configuration file along with the final addresses of the object modules are used to initialise the directory database. A process control block (PCB) is allocated and initialised for each process in the system, and each process's VP is initialised. The activation field of each VPI is set to point to the first instruction in the process's program space. The nexus contains a list of addresses which are required by the system at startup. These include the address of the first byte of memory after the end of the image, the address of the kernel's VP and the address of the directory data and operations. Figure 5.5 shows the layout of the bootable image created by build. NTSC Code Nexus Directory Data Initial PCBs Kernel video0 video1 Unused Memory vp program + data vp program + data vp program + data Directory Operations Figure 5.5: Layout of a Nemo system image as constructed by build. 5.5 The Bootstrap Sequence The built image is loaded into memory and execution starts at the first instruction in the NTSC code. This code extracts the address of the kernel's VP from the nexus and activates the kernel, passing to it the address of the nexus as an argument. The kernel locates the directory data and operations from the information which it retrieves from the nexus and uses this to locate the PCBs of the processes which were loaded as part of the system build. Each of the PCBs found is entered into the current schedule and initialisation is completed by entering the dispatcher. 5.6 Resource Management The Nemo scheduler is part of the kernel and maintains four process queues cq, wq, fq and bq. cq holds the current schedule of contracted processes sorted such that the most eligible process is at the head of the queue. wq contains those processes which have consumed their contracted amount of processor time and are able to make use of more time should it become available. fq contains any best effort processes and bq contains processes which are blocked, waiting for some event to occur. 75 Each process's PCB contains a number of fields which are used to maintain re­ source management information. These fields include period, cycles, pnext and remaining. period and cycles express the processor bandwidth which is allocated to the process. pnext is the time at which the process will next be allocated more resources and remaining contains the amount of contracted processor time which the process has remaining at any time. 5.6.1 Accounting and Policing When a process is given the processor, the time is recorded from the clock current register and an entry is made in the clock driver timeout queue so that a policing procedure will be called at current+remaining should the process still be running when its contracted processor time has been consumed. When the processor is taken away from a process, the policing timeout is removed and remaining is decremented by the amount of time which the process consumed while it had the processor. If remaining is zero, the process is moved from cq to wq. 5.6.2 Allocation For each contracted process, a periodic software timer is used to allocate processor time to the process by setting remaining to cycles every period milliseconds and updating the status field in the process's VPI. This timer also checks to see if the process is in wq because it had used all of its contracted time and if so, moves it from wq to cq. 5.7 Events Support for signalling between processes is provided by the Nemo event mechanism. 5.7.1 Creating an Event Channel Suppose process A wants to create an event channel for signalling events to process B, whose process identifier is 2. A can issue the system call ev create(''/proc/2'', rid). rid is an integer which identifies a particular instance of the system call; in a multithreaded process, rid would typically be the identifier of the thread issuing the system call. 76 ev create locates the named process and obtains a pointer to B's PCB if the named process exists. It then allocates an entry from an array of structures which is held in A's PCB. Figure 5.6 shows the structure of one of these entries. The fields of /* * PCB data required for an event */ typedef struct -- struct pcb *pp; /* Process receiving events */ u—int *recp; /* Address of event record */ u—int state; /* State flags ­ see below */ u—int rid; /* Result identifier */ ť pcb—ev—t; #define PCB—EV—WACC 0x1 /* Waiting for accept */ #define PCB—EV—OPEN 0x2 /* Channel is open */ Figure 5.6: PCB event structure. this structure are then initialised the state set to PCB EV WACC and the kernel upcalls both processes. Process A receives a return value which identifies the event channel, along with a copy of the rid given in the ev create call so that it can match the upcall results with the request. Process B is upcalled with a request to receive events and is given both a pointer to A's PCB and the identifier of the PCB event structure in A. This aims to provide enough information to enable B to identify the possible source of events and decide whether or not it is willing to receive events from that source. If it does not want to receive events from A, B issues the system call ev reject(pp, ev id) which causes the event structure to be deallocated in A's PCB and A to be upcalled with an error status. If B is willing to receive events from A, it issues the system call ev accept(pp, ev id, &value) where &value is the address of an integer which is used to record the deliveries of this particular event. This causes the state of the PCB event structure to be set to PCB EV OPEN and A to be upcalled with a good return status and the appropriate rid. 5.7.2 Signalling Events Once an event channel is in place, A can signal the occurrence of n events to B by issuing the ev signal(e, n) system call. Figure 5.7 shows the signalling of an event. The signal call uses the event identifier to locate the event channel's 77 A B ev_signal(e, n); e v += n RECD ACKD v VP PCB VP PCB status 1 VP_STS_EVENT Figure 5.7: Process A signalling the occurrence of n events to process B. structure in A's PCB. It uses the information contained in this structure to locate the address of the event record, updates the number of events received by B and sets the VP STS EVENT bit in B's VP status field. The scheduler is alerted that B has received an event, so that if B was in bq, waiting for an event, B is rescheduled. 5.8 Summary An implementation of the design presented in the previous chapter has been pre­ sented in the form of the Nemo kernel. The description of Nemo has demonstrated how the mechanisms which are required to provide support for QOS within an op­ erating system might be realised. The implementation also demonstrates that the primitives such as shared memory and events which were proposed in the design for use by application processes are also able to be used within the system itself. These primitives provide a level of abstraction which is not as high as that of message passing. However, they are better suited to a number of CM applications and their use for IPC also affords some performance gains over message passing. 78