Next: Implementation Up: Nemesis Virtual Address and Previous: System Architecture

Subsections

System Design

This section presents the design of the virtual memory management system in terms of the general concepts introduced in Section 3.

Protection Model

In a multi-address space operating system, protection is typically considered to be a function of address space management -- each process executes with a myopic view of the virtual address space which prevents unauthorised accesses. For a SASOS this is inappropriate.

Instead, Nemesis uses protection domains . A protection domain is a mapping from the virtual address space to a set of access rights, i.e. a protection domain P may be formally defined by:

$\begin{displaymath} P : V \rightarrow R^{*}\end{displaymath}$

where V is the set of valid virtual addresses, and $R \in \{$ read, write, execute, meta $\}$ .

Logically, all protection domains are considered subsets of an additional global protection domain. This allows the global protecting/sharing of certain memory regions; that is, the granting of a base level of permissions to all protection domains.

Every domain is associated with exactly one protection domain, although multiple domains may share such an association. The protection domain of an executing domain determines the accessibility of any region of memory to that executing domain.

**Figure 2:** Protection Domains in Nemesis
$\begin{figure} \centerline{ \includegraphics [width=0.99\textwidth]{figures/pdoms_alt.eps} }\end{figure}$

A large portion of the Nemesis operating system is comprised of shared text in the form of stateless libraries and closure-invoked modules These, along with small amounts of globally accessible data, may be considered to reside within the global protection domain. The NTSC -- a rather special case -- will in general exist partly within the global protection domain, and partly outside any protection domain. This is illustrated in Figure 2.

There is a close relationship between protection domains and stretches; since a stretch encapsulates a region of the virtual address space with the same accessibility, a protection domain may also be considered as a mapping from stretches to access rights. Hence the stretch interface provides a logical place to provide access control modification operations.

When a stretch is allocated, it is initialised with a default set of access rights -- read, write and meta -- for the protection domain of the caller. The third of these is the most interesting; a meta right provides the following functions:

It identifies an `authorised protection domain' of the stretch.
There there may be arbitrarily many of these. A domain executing within an authorised protection domain for a stretch is said to be authorised on that stretch.
It permits the modification of the access rights of the stretch.
A stretch provides an interface whereby various access rights (including the meta right) may be granted to or revoked from protection domains.
It authorises the mapping (or unmapping) of physical frames into the stretch.
More details on this are given in Section 4.4.

Four kinds of access control operations are provided on a stretch:

1.: SetProt(pdom, rights) : set the access rights on the stretch for protection domain pdom to be rights. This requires the invoking domain to be authorised.
2.: SetGlobal(rights) : set the global access rights on the stretch to be rights. This also requires an authorised invoking domain.
3.: QueryProt(pdom) rights : retrieve the access rights for protection domain pdom.
4.: QueryGlobal() rights : retrieve the global access rights.

The ``global'' rights referred to are the rights of the global protection domain of which all other protection domains are subsets. One consequence of this is that the accessibility of a stretch from a particular protection domain is always at least the global rights.

Address Allocation

This section describes the mechanisms by which regions of an address space may be allocated to specific domains, and later freed. Other aspects of the overall memory architecture, such as the translation system, dealing with memory faults, etc. are dealt with in later sections.

Stretch Allocation

A region of the virtual address space is allocated by creating a stretch, and hence the part of the VM system which handles virtual address allocation is called the stretch allocator . Any domain may request a stretch from a stretch allocator, specifying the desired size and (optionally) a starting address and attributes. Should the request be successful, a new stretch will be created and returned to the caller. The caller is now the owner of the stretch. The starting address and length of the returned stretch may then be queried; these will always be a multiple of the page size.

There may be a number of stretch allocators, although they will manage non-overlapping regions. This may be useful in cases where it is desirable for certain regions of the virtual address space to be associated with certain properties; for example, one allocator might provide stretches for DMA buffers, another provide stretches which must be mapped in a certain way, and so forth. Using different allocators for these various portions of the address space allows differing resource management policies to be implemented.

A stretch itself encapsulates a particular contiguous region of the virtual address space, as determined by a (base,length) pair, with each address within the region having the same access permissions. It is not possible to move, shorten or extend the stretch once it has been created -- the base and length remain the same throughout the lifetime of the stretch.

These properties rule out the possibility of nested stretch allocators, since it would not be possible to modify the access permissions of any ``sub-stretch'' independently of any other. It is of course possible, however, to subdivide and reallocate the virtual addresses of the stretch in terms of some second level allocator (e.g. a heap).

When allocated, a stretch need not in general be backed by physical resources. Before the virtual address may be referred to, then, the stretch must be associated with a stretch driver -- we say that a stretch must be bound to a stretch driver. The stretch driver is the object responsible for providing any backing (physical memory, disk space, etc.) for the stretch.

Many implementations of stretch drivers exist; two in particular are worth mentioning:

1.: Nailed Stretch Driver: this provides physical backing for the virtual addresses spanned by the stretch at all times. This essentially requires that sufficient physical frames be allocated to back the stretch when it is associated with the driver. In such a case, the stretch will never cause a page fault.
2.: Demand-Paged Stretch Driver: this provides no physical backing at all for the stretch on first association. Instead, once an address within the stretch is accessed, a page fault will occur. Only at this point will the stretch driver actually provide any physical resources, and may map only the faulting page.

Clearly a large number of alternatives is also conceivable: the only essential property of a stretch driver is that it somehow deal with events involving any part of any stretch with which it is associated.

**Figure 3:** Life-Cycle of a Stretch
$\begin{figure} \centerline{ \includegraphics [width=0.99\textwidth]{figures/life_cycle.eps} }\end{figure}$

The freeing of a stretch is similarly a two stage process. First the stretch is unbound from its stretch driver; this allows any physical resources which may have been associated with the stretch to be disposed of correctly. Secondly, the stretch is destroyed via a stretch allocator. This must be same stretch allocator used to allocate the stretch initially. Figure 3 shows the life-cycle of a stretch.

Note that the permission to unbind or destroy a stretch is independent of the standard $\{$ read, write, execute, meta $\}$ rights associated with the stretch. Instead it is the responsibility of the relevant stretch driver or allocator to implement a protection scheme. Generally unbinding and destruction of a stretch is restricted to the owning domain.

Physical Memory Allocation

Nemesis provides fine-grained control over the allocation of physical memory, including (where applicable) I/O space. A stretch driver may request specific physical frames, or frames within a ``special'' region. This allows a stretch driver with platform knowledge to make use of page colouring, or to take advantages of ``super-page'' TLB mappings, etc. Clearly, a default allocation policy is also supported for domains with no special requirements.

As with virtual memory, the allocation of physical memory is done by a central frames allocator . This allocator actually supports two different interfaces so that domains may request physical memory in appropriate ways:

1.

Via the Frames interface: this allows for the allocation and freeing of individual (or contiguous blocks of) physical frames. In normal circumstances, however, this memory will be of limited use, since it is not possible to address it!

Only certain privileged domains have the power to perform physical accesses, and hence the use of this interface is restricted to these domains.

2.

Via the FrameStack interface: like the Frames interface, this allows the allocation and freeing of frames (or groups thereof), but instead of returning raw (and generally useless) physical addresses to the domain, it creates (or augments) that domain's frame stack .

This represents the set of physical frames which a stretch driver may use for backing any of its stretches. The set is recorded in a nailed area defined by the translation system so that it is easy to validate mapping attempts by a domain.

As has been seen in the previous section, the responsibility for providing physical backing for a stretch is delegated to the stretch driver. Thus in general it is the stretch drivers that deal with a domain's frame stack.

As for stretch allocation, it is often desirable to have more than one frame allocator. This is particularly convenient on machines with memory mapped I/O space or with physical addresses which encode information about caching or buffering. The use of a separate allocator allows:

a): additional access control checks to be carried out since different physical memory spaces are often best limited to certain privilege classes.
b): the use of a different physical frame size: for `normal' memory on most architectures, frame size = page size = granularity of allocation. However larger frames may be more suitable for I/O-type physical memory regions.
c): the reservation of certain regions of physical memory for system, device or other purposes.
d): the separation of allocation and revocation policy based on the characteristics of a region of physical memory.

Clearly in the case where multiple frame allocators are available, it is not necessary for every one of them to provide both the Frames and FrameStack interface. This might rather be limited to a single allocator.

One further concern for physical memory allocators is that of revocation . Limited physical memory generally implies considerable contention.

In Nemesis memory management the ideas of guaranteed and optimistic resources are used, as they are for other resources in the system, e.g. CPU. A domain has some explicitly guaranteed number of physical frames which are immune from revocation in the short term. In addition, any domain may also have some number of optimistically allocated physical frames. This latter set are currently available for use by the domain, but are subject to later revocation without notification. The fact that the domain knows explicitly which frames it has been granted and under what conditions, allows it to place data in appropriate places.

Address Translation

The translation system deals with inserting, retrieving or deleting mappings between virtual and physical addresses. As such it may be considered an interface to a table of information held about these mappings; the actual mapping will typically be performed as necessary by whatever memory management hardware or software is present.

Nemesis expects each domain to deal with mapping its own stretches, and as such a domain will generally require some physical frames which it may use for this purpose. Furthermore it is important that attempted use of the translation system can be easily validated to prevent unauthorised behaviour. This validation usually requires two checks to be made:

a): Valid virtual address: the calling domain must execute in a protection domain which holds a meta right for the stretch which contains the virtual address. A consequence of this is that it is not possible to map a virtual address which is not part of some stretch.
b): Valid physical address: the calling domain must own the frame which is being used for mapping. Checking this condition is facilitated by the use of the frame stack.

Conceptually, then, the translation system provides three operations:

1.: Map(va, pa) : arrange so that the virtual address va maps onto the physical address pa. In practice the va will be page-aligned, and the pa frame-aligned.
2.: Unmap(va) : remove the mapping of the virtual address va. Any further access to the address should cause some form of memory fault.
3.: Trans(va) pa : retrieve the currently installed mapping of the virtual address va, if any.

The translation system is, however, a rather machine-specific object, and so particular implementations may refine or augment the above. A good example of where this is useful is for machines which have software managed TLBs -- on such architectures one may use an arbitrary page-table format and thus provide for reference bits if desired. An example of an implementation of the translation system will be seen in Section 5.

Fault Handling

A wide variety of memory faults may be generated by a typical system. Clearly the exact number, form and semantics of these are machine specific, but a generalised taxonomy may be provided in order to facilitate portable implementations:

TNV: (Translation Not Valid): the translation of a virtual address to a physical address failed because there is no valid information about the mapping. Generally this occurs when an unallocated virtual address is used.
ACV: (Access Violation): an attempt to access some virtual address failed due to insufficient privilege (i.e. a protection fault).
FOR: Fault On Read: a read access was attempted on a physical page marked as ``fault on read''. This may be used to collect access pattern statistics, or as a hook for persistence, and so on.
FOE: (Fault On Execute): an instruction stream fetch was attempted from a non-executable page.
FOW: (Fault On Write): a write access was attempted to a page marked ``fault on write''. This might, for example, be used to implement copy on write.
UNA: (Unaligned Access): an unaligned access was attempted. This may simply be a bad virtual address, or may be a true unaligned attempt.
PAGE: (Page Fault): the page to which an access was attempted is not currently mapped.

The above list includes all the faults which Nemesis currently recognises; it is not expected that every type of fault will be possible on every architecture.

Of key importance within Nemesis is the idea of accountability: every domain should perform whatever tasks are necessary for its own execution. In terms of the VM system, this means that a domain is responsible for satisfying any faults which occur on stretches which it owns. There is no concept of a ``system pager'', or of an ``application level pager'': instead, every domain is its own pager (or, more generally, memory fault handler).

Clearly this would be unduly complex in a traditional or microkernel operating system, but with a vertically integrated system, practically any system task may be easily shared by all domains. In this particular case, the domain will typically invoke the library code of an implemented stretch driver, though it may use a private implementation if it desires.

In order to achieve the above we require a light-weight low latency asynchronous mechanism whereby a domain may be notified of a fault occurring. The NTSC provides this by means of the event mechanism, which allows the asynchronous notification of a domain of some happening. Event channels are used throughout Nemesis to provide inter-domain communication and to support device driver stubs -- and to signal memory faults.

On a memory fault, the NTSC identifies the stretch containing the faulting address and sends the faulting domain an event. The next time this domain is activated, it should resolve the fault. This may involve simply mapping one of its unmapped frames, or may involve paging out a resident page. The decision on which page to replace is totally under the control of the domain which may therefore use whatever policy it prefers.

If the stretch was being shared, the domain which faulted may not be sufficiently authorised (i.e. may not possess meta rights for the stretch). In this case, it receives the event and notices that it cannot resolve the fault itself. If, as is common, it contains a user-level thread scheduler, it may decide to simply run another thread. Alternatively it may optimistically retry the faulting instruction, or attempt to contact a domain which can potentially resolve the fault.

**Figure 4:** Event Handling
$\begin{figure} \begin{center} \includegraphics {figures/events.eps} \vspace{0... ...duled and deals with the event. \ \hline\end{tabular} \end{center}\end{figure}$

It is important to understand how the domain handles the incoming event, and how it notifies the correct handler accordingly. Figure 4 illustrates how domains can handle events: Note that the structure of the user-space part illustrated is not a mandatory aspect of the design -- any domain may deal with incoming events any way in which it wishes. The illustrated structure is, however, typical.

In the case of memory faults, the first two stages of the event delivery happen as normal. In the notification handler, however, there are usually two choices:

i): The fault is fixed up by the stretch driver within the notification handler itself. This is the ``fast path'' option and is suitable where no further action is required in order to satisfy the fault.
ii): The notification handler simply unblocks a worker thread in the memory management entry (MMEntry). This thread will later be scheduled by the ULS and will invoke the stretch driver to satisfy the fault.

In the latter case, IDC operations are possible, facilitating paging, the handling of faults on shared stretches, and the invocation of a user-level debugger for unresolvable faults. More details on these aspects of fault handling are given in Section 5.5.

Low-Level Interfaces

Nemesis is most suited to RISC architectures -- the ability to choose a page-table structure for the SAS model, to prefetch TLB entries on a protection domain boundary, or to make use of software TLBs allow a fine-tuning process not possible with CISC MMUs. One additional aspect of certain RISC architectures is software cache invalidation and/or prefetching. Nemesis supports such architectures by means of cache hints .

Cache hints provide a declarative interface to the hardware; the actual effect they have is implementation and machine specific. Likely semantics include the invalidation of a cache line (or of the entire cache) and the prefetching of a line (or lines) into a cache. Currently three cache hints are defined:

1.: RequireRead(region) -- the caller intends to read the specified addresses.
The system may decide to prefetch some or all of the relevant lines, or may do nothing.
2.: RequireWrite(region) -- the caller intends to write the specified addresses.
The system may decide to prefetch some or all of the relevant lines for writing (which might be the same as prefetching for writing).
3.: Modified(region) -- the caller has modified the specified addresses.
In some architectures, this hint may be required in order to ensure cache consistency in the presence of DMA access by IO devices. In most cases, however, cache consistency is provided by the hardware and the likely reaction to this hint will be a nop.

In each case the associated range of virtual addresses is specified by region. The range is specified at a level of bytes (or at least words) rather than pages to allow caches to be invalidated or pushed over small objects (for example ATM cell payloads).

It is important to note that various other ``useful'' low-level features (such as a partitioned cache[Hayter94], or a protection translation buffer[Koldinfer92]) may be available on other or future architectures. It is difficult to design interfaces for all possible enhancements, and it is not attempted here. Whenever such interfaces are designed, however, they should be declarative . This allows implementations without the specific hardware to trivially support a nop reaction.

Next: Implementation Up: Nemesis Virtual Address and Previous: System Architecture

Robin Fairbairns
2/17/1998