Next: Related Work Up: Nemesis Virtual Address and Previous: System Design

Subsections

Implementation

An initial implementation of much of the architecture described here has been completed on one of the Nemesis platforms, the EB164. which uses a Digital Alpha 21164 processor[DEC95]. This section gives an overview of the current status and looks ahead to the work remaining to be completed.

The Translation System

The main component of the translation system is the page-table; this is walked (in software) in order to perform TLB fills, and is modified by the mapping/unmapping operations.

Currently a guarded page table [Liedtke95a] is used. Guarded page tables (GPTs) are a modification of standard n-level page tables in which each level may translate a variable number of bits. This provides for a flexibility in terms of the ``page'' sizes which may be mapped. In principle using GPTs can reduce the overall memory requirement for translation information. It is also possible to arrange for a relatively shallow tree (which should result in shorter lookup times).

**Figure 5:** Guarded Page Table Entries
$\begin{figure} \vspace{3mm} \centerline{ \includegraphics [width=0.99\textwidth]{figures/html_gpt_bits.eps} } \vspace{4mm} \vspace{3mm}\end{figure}$

The Nemesis implementation of GPTs on the 21164 makes use of several of the ideas presented in [Liedtke96b]. A guarded page table entry (GPTE) comprises two words: the first word is the guard word, while the second word is either a page table pointer (PTP) or a leaf page table entry (LPTE). These structures are illustrated in Figure 5.

The guard word contains four significant parts:

the index , a bit string whose width is $\log_{2}$ of the size of the current page table,
the guard , a variable length bit string,
the information bits, currently a 2-bit field with a valid bit V and a leaf bit L, and
the guard size, a 6-bit field size. This encodes the guard and index lengths (in bits) in the form 64-size.

The index and guard fields combined make up the extended guard .

A PTP represents the pair (p,s), where p is the physical address of the next level page table and 2^s is its size (in entries). This is encoded by using the top 58 bits of a word for p and the lower 6 bits to hold s. A consequence of this is that all page tables must be aligned to 64-bit boundaries. If the leaf bit is set in a guard, then the second word holds a LPTE. This holds the physical frame number (32 bits), the stretch identifier (16 bits) and the protection bits as defined in [Sites92].

When translating a virtual address (va), the procedure is as follows:

1.: The first n untranslated bits of va are used as an index into the current page table, where n is $\log_{2}$ of the current page table size. This results in a GPTE.
2.: If the GPTE has the V bit clear, then the translation is not valid, and is aborted. Otherwise, va is XORed with the guard, and the result shifted right by sz. If va does not match the guard and index of the GPTE, then the results will be non-zero. This causes a guard-fault to be reported. Otherwise the extended guard has matched.
3.: If the GPTE has the L bit set, then the second word contains the LPTE, and the translation is complete. Otherwise the second word contains a PTP for the next level. This contains the base address and size of the new page table, required for step 1.

Figure 6 illustrates the first stage of this translation process. In subsequent stages, the guard word will always have a zero-filled prefix.

**Figure 6:** Translating Virtual Addresses
$\begin{figure} \centerline{ \includegraphics {figures/gpts.eps} }\end{figure}$

Notice the section now marked MBZ in the guard word. By forcing these bits to be zero, the result of a successful guard match leaves the remaining bits of the virtual address untouched. Thus the result of the XOR performed may be used as the input to the next translation stage.

A single page table is used for all mappings, and is not context switched. In addition to mapping (i.e. VA $\rightarrow$ PA) information, each PTE also contains some protection information. This is used on the EB164 to hold the global protections for any individual page.

The page table base is kept in an internal processor register and hence is available to the PALCODE routines used to handle ITB and DTB misses. It is also used by the non-privileged PALCODE calls map, unmap and trans which implement the address translation functions roughly as specified in Section 4.3. The main difference is that physical addresses are not themselves passed to the calls, but rather indices into the caller's frame stack. This provides a simple way to validate the frames referred to.

The Protection System

As mentioned above, the global protection information is kept in the page table. Protection domain information is used to augment the rights found there: a TLB fill involves first a page-table lookup (thereby getting the mapping and global protection information), and secondly a check of the protection domain for additional access rights.

This is implemented efficiently on the EB164 by using stretch identifiers ; i.e. each stretch is assigned a unique identifier. Recall that the protections on every page of a stretch must be the same. This means we only require to keep an array of access rights indexed by stretch identifier. This is currently implemented by a page frame (8K) per protection domain, which can hold information for up to 16384 stretches.

**Figure 7:** Protection Domain Implementation
$\begin{figure} \centerline{ \includegraphics {figures/prot_impl.eps} }\end{figure}$

On a TLB miss, the virtual address is first looked up in the page table as described in Section 5.1. Assuming that this succeeds, we have a PTE which contains the PFN, the global protection bits -- and the stretch identifier! Using this stretch identifier (SID), we index into the current protection domain and retrieve the access rights. These are ORed into the PTE (and hence may only augment the access rights), and finally the TLB is updated. This procedure is illustrated in Figure 7.

A pointer to the current protection domain is kept, like the page table base, in an internal processor register. Unlike the PTBR, it must generally be context switched. To minimise the effects of the context switches, we make use of the 21164's address space numbers (ASNs). A pair of internal processor registers (itbAsn and dtbAsn) is set to any 7-bit value, and any subsequent fill of the TLBs is tagged with this value. The current implementation uses one of these per protection domain. Whenever a context switch occurs, the ASN of the protection domain of the new domain is inserted into itbAsn and dtbAsn.

This means that the TLBs need not be flushed at all on a protection domain switch; invalid protections stored in the TLB from the previous domain simply result in misses and refills. As a further optimisation, the ASM bit is also used. This bit, if set in a TLB entry, allows the entry to cause a hit regardless of the current ASN values. Hence it may be used in Nemesis to map pages within a stretch which has significant global rights. Care must be taken to ensure that no domain has additional rights for the particular stretch, since this might cause a protection fault where none was necessary.

Frame Stack

The frame stack is logically composed of two parts:

1.: The set of frames allocated to the domain. It is important that the system know which these are so that it can validate mapping attempts.
2.: The attributes -- clean, dirty, referenced etc. -- of these frames. The translation system does not in general need to know about these.

The current implementation of the frame stack is similarly bipartite. The set of frames allocated to any domain, plus their current mapped status, are kept in an array mapped read-only to that domain. This array is conveniently located within the domain's DCB, read-only part (DCBRO). Also kept in the DCBRO are the indices (high-water marks) of the number of guaranteed and optimistic frames allocated to the domain.

A second array, indexed in parallel, is kept in the read-write part of the DCB (i.e. the DCBRW). This array contains 16 bits of information for each respective PFN; the format and meaning of this information is entirely up to the domain, but typically contains flags to mark pages as mapped, accessed, dirty, etc.

By laying out the frame stack like this, authenticating mapping attempts becomes easy. When inserting a new mapping, instead of explicitly passing down the frame, the caller gives an index into its frame stack. Thus the call looks like ntsc_map(va,idx). The called routine simply needs to:

a): Range check 0 $\leq$ idx $\leq$ xtraf, where xtraf is the high-water mark of the array; and
b): Check the virtual address is within a stretch for which the current protection domain holds meta rights.

Assuming the authentication checks are successful, the page-table is walked, the mapping inserted, the frame's mapped status updated and the relevant TLB entries flushed.

Initial allocation and subsequent extension of the frame stack are achieved via IDC to the Nemesis domain. Currently, however, there is no proper support for dealing with the revocation of frames -- a frame may be ``revoked'' by deallocating it from a particular domain, but no notification of this is currently given. A frame which is currently mapped cannot be revoked at all.

Fault Dispatching

The 21164 defines a number of PALCODE entry points for faults and interrupts. The entry points relevant to memory management are:

IACCVIO: I-Stream access violation
DFAULT: D-Stream access violation
UANLIGN: D-Stream unaligned access
ITBMISS: I-Stream TLB miss
DTBMISS: D-Stream TLB miss

When any of the above happen, the pipeline is drained, the faulting PC is stored in an internal processor register, and the new PC is loaded by indexing from the start of the PALCODE base. The TLB miss handlers load mappings as described above as long as the translation is valid. If a miss handler determines a translation or access fault, or if one of the other three entry points is entered, then the following occurs:

1.: The reason for the fault (one of TNV, ACV, FOR, FOE, FOW, UNA or PAGE) is determined.
2.: The faulting virtual address is identified. This is typically easy to find from the value of one of the internal processor registers va or excAddr at the time of the fault.
3.: The stretch containing the virtual address is obtained by using a table mapping stretch identifiers to stretches. Clearly this only succeeds for addresses within a currently allocated stretch.
4.: The above three pieces of information are stored in the current (i.e. faulting) domain's DCB.
5.: A short assembly stub in the NTSC is entered with interrupts off. If the current domain has not yet set an event channel to receive memory management events, or if it has activations off, then it is simply destroyed. Otherwise an event is sent on the domain's memory management event channel, and the scheduler entered.

**Figure 8:** Implementation Structure
$\begin{figure} \centerline{ \includegraphics [width=0.99\textwidth]{figures/impl.eps} }\end{figure}$

At some point in the future the faulting domain will be activated. Its event demultiplexing code will call the memory management entry notification handler about the event on the memory management channel. This notification handler does the following:

Blocks the faulting thread.
All user-level threads packages in Nemesis provide an interface for the use of entries which allows for the blocking and unblocking of individual threads.
Determines whether the fault is resolvable or not.
Currently it is considered that ACV, UNA and TNV are always unresolvable; i.e. it is not sensible to talk about retrying the faulting instruction.
- If the fault is unresolvable, a todo structure is filled in with details of the fault (including the faulting thread's context ). This todo structure is put on a queue for a worker thread which is then unblocked.
- If the fault is potentially recoverable, the stretch driver for the faulting stretch is invoked. If the stretch driver succeeds, the faulting thread is unblocked. If, on the other hand, the stretch driver fails to satisfy the fault, a todo structure is used as described above.
Returns to the event demultiplexer.

Once the notification handler returns, the activation handler will typically pass control to a user-level thread scheduler, which will schedule a memory management worker thread at some point. This thread checks its queue of todo structures. If there is no work present, it blocks itself. Otherwise it takes the first item from the queue and determines what action to take. There are two possibilities here:

1.: If the fault is an unrecoverable one, the memory management worker thread will invoke the user-level debugger.
2.: If the fault is potentially recoverable, the thread invokes the stretch driver for the second time. It is possible that the fault resolution will succeed on this occasion due to the potential for communication with other domains (viz. IDC). This is explained in more detail in Section 5.5.

The structure of the implementation of fault handling is illustrated in Figure 8.

Stretch Drivers

The current implementation includes three stretch drivers which may be used to handle faults. The simplest is the nailed stretch driver; this provides physical frames to back a stretch at bind time, and hence never deals with page faults. The second is the physical stretch driver. This provides no backing frames for any virtual addresses within a stretch initially. The first authorised attempt to access any virtual address within a stretch will cause a page fault which is dispatched in the manner described in Section 5.4. The physical stretch driver is invoked from within the notification handler, where the following occurs:

Basic sanity checks are performed. These include checking that the relevant stretch is bound to this stretch driver, that the faulting virtual address is within the stretch, and that the faulting domain holds meta rights for the stretch. If any of these checks fail, the stretch driver returns Failure.
The stretch driver looks for an unused (i.e. unmapped) frame within its portion of the frame stack. If this fails, it cannot proceed further now -- but may be able to request more physical frames when activations are on. Hence it returns Retry.
The stretch driver sets up the new mapping with a call to ntsc_map(va,idx,bits), where idx is the index of the unused frame within the frame stack and bits holds the new value for the reference bits, and returns Success.

In the case where Retry is returned, a memory management entry worker thread will invoke the physical stretch driver for a second time once activations are on. In this case, IDC operations are possible, and hence the stretch driver may attempt to gain additional physical frames by invoking the frame allocator via the FrameStack interface. If this succeeds, the stretch driver sets up a mapping from the faulting virtual address to a newly allocated physical frame. Otherwise the stretch driver returns Failure.

The third stretch driver implemented is the paged stretch driver. This may be considered an extension of the physical stretch driver; indeed, the bulk of its operation is precisely the same as that described above.

However in the case where no physical frames are available for mapping, the paged stretch driver has two choices: it may request more frames via the FrameStack interface, or it may swap a page out to a backing store. The backing store used currently is a user-safe disk (USD)[Barham96,Barham97]. This allocates resources in terms of extents -- contiguous ranges of blocks on the disk -- and rate guarantees . These latter are supported by the use of a custom QoS entry in the driver which schedules requests according to these guarantees.

Each domain making use of the paged stretch driver has a pre-established binding to the USD, owns certain extents (which it may use for paging or other purposes) and holds a particular rate guarantee. Hence it is possible for the paged stretch driver to predictably determine an upper bound on the amount of time required to load a given page from disk. This allows individual stretch drivers to make both disk layout and page replacement decisions which maximise either throughput or capacity as they see fit.

Evaluation

The prototype implementation has proved the validity of the architecture, but is not yet complete. A number of items remain for future work, the most urgent being:

Page-Tables.
Although one might expect GPTs to be especially suitable for a 64-bit single address space, it turns out that in general one achieves either small space requirement or shallow trees. It seems difficult to get both at once. While there are various theorems on GPTs[Liedtke93,Liedtke94] which assert the existence of various minimal trees, the transformations required to achieve them seem rather computationally expensive.
In particular what is lacking is a good heuristic for local transforms when adding or removing entries close at hand. Further work will be required to see if one of these can be discovered, and to investigate other translation structures such as clustered page-tables[Talluri95].
Physical Memory Revocation.
The idea of `optimistic' allocation of physical frames is only useful if there is a sensible implementation of some revocation mechanism. Exactly how this should work is not yet clear: the idea of sending a ``revocation event'' may suffice, but seems rather imprecise.
Other Architectures.
The system as implemented here conforms rather closely to the design. This is due at least in part to the fact that the EB164 is a 64-bit RISC processor -- precisely one of the target architectures for the Nemesis VM system.
When implementing the system on the CISC Pentium/Pentium Pro machines, or on the StrongARM, more difficulties arise. For example, the fact that page-tables are walked in hardware provides little flexibility in the translation systems used. Nonetheless good progress has been made in realising a version of the VM system on the other Nemesis platforms.

Additionally there is considerable scope for the development of stretch drivers. Currently three are implemented: a NULL implementation for nailed-down stretches, a demand-paged stretch driver which uses only physical memory, and an extended version of this which also pages to and from the USD. All of these require optimisation and might benefit from some reorganisation. A number of other 'interesting' stretch drivers (providing support for DSVM or peristence, for example) might also be implemented.

Next: Related Work Up: Nemesis Virtual Address and Previous: System Design

Robin Fairbairns
2/17/1998