Currently a guarded page table [Liedtke95a] is used. Guarded page tables (GPTs) are a modification of standard n-level page tables in which each level may translate a variable number of bits. This provides for a flexibility in terms of the ``page'' sizes which may be mapped. In principle using GPTs can reduce the overall memory requirement for translation information. It is also possible to arrange for a relatively shallow tree (which should result in shorter lookup times).
The Nemesis implementation of GPTs on the 21164 makes use of several
of the ideas presented in [Liedtke96b]. A guarded page table
entry (GPTE) comprises two words: the first word is the guard word,
while the second word is either a page table pointer
(PTP) or a leaf page table entry (LPTE). These structures are
illustrated in Figure 5.
The guard word contains four significant parts:
A PTP represents the pair (p,s), where p is the physical address of the next level page table and 2s is its size (in entries). This is encoded by using the top 58 bits of a word for p and the lower 6 bits to hold s. A consequence of this is that all page tables must be aligned to 64-bit boundaries. If the leaf bit is set in a guard, then the second word holds a LPTE. This holds the physical frame number (32 bits), the stretch identifier (16 bits) and the protection bits as defined in [Sites92].
When translating a virtual address (va), the procedure is as follows:
Figure 6 illustrates the first stage of this translation process. In subsequent stages, the guard word will always have a zero-filled prefix.
Notice the section now marked MBZ in the guard word. By forcing these bits to be zero, the result of a successful guard match leaves the remaining bits of the virtual address untouched. Thus the result of the XOR performed may be used as the input to the next translation stage.
A single page table is used for all mappings, and is not context
switched. In addition to mapping (i.e. VA PA)
information, each PTE also contains some protection information. This
is used on the EB164 to hold the global protections for any
individual page.
The page table base is kept in an internal processor register and hence is available to the PALCODE routines used to handle ITB and DTB misses. It is also used by the non-privileged PALCODE calls map, unmap and trans which implement the address translation functions roughly as specified in Section 4.3. The main difference is that physical addresses are not themselves passed to the calls, but rather indices into the caller's frame stack. This provides a simple way to validate the frames referred to.
As mentioned above, the global protection information is kept in the page table. Protection domain information is used to augment the rights found there: a TLB fill involves first a page-table lookup (thereby getting the mapping and global protection information), and secondly a check of the protection domain for additional access rights.
This is implemented efficiently on the EB164 by using stretch identifiers ; i.e. each stretch is assigned a unique identifier. Recall that the protections on every page of a stretch must be the same. This means we only require to keep an array of access rights indexed by stretch identifier. This is currently implemented by a page frame (8K) per protection domain, which can hold information for up to 16384 stretches.
On a TLB miss, the virtual address is first looked up in the page table as described in Section 5.1. Assuming that this succeeds, we have a PTE which contains the PFN, the global protection bits -- and the stretch identifier! Using this stretch identifier (SID), we index into the current protection domain and retrieve the access rights. These are ORed into the PTE (and hence may only augment the access rights), and finally the TLB is updated. This procedure is illustrated in Figure 7.
A pointer to the current protection domain is kept, like the page
table base, in an internal processor register. Unlike the PTBR, it
must generally be context switched.
To minimise the effects of the context switches, we make use of the
21164's address space numbers (ASNs). A pair of internal
processor registers (itbAsn and dtbAsn) is set to
any 7-bit value, and any subsequent fill of
the TLBs is tagged with this value. The current implementation uses
one of these per protection domain. Whenever a context switch
occurs, the ASN of the protection domain of the new domain is inserted
into itbAsn and dtbAsn.
This means that the TLBs need not be flushed at all on a protection domain switch; invalid protections stored in the TLB from the previous domain simply result in misses and refills. As a further optimisation, the ASM bit is also used. This bit, if set in a TLB entry, allows the entry to cause a hit regardless of the current ASN values. Hence it may be used in Nemesis to map pages within a stretch which has significant global rights. Care must be taken to ensure that no domain has additional rights for the particular stretch, since this might cause a protection fault where none was necessary.
The frame stack is logically composed of two parts:
The current implementation of the frame stack is similarly bipartite. The set of frames allocated to any domain, plus their current mapped status, are kept in an array mapped read-only to that domain. This array is conveniently located within the domain's DCB, read-only part (DCBRO). Also kept in the DCBRO are the indices (high-water marks) of the number of guaranteed and optimistic frames allocated to the domain.
A second array, indexed in parallel, is kept in the read-write part of the DCB (i.e. the DCBRW). This array contains 16 bits of information for each respective PFN; the format and meaning of this information is entirely up to the domain, but typically contains flags to mark pages as mapped, accessed, dirty, etc.
By laying out the frame stack like this, authenticating mapping attempts becomes easy. When inserting a new mapping, instead of explicitly passing down the frame, the caller gives an index into its frame stack. Thus the call looks like ntsc_map(va,idx). The called routine simply needs to:
Assuming the authentication checks are successful, the page-table is walked, the mapping inserted, the frame's mapped status updated and the relevant TLB entries flushed.
Initial allocation and subsequent extension of the frame stack are achieved via IDC to the Nemesis domain. Currently, however, there is no proper support for dealing with the revocation of frames -- a frame may be ``revoked'' by deallocating it from a particular domain, but no notification of this is currently given. A frame which is currently mapped cannot be revoked at all.
When any of the above happen, the pipeline is drained, the faulting PC is stored in an internal processor register, and the new PC is loaded by indexing from the start of the PALCODE base. The TLB miss handlers load mappings as described above as long as the translation is valid. If a miss handler determines a translation or access fault, or if one of the other three entry points is entered, then the following occurs:
At some point in the future the faulting domain will be activated. Its event demultiplexing code will call the memory management entry notification handler about the event on the memory management channel. This notification handler does the following:
Once the notification handler returns, the activation handler will typically pass control to a user-level thread scheduler, which will schedule a memory management worker thread at some point. This thread checks its queue of todo structures. If there is no work present, it blocks itself. Otherwise it takes the first item from the queue and determines what action to take. There are two possibilities here:
The structure of the implementation of fault handling is illustrated in Figure 8.
In the case where Retry is returned, a memory management entry worker thread will invoke the physical stretch driver for a second time once activations are on. In this case, IDC operations are possible, and hence the stretch driver may attempt to gain additional physical frames by invoking the frame allocator via the FrameStack interface. If this succeeds, the stretch driver sets up a mapping from the faulting virtual address to a newly allocated physical frame. Otherwise the stretch driver returns Failure.
The third stretch driver implemented is the paged stretch driver. This may be considered an extension of the physical stretch driver; indeed, the bulk of its operation is precisely the same as that described above.
However in the case where no physical frames are available for mapping, the paged stretch driver has two choices: it may request more frames via the FrameStack interface, or it may swap a page out to a backing store. The backing store used currently is a user-safe disk (USD)[Barham96,Barham97]. This allocates resources in terms of extents -- contiguous ranges of blocks on the disk -- and rate guarantees . These latter are supported by the use of a custom QoS entry in the driver which schedules requests according to these guarantees.
Each domain making use of the paged stretch driver has a pre-established binding to the USD, owns certain extents (which it may use for paging or other purposes) and holds a particular rate guarantee. Hence it is possible for the paged stretch driver to predictably determine an upper bound on the amount of time required to load a given page from disk. This allows individual stretch drivers to make both disk layout and page replacement decisions which maximise either throughput or capacity as they see fit.
The prototype implementation has proved the validity of the architecture, but is not yet complete. A number of items remain for future work, the most urgent being:
In particular what is lacking is a good heuristic for local transforms when adding or removing entries close at hand. Further work will be required to see if one of these can be discovered, and to investigate other translation structures such as clustered page-tables[Talluri95].
When implementing the system on the CISC Pentium/Pentium Pro machines, or on the StrongARM, more difficulties arise. For example, the fact that page-tables are walked in hardware provides little flexibility in the translation systems used. Nonetheless good progress has been made in realising a version of the VM system on the other Nemesis platforms.
Additionally there is considerable scope for the development of stretch drivers. Currently three are implemented: a NULL implementation for nailed-down stretches, a demand-paged stretch driver which uses only physical memory, and an extended version of this which also pages to and from the USD. All of these require optimisation and might benefit from some reorganisation. A number of other 'interesting' stretch drivers (providing support for DSVM or peristence, for example) might also be implemented.