A Fresh Approach to File System Quality of Service P. R. Barham Paul.Barham@cl.cam.ac.uk +44 1223 334600 University of Cambridge Computer Laboratory, New Museums Site, Pembroke Street, Cambdridge. CB2 3QG ENGLAND Abstract This paper describes a file system structure for sup­ porting Quality of Service (QoS) guarantees. The de­ vice driver model clearly separates control­ and data­ path operations and presents a low­level of abstrac­ tion. The data­path module provides translation and protection of I/O requests enabling the file system lay­ ers to be executed as unprivileged code within shared libraries. Scheduling of low­level operations within the device driver is used to provide isolation between clients and Quality of Service guarantees. 1 Introduction A large proportion of the code executed on behalf of an application in a traditional operating system re­ quires no additional privilege and does not, therefore, need to execute in a separate protection domain. Typ­ ically, code which must atomically and securely update important datastructures is rarely executed and usu­ ally associated with out­of­band operations such as opening or closing a file. It is only this code which must necessarily execute in a separate protection do­ main. The Nemesis operating system [1] makes use of these observations to move the majority of operating system services into the application itself, leading to a vertically structured operating system. The kernel is still responsible for implementing scheduling and pro­ tection of hardware resources but this functionality is provided at a much lower level of abstraction. In a multimedia system this has the advantage of allowing applications to make their own resource management policy decisions. Nemesis focuses on providing predictable and guar­ anteed Quality of Service (QoS) for each system re­ source, rather than optimal peak performance. This is vital when dealing with continuous media (CM) data. The structure of Nemesis and it's I/O subsystem have many similarities to operating systems designed over 25 years ago [2, 3, 4, 5]. Aspects of each of these op­ erating systems take on renewed relevance when QoS is the most important consideration. 2 Device Drivers Under Nemesis Nemesis provides QoS guarantees for I/O using a non­orthodox approach to device abstraction which completely separates the control­ and data­path func­ tions [6]. A fundamental requirement for effective pro­ vision of Quality of Service is that there is some form of ``connection'' against which to account I/O resources. All device drivers under Nemesis are therefore connec­ tion oriented -- i.e. clients send a stream of requests to the driver on a pre­established connection and these requests are serviced according to the QoS parameters associated with that connection. Only low level I/O primitives are supported with the majority of high­level data­path processing per­ formed by the client using shared libraries. The struc­ ture of a generic Nemesis device driver is shown in figure 1. As can be seen from the diagram, the func­ tionality is implemented by two main modules : 1 ffl The Device Data­Path Module (DDM) ffl The Device Control­Path Module (DCM) Whilst these two modules may often execute in the same Nemesis domain, 2 for certain devices the DCM is more sensibly implemented in a separate domain. 1 A Nemesis module is a simply a self­contained linkage unit with no external references. 2 A Nemesis domain may be considered as a process for the purposes of this paper. 1 Figure 1: Nemesis Device Driver Architecture 2.1 Device Data­Path Module (DDM) The DDM is intended to contain the minimal func­ tionality to provide secure user­level access to the hardware and support QoS guarantees to clients. The DDM serves three main purposes: - Translation (of stream addresses). It is highly desirable that the addresses present in an I/O stream should be independent of the destination of the stream. This allows a single stream to be multicast efficiently to a number of destinations and also to be processed entirely in hardware, should the workstation provide this facility. - Protection (between clients). The addresses to which each client may perform I/O are usually different. The DDM should ensure that a client cannot perform I/O to addresses for which it does not have suitable access permissions. Ad­ dress translation and protection functions are per­ formed individually for each I/O connection - Multiplexing (of physical resources). The DDM is assumed to be the single multiplexing point for a hardware resource. It is responsible not only for providing shared access to an I/O resource, but also for controlling both the amount of re­ source which each client receives and the time at which the client receives it. The DDM may pro­ vide a scheduler for this purpose. This is the most fundamental difference between a Nemesis device driver and those of traditional operating systems, microkernel or otherwise. Explicit and visible multiplexing is performed at a single point within each driver. It is at this point where hardware resources are scheduled. With each connection are associated QoS parameters which are used to control the multiplexing. These parameters are themselves determined by out­of­band negotiation between the client and a QoS­manager domain. The exact scheduling algorithms applied are likely to be de­ vice specific, but in general any algorithm which sup­ ports QoS­guarantees is potentially applicable. Exam­ ples include stride­scheduling [7], jubilee­scheduling [8] and the RSCAN algorithm for disk­head scheduling pre­ sented in section 7.1. Application domains communicate with block­ and packet­based device drivers using an asynchronous shared­memory communications mechanism known as RBufs [8]. Each connection to the driver has an inde­ pendent queue and so the QoS­crosstalk, 3 prevalent in ``first­come, first­served'' (FCFS) queueing systems, is avoided. Rbufs also provide effective low­latency feed­ back to applications since clients are able to observe both the length of the queue and the rate at which re­ quests are being serviced and may use this information to adapt their behaviour. 2.2 Device Control­Path Module (DCM) Out­of­band control of the translation, protection and multiplexing functions of the DDM is performed by a separate management entity known as the De­ vice Control­Path Module (DCM). The DCM com­ municates with the multiplexing layer of the DDM in order to set up new connections and adjust the QoS­ parameters associated with existing connections. The DCM is never involved with the in­band operations of the device driver. The DCM uses high­level descriptions of access­ permissions (e.g. meta­data for a file­system, or win­ dow placement in a window system), together with access­control policies to generate the low­level pro­ tection and translation information required by the DDM. These low­level permissions are often cached within the DDM's per­connection state records to re­ duce the number of interactions with the device man­ ager. This cache provides conceptually similar func­ tionality to the translation lookaside buffer (TLB) of modern processors. 3 File System QoS The vast majority of work in the field of disk and file­system performance is devoted to increasing total throughput and/or decreasing average response times. The large variance in service times caused by several disk scheduling algorithms is usually considered as of secondary importance. In contrast, the multimedia environment often requires that total throughput of a 3 QoS­crosstalk is the process whereby the actions of one client have an adverse affect on the QoS observed by other clients. system is sacrificed in order to maintain predictable QoS on each connection. 3.1 Block­Structured File Systems Block structured file systems are the most common variety of file system and until recently have been used by most forms of unix. They achieve high space uti­ lization and are easy to integrate with virtual memory systems and buffer­caching schemes due to the fixed­ sized unit of I/O. It is often arranged that the block­ size supported by the file system is the same as the page­size supported by the virtual­memory system. Unfortunately, block allocation is a frequent oper­ ation requiring update of shared state and in a micro­ kernel environment this incurs the cost of communi­ cation with the file­system server. When writing high volume data to disk, as is the case with CM files, it would be preferable to be able to perform disk space allocation in larger units. Since all disk I/O requests are single blocks, the disk seek overheads are extremely large in comparison with the time taken to transfer the actual data. 4 This leads to very poor performance if the pattern of disk I/O requests is essentially random. 3.2 Log­Structured File Systems As a result of analysing large traces of unix file system activity, the observations were made that the majority of files are written exactly once in their en­ tirety and that most files are fairly short­lived. By implementing a file­system where the only disk write operation permitted is to append to a log, a large frac­ tion of the seek overheads of a block­structured file­ system may be eliminated [9]. It cannot be stressed enough that the above observations are highly unix specific. The addition of a cache allows most reads to be ser­ viced from memory and in addition means than most short­lived temporary files never need to hit the disk. The log is periodically compacted to remove old and deleted file data by a background process similar to a garbage­collector. Log­structured file systems allow disk write throughput to approach the maximum transfer rate of the device, but require that all file updates are per­ formed by a single server. This potentially introduces large amounts of QoS­crosstalk. Read performance de­ pends on the amount of fragmentation in the log and the number of concurrent read requests. Attempting to read a number of continuous­media files simultane­ ously will still incur seek penalties. 4 For example, the time taken to transfer an 8KB block across a 10MHz SCSI bus is 0.8ms, whilst the typical access time (com­ prised of seek time and rotational latency) is around 15ms. Although a log­structured file system is ideally suited to simultaneous recording of a number of con­ tinuous media streams with differing QoS guaran­ tees, simultaneous playback of these streams suffers unavoidable QoS­crosstalk due to the interleaving of file data on the disk surface and the impossibility of caching large continuous media files. 3.3 Extent­Based File Systems Extent­based file systems [2, 10, 11] are a compro­ mise between the predictability of block­structured file systems and the throughput achievable using a log­ structured file system. Disk space is allocated as con­ tiguous ranges of blocks called extents. The number of blocks in an extent is variable and typically dependent on the expected size of the file. Allocating disk space in this way cuts down the fre­ quency of operations requiring access to shared state and allows disk throughput to be increased by use of larger transactions. It also results in less disk fragmen­ tation and therefore an increased likelihood of consec­ utive reads and writes. File creation and deletion is potentially more ex­ pensive than in a block­structured file system, but this can be amortised using schemes similar to the Parti­ tioned Datasets of MVS [12] where a number of files owned by the same user are grouped together into a single dataset stored in pre­allocated extents on disk. Extent­based file systems are becoming popular in situations where time constraints are important and the unpredictability of a log­structured file system is unacceptable. 4 CM File Servers A number of researchers have abandoned conven­ tional file system technology altogether in favour of dedicated CM file servers optimised for very large files[13, 14]. Such file servers are typically constructed as an embedded hard­real time system running on stand­alone machines connected to a high speed net­ work and so do not experience contention from other activities on the workstation. Disk layout is optimised for very large files -- it is not uncommon to allocate disk space, and indeed perform I/O, in units related to the physical geometry of the drive, e.g. a cylinder at a time. Often the file system is assumed to be es­ sentially read­only and its layout optimised off­line. For example, in a Video on Demand (VOD) system serving a number of streams from a single disk it is possible to achieve more predictable average­cost seek times by deliberately striping file data across the disk. A CM file server may make the assumptions that there will only be a small number of files open at any particular time and that clients will perform predomi­ nantly sequential accesses. The abstraction supported is often closer to that of a VCR with play, pause and rewind operations [15] implying that demands for I/O bandwidth are constant for prolonged periods of time. With these simplified access patterns it is possible to provide stream­oriented I/O at a predetermined rate purely by use of large amounts of read­ahead [16]. Un­ fortunately, this abstraction is completely unsuited to conventional general­purpose file access. Random ac­ cess to files is usually very slow and often not allowed at all. Where it is allowed, it often causes transient QoS problems for other streams. 5 Custom Storage Systems There are a number of applications, includ­ ing persistent programming languages such as PS­ ALGOL or Napier88 and database management sys­ tems (DBMS), whose performance is highly dependent on disk I/O and which have disk access patterns differ­ ing significantly from conventional or multimedia file access patterns. A true multi­service operating sys­ tem should be equally able to support applications of this form. It is common for DBMS software running over a conventional general­purpose operating system to use a ``raw'' disk interface, bypassing the file system layer altogether. The DBMS is allowed to use an entire disk partition in whatever manner it desires and, using application specific knowledge, is able to do a much better job of scheduling its disk accesses. It would be highly desirable to be able to make these application specific optimisations without having to bypass the file system completely. 6 The User­Safe Disk Driver The User­Safe Disk (USD) device driver provides the DDM for the disk. The driver (dev/USD) is there­ fore the single multiplexing point for disk I/O and is responsible for implementing all necessary protection between clients. The granularity of protection pro­ vided is the extent -- a contiguous range of blocks on the disk. The driver exports a privileged control interface (USDCtl.if) for each partition of each disk to which a single DCM may bind. The DCM must register a callback interface (USDCallback.if) which, amongst other things, provides a fault­handler called whenever a client attempts to access a new area of the disk. The control interface also provides operations for creation and deletion of I/O streams implemented us­ ing Nemesis Rbufs channels. With each stream is as­ sociated a QoS which may be updated via the control interface. In addition, each stream keeps a cache of disk extents which the client may access. Clients perform I/O directly to the disk by sending read or write requests down the Rbufs channel. Re­ quests consist of a small record containing the block number and the number of blocks to read or write. In the case of a write, this record is followed by iorecs pointing to the data itself. For a read, iorecs de­ scribing an empty buffer are sent. The USD driver will first check the extent cache for that stream to see if the permissions for that area of the disk are already known. If an entry is not found in the cache, then the DCM is upcalled via the USDCallback.if interface. The control interface also allows the extent cache to be explicitly loaded in advance and flushed if neces­ sary. If the client has suitable permissions for that area of the disk, then the transaction is enqueued with the disk I/O scheduler. If the transaction is not permitted, then the buffer will not be updated in the case of a read and data will simply be discarded in the case of a write. In either case, an acknowledgement is sent to the client containing the length of data successfully read or written. In order to support QoS guarantees on client con­ nections, it is vital that the USD driver schedules disk transactions. Research into disk head scheduling algo­ rithms has invariably focused on increasing the utiliza­ tion of the drive, at the expense of predictability. The resulting high latencies and variance of service times for individual transactions has proven to be a source of problems for continuous media. The exact scheduling algorithm used is largely unimportant, provided that it is capable of deliver­ ing the necessary QoS guarantees, but excessive disk head seeking can dramatically reduce the aggregate throughput of the device. The following section de­ scribes two scheduling algorithms which attempt to satisfy these requirements. 7 Disk Scheduling for QoS The performance of disk drives is usually charac­ terised by a handful of parameters related to the me­ chanical timings of the device, e.g. seek and rotation times. Although these figures are typically the only in­ formation provided with a drive, they are a dramatic over­simplification of the actual behaviour of the de­ vice [6]. The average seek and access times given in the drive specifications are sometimes of little use when trying to estimate the cost of a transaction since mod­ ern drive controllers are optimised for certain common access patterns. When these assumptions fail the re­ sults are costly. Another complicating factor is that the internal buffer memory in the drive is not simply used as a FIFO. Drive controllers will typically partition the buffer space so as to cache around 8 regions of the disk surface. Models of disk drives are usually highly inac­ curate unless they take into account these factors [17]. Due to the features described above, computing an estimate of the cost of a disk transaction is a compli­ cated process. The total cost is derived from a large number of factors, the most significant of which de­ pend on the pattern of previous accesses and invisi­ ble state within the drive [18]. Although it would be highly desirable for the microcode inside the drive to schedule transactions according to QoS parameters, it is not feasible to apply hard­real­time scheduling tech­ niques outside the drive itself. Any scheduling algo­ rithm is therefore largely at the mercy of the heuristics built into the individual drive. Figure 2: RSCAN Scheduling Algorithm 7.1 The RSCAN Algorithm The RSCAN algorithm was a simplistic first attempt at QoS directed disk head scheduling and is a compro­ mise solution which also attempts to minimise the cost of ``context­switches''. It differs from previous disk head scheduling algorithms in that it aims to provide per­client rate guarantees at the possible expense of disk utilization. The scheduler may even split a long contiguous transfer at a block boundary in order to meet the QoS guarantee of another client. RSCAN maintains a list of pending transactions for all streams sorted by block number. The scheduler at­ tempts to minimise seek overheads by servicing trans­ actions in a manner similar to the SCAN algorithm de­ scribed in [19]. Due to the impracticality of computing the cost of a disk transaction in advance and the dis­ proportionate cost of ``context­switches'', the sched­ uler accounts the actual cost of each transaction in arrears. The cost is measured in units of time, rather than bytes, so that the total amount of resource is Figure 3: RSCAN Scheduler Trace a constant. The I/O bandwidth achieved by a client within its allocated time is therefore dependent on the access patterns of that client. RSCAN uses a credit scheme based on ``leaky­ buckets'' to rate­control each stream. Higher level QoS specifications are translated into a bucket size and a rate at which credits are allocated. To sup­ port best­effort I/O, the scheduler must also maintain an estimate of the remaining slack­time in the system and distribute this in the form of additional credits between all clients with pending I/O transactions. Figure 3 shows a 20 second trace obtained from the RSCAN scheduler. The trace shows two clients, C1 and C2, attempting to use the device simultaneously. Client C1 has been guaranteed 80% of the device band­ width, whilst C2 has no guarantee and is relying en­ tirely on best­effort bandwidth. Initially C1 obtains almost 100% of the disk bandwidth. At time T1, C2 begins making I/O requests and the non­guaranteed bandwidth is shared approximately equally between C1 and C2. At time T2, C1 completes its I/O and the entire bandwidth is available for best­effort clients. The small gaps in the trace are caused by conservative behaviour of the RSCAN algorithm in the presence of both guaranteed bandwidth and best­effort traffic. Although this algorithm can produce good results, the effectiveness of the scanning algorithm is tightly coupled to the number of transactions which are al­ lowed into the sorted queue by the rate control mech­ anism. In order for the scan to be effective, higher latencies must be tolerated. Furthermore, given the inherent asynchronous na­ ture of I/O under Nemesis, each client may have an ar­ bitrary number of transactions pending and care must be taken to preserve any ordering semantics. 7.2 The Atropos Scheduler Processor bandwidth in Nemesis is also under the control of a QoS based scheduler. The default CPU scheduler is known as Atropos [20] and is able to pre­ cisely control the proportions of bulk processor band­ width allocated to each domain to resolutions of under a microsecond. Processor bandwidth requirements are specified using a tuple of the form (p; s; x) with the fol­ lowing meaning: p The allocation period of the domain in ns. s The slice of time allocated per period in ns. x Willing to accept extra time (boolean). The p and s parameters may be used to control both the amount of bandwidth and the smoothness with which it is provided. Each domain is guaranteed to receive pns of time in every s (though not necessarily contiguously). Clearly, the smaller the periods in the job set, the more frequently context­switches will be required. Atropos employs a variant of the Earliest Deadline First (EDF) algorithm [21] where the deadlines are derived from the QoS parameters of the domain and are purely internal to the scheduler. The scheduler is capable of ensuring that all guarantees are respected provided that: X i s i p i Ÿ 1 Since Atropos does not context­switch any faster than is absolutely necessary to preserve QoS guaran­ tees, it seems likely that it may also be used to sched­ ule disk transactions so as to minimise the number of seeks whilst preserving bandwidth guarantees. If clients have predominantly sequential access patterns then Atropos fulfils the same r“ole as a SCAN algorithm without reordering transactions on any given connec­ tion. The Atropos scheduler was therefore used for all subsequent experiments described in this paper. 7.3 Admission Control and Translation of QoS Parameters When using either of the above algorithms, which allocate each client a proportion of the available time in which to perform their I/O, the admission­control policy must take account of the likely seek over­ heads incurred when multiplexing between the vari­ ous streams. QoS requirements specified in terms of bandwidth must be translated into periods and slices according to the current set of QoS guarantees in the system. Figure 4: The EFS File System Although this requires that worst­case seek over­ heads be assumed, 5 useful guarantees of minimum bandwidth may still be made and this does not prevent the entire bandwidth of the disk being available in sit­ uations where worst­case overheads are not realised. A single client may read or write directly to the disk at the maximum rate supported by the physical drive. A more sophisticated algorithm could provide each client with a seek budget allowing an additional num­ ber of seek operations in each period as part of the QoS contract. Any seeks over this budget would nec­ essarily be performed using best­effort resources. This would be useful for DBMS style applications with non­ sequential access patterns, or for supporting paging in a virtual memory system. 8 The EFS File System In order to make effective use of the USD device it is also necessary to manage both disk space usage and allocation of disk bandwidth. The DCM func­ 5 An estimate may easily be calculated using the bandwidth guarantee periods and slices of each client. tions for the disk device are traditionally part of a file system. As a demonstration of the flexibility of the Nemesis I/O architecture, it was decided to im­ plement a prototype file system which could equally well support conventional, multimedia and DBMS ap­ plications. The EFS file system was written for this purpose. EFS is a simple extent­based file system, but im­ plements only the out­of­band file system functions. Files consist of a number of extents which are used to control the extent­based protection provided by the USD. EFS uses the same disk layout and meta­data format as the Continuous Media File Service described in [15], but the in­band data­path operations are im­ plemented entirely by the client and the USD. An architectural overview of the system is shown in figure 4 together with a brief description of the various Nemesis domains involved. The EFS server is respon­ sible for maintaining the meta­data associated with each file. This is performed in exactly the same man­ ner as for a traditional file system. The meta­data is cached in memory in order to reduce the latency of out­of­band operations. 6 Most clients interact with the EFS server using a shared library providing default implementations of the standard unix file operations (lib/EFS). Multi­ media clients with timing constraints usually program directly to the underlying interfaces. In either case, control­path operations require communication with the EFS server via an unprivileged Remote Procedure Call (RPC) interface which supports all of the out­ of­band control operations of a traditional file­system (EFSClient.if). This interface is used to create, de­ stroy, open and close files and to read the file meta­ data. The open operation causes the EFS server to in­ voke the USD control interface and obtain an offer for an Rbufs channel. This is returned to the client which may bind to the offer creating an I/O stream connected directly to the USD device (IO.if). The read and write implementations update the data contained in a file by sending requests on the Rbufs connection. These I/O transactions are serviced sequentially at a rate determined by the QoS associ­ ated with the connection. The asynchronous nature of disk I/O allows applications to perform appropri­ ate amounts of read­ahead and avoids the problem of the file system being required to predict the behaviour of the application. An attempt to access an area of the disk for which 6 In unix file system traces, stat operations account for a large proportion of requests. the permissions are not in the USD driver's per­stream cache causes the EFS server to be upcalled. The server uses information supplied by the USD to determine the client's permissions for that area of the disk. If the attempted access lies within an extent belonging to the file then the access is permitted and the extent is recorded in the USD cache. An attempt to write past the end of a file re­ quires the file to be extended by adding a new extent. This is achieved by performing an RPC to the EFS server. Since applications are free to perform this al­ location operation in advance, and file meta­data is write­buffered, the QoS­crosstalk introduced is mini­ mal. The size of extents is chosen by the application de­ pending on the type of data the file contains. Multi­ media files typically use large extents to minimise the overheads of checking access rights during I/O. As a rough guide, the extent size chosen should be compa­ rable to the granularity at which I/O is expected to be performed. 7 9 Evaluation In the following experiments, one or more clients at­ tempt to repeatedly read a large file as fast as possible. Each client uses a separate file to eliminate caching ef­ fects, and the files are separated across the surface of the disk. 9.1 Quality of Service Parameters Figure 5 shows the effects of changing various QoS parameters of a single stream. The left­hand plot shows the effects of changing the period and slice pa­ rameters of a client with no extra (best­effort) time. The achieved I/O bandwidth is observed to vary lin­ early with the QoS guarantee up to the full disk band­ width at 100% guarantee. The plot also shows that this is true over a wide range of periods. (note that periods are plotted using a logarithmic scale.) The right hand plot shows the effect of vary­ ing the transfer size and percentage guarantee for a stream with a 100ms period. (The block size is 512 bytes.) For all but the smallest transfers, the per­ transaction overheads do not significantly affect the achieved bandwidth. With 8K transfers and above the entire media transfer rate is achievable. This indicates some degree of read­ahead by the disk controller. 9.2 Simultaneous QoS Guarantees Figure 6 shows the results of an experiment with five simultaneously active clients, each with a different QoS guarantee ranging from 5% to 25%, and no extra 7 The CMFS file system performed low level disk I/O in units of an extent. Figure 5: Effects of QoS Guarantees on a Single Stream time. In this experiment, each client had the same period of 200ms and each client was performing 8K read transactions. The left hand graphs shows the percentage of time allocated to each client by the Atropos scheduler, and the right hand graph shows the I/O bandwidth from disk as observed by the client. 9.3 Adding New Streams An experiment was constructed where an initial client was provided with a 10% guaranteed bandwidth (over a period of a second) and extra time if available. Every 30 seconds an additional client was added to the system with a 10% guarantee. Figure 7 clearly shows the bandwidth achieved by the first client reducing in steps by approximately 10% every 30 seconds until the entire bandwidth of the disk is guaranteed and no extra time is available. It may be difficult to observe the traces of the 9 identical clients since they overlay each other precisely. The experiment was then repeated with allocation periods of 200ms and smaller transfers to investigate the effects of the much higher rate of context­switches. Figure 8 shows the results of this experiment. It may be observed that at high load the bandwidth obtained by each client is slightly reduced, presumably due to higher context­switch overheads. It seems appar­ ent that the read­ahead strategy of the disk used for these experiments is highly compatible with the Atro­ pos scheduling algorithm! 10 Conclusions and Further Work Despite the simplistic approach, the USD/EFS combination is able to provide useful QoS guaran­ tees to clients without discarding the protection tra­ ditionally provided by file­system code. This benefit is achieved at the expense of some performance. Since all disk I/O is directly accountable to the client, no complicated resource transfer mechanisms are required as is the case with a traditional file sys­ tem implementation. The CPU time expended in file system code is also accounted to the client since this code is actually a shared library within the applica­ tion. QoS­crosstalk is therefore eliminated. Once a client has been allocated an area of the disk, it may use application specific storage management policies and access patterns, whilst receiving the ben­ efits of guaranteed I/O performance. This facility al­ lows efficient implementation of persistent program­ ming languages and database management systems without bypassing the file system completely. Should an application desire, it would be perfectly possible to run a number of log­structured file systems, within EFS files, each with a guaranteed proportion of the disk I/O bandwidth! Using the USD abstraction, together with a driver providing pages of physical memory, an interesting new virtual memory (VM) system is under construc­ tion where applications are responsible for performing their own paging to disk. Such a VM system should allow fine control over the amount of physical mem­ ory and disk bandwidth dedicated to each application, effectively being able to guarantee a maximum page­ fault rate. Clients are also free to implement whatever paging strategies best suit their requirements, e.g. run­ ning a different thread while the fault is being satisfied, or perhaps deciding never to satisfy the fault. Figure 6: 5 Clients with Different QoS Guarantees (200ms period, 8K transfers) Figure 7: 10 Staggered Streams, 1s period, 32K transfers Figure 8: 10 Staggered Streams, 200ms period, 8K transfers References [1] Ian Leslie, Derek McAuley, Richard Black, Tim­ othy Roscoe, Paul Barham, David Evers, Robin Fairbairns, and Eoin Hyden. The Design and Implementation of an Operating System to Sup­ port Distributed Multimedia Applications. IEEE Journal on Selected Areas in Communication, 14(7):1280--1297, September 1996. [2] W. A. Clark. The Functional Structure of OS/360 Part III: Data Management. IBM Systems Jour­ nal, 5(1):30--51, 1966. [3] Per Brinch­Hansen. The Nucleus of a Multipro­ gramming System. Communications of the ACM, 13(4):238--241,250, April 1970. [4] E.I. Organick. The Multics System: An Exami­ nation of Its Structure. MIT Press, 1972. [5] D. Swinehart, P. Zellweger, R. Beach, and R. Hagemann. A Structural View of the Cedar Programming Environment. Technical Report CSL­86­1, Xerox Corporation, Palo Alto Re­ search Center, June 1986. (published in ACM Transactions on Computing Systems 8(4), Octo­ ber 1986). [6] Paul R. Barham. Devices in a Multi­Service Op­ erating System. PhD thesis, University of Cam­ bridge Computer Laboratory, July 1996. Avail­ able as Technical Report No. 403. [7] Carl A. Waldspurger and William E. Weihl. Stride Scheduling: Deterministic Proportional­ Share Resource Mangement. Technical report, Massachusetts Institute of Technology Computer Science Laboratory, June 1995. Technical Memo MIT/LCS/TM­528. [8] Richard J. Black. Explicit Network Scheduling. PhD thesis, University of Cambridge Computer Laboratory, 1994. [9] John Ousterhout and Fred Douglis. Beating the I/O Bottleneck: A Case for Log­Structured File Systems. ACM Operating Systems Review, 23(1):11--28, January 1989. [10] Dan Hildebrand. An Architectural Overview of QNX. In USENIX Workshop Proceedings : Micro­kernels and Other Kernel Architectures, pages 113--126, April 1992. [11] Sai­Lai Lo. A Modular and Extensible Network Storage Architecture. PhD thesis, University of Cambridge Computer Laboratory, 1993. Avail­ able as Technical Report No. 326. [12] IBM Systems. OS/VS2 MVS: Overview, 2nd edi­ tion, 1980. [13] Peter Bosch, Sape Mullender, and Tage Stabell­ Kulø. Huygens File Service and Storage Architec­ ture. In BROADCAST Technical Report Series, 1st year report, October 1993. [14] Philip Lougher and Doug Shepherd. The Design of a Storage Server for Continuous Media. The Computer Journal, 36(1):32--42, February 1993. [15] Paul Jardetzky. Network File Server Design for Continuous Media. PhD thesis, University of Cambridge Computer Laboratory, October 1992. Available as Technical Report No. 268. [16] A. L. Narashima Reddy and James C. Wyllie. I/O Issues in a Multimedia System. IEEE Com­ puter, pages 69--74, March 1994. [17] Chris Ruemmler and John Wilkes. An Introduc­ tion to Disk Drive Modelling. IEEE Computer, pages 17--28, March 1994. [18] Bruce L. Worthington, Gregory R. Ganger, and Yale N. Patt. Scheduling Algorithms for Modern disk Drives. In Proceedings of ACM SIGMET­ RICS '94, Santa Clara, CA. USA, pages 241--251, May 1994. [19] E. G. Coffman, L. A. Klimko, and B. Ryan. Analysis of Scanning Policies for Reducing Disk Seek Times. SIAM Journal of Computing, 1(3), September 1972. [20] Timothy Roscoe. The Structure of a Multi­ Service Operating System. PhD thesis, University of Cambridge Computer Laboratory, April 1995. Available as Technical Report No. 376. [21] C. L. Liu and James W. Layland. Scheduling Al­ gorithms for Multiprogramming in a Hard­Real­ Time Environment. Journal of the Association for Computing Machinery, 20(1):46--61, January 1973.