The storage system in Pegasus is intended to store traditional file data as well as multimedia data efficiently. A storage service for multimedia data must have a large storage capacity (video produces half a megabyte per second compressed, so a half-hour video already occupies a gigabyte) and a guaranteed (fixed) service rate.
Ordinary data usually occupies less space and does not require a guaranteed service rate. The data rate does not have to be constant, but should be as high as possible. Locality of reference can be exploited by caching data in client and/or server memory. Most modern file systems demonstrate that caching yields substantial performance gains.
This applies to naming data too, albeit that directories can be cached more effectively when the semantics of directory operations are exploited in the caching algorithms.
In contrast, caching video and audio is usually not a good idea. If the system can already guarantee the appropriate rate for a video or audio stream when it is not cached, caching it will only use up memory, but cannot result in a higher performance -- a fixed performance is desired. To make matters worse, caching would often be counterproductive: Most video sequences and many audio sequences are larger than the cache, so, by the time a user has seen, or an application has processed, a video to the end, the beginning has already been evicted from the (LRU) cache.
Since different kinds of data require different treatment in our storage service, it was decided to make a hierarchical design for it, where a common bottom layer is responsible for reading and writing the data on secondary and tertiary storage devices and maintaining the storage structures on them. Above this layer, different service stacks can be built using specialized algorithms for particular kinds of data.
These service stacks can be partially or wholly mirrored in file-server agents on client machines. Thus, caching strategies, for instance, can be jointly implemented by corresponding layers of code in client and server machines.
The service stack for continuous data on the server has been designed to interact directly with the multimedia devices of Pegasus. As described in Section 2, continuous streams composed of several substreams (synchronized video and audio is a typical example) will cause several data streams and one control stream to be generated. The storage server stores the data streams and uses the control stream to generate indexing information. This information then allows reading synchronized streams from a particular point, and fast forward, reverse play, etc. of these streams.
The bottom layer of the Pegasus storage service is called the core layer. It manages storage structures on secondary and tertiary storage devices and carries out the actual I/O. Pegasus uses a log-structured storage layout as was exemplified by Sprite LFS [.rosenblum sosp13.].
The log is segmented in megabyte segments. Each segment is striped across four disks. A fifth disk is used as a parity disk and allows recovery from disk errors.
Normal file data ends up in the log similarly to Sprite LFS. Continuous data, however, is collected in separate segments, although their metadata (the inodes or pnodes as we call them) are appended to the normal log.
The speeds of modern disks are such that the overhead of seeks between reading and writing whole segments is less than ten per cent, so that a transfer rate of at least five megabytes per second per disk is possible on high-performance disk hardware. Striping over four disks makes a total bandwidth of 20 MB per second possible. We have not been able to test this yet, since our ATM network runs only at a mere 100 megabits per second, just over 10 MB per second.
Partly as a consequence of storing multimedia data, we have to expect that our storage service will grow large. We have set ourselves the goal to make the storage-service algorithms scale to a system size of 10 terabytes. Cleaning algorithms for a storage service of this size have to be designed carefully. If any part of the cleaning process scales with, say, the square of the system size, cleaning a terabyte file system will take a very long time.
We are currently implementing a cleaning algorithm whose complexity only depends on the number of segments to be cleaned and the amount of `garbage'. Roughly, it works as follows. During normal operation of the file system, the core maintains a garbage file. Every time a client write or delete operation creates garbage, an entry describing the hole in the log that corresponds to the obsolete data is appended to the garbage file.
When the file system needs to be cleaned, the garbage file is read and its entries are sorted by segment number. Then, a single pass through the garbage file is needed to find and clean all segments containing garbage. When cleaning is complete, the garbage file is truncated to a single entry describing the old garbage file itself.
Allowing client operations to continue during cleaning does not complicate the cleaning algorithm. At the start of a cleaning operation, the current place in the garbage file must be marked and cleaning uses only information before the marker while new garbage is appended after it. When cleaning is complete, the portion of the garbage file before the marker is deleted.
The first prototype of the core of the Pegasus file server now runs, with an incomplete cleaning mechanism. Higher-level services are being added; a Unix v-node interface is installed which allows the storage system to be used as a Unix file system.
Since files are stored on RAID, recovery from disk failures is straightforward. Once files have reached the disk, it is unlikely that they will be lost in a crash. Files, therefore, should be put on disk as soon as possible after they are written by the application. However, from a performance viewlpqoint, disk writes should be delayed so that overwrite operations and delete operations can be exploited to save disk operations. In the Pegasus storage service we have tried to get the best of both worlds.
For this, we make use of the assumption that client and server machines crash independently. When an application makes a write operation, the client agent sends the data to the server and keeps a copy of the data in its buffers. When the server receives the data, it acknowledges this to the client agent which, in turn, unblocks the application. The data is now safe under single-point failures: when the server crashes, the client agent notices and either writes the data to an alternative server or waits for the crashed server to come back up; when the client machine crashes, the server will complete the write operation.
When there is a power failure, client and server will crash together. To guard against this, the servers can either be equipped with battery-backed-up memory, or with an uninterruptible power supply. With the latter, when a power failure occurs, the server has time to write its volatile-memory buffers to disk and halt.
These mechanisms obviate the need for writing data to disk quickly. For normal file traffic, this is not only beneficial for write performance -- <.baker sosp13.> showed that 70% of files are deleted or overwritten within 30 seconds -- but also for cleaning performance: The data that does eventually get written to the log is reasonably stable, so garbage is created at a much lower rate.