Task activity vectors Different functional unit on process run at different temperatures. This is dependent on the task. You get hotspots in some areas. Hotspots decrease chip lifetime, cause a performance penalty (due to throttling), but scheduling policies are unaware of this. New approach: schedule suitable tasks to influence power consumption and temperature. e.g. If the FPU is a hotspot, schedule an integer-based task next. Need to characterise tasks by the locations at which they dissipate heat (using activity vectors), and they then introduce a scheduling policy based on these. Characterising tasks could use chip temperature sensors, but these have insufficient spatial and temporal resolution. In their appraoch, you provide a priori information about where a task will dissipate heat. Activity vector is an n-vector with a component per functional unit, value is the utilisation of a unit (inferred from performance counters) when the task is running. Runqueue sorting schedules tasks in a sequence that uses different functional units between subsequent tasks. Leads to a lower maximum temperature in the individual units. Don't keep a sorted runqueue, but instead sort lazily, keeping two runqueues: search in the first queue, after execution insert the task in the second queue, maintaining order. Switch queues when the first queue is empty. Problem: temperature is determined by the previously running task, which is only the case if the task ran for a sufficiently long time (otherwise the task before that will have an effect). Also have to take into account the neighbouring units. Solution is to consider the unit temperature when choosing the next task, performing online temperature estimation and power estimates from the performance counters. So, schedule tasks that use units with low temperature, which has a higher scheduling overhead but gives a much better schedule. Activity balancing: influence the contents of the runqueue using task migrations, to ensure that there is a mixture of tasks on each CPU (so that runqueue sorting works). Works for SMP, but not SMT. (Applying runqueue sorting on the same physical chip will often schedule two or more things on the same functional unit at once.) Activity unbalancing for SMT: schedule tasks using different functional units on sibling processors. Collect tasks that use the same functional units on the same logical processor. Evaluation: implemented on Linux, an 8-way Xeon, ran benchmarks. Record power consumption, but this doesn't provide enough informaton so they repeated this on a temperature simulator. Run three instances each of an integer-heavy and FPU-heavy chip. Simple runqueue sorting removes spikes from the temperature distribution, and spends much less time at the higher temperature. Enhanced runqueue sorting does even better, reducing the maximum and diminishing the hotspot effect, though with a slightly higher low-temperature. 1.3--1.5% overhead from the (simple to enhanced) runqueue sorting. Activity balancing has a less pronounced effect, as there is less of a high-temperature spike, and removes a big low-temperature spike. 0.3% performance overhead. Activity unbalancing gives a 3.6% performance gain, and concentrates the temperature distribution around 75--77 degrees C. Much lower maximum. Apparently no effect on throttling, because the throttling hardware isn't sophisticated enough to take individual functional units into account. Q: realistic workloads - might not be generally applicable when you tend only to have a single task that is ready to run. Might work better when you have server consolidation and virtualisation. --- Efficient Guaranteed Disk Request Scheduling with Fahrrad Want to guarantee performance for I/O with hard and soft real-time and best-effort approaches. e.g. Backup (throughput expectations), multimedia, transaction processing (latency) and other apps (responsiveness). There is a tradeoff between good guarantees and good performance. Traditionally just reserve throughput (based on worst-case response time assumptions), which means you can only reserve a small fraction of total bandwidth. Fahrrad is based on actual (not worst case) response times, which means almost 100% is reservable and performance is better. Implemented as a disk driver. Reservations made via a broker. Given requirements and I/O behaviour, the broker translates these into the disk-time utilisation and reservation granularity (period for reservation). Each stream is guaranteed some portion of disk time. Disk time utilisation: easy to reserve, measure, manage and guarantee. No dependent on the workload, and minimises worst-case assumptions which leads to high performance and workload isolation. Guarantees throughput for a specified workload. Use EDF to schedule, which is feasible as long there is room for one worst case request in the reservation with the shortest period. Since requests are non-preemptible, the utilisation must also include an additional worst case request at the end of the period. Each request has a micro-deadline, and dispatch requests based on the earliest micro-deadline first. The micro-deadlines can be shifted based on the actual I/O time. Ordering matters, so we need to allow efficient reordering without violating any guarantees. Determine the largest set of requests that can be executed in any order. Need to account for inter-stream seeks, so pad utilisation with two additional seeks per reservation period. Implementation as a blkdev in Linux 2.6.17. 2406 lines of code. Applications make reservations via ioctl. fcntl associates a stream with the reservation. Changing the reservation period of a soft real-time stream has no effect on other streams (but longer period => better throughput). On the other hand, changing the period of a hard real-time stream affects the throughput of other soft real-time streams. Fahrrad does much better than Linux (in terms of constant throughput), with two media streams, a transaction stream and a best-effort background scheme. --- Exploiting Idle Resources from Outside Operating Systems Using underutilised resources for things like open computing projects or important local tasks (disk error checking, backup, virus checking, etc.). But they must not disturb foreground processes, which are more important/time-critical. Priority-based schedulers don't guarantee strict prioritisation of FG processes over BG processes: may get contention in places other than the CPU, and priorities may dynamically change. Mechanism must be easily deployable, consider different resource types in combination (e.g. disk and network I/O), and handle varied workloads. Approach is to estimate resource usage at user level (using dtrace), derive resource shares from this, and suspend BG processes when their shares decrease. FG processes will use the reclaimed resource. To monitor CPU use, measure time; disks, count blocks read/written synchronously; and network, count number of send and write calls. Don't consider asynchronous I/O and network traffic. Scheme needs to ignore some system services that consistently consume resources (e.g. the swapper, X), and not include them in the calculation of the share for BG tasks. Also need to take idleness into account (if foreground is only using 20%, then the background share can be higher than the threshold). Must also take multiprocessor schemes into account. Must judge when to resume the suspended BG processes. Use exponentially increasing scheduling intervals: if they don't cause contention, then let them run (exponential backoff for contention). Also attempt to detect idle periods (when the swapper has a high CPU share, or no disk or network requests have been issued recently). Implementation: a daemon on Solaris 10, using dtrace to get the system information. Background processes are contrlled by signals from the daemon. Evaluation: show that FG throughput isn't degraded by the BG processes. Compared against executing BG processes with the maximum nice value. Run over disk microbenchmarks. Shows that the effect of running background processes doesn't increase the execution time of the foreground applications, compared to just using nice (which leads to longer execution times). --- Parallax: Virtual Disks for Virtual Machines Present demands of VMs and looking at potential future uses. Gold mastering (creating and provisioning disks). Read-only data sharing (keep sizes manageable as number of VMs increases). Limitless lightweight snapshots. Each physical host gains a "storage appliance VM" running Parallax, which serves up virtual disks, possibly based on a centralised virtualised storage system. Can use this to implement cool features that would be hard to get into commercial systems. Expands the storage domain from the storage server into the physical hosts. Virtual disk structure maps the logical/virtual address (seen by the guest file system) onto physical blocks. Like a page table. For read only, we can share parts of the table. Read sharing is critical for space efficiency. Write sharing can be done at a higher level. Avoids complexity/performance/failure characteristics of distributed lock management. Allocation done from large (2GB) extents, in a lock free manner. Extents can be typed to separated data from metadata (i.e. the virtual disk structure radix tree). Guests take write locks over whole trees. The Parallax instance locks a whole extent when it wants to allocate from it. Snapshotting simply clones the root of the tree, and changes the pointer types in the new root to be read-only. Subsequent writes are copy-on-write. Gold mastering takes a snapshot when the system is in a well-defined state, and the read-only sharing/CoW will keep storage costs down. Garbage Collection is parallel mark and sweep as a background process -- does not pause or lock live disks. Performance is linear with the system size. Allocation bitmaps are resident in memory (100MB per TB of disk storage). Might look at paging this to disk. Performance is marginally worse than a straightforward virtualised disk or direct (Dom0) access, on the Bonnie+ benchmark. Postmark benchmark within 10%. Compared with accesses on a NetApp filer, and with a local disk (Parallax looks better against the filer). Snapshot latency is about 1ms (8kb to disk). Under load, this is 90% of the time within 10ms. Increaseing the snapshotting interval has little effect (~5%) on the duration of a kernel build. Lots of smart performance improvements and caching, not discussed in the talk, but in the paper. Related to whole-system state recording (NSDI paper). Want to look at defragmentation and linearization - a known problem with CoW that they throw blocks all over the disk. Can use a lineariser in the background, in conjunction with the garbage collector. Super page mapping, and making file systems aware of Parallax would be future work also. Q: Small block size for better sharing and because it matches the block size of most file systems. Q: Could consider studying the differences between conventional and VM-based systems. --- Replication Degree Customization for High Availability More replication - more availability, but more space consumed. Skew in the popularity distribution of objects - we should replicate popular objects more. Is the space constraint important? Yes, because there is a high failure rate (and therefore we need a lot of replication), and, for performance, we want to keep things in RAM. Most related work uses uniform-degree replication. Some are slightly adaptive (give higher degree to popular objects, lower degree to the rest). Some ideas from erasure coding used. Can come up with an analytical constraint that you can optimise (based on popularity and degree of replication) - minimum expected unavailability. Replication degree for some object is some constant plus the log (to the base of the inverse of the machine failure probability) of the ratio of its popularity to its size. Complex system models. What about multi-object operations? Unavailability is approximately the sum of the unavailability of the sum of the individual objects. Also non-uniform rates: the probability becomes the average per-machine failure rate (as long as you use random replica placement). Object popularity:size skewness. We know popularity is skewed (power law, e.g. web store or keyword search). In a web store object size is independent of the popularity. Which is great! But in keyword search, popular keywords appear in a lot of documents (so the size is skewed as well). So, in this scheme, the most popular objects do not necessarily get the most replication. Popularity stability: is popularity stable? Comparing traces over a month-long period, the web store is quite stable, but with a few large variations. With the keyword search, it's much more stable. But this doesn't deal with "flash crowds" -- e.g. when a popular patch is released. Dynamic maintenance overhead -- if the system adapts. But the object popularities and replica assignments are relatively stable, so there isn't too much overhead. With coarse-grain adaptation (just low/high replication degrees), the overhead is even less. Trace-driven evaluation (taking a real trace from Planetlab or based on Internet server failures). Reduces the unavailability by a factor of between 10 and 200, on both traces, for four different algorithms. The availability improvement is independent of the space limit. Maintenance overhead is less than 0.3%. Q: independent failures is a strong assumption. Don't have a direct analytical solution for correlated failures. But the system does work well in the case (the PlanetLab trace) where there are correlated failures. --- GreenFS: Making Enterprise Computers Greenet by Protecting Them Better Half the price of a hard disk is the purchase price: the other half is in electricity for it. They generate noise, which has a detrimental effect on our productivity. And they are more fragile when they are spinning. So we want them to spin less. Classic solution is to use timeouts: if there are no requests pending, after some time, just spin them down. But spinning up and down the disk can shorten the life of the disk. So in reality, this solution isn't used. Hybrid drives with a non-volatile flash cache improve this. Diskless architectures are unloved by clients because of the latency and lack of off-line working. Data reliability and user convenience are more important than time and noise. GreenFS performs all-time all-data backup. And at the same time it is greener. Clients have flash and perform versioning. The server eliminates redundancy and stores the data on a disk array. Offline working: a hybrid disk implemented at the file system level. Can use a pen drive in conjunction with remote storage to reconstruct your data even if your laptop is stolen. Occasionally need to synchronise the local version from the server (so, on shutdown, when many writes are outstanding, when connectivity is poor, periodically and manually). Uses fan-out stackable file system. Mounted over ext3 for the HDD and flash drive, and NFS. Can also stack in, for example, an encrypted file system on the HDD (in case of theft). And stack NFS over VersionFS. Evaluation: UnionFS has a 1--3% overhead on top of normal ext2/3 FS. NFS has a big latency. Tested on a compile of OpenSSH. GreenFS has a slightly higher overhead than UnionFS, but much faster than NFS alone. For emacs invocation, GreenFS is 2x faster than the HDD alone. Saves 2.5--14% power (especially good for server and desktop, but less advantage for the notebook, because the power consumption is highly optimised). Evaluation of shock protection based on a scenario with almost perfect network availability, so GreenFS had almost total protection of the hard disk platters. --- VPFS: Building a Virtual Private File System with a Small Trusted Computing Base Solving the problems of isolation and huge TCB when using applications on top of a legacy OS. Add one file system per application. A general purpose file system that makes a minimal addition to the application TCB. Run an untrusted VPFS proxy on top of the legacy OS, and have it communicate with both the hard disk and the trusted VPFS server in the trusted domain. Want to protect Confidentiality and Integrity -- Availability is impossible if you rely on a untrusted component, but we can still focus on Recoverability. VPFS uses a TPM to enhance its protection - in particular sealed storage. VPFS server is a file system wrapper that does a similar job to a VPN client -- tunnelling secure data through an untrusted channel. Integrity is based on a Merkle hash tree. A master hash file is stored in sealed memory, and contains master entries, which are like inodes in a Unix FS. The directory structure is first-class metadata and hence security critical. Confidentiality: want to keep filenames and directory structure secret. We can encrypt filenames and use a flat naming scheme in the untrusted file system to hide the directory structure. Untrusted Naming. VPFS server encrypts the pathname, and passing this to the (untrusted) VPFS proxy, which can then use the legacy file system API to do lookup. For integrity, we need to make sure that the correct file is being accessed. Also a file can't be overwritten silently. If you can't open an existing file, this is a deletion attack; deleted files should no longer be accessible. And directory listings must be correct, complete and up-to-date (not provided by this naming scheme). Untrusted naming doesn't support hard links or ls. An alternative is trusted directories, in the TCB, which provides these features at the cost of more code in the TCB. VPFS is 4000 to 4600 lines of code; against 50000 LOC for the Linux file-system stack. Performance evaluated using Bonnie++. VPFS with plaintext files is marginally slower for reads, but over 20% slower for writes (versus a native baseline). On encrypted files, it is about 40% slower for reads and even more for writes, but it is better than EncFS. Future work: recoverability (trusted backup server), and journalling/robustness using a split journalling mechanism. --- Application-Level Isolation and Recovery with Solitude Want to mitigate the risk of attack to networked applications. A browser application can erase the entire home directory; server programs run with root privileges. After an attack, we may lose data, and malware could have entangled itself in the system. Recovery is manual, error prone and time consuming. So want to limit the damage that an attack can cause, and make it easy to recover from it. So, isolate applications from each other, limiting their access (principle of least privilege). Then use taint analysis to evaluate the effects of an attack, and selectively revert these effects. Solitude architecture gives each application an isolated CoW filesystem. Modifications are isolated within the application environment, but some files may be shared or critical: this "contamination" of the trusted base file system is tracked and used in taint detection as follows. Isolated File System (IFS) is the CoW file system, per-application. Most applications can run without modification on this. This protects the base filesystem, stops applications affecting each another. Applications run within a chroot jain, with disabled root privileges, but this doesn't work for setuid/some server-side applications. Linux capabilities are used to give explicit capabilities to these programs running within that environment. Fine-grained file privileges are provided. A specification file containing the capabilities and file privileges must be written for these applications. You risk losing critical data if you just discard the CoW FS when an intrusion is detected. Users are allowed to commit specific files to the base FS. Also a need to share files across these isolated applications. So allow applications to commit files in certain directories (by adding entries to the capability file). e.g. a web browser could have this right over the downloads directory. CoW enables pre-commit recovery. Post-commit recovery is taint-based. All committed files are tainted, and taint analysis at the system-call layer can determine the activity of the attack. Selected system calls within the log are replayed to recover the system. Evaluation: degree of sharing, storage overhead and performance overhead. Sharing: track all file operations on two machines (an author's workstation and a group server). About 0.6% of client files are written by >1 application; a higher percentage on the server (but some are temp files). Storage overhead is improved over previous system (Taser) as fewer operations (FS only?) are logged. Negligible percentage of client-machine calls needed to be logged (about 20Mb over three weeks); 2.4% of events on the server (about 630Mb over four weeks). Performance evaluation on untar and build of the Linux kernel sources. Server workload involved downloading and uploading a large file. Untar is really expensive (5x overhead), but download is almost the same as base Linux. Other tasks are around 2x overhead. FUSE implementation is to blame for some of the poor performance. --- Towards Cinematic Internet Video-on-Demand VoD is costly. P2P has helped some applications, but can it help VoD, when we have high bandwidth and real-time constraints? GridCast uses an ActiveX control on a website or a dedicated application. It has a hybrid client-server/P2P architecture. A central tracker and source server exist, but peers can fetch chunks of the file from other peers. Evaluated GridCast by measuring the concurrency (people watching the same video), chunk cost (proportion of chunks that had to be fetched from the source) and continuity. GridCast gets a good cost (which should be (1/number of concurrent hosts -- only a single peer fetches from the source server). Based on caching only the currently-watched video. However 80% of viewing sessions have a concurrency of 1 -- long tail in video watching. Users have 90% spare upload capacity and 60% spare download capacity -- about 2.5Mbps in either direction is used. If you cache multiple videos, the chunk cost decreases by ~75%. Classify peer-cache misses by causes: it might be new, the peer may have left, or evicted, maybe you cannot connect or there is insufficient bandwidth. In single-video caching, most misses are caused by the peer being evicted; in multi-video cachine, it's mostly peer departure.