Batch modification with sequential commit -- BOSC SYNC vs ASYNC reads reads are show stoppers, because they need to be synchronous. Update operation = read + write. Does the read need to be synchronous? Making it async can make disk scheduling much easier. Another app: b-tree. Modify function distinctly non-trivial. Sort function in leaves requires data in leaves. Answer: async fetch the block from disk, then blindly apply required logic, then store it back. Logging. Data in temp buffer can be lost following crash. Preventing that requires a persistent store. Want to make that *one* update faster. Answer: just write log records to wherever the disk head happens to be right now. Call it TRAIL logging. Assumption seems to be that you have a dedicated logging disk. Implemented as a user-level library or as a kernel thing. TRAIL logging: Eval: B+-tree thing. Show that normal B+ treeu implementation doesn't get better when you have more buffer memory BOSC does much better (20x faster), but still doesn't scale past 1GB of buffer space. Another one: dedupe of on-disk structures with a garbage collect. Again BOSC much faster -- 25 to 50 times -- but it's still not clear where they got their baseline from. Q: Compare against a vanilla implementation. Where did they come from? A: We designed them. Q: SSDs. Comment. A: Should work for flash disks. Not in paper, but the sync vs. async read thing should apply anywhere. Q: How hard is it for the programmer to segregate sync and async reads? A: Tricky, but most of the time they just need to express things as updates, which is a bit easier. Q: How do you handle failure of logging disk? A: Haven't considered it. Just use RAID. Q: RAID5 -> read-modify-update cycle. A: Could use RAID1 instead. Really want the head-synced form of raid --------------------------------------- Andrian Caulfield UCSD Moneta-Direct: Profidinc safe, user-space access to fast solid state disks Virtualisation to eliminate overheads . There is lots of data in the world, but hard to turn data into information. Considering faster-than-flash NV RAM. Assume as fast as DRAM, as dense as flash, non-volatile, and reliable. . Project 12us latency, 1.7GB/s bandwidth i.e. very, very fast. Means that OS overheads dominate hardware ones. Moneta == previous work. SSD for fast NVM, using DDR2 to emulate phase change memory by introducing some extra delays. Careful optimisation got OS latency down to 5us, filesystem down to 5us. Filesystem costs you about 7x performance on a write-intensive workload (using XFS). Answer: move the access control down to hardware, leave control path in OS, and allow data path to be exposed all the way to userspace. ``Virtualise the interface, not the device''. Basically do the VMDQ thing for splitting device many ways. Prototype supports 1000 independent channels. Uses an extent-based permission table. 16k entries, shared between channels, but permissions are per-channel. Problem: libmoneta cache and block cache are not coherent. Also, file fragmentation can lead to overflowing the cache -> poor performance. Result: reduce cost of OS and filesystem. Still have a bit of a bottleneck in interrupt handler. Fixes: -- spin and wait for hardware to DMA in a particular completion flag -- kernel takes interrupt, sets completion flag -- Or the usual sleep mechanism. User-space spin fastest, in wake-run terms, and sleep the slowest, but sleep gives the highest bandwidth due to better CPU multiplexing if request size >16KiB. Overall results: *much* better for small accesses, marginally better for large ones. For their workload the results are almost optimal, in the sense that FS-level activity sees similar performance to going to the raw disk. Without application changes, which is quite cool. Also have an async interface, which helps further but requires API change. Seems to be a bigger win for u-benchmarks (~5x) than macros (~2x), but still not bad. Also show better scaling with CPU count. Q: SR-IOV. Comment. A: One device with many interfaces, rather than lots of interfaces. Q: SR-IOV presents a different PCI function, but it's one device on the back. A: Q: SR-IOV does precisely this. Your slow path goes through the OS. Have you tried getting rid of that? e.g. distributed filesystem in user-level. A: Considered it, but wanted to support many FSes, and minimal changes to FS. Might be worthwhile to do it the other way. Q: Some of your apps show relatively small performance win, and you said that you might be able to change that with optimisations. Most of them are already heavily optimised, so that might be hard. A: Some optimisations (e.g. read cache) stop making sense when the disk is this fast, so you could help things by removing them again. Q: Sharing between applications? Is cache coherent? A: Rely on app-level caching. Devices often fast enough that cache can't pay for itself. Getting rid of cache makes coherency trivial. -------------------------- Dushyanth Narayanan Whole-system persistence using non-volatile memory MRSC NVRAMs: byte-addressable, non-volatile, DRAM replcement. Could use as storage or as memory. What is the right abstraction? Transparent persistence isn't really transparent: persistent objects cannot point at volatile ones. Hard to enforce; very hard to make all of your libraries enforce it. Also have substantial performance overhead compared to DRAM, because persistence requires pushing stuff out of cache. Hardware options: battery-free NVDIMMs. Combination of DRAM + flash, flush DRAM to flash on power-off. Plugs in as ordinary DIMM. Powered by an ultracap. Really exists, as commercial product. Whole system persistence: keep everything in NVRAM. Make power-failure look like suspend/resume. Problem: suspend too slow -- run out of power. don't suspend, recover from nvram -- too much state lost (e.g. registers, dirty cache lines) Alternative: flush-on-failure: use power-supply internal capacitance to keep the machine up for a few tens of milliseconds, and use that to save volatile data. Commodity PSU apparently gives you ~30ms warning when mains power goes away before its outputs go away. They tried a coupe of different PSUs, and they all gave you at least 10ms notice, even when loaded. These are fairly small machines, though. Quite a lot of variance, though. ATX spec requires that you get at least 17ms. Expensive part is flushing the cache. Worst they found was 4ms. Almost independent of amount of dirty stuff in cache; probably an artifact of wbinvd implementation. System suspend can take up to 6 seconds. Performance: good, because you don't have to keep flushing stuff out of processor cache into NVRAM. Can be up to a factor of four on one test. Q: Can you restore things when power comes back? A: Relies on BIOS support to restore memory. Device restart: do it in the hypervisor. Q: What happens if you get a power event which lasts less than 10ms? A: Have to go all the way down and then back up. ------------------------ General discussion on storage 1: Lots of point solutions, no real idea of where we're going overall DN: Less gung-ho than we were a while back. NVDIMMs are there today; not revolutionary in same way that PCM will be. Don't want to make big predictions. 2 : Have working PCM in lab. Not quite as fast as desired, but will get there. Also looking at flash trends; latency up and reliability down. That will force use of other schemes. 3: NVDIMMs: what's the market? DN: Made by startup, so secretive about business model. Seems to be battery-backed RAM in RAID cards, but support very fast memory.