RESOLVE 2012 Hardware and network virtualisation session The case for reconfigurable IO channels -- Anil Madhavepeddy, Steven Smith,... IPC matters. Previously, just an issue for microkernel people, now an issue for everyone esp. due to virtualisation (including JVM-style virtual machines). Hyperthreading, NUMA, multicore all make it more interesting. First benchmark: all-pairs throughput and latency tests on 48 core AMD machine. Native TCP result is fairly obvious, but explanation depends on knowledge of machine architecture. Virtualise it on Xen: basically turns into noise. Problem is Xen scheduler rebalancing stuff underneath you. Pin vcpus: less weird, but still some very strange results. Problem: pinning vcpus needs root access! Now switch mode: UNIX domain sockets. Native is still pretty reasonable, but Xen now shows exciting banding effects. Problem with Xen is a NUMA effect: all memory allocated on node 0. Another test: Xen IPI latency test. AMD opteron architecture is very obvious. Intel one is much less obvious. Conclusion: this is very hard to predict -> need a new API which hides it. FABLE performance graphs. Lots of different schemes for moving data around. TCP performance is terrible (often a factor of two, up to a factor of fourteen down). FABLE is a new API for this. First-class IO channels which deal with flows (rather than packets). Support reconfiguration, including automatic transport selection. Example: running DB and frontend VMs on Xen. Want a fast transport which works whether you're running locally or not i.e. transparently convert between vchan (or ...) and TCP. Uses: improve cloud IO (e.g. Xen ring protocols, CIEL, Hadoop, mirage, etc.) Questions: Q: Is this heading in the direction of a magic self-parameterising compiler which hides this? A: A lot of things which used to be static are now dynamic (e.g. context switch costs). That means a kind of online expert-system optimiser. Inputs are dynamic, so outputs can't be static. Q: How do you solve the addressing problem? A: Using a name daemon. Big missing link in Unix. Q: Can you do it in userspace? A: Mostly. Need a syscall for trust. Possibly more portable to just define a new kernel API, rather than dicking about with LD_PRELOAD. Q: Upstreaming means getting it everywhere, which isn't necessarily easier? A: First prototype will be an LD_PRELOAD hack, to avoid that, but long-term goal is upstream kernel. Q: First problem is lack of information, second problem is how you use the information once you have it. Not sure how to solve second one. A: All links individually chosen by system, but nothing has a good idea of the overall architecture. A single all-knowing system could do more optimisations. Q: Is there cross-talk between transports? A: Yes. Also did contended tests, and it does make a big difference. Q: Is it transport-dependent? A: Yes, it is. Layered stack makes reasoning about this hard. Mirage fixes it. :) Q: Problems: implicit assumptions in models, another to do with scheduling. Is it possible that the Unix domain on Xen socket is a bad scheduling interaction? A real problem for optimising contended locks in BSD on Xen. Do we need more transparency? A: Lots of people have tried it. Turns out to be quite hard. -------------------------------------------- Next talk: hardware and software techniques for assisted execution runtime systems. Kestor, Gioiosa, et al. Barcelona + Catalunya Assisted execution systems: Amdahl's law limits maximum speedup for parallel execution. Test case: STM^2. Offloads some transactional memory operations to other threads in order to improve performance. Up to ~ factor 5 performance boost over other STMs. Pathological performance if you get low transaction rates? Something to do with auxiliary threads using resources even when idle? Solve by supporting fine-grained division between app threads and aux threads. Requires help from whole stack (STM2, OS, hardware). Using IBM POWER7 hardware thread priority support to say how many re-order buffers etc. each thread gets. Three interesting cases: -- Application doesn't do any transactions, so drop aux thread priority to nothing. -- Application doesn't do any transactions, because it's too busy . -- Application does lots of transactions, so aux threads become overloaded, so boost their priority. Seems to work in their microbenchmarks. Not convinced I really understand them, though. Bored of microbenchmarks now, moving on to STAMP benchmarks. Some apps show performance win, some show a hit. No real idea how to predict which is which. Q: Did you compare to non-assisted execution STMs? A: Yes. On most of these tests, assisted execution is such a big win that it swamps this. Q: Applicability to problems outside of STM-land? A: Yes, could be applied to other assisted execution systems, but possibly not beyond that. Q: How long does it take to reconfigure? A: The submitted one doesn't adapt, but now we have a new one which adapts quickly. It does ``a very good job''. Q: Always needs a syscall? A: No, some userlevel ones can be changed with a register write. Q: Priority inversion problems? A: Yes, that can happen. We wrote a paper on it. It's in ISCA. ----------------------------------- Final talk of the session, rwatson and the CL security group, plus most of hardware, and a good chunk of SRI. CHERI: A research platform deconflating hardware virtualization and protection. The custard talk. 1980s: microkernels. Large pieces of software break down to smaller ones. Not widely deployed due largely to performance problems. 2000s: privsep. Much more widely used this time around. Partly due to better hardware, partly higher tolerance (due to more immediate problems). Appear to be stuck with C-language TCBs, therefore want to make them safe. Q: Will we run into the same barriers here as we did on u-kernels? A: Yes. Previous work: Capsicum -> capabilities on Unix. Problem: hard to have lots of processes for sandboxing on modern hardware+OSes. e.g. can't put each image processing into its own box. DARPA CRASH project: throw away hardware, start from scratch, what would you get? -> CHERI MIPS == capsicum in hardware. Looking at exactly where the hardware/software barrier should be. Traditional hardware uses rings as its trust model, which is tricky if you're using things with mutual distrust or non-hierarchical trust. Answer: capabilities. Use a strange hybrid system to make porting old code easier (in most cases a no-op, but without any of the nice advantages of the new protection model). One important aim: make building security contexts cheap. Each address space has its own security model and executive, a la privsep. Hardware: add capability coprocessor. Extend register file with capability fields, so that you can change them without kernel traps. Hybrid operation: allow unmodified MIPS binaries to run. Implementation details: most research processors are missing lots of important features. Fixing this requires cooperation across multiple research fields (hardware people vs. software people). e.g. looking at interactions of TLB size and OS strategy? Status: have a basic MIPS processor, bringing up FreeBSD. Next up: multithreading (esp capability-aware IPC) www.cl.cam.ac.uk/research/security/ctsrd/ Q: A lot of this stuff was in x86 20 years ago (e.g. call gates), and they kind of dropped away. A: Yep, and it's getting further away (e.g. loss of segment registers). Question we're asking is about path dependency: did we go the wrong way in the past? Q: Easy to verify properties of pure kernel (e.g. L4). Peripherals are much harder. A: SeL4 is cool, but, yes, it largely ignores the very low-level operations (e.g. TLB flushes), which is kind of sad. Working on improving that. One collaboaration is integrating a theorem prover into bluespec for checking that kind of thing. Pretty tricky, esp. given that we want to keep a C-language TCB. Panel session ------------- Q: Trust is a global property. Where do we go from here? Anil: Trust is an economic property. Every reseller provides a certain level of indemnity, and that works better than provable correctness. i.e. depend on socal and business mechanisms, rather than mathematical ones. Rob: Tension between API and trust, and between security and performance. If security is optional, people will turn it off. Soemtimes security helps performance e.g. provable isolation. Tricky to arrange. G: Aim is to make parallel programming easier, not so much about security and trust. Our trade off is ease of programming vs. performance. Anil: Something has to change. Not clear whether it's languages or hardware or what. Q: Hardware-software interface is interesting, but increasingly difficult to play with, despite better techniques i.e. need more people. i.e. field is maturing. Do we need to rethink the way we do this kind of research? If so, how? Rob: Look at effects of open source OS research. Many OSes built in part to run experiments on. Proprietary OSes hard to do reproducable tests on. Then we can do experimental method . To some extent, FPGAs open the door to doing this kind of research again, because it doesn't matter so much that we can't keep up with Intel and AMD. e.g. RISC isn't nearly as dead as it used to be (ARM, and other non-WinTel platforms). One aim of BERI is to provide a platform for HW/SW research in the way that Linux is a platform for OS research. Anil: Lots of layers of emulation now. Field is entering middle age. Generation computing: what do we see our jobs as being in a hundred years? Do we expect to always be able to understand every layer? In ten years, unless raspberry pi saves us, new programmers won't understand any of the lower levels. Q: Lots of students don't understand the low level stuff. Q: Lots of practitioners don't understand the low level stuff. Rob: I describe this as cross-disciplinary, but ten years ago this was all one discipline. Q: Specifying security versus enforcement? A: ??? Q: Things are complex, so people tend to deal with simplified versions. Does that make the results less useful? Anil: It just means CS is turning into a real science. Much more complicated stats and experimental methods. Partly poor training of computer scientists. Rob: You went to the wrong university. G: If you're an arch guy, you just design arch, and don't care about runtime. If you're a runtime guy, you don't understand the hardware. Co-design might allow more optimisations. Rob: Trying to do that in BERI. Risk of over-simplification in much research. BERI should help; not clear whether it's actually sufficient. We've historically done very badly at managing systemic complexity. If you can't reason about the large system from the micro cases you've done something wrong. Anil: We have leverage from older CPUs. One of the FABLE aims is to to do longitudinal studies of IO performance, so that we can track how performance is changing over time. Won't know if it's worked for at least five years. Everyone should go to fable.io to run the tests. Malte: It's a bit sad that we don't know the ground truth, given that it's entirely man-made. Rob: We like to ask historic questions and compare new and old CPU designs. e.g. what happens if TLB side goes to infinity? Interesting to compare to results from FABLE. Cross-validation of hypotheses. Q: What are the lessons in what you've learned for people building these architectures? Anil: Rebuilding the old SGI machines. Make the IRIX people do what they did twenty years ago. Evolutionary business model. Q: Clean-slate approach to multicore architecture? Q: e.g. switch instruction set while CPU is running Anil: Programming languages have global impact. IETF ... build protocols by printing out ASCII documents. Rob: Capsicum is trying to change the world in an incremental way i.e. sensible deployment strategy, but at the same time asking clean slate questions. Completely clean slate is tricky because you end up changing lots of things at once and can't split out individual effects.