Comprehensive kernel instrumentation via dynamic binary translation Peter Feiner Toronto Operating systems getting bigger; mostly drivers. Also more swearing. Need more tolerance for bugs -> need more tools. Easier if you could do DBT on kernel. Ported DynamoRIO to Linux kernel. Took about 18 months. Then built heap debugging tool in 5 days. Hypervisors? e.g. VMWare. No API. PinOS == Pin + Xen -> only emulated devices So add DBT to a hypervisor with pass-through devices -> what they did. Approach: load a module which shims itself under the OS and lifts it onto a JIT. Complications: how do you do IO? Concurrency? Interrupts? Interrupts are hard bit. Want to copy interrupt handler to code cache (plus instrumentation) and run it. Problem happens if instrumentation isn't reentrant. Turns out to be quite hard . So delay interrupts instead. How long for? First answer: try the end of the BB. Doesn't work: what if the interrupt BB disables interrupts while it's pending. That includes changes to IO APIC! Must instead delay until start of next OS instruction. How to delay interrupts? First answer: cli/sti pairs. Overhead kills you. Instead patch code cache in response to interrupts (!) by just lobbing in int $n instruction when we want to handle the interrupt. Clear IF flag in hardware-pushed interrupt stack frame, return to code cache. Eventually hit int $n, hardware pushes another interrupt frame, framework removes patch, fixes up frame, and then runs the OS interrupt handler. Performance: 3% for compute-bound, up to ~5x for worst case they've found . Q: How come you have userspace overhead? A: Some indirect effects. Hypothesise cache effects Q: Any interesting debugging advice you'd like to give us? A: KVM makes things a lot easier. --------------------------------- Continuous object access profiling and optimizations to overcome memory wall and bloat. Rei Odaira IBM Tokyo Discourage instrumentation. OO programs allocate lots of objects in a wasteful way. Several optimisations to improve this: truncation, compression, equal object merging,... Depend on access profiling e.g. are objects immutable. Want to build accurate and lightweight object profiler. e.g. Lucene allocates a 16k buffer but then only touches first 100 bytes. Speculative buffer shrink useful performance win (4x!), but need fixup logic, details in paper. Must track buffer for entire lifetime, because until then you can't do max. Don't need to track everything: only one buffer per allocation site. Propose barrier profiler. Memory-protection based. Profiling done from fault handler. Object-based protection, not page-based. Object sampling is adaptive. Can stop profiling at any time. Accurate == low error probability, not zero error probability. Use barrier pointers. Reserve a protected region outside the heap. Talk presents simplified version where the region is the same size as the heap. Not real memory; just virtual address space. Used in obvious way to create pseudo-shadow objects and emulate from page fault handler. Sampling strategy: sample one object for everythin 8th MB allocated. Perf overhead: average ~1.5% for their tests. Bursty tracing got ~9%. Use it to implement the object optimisations. Useful win;
. Q: Used barrier pointers to do sampling. Could you instead use watchpoint registers? A: Yes, sometimes, but you'd need quite a lot. Q: What happens if the sampled object is hot? A: Can stop profiling an object at any time, but that compromises accuracy. Stop profiling if something is frequently accessed and it doesn't look like it's optimisable. Q: Results don't do accuracy/performance tradeoff e.g. other sampling rates? A: In paper. Tried a bunch of different ones, 8MB seems like a good trade-off. ------------------------------- A case for unlimited watchpoints Joseph Greathouse UMich, CMU. Ref previous talk. Aim is performance . ``Make dynamic software analysis faster'' Existing dynamic analysis tools tend to have high overhead. Existing work proposes hardware accelerators for this. Need a generic accelerator -> watchpoints! Taint analysis. Set watchpoints on all tainted values. All very easy. Race detection: watch everything to check whether values are shared between threads. Need lots of watchpoints for this to work. Emulating with page protection is a bit sucky. Also, page protection harder to make per-thread. Also want range watches. Could use ECC mangling, but that's not per-thread, no ranges, and per physical address. His design: store in memory rather than on chip (but cache on chip). Hardware design uses a range cache (previous work). Stores ranges and a watched/not-watched flag. Hardware can apparently do a very easy split/merge operation. Can parallelise lookup across every range quite easily. Ranges in memory are just a tree. Per-core hardware range cache with software overflow/miss handler. Cache is write-back. He wants it to be precise and entirely user-level. Eval: pin-based simulator. Benchmarks only consider the shadow value cache accesses; not any kind of propagation logic in e.g. taint tracking. Using a 128K range cache. Performance is apparently then perfect; RC reduces overheads to zero. Not convinced I believe that. Need for lots of small ranges. e.g. byte-accurate data race detection needs lots of small ranges. Extend design: some range-cache entries are instead pointers to bitmaps representing the bytes which are watched. Moving to 64K RC + 64K bitmap cache buys you ~20%? Future work: u-arch analysis. Q: Frequent metadata updates? e.g. memcheck A: If you're always working on metadata then temporarily turning metadata updates off won't help you. Q: A: Okay, it doesn't work for everything, but does work for lots of things. Q: Have you looked at signatures e.g. encoding watched reasons as a hash? A: Hard to do with good performance. Also hard to remove stuff from watched set. Q: If this was widely deployed, lots of people would use it (e.g. JVM for GC). Would that make analyses which depend on this more difficult? A: Yes, in that case you'd need some kind of cooperation between the different users to virtualise. ------------------------ Aikido: A system for dynamically accelerating shared data analysis Marek Olszewski CSAIL Analyse instructions which access shared data. Conservative approach has high overhead. Sampling would help. Would help more if sampling could be targetted at instructions which access shared data. Can cope with fairly coarse granularity -> do it with page protection. Tricky part: most OSes share page tables between threads. So don't do that. Alternative: dynamic instrumentation using user-level virtualization. Their approach: lob a hypervisor under the OS which allows you to do per-thread page protection, and use that to accelerate a user-level thing. Talk assumes shadow page tables; implied that paper covers nested page table hardware. They're essentially putting the thread-level stuff in the shadow tables, and switching shadow table whenever the guest OS switches threads. libAikido makes hypercalls to control what's going on. Only instrument accesses once the page has been demonstrated to be shared -> some false negatives, but relatively easy to characterise. Example app: fast race detector. Compare with Aikido and without Aikido. Results: some benchmarks see up to 6x speedup, but lots show no benefit at all. Average seems to be about 75%. Q: Per-thread protection needed for JVM optimisation. Do you need a hypervisor? A: Not really, could do at OS level.