Session 3, Memory. Memory reclamation in Garbage Collected runtimes Ben Corrie Only in Java, but might work elsewhere. Running JVM on top of another virtualisation layer. Hypervisor then doesn't know when JVM frees stuff. Hypervisor then has to try to reclaim memory somehow else. Easy way: page sharing, dedupe stuff. Alternative: balloon driver. Ballooning can lead to swapping. Final fallback: memory compression, hypervisor swap. Obvious analogy with JVM running on an operating system. Combine JVM with balloon driver -> apply memory pressure to OS rather than to JVM -> heap gets swapped out -> major GC sweeps are very slow. Tends to lead to suddenly falling off a cliff, rather than a nice graceful degradation. Directly comparing GC policies is a little tricky, but macro benchmarks show that some policies are better than other (serial okay, parallel disastrous, CMS pretty tolerable). Basic solution: have something in Java inside the JVM which allocates large zeroed objects when the balloon driver would normally run, and then let page sharing handle it. Leads to much more graceful degradation. Future work: is there any inherent efficency advantage to proactive/reactive memory reclamation? How do we decide the appropriate thing to reclaim? Q: How does this work in different languages? A: What matters is whether things get tenured. Medium-life data (a few hours) is kind of annoying; gets tenured, grows, and then gets left around. Q: On Xen we sometimes send signals to apps to tell them to shrink. Have you tried that? A: Needs a crystal ball. Hypervisor tends to ask for memory in small dribbles, which leads to lots of small GCs. Q: Three-party problem. You're looking at the app and the hypervisor; does the OS have anything to do? A: Some JVMs have a policy of making the heap as small as possible, in which case ballooning is stupid. At that point Java is no longer a special case. Q: DB world has similar tensions. Need consensus between OS and DB layer. A: We can solve this to some extent by hacking stuff together, but it tends to be quite fragile. Important to decide how much hypervisor should know. Q: Even if the hypervisor knew everything, it'd still be non-obvious what the optimal policy is. A: Lots of out-of-heap caching solutions, which are mostly hack arounds for this kind of thing. Q: Don't really need a balloon class in the JVM, because it could just zap the contents of the free pages to zeroes and then let the page dedupe handle it. Performance hit? A: Q: Trying to be JVM-agnostic. Are there java applications which would work badly with this? A: Yes. Things with lots of short-lived data. We only really care about the tenured part of the heap, because if you look at the short-lived part it's changing too quickly due to heap compaction and the dedupe fails. Q: Really thinking about weak data type things. Balloon might accidentally flush the cache. A: Discussed in paper. ------------------------------------------------------- KSM++: Using IO hints to make memory-deduplication scanners efficient Konrad Miller, et al. Memory deduplication is sometimes useful. Semantic gap: hypervisor doesn't know about semantics of page cache, which is a major source of duplication. Usually work using a periodic scan, but that has problems with short-live duplication. Other approach: Satori-style keyed off of PV disk drivers. Observe that hypervisor does IO on behalf of guest VMs. Use that to hint to the scanner that it should scan IO pages first. Continue to use the normal scan to determine whether to share pages. Can safely drop old hints if you get an IO burst. Throttling to avoid starvation effects. Hint buffer size: ~8K seems to be the answer on at least on benchmark. Further increases cause problems. Not sure quite what happens there, they speculate that it might be due to duplication. Q: Why not dedupe the buffer? A: Wanted to avoid performance penalty, but will probably look at it later. Eval: compare KSM to KSM++ to a theoretical optimal one. Seem to get reasonably close to theoretical limit, and get much closer than ordinary KSM (2x to 10x increase in merging opportunities). Performance generally as good as or better than KSM. Only benchmark discusses is kbuild; others run but not shown; results similar. Q: What proportion of hints are correct? A: Depends on workload. Up to seven times as effective as a naive deduplication strategy. Q: The hints are supplied by the host OS? A: Yes, as part of IO path. Hook into virtual DMA controller. Q: Try to balance regular scan and this queue. Do you take into account their relative effectiveness? A: The amount of pages which can be shared from each queue is workload dependent. Would need to do it based on observed behaviour. For our benchmarks 1:1 worked well. Q: You now have two queues. Would you consider adding any more? A: Thought about networking, but that's much harder. NFS on host already handled, but remote DMA not. Q: Is this more effective on immutable data? A: No, there's no requirement for that. This works very well for mutable data because we also try to merge on write operations. Q: Transparent informed prefetching? Economic model to compare cost/benefit of good bad hints. A: Running simulations now. ------------------------------ Using solid state drives for virtual block devices Sang-Hoon Kim TRIM command exists. VBD and VDI have their Xen meanings. VBDs don't pass through TRIM commands from guest, and neither do the other bits of the stack . They want to fix that. Complication: Many VDIs are backed by filesystem SRs, and they tend not to expose the TRIM command in a nice way. Plumbed through in obvious way. Complication: punching holes in the middle of a VDI leads to fragmentation, defragmentation is hard . End result seems to be somewhat better performance, which is handy. Deletions slower, though. Also reduces degradation over time due to fragmentation etc. Q: ftrim -- is that just a make-sparse fcntl? A: Semantically similar, but we've not looked at the details. Q: Have you patched the guest OS? A: Modified ext2 filesystem in IO domain. Also needed to modify VBD driver to actually enable the TRIM command in the frontend Q: Did you try any other filesystems? e.g. a log structured filesystem might behave differently A: No, we didn't. Q: Can you make the TRIM policy adaptive? A: It could be, but isn't. Panel session ------------- Q: Theme is leaky abstractions. Comment? BC: Spring framework is all about layers of abstraction. Strong believer in separation of concerns. The less the hypervisor knows the better, for un-fragility. Q: But the less you know, the less you can optimise. BC: The surprising thing is always how much performance degradation the customer will accept if it makes things simple. KM: The abstractions haven't changed, but the needs have changed. BC: JVM makes assumptions about e.g. memory, but that assumption no longer holds. Assumptions kind of breaking abstraction i.e. it's previous cross-layer optimisations which break things. Q: But leaks exist, so the answer is just to be careful about what the leaks are. RW: Very small changes to interfaces needed. Q: Many changes can be made fail-safe e.g. ftrim KM: But dramatic changes might get you there much sooner. BC: Hints can make it difficult to understand what's going on, though. e.g. Java annotations -- hard to tell if it's actually worked. RMcI: gcc unlikely hints almost always wrong. KM: Information transport problem. e.g. throwing exceptions in OO languages No real equivalent in operating systems/ Anil: Balloon driver is an example of an abstraction which has gone horribly wrong. Why would you get memory and then give it back? BC: Value add, not hack. A trick which the VMM can do. Resource sharing? Q: Are the arguments economic? BC: Well, some are technical, but a lot of people won't do it anyway. For people who have batch jobs which run at ta certain time and then don't run again, or 24 hour service around the world (i.e. time-plexed) you can shift load around. Anil: MIRAGE MIRAGE MIRAGE i.e. use hotplug rather than ballooning. Need to time out old abstractions. RW: GC+virtualisation thing. Linux handling of large pages is odd. How important is that? BC: Large pages help in lots of places. RW: Are you relying on the behaviour that the balloon driver can't reclaim large pages? BC: Not really, just an implementation artifact. RW: Extensions end up relying on heuristics, but you then end up locking yourself to them, but that then constrains the OS in future. BC: Aim is to avoid doing that. Zeroing pages is pretty much agnostic of everything. Q: The L in RESOLVE is Layering. Finally arrived there. Sometimes layering is good, sometimes it's bad. Choosing a bad interface can lock you into a bad path if you don't get enough control, but the whole point is to hide things you don't need to know. If you get it right, performance is good and things are well-hidden. Otherwise, it all goes badly. Doing a bottom-up rethinking (e.g. CHERI) avoids some of the historically nasty abstractions. BC: Only really find out what works in ten years' time. RW: Java is complete world-view, but problems still keep coming up. Q: And LISP machines have the same problem. Q: Performance mostly comes from good caching , but a lot of stuff is . What is more important? KM: Page colouring: Linux doesn't have it. NUMA doesn't matter that much. Q: NUMA does matter. KM: Need to know how you're going to use the data?