Debugging through time with the Talfamadore Debugger. Christopher Head, Geoffrey Lefebvre, ... Also talking about program understanding. Suppose I want to know how to write a NIC for Linux. Look at existing driver -> too long to read entire thing, comments quite hard to follow. NIC drivers quite tricky: need to be fast, memory allocation, interrupts. Unforgiving of errors. Quite complex interfaces, both to hardware and kernel. Control flow complex due to dispatch tables. -> need better ways of understanding large bodies of source. Tralfamador: exercise code under tralf, then ask questions post-hoc. Questions: what parameters are passed to a function (distribution rather than specific instance), what is life cycle of a structure, ... . Don't have to modify kernel, don't have to iterate, ... Demo: Shiny UI for asking those questions. Also a nice way of drilling down and getting more precise queries (e.g. who allocated this one-byte packet buffer?). Q: How big are trace files? A: Representative: kernel build = 106GB. Another trick: generate a dataflow graph, with weights. END OF DEMO Architecture: start with trace taken under qemu, then apply an operator pipeline to extract useful information. Individual operators are quite simple, but can be combined to do something more fun. Doing it offline, rather than live, makes developing the tools easier, because you get nice determinism. Breakpoint command: finds *all* of the places in the trace which ran a particular instruction. Backtrace command: produces a DAG of backtraces for all hits in the trace (selected by some other command) with weights. RW: Selected a device driver as the example. What are the limitations there? A: Device drivers hard to plug into virtual environments. Need qemu support. Working on record/replay for KVM/Xen so that you can use PCI pass-through and instrument that way. RW: But you can only instrument things you can see? A: Yes, but we can see reads, so it's kind of equivalent. Q: Can you handle multi-processor code. A: No, but working on it. Probably just going to grab an off-the-shelf DRS. RW: Example is kernel, but you've also done apache? Can you do traces which cover both? A: Yes. KM: Can you compress the traces a bit? Can you find places where likely()/unlikely() are used incorrectly or not used where they should be? A: Yeah, we've been thinking about that. Q: How hard is it to make one of these pipelines? Could you compare two different runs of a program, to see how they differ? A: We've thought about that. Not really done much. BC: Can you add state to the operators? A: Yes. BC: Can you do TESLA-style temporal assertions? A: Yes. Q: Are you limited by the fidelity with which you can map the code back to the original source? A: Kind of. Anything you could do live can be done on the trace. MS: Is this generally available? A: Will be eventually. MS: Title is ``debugging through time'', but UI doesn't have any indication of time. Could you add one? A: Thought about it, but wanted to look at this more of a query-over-trace thing than a point-in-time thing. Q: Mentioned that threads have data race. Was that multi-threads on uniprocessor? A: Haven't actually built that yet. Q: Multi-processor behaviour usually goes away when you trace due to performance effects/ A: Working towards using DRS to reduce overhead and get away from some of those effects, but not there yet. Q: Related work: Chronicle, chronomancer. Compare and contrast? A: Not familiar with the exact details of those systems, but most of them are on point-in-time debugging rather than continuous time debugging. ---------------------------------------------------------------- Feasibilityu of mutable replay for automated regression testing of security updates Ilia Kravets Sysadmin's perspective. Updates can break things, but are sometimes necessary. Question: does this update change system behaviour in a common case? (assuming that the system mostly works to start out with) Options: proof or testing. Proof is hard, so use testing. Needs to be automatic, user-specific, scalable, application independent (no source and no vendor coop), and plausible for novice users. Regression testing: use input from user workload, run both versions. For trivial patches we can use a DRS, but for more complex ones the DRS will diverge. Divergence: could indicate old is correct, new incorrect; new correct, old incorrect; both correct. Aim is to distinguish case 3 from cases 1,2. Plan: record what happens in production with old version, then check on a separate system. DRS can record on one of several levels: application, OS, hypervisor, hardware. They do at OS level, using SCRIBE. Haven't quite finished implementation yet. Validation: look at 65 most popular packages on Debian to find 220 updates and manually try to figure out whether divergence will happen. They claim that for most the divergence will be tolerable. Come up with a bunch of toleration techniques. Techniques: -- memory layout can change -- read-only syscalls -- sometimes output changes in ways which don't matter (Q: actually, that does sometimes matter A: usually it doesn't) -- ``temporary change''. In the example, this is inserting a lock acquire/release pair i.e. change undoes itself later -- freeing unneeded resources (e.g. calling close()) -- semantically-equivalent output format change They claim that implementing a relatively small number of techniques allows you to tolerate a fairly large selection of divergences. Q: Can you do this with static analysis? A: You can do it with static analysis, but it's not very scalable (in terms of lines of code). Q: Is your technique scalable? A: I claim it will be once it exists. NS: How would you handle on-line patching e.g. splice? A: Somewhat orthogonal. Can apply the patch to the test machine and test the patched version as if it were an offline patch. RW: Brought in a lot of intuitions about which changes are harmless. How correct do you think those are? A: Need to read sources to determine whether changes are safe, yes. Not an exact analysis. Panel session ------------- Q: You need to map from code back to source. How would it handle dynamically compiled languages? Also need to make assumptions about what's safe and what isn't? Also, how do you actually tolerate a change? IK: Don't try to prove any correctness. Tolerance involved relaxing assumptions which the DRS makes e.g. assuming that the code is the same. We focus on security updates, which tend to be small and unobtrusive. e.g. to tolerate changes in address space, OS replay cannot just assume that syscall pointers will be the same. A: Focus is on whether you can converge, which means that you can get some slightly wrong semantics. Q: Looking at small patches tells you something about scaling. IK: This scales to large applications, but not large patches. Q: Looking at divergence to check whether things are safe is cool, but what if you want to change behaviour? IK: Assume that you're not changing the *common* case i.e. when you're not under attack. A: Knowing when you're under attack is useful. RW: Limited class of attacks considered. Was that deliberate? IK: If it doesn't affect ``overall behaviour'' it'll be fine. RW: You care about buffer overflows, say, but not e.g. resource exhaustion. They're major changes, and therefore rare, but they do happen. IK: Yes, we found that 6% of the updates were hard, and we don't handle them. KM: How often do you get bugs which are visible in memory dumps, but not visible elsewhere? e.g. silent buffer overflow. Lots of approaches where you send same input to multiple versions of a program and compare the results.. IK: Hard to handle non-determinism in that model. KM: Can handle it ad-hoc. A: What about multi-core, shared memory stuff? KM: Splitter handles that. Q: What challenges did you have with indexing? CH: Fundamental format is just a lot of basic blocks, but also have a lot of indices. Choosing them well is important. IK: Do you debug info during record? CH: No, only during analysis phase. IK: So you could record a smaller log and only run analysers during replay? CH: Yes. Will do that soon. Don't need debugging info during record -> could debug closed-source stuff by recording at customer site and then shipping back the traces. RW: Spooling gigabytes of data to disk is going to change behaviour. Also, if you're looking for differences, how much do you need to deal with? CH: Record/replay is probably low enough overhead to get away with it. RW: TCP delay stuff? CH: You'd need low overhead, but it could be done. Q: Tralf exists, other system doesn't. Do you have a plan for implementing it? IK: We are currently working on it. Q: Does it have any commonality with Tralf? IK: No; working at a completely different level. CH: IK is trying to get useful data out of something without source code, which is a fundamental difference.