ProtectIT Want to provide a trust guarantee on the exchange of sensitive data. Deal with heterogeneous and legacy services and policies. Do this with minimal modification, in an adaptable and safe manner. Assuming a well-known message format and protocol (e.g. HTTP, FTP, etc.). Some use of virtualisation to isolate protection domains. Path Controller (in privileged domain) collects the system properties to make a trusted data path between other platforms running the same software. Monitoring service. Evaluates the system and application properties and collects information. Attribute-value pairs in the /proc filesystem. Trust Controller polls these files to see if there are changes in the system, and passes this information to the Path Controller. Trust Controllers in different machines communicate. Some concepts (tags and labels) are borrowed from Distributed Information Flow Control. Protected data records are tagged, and channels are assigned roles, which are mapped to labels and indicate which kinds of tagged data it can carry. Traffic Interceptions are based on the principal, location/destination (kernel/driver domain/remote/local), time and what operations are applied to the traffic. e.g. Send an email tagged as confidential. You have the choice between a corporate-encrypted SMTP server, a corporate-plaintext server and an external server. The rule will choose the encrypted one. e.g. A DB request is tagged as reliable, but you have the choice between two servers providing a best-effort service. So the rule will send the query to both of them, to improve reliability. Implemented on Xen. Data flows through socket layer, through a packet queue then into the split devices. Interception happens in the packet queues (either in the guest or in the driver domain). If you intercept in the guest, you are in the same protection domain as the guest, so you are more vulnerable. In the driver domain, you have better isolation, but you have to simulate the network stack to handle the data -- worse performance. Can run the action code in the driver domain (its kernel -- better performance, worse trust -- or userspace), or in a separated domain. Interception is done through a VFS-style interface. Evaluation is based on the overhead of applying a dummy actionlet (roughly linear with the size of the message, on the order of tens of milliseconds). --- The Economics of Media How much is it worth to govern 50% worth of internet traffic? Ten or so years ago, people would speculate that this would be a multi-billion dollar business. Well, almost 75% of traffic is P2P; two-thirds of which is BitTorrent, but BitTorrent itself doesn't make a lot of (any?) money. Online media distribution accounts for about 6.7% of income for the record industry (about 85.6% for CD albums). The Radiohead model: choose-what-you-pay made $2.26 per album; conventional model would see them getting about $1.60 per album in royalty -- cutting out the middleman is profitable. Apple is the rare success, but it's making money from the hardware, and running a loss on iTMS. Economically, you want to establish a "stock market" of content, with free trade. Might never happen, but in the academic world we can dream! What if we could virally resell content? Spread music (initiating at a server farm) in a peer-to-peer fashion. Money still goes back to the central server, but the selling peer gets a cut. Lowers the operating cost of the central store. Want this to work off-line, anytime, anywhere and with anyone. Hollywood wants this, because you're more likely to buy it the first time you hear it. This would also reduce the energy required for distribution. Need security for the transaction. Assumes a standard PKI and tamper-resistant devices. Protocol needs atomic exchange of receipts. Limitation: doesn't support user-created content. How do you verify that your self-created work isn't a Madonna video. Built an economic model. The server has a certain expense for marketing (sending the music to the first buyers). There is a price for the first purchase, and for peer-to-peer purchases (to which the incentive must be added). The price is negotiated between peers. For the buyer, there is the amount that he is prepared to pay at a particular time. There is a probability of the transaction failing, and the cost at the server and peers. The viral network is modelled as a scale-free network. The model is optimised for profit. Came up with a dynamic pricing algorithm. Results are so good they had to pick the especially bad ones to convince people that it was real. --- DepSpace: A Byzantine Fault-Tolerant Coordination Service Want to find abstractions (i.e. middleware) for dependable distributed systems. Software is becoming complex. What about tuple spaces? Used in some modern systems (JavaSpaces etc.). They are a shared memory abstraction: can put tuples in the shared memory, and take them and read them. They can be used to build dependable distributed systems. How are the implemented? Usually as a SPOF tuple server, but this isn't dependable. Can it be made dependable? Yes. DepSpace is a BFT and secure tuple space. Bridges the gap between data confidentiality and BFT state machine replication. Coordination service can be used to coordinate processes in an untrusted and unreliable environment, like Chubby and Sinfonia. A service is easier to program, less resource-demanding and easier to make dependable than a coordination library. Why a tuple space? Theory: turing powerful, can be used to simulate a universal shared memory, and efficient to build a BFT algorithm on top of it. Practice: simple to program, content-addressable, and time- and space-decoupled communication. It's powerful and elegant. A dependable tuple space is a typical tuple space, with the addition of an atomic compare-and-swap operation (if a tuple is not present, then out a tuple based on the other template you provide). Assume n >= 3f+1, f Byzantine failures, independent failures (usual BFT assumptions). Replication layer: state machine replication using a typical BFT algorithm. All servers perform the same set of operations in the same order. Confidentiality layer: problem is that if you replicate, you get more replicas and more availability; but for confidentiality you want less availability and less replication (trusting fewer people). We cannot trust the servers, and we can't encrypt the tuples (or we lose content-addressability). Use secret sharing protocol, that must be verifiable and indeed publicly verifiable. Tuple insertion. Use PVSS to generate n shares for the tuple and use a hash function to generate a hash fingerprint for each field of the tuple. Shares are encrypted with the destination server shared key. Distribute these to replicas, which get the shares and the fingerprint of the tuple. Read or removal. Generate a hash fingerprint for all the known fields of the template. The servers match based on the fingerprints, the secrets are sent back to the client, and they can be combined to generate the desired tuple. What if malicious clients write fingerprints that don't correspond to stored tuples? Use lazy recovery and put the client on a black list. Policy enforcement. Performs fine-grained access control on the tuple space. Implemented in Java, in about 12Kloc, big part spent on the BFT (Paxos at War protocol). Uses Java crypto extensions, and Schoenmakers PVSS scheme. Policy enforcement scripts written in Groovy. Evaluation is latency and throughput on Emulab. Compared against GigaSpaces (non-fault-tolerant). Looks like O(log n) latency in the size of the tuple, and ~10ms latency. Confidentiality has no impact on throughput, . Read throughput is much higher than write throughput. How to use it? Build simple BFT protocols on top of DepSpace (on top of complex BFT protocols). e.g. Solving Byzantine consensus with DepSpace. Can be written in about 5 lines of code. Is this the killer app for tuple spaces? --- Hang Analysis: Fighting Responsiveness Bugs Software has prolems with responsiveness: how do we detect and eliminate the bugs that cause this? e.g. TortoiseSVN GUI for SVN. Hangs when the network is unresponsive, because it attempts a synchronous network send (for listing the directory). This blocks the message loop. Soft-hangs eventually return control to the user. Usually not a correctness bug, but caused by bad coding style. These exist in both client and server apps. User survey shows that people are beginning to complain about hangs as much as about crashes, so this is becoming important. Address them with: coding discipline (e.g. no synchronous I/O in the GUI thread) (doesn't help legacy code, and is hard to inforce due to the use of black-box libraries), debugging (make timing assertions, but this has low coverage as it is environment-sensitive (quality of network connection, etc.)). Or let the end user make bug reports (e.g. using Watson to send bug reports) -- but the damage is already done. Idea is to use static analysis to find blocking operations in the the source code, based on critical and blocking patterns. Intersect critical invocations (from the GUI thread, e.g.) with blocking invocations. If there exists a path in the call graph from the critical entry to a blocking call, then we have a problem. There are complex cases, though. For example, in a file chooser where the files are remote, the network access could be implicit, depending on the filesystem on which that file resides. So we need to be context sensitive (know whether the file is local, or can be remote). Or else we get a lot of false alarms. Goals are to be expressive, and scale. Framework uses the Phoenix VC compiler plugin to dump all function invocations, and thereby create a context-sensitive call graph. This is pruned to include only the critical invocations, and then pruned again to consider only possibly-hanging invocations. Post-processing finally generates the hang reports. Use a DB to store invocations and patterns. This can be implemented efficiently using a Binary Decision Diagram (scaling to 10^23 call paths). Also takes into account indirect calls through function pointers and OS callbacks. Do clone-based context sensitive analysis. This greatly expands the call graph. Critical entries are the UI thread function, the event/message dispatcher, the event/message handler and any others added by the developer. A recursive query is used to perform reachability analysis. Blocking functions are specified by blocking patterns. These include certain blocking syscals. And some conditionally-blocking operations: e.g. sleep(0) to yield against sleep(30000), or file operations which may be remote based on the filename, or sync/async socket modes. Hang gates: the external function in a library which leads to a blocking operation -- as post-processing, because the developer only want to deal with his own code. Evaluation required annotating about 100 Win32 APIs, identifying hang paths in GIMP, TortoiseSVN and one other application. --- BorderPatrol: Shrinking the Black-Box for Precise Application Tracing Why trace? Want to understand or optimise your application, pinpoint faults or check its correctness. Hasn't this all been done? (Everything up to DTRACE). Today's applications are not single processes: they are distributed, multi-tier, load-balanced, replicated. e.g. Web server--application server(--web service)--database server. Want to know how a particular web request is handled, not where a particular process spends its time. Simply instrument all modules. Widen all interfaces to pass a request ID. Module entry/exit logs the request ID. But you can't just change standard libraries, protocols. You might not have source access. Black-box inference. Observe communication between unmodified applications, and infer the paths between them. Although if requests overlap then you get ambiguities that must be handled statistically. The paths are not precise. BorderPatrol improves this by pervasive tracing within black boxes, understanding the interfaces, and changing the environment (not the code). It assumes that modules actually implement their specification, operate on events in isolation, and operate on events immediately. Tracing at the library interposition level. Despite these assumptions, many different modules (web servers, databases, frameworks, CGI things) exhibit these behaviours. Black boxes can do anything, but they don't! Threading doesn't have an effect on the assumptions. Even if you're concurrent, we can detect the fork. Library interposition wraps around 40 standard functions. Only allow one event to enter a module at any time (event serialisation). But even something like read() on HTTP/1.1 can give you more than one request. A small bit of code that understands the protocol envelope can serialise these (e.g. look for GET requests in the HTTP stream). Not as severe as having to reimplement the protocol. Is it easier than instrumentation? The HTTP processor is about 100 LOC and can work with any web server. Is it ridiculous to serialise events? We're serialising events, not requests, so parallelism can exist. Evaluation based on microbenchmarks. Bad overhead on exec-bound workload, otherwise not bad. Evaluated in the wild based on dearinter.net. Apache/TurboGears/PostgreSQL system (low overhead of around 20%). Also on Zeus, a high-performance, event-based, closed-source web server (gives high overhead of 96% on static content). --- 30 Seconds is Not Enough! A Study of Operating System Timer Usage All OSs have system timer facilities which are used to schedule notifications in the future. While hardware has changed a lot, this abstraction has not. A lot of timers are used (as much as 7000 timers/sec). Network and GUI code makes a lot of use of this. Point of the study is to find out how timer functionality can be enhanced, and how a clean-slate approach would look. Timers implemented as a timing wheel on a priority queue. In Linux, multiplexed in-kernel and also by system calls that are used by user-space applications. Two implementations, standard jiffy timers and high-resolution timers. In Vista, it's much more complicated with more multiplexing points. Method: profile running systems on real hardware to identify usage patterns and frequency, then inspect the code. Four workloads: idle desktop, Firefox dispaying MySpace, a Skype call, and a loaded web server (with no GUI). What are timers used for? Delay, periodic, timeout, watchdog and "other". Web server has a lot of watchdog; Idle has a lot of periodic. All have a lot of "other" -- information hidden by the multiplexing layer. Analysis of the timer values for different worklaods: Firefox has lots of short ones; Idle has a lot of long ones; Webserver has a wide, sparse distribution (based on programmer's educated guess?); Skype is more spread. How amenable are timers to adaptation? Most timers on Linux-Idle expire at 100%, so not amenable to adaptation, or else very short, so not much effect. But Vista has a much wider spread, and there are timeouts that indicate over-conservative guesses. Also Vista shows timers that can be up to 100% late due to the best-effort (whereas Linux doesn't). Not sure why some programs employ these timers. Skype shows a lot of very short timers, which are not being serviced in time. Adaptive timeouts. These should learn their environment: which has been used and works well on TCP. Currently trying this approach on Vista. Timeout provenance/dependency: multiplexing in various layers makes debugging and profiling difficult. With an appropriate interface, we could make provenance explicit. Could view timer behaviour as an aspect. Better notion of time. Some programmers are forced to over-specify time (e.g. run me in ten minutes vs. run me at some appropriate time in the next ten minutes). Handled with a "deferrable" flag in Linux which prevents Linux waking up a sleeping CPU, but this is ad-hoc. Use-case specific interfaces (the four use cases seen above (delay, periodic, timeout, watchdog)) are helpful. So have a periodic timer with an explicit period (rather than manually handling this). Timeouts based on the event that the timeout is waiting for. Delay timers with a minimum and maximum. Interaction between timers and schedulers should also be studied. --- Samurai: Protecting Critical Data in Unsafe Languages Emerging platforms like smart phones and web-browsers make interoperation easy (flat address space), leading to innovation but with little isolation or fault-tolerance. Dangers are the flat, uniform (single) address space; unsafe languages which allow all sorts of transgressions using pointers; and unsrestricted third-party code. Lead to security risks, unreliability and corrupt data. Could just rewrite in a safe language, which takes a big investment. Or could use static techniques, which have false positives (and false negatives). Or dynamic techniques, which are difficult to apply to third-party code and incur performance overheads. Alternatively, we could tolerate the errors at runtime. "Failure oblivious computing" -- crashing or stopping the program is not a good solution, e.g. if it's an Air Traffic Control system. Data structure healing -- if the invariant fails, do something spurious to make it correct (100 planes invariant -- add or remove planes from the sky). But this is a bit unsound. Memory is a weak abstraction -- introduce critical memory to make this better. Some data is more important than other, and the program may or may not be able to continue if it gets corrupted. Programmers think that the file system is less likely to get corrupted than the memory. And the solution has to deal with third party code. To use critical memory, you have to identify what data is critical -- not a stretch because people are already thinking in these terms. Protect this with isolation and replication. It protects against software *and* hardware errors (even cosmic ray bit flips). Technique can be selective and incremental, when applied to existing code. And it is compatible with existing third-party libraries. "critical" is a type specifier like const. The data is shadowed in a parallel address space. Add cload and cstore operations to work with critical data. Writing to a critical variable is implemented as a cstore. If a buffer overflow causes the local variable to change, this will be remedied when you try to read the variable and do a cload on it. Can either tolerate this error, or fail fast on the cload. Library code doesn't need to be aware of critical memory. If it should be modifying critical data, we have a mechanism to "promote" the changed value on return. Similar to the idea of committing changes on return in Solitude from yesterday. Samurai involved having a critical heap, using crit_malloc/crit_free. Done in software, making a base copy plus two shadow copies. The allocator is robust to reduce the likelihood of correlated errors. On crit_malloc, put shadow pointers in the malloc metadata, and randomise allocation (to avoid correlated erros). A regular store causes a memory error; critical store writes to all copies. Critical load takes the majority value and detects or repairs on a mismatch. Metadata is protected with checksums and backup. Implemented as a compiler pass using Phoenix to instrument the loads and stores. Evaluated on SPEC2000 modified with particular data structures that should be protected (subjective choice of variables to instrument). Also evaluated on making libraries like the STL list class and the heap allocator use critical memory. Small performance overhead in most cases (<8%), except in gzip, when making everything critical. Evaluated using fault injection, which shows almost 100% success for Samurai. The STL list class using critical memory had a 9% overhead on a web server. Use critical data as programmers currently use the file system to protect data. --- Controlled, Systematic and Efficient Code Replacement for Running Java Programs Runtime adaptation -- adding and removing extensions at runtime without shutting down. Services may need to run continuously. And yet unanticipated performance problems may arise, and the implementation may need to be changed on the fly. Must be programmable using existing development tools and techniques, and not introduce any additional complexity. Should be extensible to add new adaptation techniques. Should be flexible and efficient. PROSE: platform for PROrammable extenSions of sErvices. Allows you to identify where to change the program (method boundaries, etc.), and establish the additional actions that are needed at those points. Works on running Java applications. Based on dynamic bytecode instrumentation, and method code replacement that operates while an application is running. Crossover with Aspect-Oriented Programming, borrowed terminology (but different goals and implementation). A change is an aspect. An aspect comprises crosscuts, which contain "advice" (the new code) and a "pointcut" (pattern expression). Join-points specify where the code is to be executed, and Proceed calls the existing method. Weaving is adding aspect functionality to an existing application: allows hundreds of places to be changed at once. Uses a modified JVM, which passes aspects to the JIT, and modifies the JIT to add a "Weaver" that generates "woven" bytecode before passing this to the JIT. The native machine code is thereby "woven". Implemented on Jikes RVM and Sun HotSpot. For Jikes, the JVM is modified to support changes to classes at runtime. Triggers the recompilation of methods to weave advice. The JIT triggers the recompilation and includes the new code. Compatible with an optimising compiler. On the Sun JVM, the JVM is unmodified and uses a JVM plugin which uses the HotSwap mechanism to change the implementation. Some advice could contain multiple "proceeds", which are wrapped by a prolog and epilog, and surrounded by advice code. The proceed code is inlined once at the end of the method code body, and subroutine call is simulated within the code. Inlining everywhere would lead to code blowup. Challenges are to deal with method arguments, return values, autoboxing, varargs and exception handling for the single and multiple proceed inlining. Possible to have multiple redefinitions of methods (execute two lots of advice code for a single method). Controlled by priorities. Evaluated as the overhead on SPECjvm98, and Java Grande. Around 2% overhead with a standard deviation of less than 7%. Both JVMs tested -- Jikes baseline, Jikes optimising and HotSpot; with and without PROSE. Also compared with AspectJ5, ABC and Steamloom, and comes out 6-to-10 faster than AspectJ5 (PROSE at compile time vs. AspectJ at load time), 100 times faster than Steamloom. Compared at compile time, load time and run time, for call modification and methody body modification. However, the load time mechanism for PROSE is slower than the load time mechanism for AspectJ5. So, the execution time is much better, at the cost of being slower at modification time. Example applications include dynamic persistence, and adding caching strategies to multi-tier Java applications at runtime. --- Documenting and Automating Collateral Evolutions in Linux Device Drivers Collateral Evolution is e.g. when a library function changes its name, and therefore all the client code has to change the name. Or we could add a new parameter. The target is Linux device drivers -- 50% of the code of Linux, with approximately one library per device type (PCI, sound, etc.). 1200 evolutions in Linux 2.6, which led to far more collateral evolutions. Currently this is done nearly manually: difficult, time consuming and error prone. This has led to mistakes in the code base. This is worse for Linux because development is concurrent and distributed. Difficult to get a single patch that atomically updates all of the code sites, because new drivers are occurring all the time. Need a taxonomy of transformations (add parameters, split data structures, change protocol sequencing, change return type, add error code, etc.). And one of collateral evolutions. Current refactoring tools can do little better than renaming. e.g. scsi_get and scsi_put will disappear. So we must delete the call to the library and delete the error checking code. A local variable becomes a parameter. How should that transformation be specified? A Perl script or sed command could not cope with this. Idea is Semantic Patches: has syntax much like a patch with modifiers that say add or delete lines, but it must be abstract across all the possible code sites. A declarative language with metavariables. Ellipsis operator to correspond to a wildcard in the code body. Only transform if everything matches. Applied using a command line tool called spatch, which operates like patch. SmPL (Semantic Patch Language ("Simple")). Abstracts away irrelevant details, like spacing/indentation/variable names/irrelevant code (the ellipsis)/isomorphisms (if (!y)/if (y == NULL)/etc.). Ellipsis is based on the control flow graph. "..." means all subsequent paths in the control flow graph. C isomorphisms. e.g. the NULL example. Reversing the order of an if-else. Pointer accesses x->y vs. *x.y. More isomorphisms can be specified by the user. SmPL syntax for specifying these "<=>" is the isomorphism operator. Architecture translates the C file to a control flow graph. The semantic patch is expanded with isomorphisms and translated to extended CTL ("Computational Tree Logic" -- a compact formalism for staying stuff about trees, with some additional featuers). Together, the CTL is matched against the CFG using a model-checking algorithm, and the matched code is modified based on the patch. Finally an unparsed version is output. Issues: need to produce readable code, keeping space, indentation, comments and CPP instructions. May also have to work on CPP instructions! Leads to 60,000 lines of OCaml code. Evaluated by using it do detect past collateral evolutions in Linux 2.5 and 2.6 using pre-existing patchparse tool. Selected 60 of them to make a test suite for the spatch tool. Then compare applying the spatch tool to what the Linux programmers did manually. Used 20 complex evolutions, where programmers made mistakes or missed some sites. 23 mega evolutions that affected over 100 sites (requiring 40 people over the course of two years). And 26 typical evolutions (involving a typical, median directory). More than 5800 driver files were transformed. Semantic patches are an average of 106 lines long: often 100 times smaller than "human-made" patches, which is a measure of time saved. Easy to review the patch compared to a huge, long patch, and perhaps less time to generate the semantic patch. Correct and complete automated transformation for 93% of files. Missed code sites on the remaining 7%. Just false positives, no false negatives. Processing time of 0.7s per relevant file. Sometimes the tool did better than the human patches. Also implemented some semantic patches for current kernel modifications, and it can be used to do bug-fixing. The patches generated by spatch have been accepted in the latest Linux kernel. Can it be useful for different software bases, different types of programmers (not just Linux/device drivers), other kinds of evolutions or transformations?