Remus: High Availability via Asynchronous Virtual Machine Replication (Brendan Cully, UBC) *** Best Paper *** High availability: fault tolerance is rarely worth the trouble. Vanishingly small number of fully fault tolerant systems in the world, due to the effort it requires. Redundant hardware is too expensive. Building it into software is known to be hard to get right: and once you have it, it's hard to maintain as your application grows and changes. And you need the source code: end users of closed source software are out of luck. An easier way? Virtualisation makes it easy to capture changes to memory or the CPU, without needing additional hardware. Demonstrated by live migration. Can we have a continuous copy of the system to another location? Three steps: 1. Checkpoint at discrete intervals, and send out over the network. But then you need throughput, so 2. replicate these checkpoints asynchronously (keep the system running while you checkpoint). But then there's some state that's not backed up, so 3. make the uncommitted execution speculative. Gives completely seamless failure recovery: application will continue running from the point it left off (even active network connections will be maintained). Requires no knowledge about the hardware or the operating system: completely general, build on Xen but portable to other systems. Transparent to the protected system. Backup doesn't have to run any of the software that's running on the primary: don't require the instruction stream to run deterministically at both -- that would mean the work at the backup is being discarded most of the time. Deterministic running on both machines requires hardware knowledge, and also places load on the backup: Remus can put several backups on the same machine. Operation: put system in a VM to interpose on various operations. Checkpoint at a high frequency (every 20ms). Buffer network output between checkpoints, and on failure, resume execution from the last checkpoint. Checkpointing requires a roughly 2ms pause, and uses log-dirty shadow page tables to note changes while the machine is running. The image is saved to a local buffer, and speculative execution begins (saving network output). Memory checkpointing buffers locally (as opposed to live migration -- which takes several seconds to run, with 50--100ms downtime). It defers most of the work of suspending to the point when the failover occurs. Details in the paper. Network buffering: never see output from the system that hasn't been checkpointed to the backup location -- make no promises you can't keep. Don't need to replay network after the failover. If received packets are lost, IP will step in and help you. The disk requires much stronger guarantees: anything the VM believes it has written to disk must be on disk. The disk state at the time of the last checkpoint must be available. Disk writes are simultaneously sent to a backup buffer on the secondary host. Writes are only released to disk after a checkpoint operation. Detecting failure. If the backup doesn't respond, the primary disables replication; if the primary doesn't respond, the backup reactivates the VM from the last checkpoint. If the network partitions and both machines are operational: use two bonded NICs to avoid the single point of failure. Two main overheads: taking the checkpoint, and the network delay. One control: how far apart you want the checkpoints to be. Also, replication delay as a function of checkpoint size. Time taken to take a checkpoint is tiny compared to the time transmitting the data over a GigE link. Main bottleneck is the network link, improving which can let you have checkpoints closer together. Failover test evaluated by having a disk-, network- and CPU-heavy workload. Continually ping the VM. The applications at network connections continue without visible disruption. Hypervisor based fault tolerance did deterministic execution on primary and backup. Required deep hardware knowledge. Has also been shown that deterministic execution for SMP can even be slower than the uniprocessor case. Future work is to tie this together with Parallax (see EuroSys notes). Want to look at time travel with checkpoint re-execution. Also want to look at disaster protection with long-distance links => high latency and WAN IP-related issues. --- Nysiad: Practical Protocol Transformation to Tolerate Bynzantine Failures (Robbert van Renesse, Cornell) Want to mask any kind of failure in a large distributed system. So you'll need redundancy and diversity. Deterministic or logical failures can't be handled by this: want instead to tolerate non-deterministic failures (heisenbugs, etc.) or malicious behaviour. Crash-failures are relatively rare (<20% of failures, shown by studies). Fail-stop is a subset of crash failure is a subset of arbitrary or Byzantine failures. How do you detect fail-stop on the Internet? Pinging alone doesn't give accurate detection. Remus deals with fail-stop. Paxos deals with crash failure. This paper looks at arbitrary/Byzantine failures. Know how to deal with crashes: detect failures and reconfigure. We also know how to build storage services that tolerate arbitrary failures efficiently. How do we make DNS/BGP/MapReduce deal with these failures? Generalised approach uses replicated state machines: ensure that all state machines take the same inputs in the same order. Requires consensus for agreement; output requires simple voting. But consensus is hard to solve (impossible with crash failures and no assumptions about timing), and expensive. How do you replicate a big distributed system? You could replicate every node, ending up with a much larger number of nodes. But this would be overkill. The network is already capable of dealing with crash failures. Nysiad makes the network able to deal with arbitrary failures. Like replicated state machine approach, do something like primary-backup replication, which does FIFO broadcast of inputs from primary to backups. Again, for output, voting is sufficient. Each host has a collection of guards, which track the state of the primary host, attest the validity of its output and makes sure it processes all of its input or none of it. They do not take over when the primary fails (mask crash failures). Each host gets 3t+1 guards (t is the number of tolerated failures). Any two neighbours (hosts connected by a communication link) must have 2t+1 common "monitors" (of the communication link). OARCAST is an ordered, authenticated, reliable broadcast protocol which is FIFO. The sender might be Byzantine, so every guard sends in the same order, even if it is Byzantine. The messages are persistent. Guarantees the relay property: if one guard delivers a message, all correct guards deliver a message. Also, a primary is authenticated (can't pretend to be somebody else). This is weaker than consensus: only one "proposer", and it may block if the proposer is faulty. Primary proposes message to its guards, collects n-t signed responses from the guards, then sends an "order certificate" to the guards. Assume tacitly that the guard graph is static and known. But the communication graph is unknown initially and changes over time, which means the guard graph has to be updated accordingly -- a non-trivial problem. Each host goes through a sequence of epochs. A host has an epoch certificate, with the host identifier, the epoch number, the set of identifiers of the guards for that host, and a hash of the final state of the host in its previous epoch. This certificate is generated by a certification authority: the "Olympus". Changing of the guards protocol: 1. guards send a hash of their state to Olympus. 2. Olympus collects n-t identical state hashes (by disabling OARcast for this epoch, this fixes the state for this epoch). 3. Olympus assigns new guards to the host and generates a new epoch certificate. How expensive is Nysiad? Can it be used to replicate a file-server -- can we hide all these network messages behind the disk requests? But does it also apply when there isn't disk I/O? Automatic translation of Paxos (crash-tolerant) is pretty similar to PBFT. Evaluated by simulations on Scalable Source Routing, and Multicast Trees. Tested on a random network and a balanced binary tree network. The low-level latency cost is 3x (due to the three phases of OARCAST). Message overhead factor (Byzantine vs. Crash-tolerant) is expected to be about 19, due to OARCAST. Gets worse with t, but scales well in the number of hosts. The computational overhead looks roughly logarithmic in the number of hosts. The message overhead can be very low, but seems to reach an asymptote around a factor of 12. How could performance be improved? Hardware multicast would help (Evaluation used point-to-point messages). Could reduce the use of Public Key Cryptography. Could combine with Intrusion Detection and Accountability to detect some faulty guards. Application-dependent optimisations are also possible. Shown that you don't need to solve Consensus to achieve Byzantine fault tolerance (Byzantine and Crash failures are orthogonal). --- BFT Protocols under Fire (Atul Singh, MPI-SWS/Rice) BFT research has seen a lot of attention, with many major papers showing an improvement in performance. However, there is no apples-to-apples comparison, and we don't see how they perform under a variety of conditions. (For example, most evaluations are based on fairly benign conditions.) How could we compare them apples-to-apples? Firstly, based on implementations, but this is difficult because of different implementation choices (of languages, crypto, transport, etc.). Or could do it analytically, but this is only possible for relatively simple cases: it's hard to capture the dynamic behaviour of the protocols. Talk presents BFTSim, which allows the specification of protocols in a high-level declarative language (OverLog/P2). The cost of message processing and cryptographic operations is specified, as is the network topology. Execution is then simulated. These inputs are sufficient to predict the performance of a native implementation. Brief explanation of the Zyzzyva BFT protocol (see SOSP notes). BFTSim uses OverLog, which is built on DataLog. An action is predicated on an event occurring and a number of preconditions being meant. BFTSim simulations the computation and the network part. OverLog is extended to simulate compute-intensive operations, which introduces a delay (e.g. for crypto) in virtual time. PBFT implemented in 130 OverLog ruls; Q/U in 90 rules; Zyzzyva/5 in 150 rules. Compared to roughly 10--15k lines of code. OverLog engine is built over the ns2 network simulator, which enables the simulation of effects in the network. Validation of the results: can BFTSim match the results of real implementations. Validated in three settings (baseline (f=1), with a faulty mute replica, and with a higher replication factor). BFTSim shows a Latency-Throughput curve that roughly matches the published results of the PBFT, Zyzzyva (difference within 10%) and Q/U protocols. More results are shown in the paper, which also exhibit the matching characteristics. What can we use BFTSim? Can compare protocols in identical conditions. Can then evaluate the protocols under a wide range of conditions. Enables the evaluation of protocol optimisations before having to implement them in gory details. Can finally make BFT protocols more accessible. Under identical conditions, compare Q/U, PBFT and Zyzzyva (shows Zyzzyva beats Q/U, which beats PBFT) for throughput/latency. For batching = 100, PBFT beats Q/U (with no batching), but Zyzzyva wins by a slight margin. For larger payloads, all protocols perform roughly equally badly. But, zooming in, Q/U beats both PBFT and Zyzzyva. Why does Q/U do better? Because clients can pick a preferred quorum. So there is no one-size-fits-all protocol (at least at present). Evaluation with heterogeneous network connectivity: one replica has a slow link. Q/U shows performance degradation as the slow link is made even slower. PBFT shows a slight improvement in performance as the slow link becomes slower (because it does less crypto and doesn't have to wait for all replicas to get back to it). Zyzzyva slows down dramatically, but Zyzzyva/5 shows no change in performance. Q/U doesn't perform as expected. No contention in the workload is ideal for Q/U, but it is still worse compared to Zyzzyva/5. They show a potential optimisation for Q/U which uses "optimised messages". --- Uncovering Performance Differences Among Backbone ISPs with Netdiff (Ming Zhang, MSR) There have been lots of studies evaluating and comparing many systems (e.g. file systems, databases, web servers), but little systematic work on comparing/evaluating ISPs. Customers want to know what ISP is best for their needs. ISPs offer service level agreements, but these are typically only specified within the ISP network, not end-to-end. Only consider aggregate performance, not the user-perceived performance. And it's hard to perform a comparison. The Keynote study hade limited covereage, and only measured individual paths. ISP comparison must be relevant to customers (measuring end-to-end path, targetting destinations of interest and comparing based on application workloads). Can also be helpful for ISPs, to compare internal versus external performance, identify bad points of presence (or destinations). Ideally cover every PoP, destination, etc. Instead, Netdiff places Probers at the edge of the network. Using a single probe, can infer performance between ingress and egress PoPs. Therefore, Netdiff doesn't require cooperation from the ISPs. Measurement process uses a centralised controller, and lots of probers. The controller chooses a probing list of destinations, and sends these out to the probers. The probers probe and send the results back to the controller. How are the probed destinations selected? If you probe all prefixes, there is a lot of redundancy. Therefore Netdiff maps the ISP topology at a large time scale (every day or so). Every 15 minutes or so, Netdiff measures the ISP performance, by probing a subset of prefixes. Want to cover as many paths as possible without causing overload or doing too much work: use an optimisation algorithm (covering/pack problem -- greedy algorithm). Using the optimisation, 18 backbone ISPs can be probed with 5--23k probes per ISP, which is a 400x reduction (compared to the naïve approach). This takes 20 minutes per ISP. Another issue is, due to using single-ended probes, can get noise and errors. Nodes and links can be overloaded, paths can be asymmetric, etc. Validated with Keynote (double-ended probes). CDF of the relative difference in paths shows a fairly small difference factor. Deployed on PlanetLab, with 300+ sites worldwide, and 18 backbone ISPs considered. Compares internal paths (PoP->PoP) and destination (PoP->destination network). Evaluated measured latency minus optimal latency, accounting for difference in path length (the path stretch). Paths are groups based on length (short <20ms, medium 20--50ms and long >50ms (international, transcontinental)). Comparison of latency stretch shows that internal vs. external latency stretch gives different rankings of ISPs. Best choice of ISP is dependent on workload (e.g. distance, geographic region of clients, etc.). Rankings are totally different for medium path vs. long path. Medium paths to/from the US gives a different ranking from general medium paths. Netdiff.org provides raw data, and ways of drilling down, based on source/destination country, city and destination. --- Effective Diagnosis of Routing Disruptions from End Systems (Ying Zhang, UMich) Can this diagnosis be made without accessing ISP-proprietary data? Yes, and this paper shows how. Routing disruptions impact the performance of real-time applications like VoIP, etc. Routing events can cause high packet loss and long delays. Existing diagnosis work are ISP-centric, based on analysis of BGP data, collected using BGP monitors on routers belonging to ISPs. This is collected passively (by establishing a BGP session with the routers), and accurately reflects AS-level path changes. On the other hand, it is limited, because it is difficult to get ISPs to release this data (privacy concerns). Also, BGP paths reflect AS-level paths, which sometimes do not reflect data plane paths, and intra-domain changes. So we want to infer the data-plane paths of many routers from end systems. Required to cover all the PoPs of a target ISP. Also need to cover most of the destinations on the Internet (PoP-to-destination paths). Routing events are identified by comparing paths that are measured consecutively (snapshots of the network routes). Advantage is that no access to the sensitive data is required. It identifies actual data-plane paths, and monitors data plane performance. Challenges are that the probing resources are limited to the hosts we control (need still to ensure coverage of probed paths and timing granularity). There is also measurement noise in the traceroute data. Colloborative probing -> event identification and classification -> event correlation and inference -> event impact analysis -> reports. Colloborative probing uses traceroute from a set of probing hosts, to learn the routing state and improve coverage (to cover PoPs and paths). Also need to reduce overhead on the system. Event classification by comparing routing snapshots according to ingress and egress changes. Type 1 event: ingress PoP change (upstream ISP changing route, failure in the ISP, downstream ISP changing announcements). Type 2 event: same ingress PoP, different egress PoP (internal or external disruptions). Type 3 event: same ingress and egress PoP. Likely causes: link failures from egress PoP to neighbouring AS. This causes a new egress PoP to be chosen; internal distance changes as the cost of an old internal path increases or the cost of the new internal path decreases (hot potato routing). Event correlation: spatial correlation where a single network failure affects multiple routers, or temporal correlation where routing events happening close together are likely to be related. Inference based on an "evidence": an event that supports the cause; or a conflict: a measurement trace that conflicts with the cause. Greedy algorithm over evidence and conflict graphs which finds the minimum set of causes that can explain all the evidence while minimising the number of conflicts. Evaluated against a number of target ASs, classified routing events into types. Most changes are Type 3, making changes in the Internal PoP path. Validated against a BGP-based approach, to see how many hot potato changes are identified (and the intersection between the two approaches). The methods show a lot of overlap, but also 30--45% false positives/negatives. Inaccuracy is due to the limited coverage (only access to PlanetLab nodes), coarse-grained probing (only probing every 18 minutes) and measurement noise. The performance can keep up with generated routing state. --- cSAMP: A System for Network-Wide Flow Monitoring (Vyas Sekar, CMU) Why is network monitoring needed? Need to know how to configure routers so that users see good performance, understand new applications, detect anomalies and perform network forensics. Need good traffic measurements for this: not just the volume, but the fine grained structure of the end-to-end communication. Monitoring applications generate flow reports, showing the ports and protocols for flows between the same source and destination (plus packet or byte counters). These reports should respect resource constraints (not generate too much traffic themselves), provide high coverage of flows, and fulfil network-wide goals, and should force low overhead in data management -- shouldn't have too much redundancy. Routers have to sample the flows: insufficient resources to track every flow -- even as these limits rise, the demand increases at least as fast. State of the art is based on uniform packet sampling. Each router samples packets independently, and aggregates these into flow reports. This respects resource constraints, does not have high flow coverage (because they are biased towards large flows), does not satisfy network-wide goals (because they are too coarse-grained), and contains a lot of redundant measurements, so there is a lot of data processing required to transform the reports into a useful aggregate. Observe that packets sampling has low flow coverage due to its bias -- can we pick an unbiased sampling algorithm? Also, routers sample independently leading to redundancy -- can we manage them in a coordinated fashion to satisfy high-level objectives (and still satisfy resource constraints). cSAMP == coordinated sampling. On a single router, do random *flow* (not packet) sampling. The packet header (protocol, source/dest IP source/dest port) is hashed and the router only logs when the has is within its subset of the hash space. By sampling these at random, this increases coverage. What if there are multiple routers on a single path? Use hash-based coordination: the hash ranges for each router are disjoint. This implements coordination without communication. Some implicit communication is required to start and configure the hash ranges. Network-wide optimisation is used to satisfy network-wide constraints and objectives. What if we move from a single path to a network? Multiple origin-destination pairs in the network. Per origin-destination pair, assing non-overlapping ranges to each router. Each router has a sampling manifest that specifies the hash range for each origin-destination pair that it might see. For each packet, see if it should be logged (based on hash and origin-destination), and log it. Network operations centre now has to generate sampling manifests for its routers. The routers then generate flow reports which can be sent back to existing applications. Manifests are generated by taking origin-destination pair info, traffic stats for these routes and the router topology, plus router constraints. This is then optimised for the whole network: maximising the total product of coverage and traffic at each router. Also want to maximise to minimum fractional coverage at any router. This can be formulated as a linear program. And the output is a set of sampling manifests for each router. Practical challenges. What about changing traffic flows? Can use a history a traffic matrices and an adaptation scheme. Interior routers identify origin-destination pairs by assuming that routers mark packets with this information. For multi-path routing, this can be dealt with by a simple, lightweight extension. The algorithm needs two optimisations to be scalable (linear programming as-is is not scalable) Evaluated using data from Educational and Rocketfuel ASs, a traffic matrix based on a population model. Total volume is scaled based on the number of PoPs (assuming 8 million flows baseline for Internet2). Measured coverage, network-wide goals and redundancy. Compared to flow sampling (fixed-rate and maximal) with same memory (400k flow records); and packet sampling (1/100 and 1/50) with infinite memory. Hard to compare flow- and packet-sampling unless you give packet sampling infinite memory (which is optimistic for packet sampling). cSAMP gets better coverage (as a fraction of the flows covered) than the flow or packet sampling methods. Significantly better for the minimal fractional coverage at any router. Evaluated robustness to traffic dynamics, based on data from Internet2. Scalability? Takes 0.16s to generate sampling manifests for 70 PoPs; 10s for 350 routers. Enables responding to traffic dynamics within minutes. Achieves all of the goals set out at the start of the paper. --- Studying Black Holes on the Internet with Hubble (Ethan Katz-Bassett, UW) Goal of the internet is global reachability: every address reachable from every other address. It seems to usually work, and can we assume that this is just due to the protocols? But observing the outages and NANOG mailing lists, we can see lots of outages. But traces from PlanetLab show that some endpoints become unreachable from some monitoring nodes, but not others. Goal of Hubble is to monitor these long-lasting reachability problems and identify the causes. Ping monitors identify targets. Then distributed traceroutes are launched, and the problem is classified by grouping failed traceroutes. Spoofed probes try to isolate the direction of failure (based on known working routes) for asymmetric paths. Architecture is to ping prefixes every two minutes from PlanetLab to see if it is still reachable, reporting the target after a series of failed pings. Also maintain BGP tables based on feeds. Then launch the distributed traceroutes to determine which routers and ASs have reachability. While problem persists, keep probing every 15 minutes. Problem is classified, based on current topology, historical topology and direction isolation. Want to do this automatically, in real-time, to find a common entity that explains a substantial number of the failed traceroutes to a prefix. Historical topoplgy uses daily probes from PlanetLab to all prefixes to give a baseline view of paths before problems occur. Direction isolation because traceroutes only return routers on the forward path, so you might assume the last hop is the problem, but this assumes a working reverse path. Use spoofing from nodes that have visibility to fix this. Deployment on RON does this, PlanetLab disallows it. Evaluated based on the identification of targets (how much of the Internet), reachability analysis and problem classification. Coverage is 89% of net address space every 2 minutes; 92% of edge ASs, origin ASs for 99% of 14 million BitTorrent users. Path coverage is compared with BGP paths of 447 RIPE peers. Better coverage on "downhill" ASs, as PlanetLab is mostly at education institutions. For 60% of monitored ASs, traverses ALL downhill RIPE ASs. Classification into 9 classes, automatically classified 82% of 457,960 problems. Spoofing manages to isolate 68% of failures from spoofing sources. 47% of failures on forward paths, 21% on reverse. Study using Hubble: 3 weeks of data starting 17/9/2007 -- 31,000 black holes involving 10,000 prefixes. 20% of black holes lasted at least 10 hours; 68% were cases of partial reachability. Study also showed BGP updates (from RouteViews) only correlate with 38% of the problems. Multi-homing may not give failure-resilience. Inconsistencies across ASs occur: where an AS is responsible for partial reachability, other paths across the AS work. 75% of router problems are with routers that are not on the baseline path. --- Maelstrom: Transparent Error Correction for Lambda Networks (Mahesh Balakrishnan, Cornell) Lambda network: a high-speed optical network between distributed data centres (used for data mirroring, disaster tolerance, outsourcing). Lots of bandwidth but performance is really bad on these links: TCP collapses. TCP fails three ways: throughput collapse (due to loss -- interpreted as congestion); requires massive buffers for high-rate traffic; and feedback loop between receiver and sender which causes recovery delays for time-critical traffic. But we can't change TCP: so have tried splitting flows into multiple flows (ad-hoc, application specific), resize TCP buffers, spend a lot of money to ensure losslessness. Is perfection achievable? Perhaps yes, but not in today's high-speed networks. TeraGrid is an excellent data point: heavily-used but not congested. End-to-end monitoring framework between two sites on TeraGrid, showed 24% of measurements had 0.01% of their packets lost. 14% of measurements had 0.1% of their packets lost. Due to transient congestion, degraded fibre, malfunctioning/misconfigured HW, many other reasons. Want to run unmodified TCP/IP over high-speed lossy networks. Maelstrom is an appliance installed as a passive device at either side of the lambda link. No modification to end host or network. Uses Forward Error Correction (FEC): given a communication channel, the sender injects redundancy into the channel so that the receiver can retrieve missing data. Recovery latency doesn't depend on the length of the link (RTT). FEC has a constant data overhead (proportion of packets that are injected for redundancy). Packet-level FEC at end hosts is inexpensive and requires no extra hardware, but you can't simply change all end-host operating systems, and the recovery latency depends on the channel data rate. (Because the sender does not create redundant packets until it has all the data packets.) Solution is to do end-to-end FEC between data centres. Mechanism: send-side appliance creates error-correction packets for all IP packets that leave the data centre (e.g. by XORing 5 packets together). FEC is susceptible to bursty/correlated losses. The existing solution is to interleave several high-rate streams, giving a tradeoff between recovery latency and burst protection. Instead of protecting against the maximum burst size, want latency to depend on the actual burst size. Therefore use layered interleaving (every 100 packets, every 10 packets, etc.), giving not quite as strong correction as existing codes, but better latency. Two flow control modes for TCP/IP traffic: end-to-end or split. Split the connection into multiple flows, which avoids client buffer resizing (performance-enhancing proxy). Works also for UDP-based protocols, but loss can also occur at the end-host buffer (kernel overflow or bad NIC). Maelstrom on the receive side also acts as a packet cache. Works at 1Gbps on a 3GHz box. Implemented as a Linux kernel module. Showed how TCP throughput collapses with 0.1% and 1% loss. Maelstrom is almost as good as the no loss case, except where the latency is very short (cannot keep up with a 900Mbps connection). As the latency increases, throughput decreases as with TCP. Split mode and buffering makes throughput independent of RTT, effectively. For TCP, with loss, the delivery latency is up to 500ms (big spikes, jitter). Maelstrom is very steady, with no spikes above 100ms. 80% of packets are recovered right away with layered interleaving, loss burst length of 1. As burst length increases, the packets take longer to be recovered. Q: rateless codes code be used? Possibly too expensive, but there may be new codes (systematic rateless codes, published in 2006) that are efficient to decode. Q: should evaluate against the ad-hoc solutions? Are they fair to other flows? Q: delta between this and OverQoS? Q: can Maelstrom do 1-N connectivity? No, does 1-1 mapping of Maelstrom appliances (send/receive). --- Swift: A Fast Dynamic Packet Filter (Zhenyu Wu, College of William and Mary) CSPF is the pioneer packet filter: small piece of code in the kernel which does in-place filtering. BPF is a milestone, with novel techniques, high performance, cross platform, library support, and is widely used. Problems with it: long filter install latency (compilation, user-kernel copy and security checking -- order of milliseconds to seconds). With dynamic filtering, the filtering criteria may not be known in advance and this loading may have to be done online. Example is FTP passive data transfer: neither client nor server knows the port that will be used. Filter update has to complete before the data transfer begins (and after the PASV command is replied to). If it doesn't, this leaves an opening to the attacker. Filter execution also incurs latency. Lots of pseudo-machine interpretation overhead to do simple things (for security). Previous solutions focussed on improving execution efficiency. xPF, MPF, DPF. FFPF proposed a packet-filtering framework that improves the efficiency of multiple simultaneous filters: external functions are loaded into the kernel as a kernel module. Difficult to program and could introduce security issues. Swift uses a pseudo-machine, but renovates the filter engine with fast updates and improved instruction set efficient. 1000x speedup over BPF for update, and 3x faster for execution. Includes a specialised instruction set to avoid filter compilation, and bridge the gap between high and low level languages. Instruction set based on BPF language primitives. Strict program organisation allows incremental program update: only modify what is necessary. Simple to address the part of the program that needs to be modified by indexing. Security checking is eliminated by using acyclic definitive finite automata: make the engine secure and so don't need to check the program. Programs involve executing all instructions in a program (if all pass, the packet is passed); if any fails, then go to the next program. Swift designed to be SIMD, for e.g. comparing against multiple IP addresses at once. Or reduce the number of instructions required to check multiple things (three different fields at once). Swift filters that are copied admit optimisations so you don't have to execute the same checks more than once. Implemented in Linux 2.6: 3 added files, 9 modified files. Compatible with BPF. Controlled with a user-level control library. Evaluated for dynamic filtering with FTP passive data transfer capturing. Workload of 1--200 concurrent FTP data sessions with and without high speed background traffic. Filter update latency is constant in the number of sessions, and also is on the order of microseconds (against milliseconds for BPF). Swift misses far fewer packets than BPF. For static filtering, compared against LSF and optimised C, with six different filtering criteria of increasing complexity and workload. Swift has slightly higher execution latency than LSF, but not particularly significantly, for simple filters. For more complex filters, Swift outperforms LSF. --- Securing distributed systems with information flow control (Nickolai Zeldovich, Stanford) Traditional web applications have lots of trusted code (HTTP frontend, application code, database). Typically the application is the least trustworthy thing. Millions of lines of code, lots of 3rd part libraries, etc. Vulnerabilities here can lead to catastrophic exploits. Information Flow Control has been used in operating systems to contain user data securely, even if the application code is buggy or malicious. Ensures that only the correct user's data can be sent to that user's web browser. But IFC only works on a single machine. Web applications often need multiple machines to scale. This work extends IFC to distributed systems. Traditional IFC. Labels control information flow, labels associated with processes, files, etc. So blue files can be read by blue processes, but purple files can't be read by the same process, for example. Can give processes the permission the remove labels, request new labels, etc. For a web app, give user data a label (in the database), run separate versions of the application for each user that have that label, and allow the HTTP front end to declassify for sending to the user (trusting the front end to do the right thing). Distributed challenge: how can we decide whether two processes on two different machines should be able to communicate? Need to be decentralised (no fully trusted part) (not traditional DIFC). Encode a label in messages, and let each machine enforce IFC locally. How can we know to trust the OS on the recipient machine? Trust on a per-category basis. Add an "exporter" daemon on each machine that is used to send or receive messages, enforcing IFC on the messages locally, and ensuring correct labels on external messages. Exporter provides a simple send API. Covert channel exists if you just query category owners to see if the message should be able to be sent to a particular host (strawman design). Another strawman is to put trust relationships in the exporter, which also has a covert channel. Can't eliminate all covert channels, but can avoid them in the interface. Self-certifying category names: categories are named by a public key; trust for a category is defined by a certificate signed by the category's private key. The sender provides all certificates needed by the exporter to make the decision on trust. The exporter is stateless (avoiding covert channels), enables the sending of labelled datagrams, and is small (only 3700 lines of code plus cryptographic libraries). Can implement RPC on top of the exporter using exp_send datagrams. Can build resource allocation service on top of this RPC, also a program invocation service (start code running on another machine), and applications on top of these. How to bootstrap a new machine in the system? Need to gain access to a new machine's resources using admin privileges on an existing machine. Need to set up public/private keys on the new machine, then use the program invocation service. Traditional web server has >1 million lines of trusted code. Can use IFC to ensure that the application code cannot leak user data. Can also deprivilege a number of other components. This extends to the distributed case using the exporter daemons. Can also work with multiple different operating systems (Flume, HiStar, even legacy OSs such as Linux). Incremental deployment is possible, such as running an untrusted perl script on HiStar, and running the rest of the application on legacy Linux. Benefit of decentralised nature is that the application can be scaled to run on third party computational clusters. This would enable secure mashups that combine data from multiple web sites without having to share the same application platform. Q: for the non-expert, how does this differ from a capability system? Strictly more general: capabilities are like the privileged things that can declassify data. --- Wedge: Splitting Applications into Reduced-Privilege Compartments (Andrea Bittau, UCL) Servers hold sensitive data, and software vulnerabilities can be exploited by attackers to get access to this data. e.g. Need to keep an SSL server's RSA private key secret. A master reads the key and creates worker processes that talk to the network. If the web server runs as root and is exploited, it could read the RSA key from disk: so we run it as nobody. The worker process though has a bunch of memory that includes the private key (handed down by the master). Therefore it is still possible to attack the worker and read out the private key from memory. Want to limit the access of code to memory at a fine granularity. Based on the principle of least privilege: partition applications into compartments and restrict what they can do. Could just use processes, by forking. But this means that the child inherits memory map and file descriptors. Memory needs a lot of scrubbing to be sure that it doesn't contain sensitive data. Want a default-deny policy: inherit nothing from parent. But this is difficult to use for legacy code. How many permissions do we need to explicitly grant? Lots: over 600 memory objects for an Apache worker process. Contribute new system calls for default-deny that create compartments and specify privileges. Contribute tools to make this usable when partitioning legacy code. sthreads are default-deny compartments. Creating a new sthread, it inherits nothing (secure by default). Can only give permissions at a page-level granularity (privilege enforced by page tables). What if sensitive and non-sensitive data ends up on the same page? Use "tagged memory", malloc with a tag, which puts all memory with the same tag within the same memory region (page-aligned). sthreads can be granted access to tags. sthreads use sensitive data through call gates. A call gate is an entry-point with predefined privileges -- a subset of the creators privileges. Can then be used for privilege elevation. But Apache is complicated. Can we do static analysis to identify memory accesses? It might fail (due to function pointers) or might be too conservative and give too many privileges. Crowbar performs runtime analysis of memory accesses to learn the privileges that each compartment needs. Difficulty is ensuring high trace coverage. Dynamic analysis will really give least privilege, but might be too little if you haven't considered a possible code path. Applied to Apache+OpenSSL. Goal is to protect the private key. But what about the session key? Threat models are passive eavesdropping or active man-in-the-middle. An attacker can generate an arbitrary session key. Can prevent arbitrary session key leak, but still be vulnerable to a man-in-the-middle attack, if the man-in-the-middle has exploited the server. Need to prevent the session key from ever being exposed during the SSL handshake, so that it isn't visible to the man-in-the-middle. Solution is to kill the SSL handshake handler (which could have been exploited) and creating a client handler, which can only be exploited by knowing the session key (which hasn't been exposed). Only the client handler touches the sensitive data. sthreads is a userland library of 1154 lines. Crowbar is a 2391-line binary instrumentation tool, based on Pin. Needed to change 1700 lines of code in Apache to make it separated. Only 6% of the 252030 is privileged (in call gates) after the separation. Crowbar can get a trace for Apache in 15s. For SPEC applications, it takes 82s on average. Anecdotally, one trace is enough. Cost in throughput: 2x slower than vanilla Apache. Due to the inability to cache session keys (overhead of RSA). For a small static page. Wedge is finer-grained than previous privilege separation work. DIFC is also related. Crowbar could be used for DIFC. Wedge does not allow unprivileged code to compute over sensitive data at all, unlike DIFC which allows unprivileged code to handle sensitive data and control where it may be sent. --- Reducing Network Energy Consumption via Sleeping and Rate Adaptation (Sergiu Nedevschi, UCB) Network energy consumption is becoming a big issue: routers are power hungry, power costs are becoming a significant fraction of TCO, and of course there are energy concerns. IEEE has a task force on energy-efficiency. Opportunity: networks are provisioned for peak load (e.g. the phone network needs to work well on Mother's Day at 2PM), but average utilisation is quite low (33% on AT&T switched voice). But energy consumption is proportional to capacity, not actual utilisation. Idle energy consumption is high. Cisco linecard draws 80W when idle; 90W when fully loaded. Idea: let equipment sleep for brief periods of time or slow down when lightly loaded to save energy. Inspired by PCs and processors which do this (frequency scaling, Hibernate mode). Sleeping reduces the idle energy, and slowing down reduces the power used for idle and active. Assuming that there's support for these states in NICs, linecards, switches, routers... which there isn't at present (though it's forthcoming). Depends on the type and extent of this support, and the careful use of these states to protect performance and maximise savings. How much energy can we save without compromising performance? What kind of hardware support do we need? Can we realise these savings with practical schemes? Methodology is to 1. Model hardware support for sleep and rate adaptation, 2. Evaluate savings/performance with trace-driven simulations (on ns), and 3. Look for (unrealistic) bounds on the performance and savings. Model sleep as Power_idle, Power_sleep and a transition period of maybe 10us or longer -- maybe 10ms. Timer or activity-driven wakeup -- proposed to be standardised by IEEE. Assume independent sleep state for different interfaces. Measure the energy savings in percentage of time asleep, and the performance loss and maximum delay. A link can sleep between packets if the transition time is less than half of the inter-packet time. What if we buffer and burst packets so that there is a longer period of time during which we can sleep? More efficient than trying to sneak in the sleep between the packets. Propose buffering packets at the ingress, then every Bms, burst them across the link (say B = 10). Buffering only at the ingress, so only one additional delay end-to-end. Works in a chain topology. In a more generic topology, need to coordinate burst times between different ingresses. Want to align these by adjusting the relative timing phase of different ingresses. So that internal routers handle all the bursts in a big burst and can then go to sleep. Perfect coordination is not generally possible, but can find an upper bound (optB&B) for how much can be achieved with this scheme. Potential benefits of sleeping are bounded by average utilisation = 100% - %age of time asleep. optB&B does better than traffic with no shaping (either CBR or pareto flows). Practical sleeping algorithm (practB&B) can achieve quite close performance to optB&B. A simple scheme with buffering and bursting independently. The impact of sleeping on delay is that there is no added loss, and the added delay is bounded by the burst period, B. Need quick sleep transition times of <1ms to give low delays. For rate adaptation model, assume we have a discrete set of rates, each costing a particular amount in power. Also a transition period between rates, and independent rate adaptation. Same metrics. Want to decrease rate as much as possible without introducing more than d ms per hop. So need to pick the service rate so that the service curve follows shortest Euclidean distance between arrival curve and departure curve (of packets at the link/router). practRA is a practical algorithm for achieving this. Performs well for uniform rate distributions (close to upper bound), but not so well for exponential distributions. No packet loss observed, and added delay less than d * number of hops. Comparison between sleeping and rate adaptation. Look at power measurements of a multi-rate Gbit NIC. Idle and active powers are pretty similar, and appear to scale with rate. Sleep gives the best energy savings -- much better than rate adaptation. Modelling future power profiles: active power is a constant plus some function of the rate. Frequency scaling power is linear with the rate, but voltage scaling power is about cubic with the rate. Rate adaptation does much better when dynamic voltage scaling is available; not so well with frequency scaling. More opportunity for scaling when the idle power is large. Rate adaptation does better when more of the power scales with the rate (e.g. with dynamic voltage scaling). Can halve energy consumption for lightly-loaded links (10--20%). With little impact on performance, and realised with simple mechanisms. --- Energy-Aware Server Provisioning and Load Dispatching for Connection-Intensive Internet Services (Jie Liu, MSR Redmond) Energy consumption in the IT industry is getting out of control. Annual energy use doubled from 2000 to 2006. Fastest growing sector for energy consumption. Opportunity comes from server loads fluctuating over time (e.g. Windows Live Messenger (MSN) shows circadian rhythm in load pattern). An idle server consumes up to 60% of its peak energy. Power consumption is fairly linear in workload (from a high origin). Shutting down an unused server is the best thing to do. Dynamic voltage/frequency scaling isn't effective for server applications (yet?). Work focuses on stateful connection servers (IM), with long-lived, persistent connections, and which are CPU, memory and network intensive. Messenger works with a dispatch server that fields login requests, picks a connection server from a cluster and gives its address back to the client. Connection persists until the client logs off (or the server crashes). User observable performance: service availability (service not available error), service continuity (server initiated disconnection), service latency (transaction delays). Transaction delays aren't a big problem in the MSN interaction trace. Service continuity is the most important. Servers can host 100k connections but only accept O(dozens) of new connections per second. Handling a new login is orders of magnitude more expensive than keeping a connection alive. When a new server comes in, it takes load only gradually (takes a long time to fill up), but when a server is turned off, it takes a long time to redistribute these users, and we want to optimise this. Provisioning strategies means looking ahead. Need to compensate for forecasting error in this lookahead. And compensate for the load dispatching dynamics. Also need to be careful about shutting down servers (interruptions). Load forecasting by seasonal data regression: long term dependency on time, but with local adjustments. (To predict for this Thursday, look at last two Thursdays, and what happened half an hour ago (compared to what you expected).) Forecasting is "pretty good". Load dispatching: aim to have all servers having the same load (balancing), starve servers before shutting them down. Or load skewing, which manages the number of "tail servers" and seems to give better shutdown candidates. Can combine forecasting/reacting with balancing/skewing. Evaluated on 6-week trace from MSN. Scaled to 60 connection servers. 5 weeks of data for training; 6th week was forecasted. 2 weeks of data for modelling the CPU and system load, validated on 4 weeks of data. CPU power consumption modelled as linear relationship based on measuring a real server. Forecasting/balancing can save 30% of energy, but with huge amounts of disruption (compared to other approaches). Can tradeoff energy saving against the number of service interruptions. Reactive skewing can perform well with unpredictable loads, but accurate prediction gives the best energy savings/fewest interruptions. Should work at migrating connections between servers. This will get rid of the disruption (or just do client-side automatic reconnection).