Consensus Routing: The Internet as a Distributed System (John John, UW) *** Best Paper *** "A good network is one that I never have to think about" -- Greg Minshall. But the Internet doesn't quite satisfy this property, because BGP fails to achieve global reachability. Why? Physical paths exist, but no BGP path; 10--15% of BGP updates cause loops and black holes, and 90& of packet losses can be attributed to loops. Autonomous systems have different goals and incentives; may compete. BGP gives opaque policy routing: make preferred routes visible to neighbours, but the policy is hidden, and the ASs are autonomous. ASs can change policy at any time without prior notification. When AS receives new path, start using it right away; forward it to neighbours after some delay => globally inconsistent view. Example of what happens when there is a link failure, and how a routing loop can occur, which might last for tens of seconds. Hard to tell between a link failure withdrawal and a policy change. Problems are caused by inconsistent global state: link failures, traffic engineering, scheduled maintenance, links coming up. The protocol behaviour is complex and unpredictable, and hard to know when the system has converged. Consistency is the key to high availability! From distributed system design: separate safety from liveness. For safety, forwarding tables should always be consistent and policy compliant. For liveness, should react to changes quickly. Safety: apply updates only after they have reached everywhere, and do it synchronously. ASs compute and forward routes as before but don't apply them to the forwarding table. Periodically take a distributed snapshot (Chandy-Lamport). At the end, ASs know what updates they consider to be complete, so send these to consolidators (tier-1 ASs), which gather this information, agree on it (consensus protocol), then flood this information back to all ASs. Now all ASs know exactly which updates are incomplete and are hence unsafe to apply. So they just apply the completed ones. Liveness: problem is that on link failure, we need to wait til the path reaches everyone. During this delay, there will be packet losses and unavailability. So dynamically reroute the packets around the failed link, using existing techniques (pre-computed backup paths; detour routing). With consensus routing, global reachability is maintained while the link failure occurs (using the transient routing). After the first snapshot when the system has reconverged, can resume regular routing. For scheduled maintenance (withdrawal paired with announcement), can show that two updates depend on each other (so if one is incomplete, the other is incomplete). This makes sure of atomicity for these updates. Evaluated: how does it affect connectivity and what is the traffic overhead? Run on a simulator, emulating BGP messages sent by ASs. So evaluated at scale on the internet topology. Method: fail one of the access links for a multi-homed stub AS (should not cause disconnection). See what fraction of ASs are temporarily disconnected until convergence. On failing the link: BGP is seamless for <40% of ASs. Consensus routing achieves seamless connectivity for over 99% of ASs. The overhead is 30% at the media (30% more bytes sent for each failure). BGP updates are large.... --- Passport: secure and adoptable source authentication (Xiaowei Yang, UC Irvine) In DoS flooding attacks, attempt to exhaust a shared resource (*bandwidth*, memory, CPU): the most newsworthy weakness of the internet -- can be lucrative, bring down whole countries (' internet connection). The ability to spoof sources weakens defense against DoS. Filtering: Attackers can impersonate legitimate hosts, which leads to denial of service for the legitimate hosts (if they are blocked). Also, this makes it possible to evade filters. Also possible to launch a reflector attack (spoof the victim's address as the source and send packets to reflectors). DNS servers can act as reflectors. Pushback: attacks not sent directly to the source, instead hop-by-hop. Could combat DoS by making source addresses trustworthy (assumed by ingress filtering). Then build defense systems that build on this: filter-based, capability-based, etc. Ingress filtering is not secure, but is lightweight though not particularly adoptable (about 20% of hosts on the Internet can spoof, due to weak ASs). Digital signatures are secure, but too heavyweight (though they are adoptable). Passport is secure, lightweight and adoptable. Use efficient symmertric key crypto, use routing to distribute keys. Passport prevents AS-level spoofing. e.g. using egress filtering. Each AS is responsible for the prevention of internal spoofing. Each AS shares a pairwise symmetric key with each other AS. Border router stamps Message Authentication Code in a Passport header. Other border routers verify corresponding MACs. How do we obtain shared secret keys? Chicken-and-egg problem: need keys to send packets, need to send packets to share keys. Do Diffie-Hellman key exchange via routing messages. This is very efficient: an AS only needs to send one announcement to establish all pairwise keys. Incremental deployability: transparent to hosts and interoperable with legacy ASs. Downstream legacy ASs can also benefit from it. Secure under host, monitor and router attackers. Also handles path changes. Evaluated a Linux-based implementation using Click and XORP, measured overhead (throughput, processing, header, memory); model adoptability using "Modeling Adoptability of Secure BGP Protocols" from SIGCOMM 2006. Also do a security analysis. Calculate the immediate security benefit of adopting the system (based on how many hosts can no longer spoof). Model adoption by saying that if the benefit to someone is greater than some threshold, it will be adopted by an AS. The metric is the critical threshold: the higher it is, the more adoptable the system is. Compare Passport, Ingress filtering and SAVE for this metric. Ingress is not very adoptable using this metric. Passport is far more so. SAVE is somewhere in between. Also can show that Passport provides a stronger security benefit that the others. (Ingress filtering only provides a strong security benefit if all ASs implement it!) Applications: Passport prevents reflector attacks, strengthens capability-based DoS systems (ensures bandwidth fairness on the request channel), can be used to build a secure authenticated filtering system; also used for resource allocation etc. Secure/scalable filtering: filters can be installed close to the source. --- Context Based Routing (Saumitra Das, Purdue) A whole new paradigm for doing routing in wireless networks. Evolution of routing in multi-hop wireless routing. Started with Dynamic Source Routing, and then AODV. Broadcast and find out the route with the smallest hop count. Then in 2003, we saw hop count is not enough: should also consider loss rate. But then loss rate isn't enough, should also consider bandwidth. How do you characterise a link? Best transmission time? Can construct a graph of the network with link costs, and use Dijkstra's algorithm. Current practice: with a link property metric, do Dijkstra. Works well only in basic scenarios. More advanced scenario? Wireless network coding. Implemented by knowing what your neighbours have, and what they need. Need now to choose a path that exploits network coding benefits: mirror existing paths (if A->B->C->D exists should then use D->C->B->A to exploit coding). Or multi-radio, multi-channel networks. Can send different flows on different channels. Which path really exploits the presence of multiple radios (perhaps at different speeds; which might perhaps interfere?). Context based routing captures link interdependencies. Relies on Context-based metric design (for a specific scenario) and path selection (generic). How to do a context-based metric for network coding? Cost of a network link depends on the context (e.g. where the packet has come from), which reduces the cost when you already have a flow in the opposite direction. For multi-radio, state of the art is WCETT: try to use each different channel an equal number of times, but you can end up with primary interference (incoming and outgoing on the same channel) or secondary interference (A->B->C->D with A->B and C->D on the same channel). Can do better if you take into account the fact that the same channel has been used on the previous hop (or the hop before that). Context-based path selection: can't directly compute the metric with Dijkstra's algorithm -- the sub-path of a shortest path may not be a shortest path. The ideal solution will store context information about the past, but this leads to exponential growth in information. Fortunately, only the immediate past matters, which uses finite memory and enables you to prune paths (store the best path for each past context category). Evaluated by implementing in the Qualnet simulator, and also for Windows XP and for Linux. Compared performance in simulated topologies and a 14-node and 20-node testbed. CRP improves network coding: CRP codes 20,366 packets, versus LOSR+COPE manages 1,583 coded packets (for a regular 2-D mesh). On real testbed, it gets around 1.29--1.71x improvement on the number of coded packets. Also achieves gains for multi-radio (much better for UDP than TCP). --- NetComplex: A Complexity Metric for Networked System Designs (Byung-Gon Chun, ICSI) Engineering maxims stress simplicity: the same goes for system design. Many major papers talk about how a design is simpler than what has gone before. But we have no quantitative metrics for deciding whether a system is simple or complex. We only have algorithmic complexity, which is at time incongruent with system simplicity: flooding is simple, but has high message complexity. Is it possible to measure design simplicity? Focus on a complexity metric for the algorithmic component of a networked system. Should enable us to more rigorously compare and contrast designs, design simpler systems and align the goals of the theory and system communities. Aim to capture distributed dependencies: designs are centred on state, which may be distributed and have dependencies, but current metrics treat all state as equal. So we should measure dependencies, but how? Can just count them to avoid complicated probabilistic models (complexity metric should be simple!). For a given piece of state, s, measure the ensemble of distributed dependencies that must hold together for s to be consistent with the inputs from which s is derived. Different types of dependencies: value dependencies (linked if a change must be propagated, unlinked otherwise) and transport dependencies. Simple counting of dependencies is unsatisfactory. Series versus parallel data collection would end up with the same count of dependencies. Where you have intermediate dependencies (series), this is more complex than a single point of dependency (parallel). So we instead sum up all of the dependencies of intermediate state. Complexity for a node is the number of state changes relayed, the the number of transport states for relayeing, and the complexity of the local state. Then we can compute complexity for a particular piece of state. Operation complexity is derived from state complexity as the complexity of the state that is the result of an operation. e.g. routing: the operation that delivers state x from source to destination using the routing entries at d intermediate hops. So the complexity is 1 + the sum of the complexity of the routing entries at each hop. Deep dependencies are more complex than wide dependencies. (Three scenarios, 1-hop unicast, m direct value dependencies, 1 direct/m-1 indirect value dependencies -- the indirection makes it more complex). To compute a metric, define a dependency structure for a networked system algorithm (annotated DAG as a composition of canonical scenarios). Shows examples for link state and distance vector routing. Evaluated by computing complexity (against state and message complexity) for several routing schemes. Compact routing is the most complex, but has the lowest state and message complexity (most scalable). Also analysed classical distributed systems (consistency algorithms, coordination, shared variable, update propagation). Complexity metric appears to agree with intuition. Metric validated by a survey that shows it mostly matches a survey of 19 graduate students (in a distributed systems class). Metric complements scalability metrics. Does little to validate the assumptions, correctness or quality of a solution. Does relate to robustness, but only indirectly (low complexity can imply robustness). We might need better metrics (requirements discussed in paper). Could the metric be used to work out robustness? Might be intresting to compute complexity automatically, or use it to guide simpler system designs. --- DieCast: Testing Distributed Systems with an Accurate Scale Model (Diwaker Gupta, UCSD) Scalable framework for the testing of large systems. Ideally, you want to test your (e.g. high performance filesystem) on a set of diverse deployment environments (like all your clients use), but you only have limited testing infrastructure. How can you use this to test a large system? Goals are fidelity (close replication of the target system), reproducibility (controlled experiments) and efficiency (using little resources). DieCast scales up test infrastructure by >10x. Want to replicate target system using fewer machines, whilst maintaining resource equivalence (CPU capacity, network/disk response). Preserve application performance. But don't attempt to scale physical memory or secondary storage. We can use virtual machines to encapsulate individual machines and run them together on a single physical machine. Need to recreate the network topology using a network emulator. Efficient (in terms of resources used), but each VM only gets a fraction of the resource used, so we lose fidelity. Howe do we get fidelity back? Need to "inflate" the amount of apparent resource held by the virtual machines. Use previous work on time dilation: use time as a resource and trade it off. Makes the CPU speed, and network and disk I/O seem much faster. But external systems are running in a real time frame (e.g disk, network connections). For Disk I/O scaling, seek time and read/write throughput must be scaled, but the low level functionality is implemented in firmware (plus must deal with virtualised I/O model). Per-request scaling becomes difficult. Implementation on Xen 2 and 3, and can be ported to non-virtualised systems. Supports unmodified guest OSs and have disk I/O scaling for fullyvirtualised (with DiskSim) and paravirtualised (with scaling in the device driver). Fullyvirt Disk I/O handled in Dom0 by ioemu-dm. Guest assumes that it is using a real disk, where it might be backed by a file, or a partition, or whatever. ioemue passes request to DiskSim which manipulates the request completion time for the simulated disk. Simulated hardware is now decoupled from real hardware (can simulate different kinds of disks -- useful for motivating example of high performance file systems). Delay in DiskSim must take into account real hardware delay. Network I/O scaling. With no time dilation: real and perceived network bandwidth is about the same. But we need to scale down the bandwidth and scale up the latency. Use a network emulation environment, like ModelNet or Dummynet (or Linux traffic shaping). This all gives DieCast fidelity, reproducibility and efficiency. The scaled system almost looks like the original system. Validation: does it match the original system performance? Can a smaller system be configured to match the resources of a much larger system? Applications: RUBiS (eBay-like ecommerce service which ships with a workload generator). Topology includes databases, web servers and workload generators. Multiplexed 40 virtual machines onto 4 physical machines with Xen 3.1/fullyvirt. DieCast tracks the throughput performance of the Baseline system (no virtualisation) perfectly. The non-DieCast (just virtualised) system trails off. Response time similarly tracks the baseline system. CPU usage, memory usage and network utilisation tracks the baseline. Also validated on BitTorrent and Isaac. Matches application specific metrics and preserves the resource utilisation profile. Case study on Panasas -- who build storage systems for HPC. Difficult to replicate client deployment environment, and limited testing resources. Panasas uses a custom OS with integrated hardware/software. Doesn't run on Xen! So they ported DieCast to this environment. (Clients are just Linux so these can go in VMs.) They matched the Panasas performance metrics for their internal tests. 100 physical machines scaled up to 1000 clients. Limitations: don't scale memory, long running workloads are a problem, specialised time-sensitive hardware appliancies are a problem. --- D^3S: Debug Deployed Distributed Systems (Xuezheng Liu, MSR Asia) Difficult to reproduce bugs in a distributed system (machines execute concurrently, machines and network may fail). e.g. a distributed reader/writer lock: invariant is only one client can hold it exclusively. But optimisations make this tricky to reason about. Can debug it by simulation and model-checking (but these aren't good enough), so we need runtime checking. The state of the art in runtime checking is debugging by printf. Need to collect logs and retain partial order. Requires too much manual effort, and it's difficult to know what to log. For a large system, this becomes challenging: a central checker cannot keep up, and snapshots must be consistent. Focus on making runtime checking for large distributed systems. D3S lets you write a predicate which can be injected into the running system without stopping the running system. Can use multiple checkers to check the system concurrently. Contributed: simple language for distributed predicates, programmers can change checks on the fly, failure tolerance. Goal: to make it simple for the developer (to write predicates), to run in parallel (multiple predicate checkers), and to be correct. Idea from MapReduce for writing predicates. Predicate written as a pair of states, dataflow and triggers; and a C++ class containing checker code. Checkers see a sequence of consistent snapshots of tuples of exposed states. Hooks are injected using binary rewriting. To construct consistent snapshots, use a Lamport clock, but how does the checker know whether or not it receives all necessary states for a snapshot? Need to detect application node failures. Use an external service or built-in heartbeats. When a node isn't exposing changed state, it should periodically announce its timestamp. Experiment on five real-world systems. Does D3S help finding bugs, does it place a burden on developers, and is the checking overhead acceptable. All third-party systems! Case study on leader election, on a database application over 8 machines, with random crash & restart. Debugged with 3 checkers, partitioned by replica groups. Found a bug in this. For data centre and wide-area applications, found a number of correctness bugs, performance problems and tradeoffs. Though this involved writing 50--210 predicates per application. Performance overhead (under a stress test) is 3--8%. Negligible overhead in other systems. --- FIN ---