Intro 175 submissions -> 796 reviews -> 30 papers (all records). Union of USENIX, SIGOPS and SIGCOMM. >200 attendees (also a record). --- Best Papers: * Remus: High Availability via Asynchronous Virtual Machine Replication (see day 2, first paper) * Consensus Routing: The Internet as a Distributed System (see day 3, first paper) --- Xen and the Art of Virtualization Revisited (Ian Pratt, keynote) Brief history of Xen, why virtualisation matters, review of paravirtualisation and hardware-software co-design. History: Began with XenoServers at HotOS (federated, distributed computing infrastructure -- account for resource used and bill it to the users; mutually distrusting users using a VMM; locating computation at the most efficient place in the network). A struggle to get funding for this! Bet Keir Fraser that he couldn't knock something up: development started in April 2002, initially within the SRG, but then GPL'd it and got other people involved. Other technologies available at the time didn't provide the wanted isolation, and weren't open source, so this got the community interested. At SOSP in 2003, word got out, getting interest from industry. First solid release in April 2004: far more robust than the version for the SOSP paper. Could have stopped with the SOSP version, and focussed on writing research papers, but decided to turn it into something useful -> the 1.0 release. Making enterprise software (software that really works) is a lot of work! Hardware vendors started taking Xen seriously in 2004 -- iap worked at the boundary of software and hardware -- asking what they wanted in the CPU. Takes about 4 years to go from feature request to feature. But maybe better software algorithms will help in the future and move the goalposts. Xen 2.0 in November 2004: production-quality for running Linux and other OSS OSs. Used for virtual hosting commercially. Amazon initially used this. In 2005, gained traction with the OS vendors (RedHat, Novell, Sun), who wanted to incorporate Xen in their OS. Started adding paravirtualisation to make their OSs run well on Xen. In 2006, VMware and Microsoft began to see the benefits of paravirtualisation, incorporating it in their products. In September 2006, the first complete product -- XenEnterprise -- was released commercially. 7.5 years from the idea to the commercial release! Next month will see Xen available as an option embedded in Flash on Dell and HP servers. Makes virtualisation ubiquitous, built into the platform. Project mission: build the industry standard open source hypervisor (to be incorporated in products from multiple vendors); maintain industry-leading performance (new CPU feature -> CPU vendor works with project to implement support; help with paravirtualisation); maintain stability and quality (security! Belt and braces approach); work on multiple CPU types (4096 IA-64 box -> laptops/mobile devices -> Samsung/ARM mobile phones (3 VMs: radio software (completely isolated), factory installed applications, and user-supplied applications); foster innovation (work with other research groups (e.g. Brendan Cully's work which won the best paper award). Why is virtualisation hot? Scale-out projects: using commodity x86 servers with commodity OSs, instead of mainframes. One application per commodity server, leading to "server sprawl". Huge number of servers arriving every week, but only 5--15% utilisation of each. Why? Popular OSs don't provide: full configuration isolation (registry tweaks, DLL hell, but not just on Windows); sufficient temporal isolation for performance predictability (vendors won't support application if it's running alongside another application on the same machine); strong spatial isolation for security and reliability (possibility of correlated failures caused by one application crashing); and true backward compatibility for applications (especially for Linux: might be running RHEL3 and Linux 2.4 for existing applications, only putting new applications on RHEL4 (and scared of RHEL5)). Benefits: Provides a simple interface to OS images (compared to the system call interface), and is simpler than an OS server, so it's easier to reason about spatial and temporal protection. So you can consolidate on a single server, exploting multi-core CPUs. Makes management much easier: a uniform way across heterogeneous hardware. Use a single image across heterogeneous machines: useful for disaster recovery. 2nd generation benefits: avoid planned downtime with VM relocation (when you know a machine is going to fail). Dynamically rebalance workloads to meet SLAs -- but this puts quite a lot of load on an already-busy maching. Smart NICs can make this better (remote DMA; TCP offload). Hardware Fault Tolerance (viz. Remus): keep two instances synchronised on different hardware -- using checkpointing or deterministic replay. Security. Risk of a hidden hypervisor is a myth (Bluepill). However, there is a risk that an installed hypervisor could be exploited, and this can be dangerous. Hypervisors add to the attack surface: a network-facing control stack (written in OCaml for the commercial version: takes out a whole class of attacks -- actually pretty good for systems programming). Even worse: what if software running in a guest VM could break out of its containment? Maybe an exploit in the emulated devices? (Stub domains can help with this -- done for security and temporal isolation, but has led to better performance.) Measured launch: what you booted is what you installed, in read-only memory (TXT/SKINIT) can also help. Improving security with hypervisors. Move firewalls/IDS/malware scanning into another VM: more robust/harder to disable than if it is in the same VM. Do it on every machine, rather than for the whole network: put protection closer to the application, and no longer rely on protection provided by the OS. Harden OSs with immutable memory to protect kernel data structures. Taint tracking to make sure network-provided data never becomes executable. Logging and replay: how did things become infected? Next frontier is to make it easier to administer VMs. (Already made it easier to adminster the physical machines.) Breaking the bond between OS and hardware. Simplifies application-stack certification: only against a particular OS image; OS image then certified against the hypervisor -> virtual appliances. Virtual hardware makes it easier to create or modify OSs. e.g. Application-specific OSs (Database e.g.), or native execution of Apps on top of the hypervisor (JVM/Xen). Only need to target virtual hardware: don't need a variety of physical device drivers. Makes it easier for hardware vendors to "light up" new featuers more rapidly -- don't need to wait for new OS features. Paravirutalisation. First done by IBM on VM/370, then on the Denali and Xen projects around 2001/2. If you design your I/O interface to take advantage of virtualisation (quite different from a hardware interface), you can do better in performance. Let the OS know what physical memory is actually available to it lets it make better decisions. Difficult to reconcile virtual and wall-clock time: PV to make scheduling decisions on virtual time. Now some of these things aren't compulsory due to the virtualisation hardware from Intel and AMD: but instead you can do incremental paravirtualisation. Still important for performance, even with better hardware: higher-level paravirtualisation yield greater benefits, and these are best done in software. MMU virtualisation: challenging to make this fast, especially on SMP. Hot unplugging of virtual CPUs (if they are unnecessary), or multicast TLB flushes (huge win as a paravirtualisation -- expensive to emulate what is done in hardware). 3 MMU virtualisation modes: direct mode (original and arguably the best) (guest is aware of the subset of physical RAM that it owns; hypervisor enforces validity; guest calls hypervisor (in batches) to make changes to its tables). Shadow pagetables (guest thinks it has real page tables and contiguous physical memory; hypervisor maintains and synchronises a second set of real page tables) -- looked expensive and hard to do well, but with some guest modifications, software algorithms can do a good job: things keep getting better. Hardware assisted paging (NPT/EPT): hardware handles the translation from guest-physical to machine-physical, but this can vastly increase the number of memory accesses to perform a TLB fill (factor of 5 increase!); currently a 15% slowdown compared to shadow PTs (although better performance on wide-SMP architectures). Network I/O virtualisation: difficult because of high packet rate, small batches, and data copy to the VM on receipt. Xen has asked for hardware modifications in the NIC, and some of these are coming online. Level 0: modern conventional NICs (bunch of different offloads). Typical Xen architecture with only Dom0 accessing hardware. But can now do direct device assignment using an IOMMU, though this might give problems (can't share device; live migration (hotplug/unplug)). Or use Xen driver domains to virtualise a particular NIC for performance or reliability reasons. Use grant tables to keep good isolation between the VMs. Level 1 (smarter NICs): multiple receive queues based on destination MAC address for VMs -- VM-supplied buffer given to NIC to avoid copying. Level 2 (direct guest access): safe and protected access from VMs to the NIC (Solarflare/Infiniband) -- Dom0 sets up accelerated routes and then DomU can access hardware directly -- simply an accelerator module, which allows easy hotplug/unplug but also gives line-rate performance. Level 3: present NIC as multple PCI devices (no real performance improvement), and have a full switch on the NIC -- takes away some of the benefits of virtualisation (lock-in to the hardware). Smart NICs reduce CPU overhead substantially (272% of the CPU usage for Type 0 -> 126% on type 2 (100% for native Linux)). Open source is a great technique for making University projects have impact. Soon will see ubiquitous hypervisors, and could see a golden age in operating system diversity. Still a fun area to be working in -- using and influencing the hardware! Q: the tension between added features to Xen and the security threat -- is there a finite set of features where things could settle down and become secure? Depends where you add the features: Xen only puts a small amount of code in the hypervisor/TCB, and keeps narrow interfaces. Q: the interaction between hardware and OS has been studied and optimised intensively -- does Xen preclude some of these optimisations? Idea is to make the hypervisor interface sufficiently low-level that as much of the hardware as possible is exposed -- some people will always want to do more, as the hardware becomes faster, this might become less of an issue. Xen gives good performance! Q: every OS attempts to provide virtualisation, main difference is the interface (POSIX vs. hardware interface) -- why can't we provide virtualisation at the POSIX level? (Does the hardware interface not obscure intent?) The interface is so broad and high level that you have a lot of stuff to do to get down to the hardware -- leads to the potential for bugs, crosstalk. Q: plan for scalability with massively multi-core? Has been *booted* on 4096 cores, but not optimal performance. Does well on 64/128-way boxes. --- One Hop Reputations for Peer to Peer File Sharing Workloads (Michael Piatek, UW) Potential of P2P vs. the practicality. Depends on user contributions but users are stingy. 70% of users in Gnutella didn't contribute anything to the network. Led to contribution incentives in network design (tit-for-tat servicing in BitTorrent/eDonkey). How are these doing? Experiment by varying contribution level (1/30/100KBps) to see the cumulative distribution of swarms. 74% of download rates are <50KBps (even if contributing 100KBps). 25% of swarms are completely unavailable. At 80% level, a 100-fold increase in contribution gives only 2.7x increase in download rate. Contribution: case study of problems with BitTorrent (show why incentives are often ineffective) and proposed fix (one hop reputations). Measurement questions. Is there a fundamental lack of (aggregated) capacity (due to e.g. asymmetric bandwidth)? No. Average capacity is 8x average performance -- people are witholding capacity. Is there a fundamental lack of interest? No: just that most peers download and quickly leave -- unpopularity at a given instant doesn't mean that it was never popular. Top 10% of peers are responsible for 80% of possible bandwidth. Most of the total capacity is held by users with the least incentive to contribute it (because providing extra capacity gives little benefit). Swarm popularity peaks early -- after downloading, users leave: capacity and replicas are lost, and eventually the total data becomes unavailable. Want late-arrivers to be able to draw on the capacity of previous downloaders. Tit-for-tat in BitTorrent. Incentives only apply when users are actively downloading. We need these to persist. Simple fix: instead of scheduling on a short-term basis, peers could remember all contributions and reciprocate across time and different swarms. Depends on frequent repeat interactions, which are rare. Experiment: trace of 55523 swarms, chance of peers interacting again (assuming an 8-hour session or an infinite session). With 8-hour session durations, 99% of peers that interact do so only once. With infinite duration, 91.5% of peers interact just once. Can one-hop reputations increase the set of hosts with known reputation? Indirect reciprocation. If A sends to B->C->D->E, in future, E should recognise A's indirect contribution to it. How long a path do we need (too long and it can be compromised; too short and little benefit): one hop is enough! 97% of peers are connected through 0.00002% of the most popular peers. Protocol. Need persistent long-term identification (self-generated public/private key pair). What do intermediaries do? Maintain accounting information and provide signed verification receipts to users that contribute to them. How are intermediaries discovered? Gossip during connecton setup. Mechanism described in the paper. Intermediaries are selected by popularity. They get priority service as an incentive to serve as an intermediary. Peers are evaluated as the intermediary's valuation of the peer times the local valuation of the intermediary. Evaluation. How many shared intermediaries exist for two random peers? 97% of peer pairs have at least one; median is 134 when the set of potential intermediaries is of size 2000. Time to reciprocation: what if a peer chooses an unpopular intermediary? There might be a long delay before it has the chance to get reciprocation. How many selected intermediaries are encountered again? 73% of intermediaries are selected again within 100 interactions. Conclusions. We need persistent contribution incentives for P2P to reach its potential. One hop reputations are a protocol and policy that fosters these incentives. --- Ostra: Leveraging trust to thwart unwanted communication (Alan Mislove, MPI-SWS) Electronic systems make communication very low cost. Makes it easy to make content available to millions of users. But this leads to unwanted communication -> spam or unwanted Skype invitations, mislabelled content on YouTube. Users are not accountable. Free accounts make it easier for a banned user to create a new identity. Previous approaches. Filter based on content (hard for photos and videos, subject to false +ves). Charge money to send (requires a micropayment infrastructure). Introduce strong identities (passport or social socurity # to sign up for an account - controversial). Ostra doesn't use any of these: instead works in conjunction with an existing comms system and leverages an existing social network. Exploits the cost of maintaining social networks. Hawala (Indian system for transferring money): give the money to a hawala dealer who would transfer it via the hawala dealer social network. Hawala dealers only exchange promissory notes (settle up in the future). Like debts between banks, but using only pairwise trust, and no centralised trust (e.g. the Fed). Works because links take effort to form and be maintained. Can't get new links easily. Misbehaviour leads to being ostracised. A short-term gain leads to a long-term loss. Ostra doesn't need as high a level of trust as hawala, because there is far less at stake (a spam message rather than losing money). Applicable to messaging, group communication and content sharing. Communication systems usually embed a social network (contact lists/address books/friend list) and this can be explicit or implicit (derived from who talks to whom). Assumes links take some effort to form and maintain, and that the social network is maintained by a trusted site. Recipients classify messages (e.g. as unwanted), which can lead to the link being penalised (in the direct-send case) until the link can no longer be used. In the indirect case, we would look for a trust chain between source and destination, and all links in the path would be penalised on unwanted messages. Messages themselves are sent directly. Each link has a credit balance, B (how much one user is "in debt" with the other). Also has credit bounds (lower L, and upper U). L <= B <= U. When sending a message, the lower bound is temporarily raised and reset once the message is classified. If the message is wanted, L is relowered to its original level. If not, B is decreased towards L. If B is at the lower bound, the sender must wait until he has enough credit. This iterates for sending to non-friends, on any path from source to destination. Intermediate users are indifferent to outcome (an unwanted message passed on leads to less credit on the outbound link, more credit on the inbound link). Guarantees a per-user bound on sending spam: received spam + (|L| * number of links). This holds for any subgraph (credit can neither be created nor destroyed) => collusion doesn't help attackers. Note that a user cannot receive if B = U (or send if B = L). Therefore introduce a credit decay of, say, 10% per day. Preserves the conservation of credit, and allows people to send/receive again eventually. Offline users may cause credit reservation, so introduce a classification timeout, T, which treats a message as wanted if it is unclassified after T days have passed. Offers plausible deniability of receipt. For content sharing, create a virtual identity for the content-sharing site, and consider an upload to be a message to this identity. Security. Conspiring users cannot create credit and Ostra cannot reach starvation. What about multiple identities (sybils)? Any subgraph is limited by the cut between that subgraph and the rest of the graph, so sybils connected to themselves are no use. Can attackers target vital links? Social networks tend to have a dense core: the min-cut is almost always at source or destination. Evaluation. Based on a simulation of a social network (crawled from YouTube - 476000 users and 1.7 million links) and message trace (150 users at MPI over 3 months; 13000 messages). Effectiveness of blocking unwanted communication. Simulated in three scenarios: messaging with random traffic, messaging with proximity-biased traffic and centralised content-sharing. Simulated with random attacking users L = -3 = -U, and decay of 10% per day. With 20% of the users being attackes, only 4 spam messages per good user per week: Ostra meets its expected bounds on unwanted messages per user per week. Do messages get delayed? With a classification delay of 2 or 5 hours, only 1.3% of messages get delayed, and 2 hour classification delay => 4.1 hour average delay; 6 hour delay => ~14 hour average delay. --- Detecting In-Flight Page Changes with Web Tripwires (Charles Reis, UW) Inspired by ISPs inserting adverts into web pages. Wanted to see how often this occurs. Detect how your own page is being modified using a small bit of JavaScript (a tripwire) which runs in the client's browser, finds the changes and reports these to user and the server. First fetches and renders the original page, fetches JavaScript code in the background (including an encoded version of the original page) which then diffs the two versions diffed. Attracted visitors by posts to Slashdot and Digg (front pages in July 2007) -> 50000 unique IP addresses. 657 clients saw changes (1.3%), many of which were made by client software (not AdBlock, but maybe client proxies) and some made by agents in the network (e.g. ISPs and enterprise firewalls). Shows diverse incentives for modifying content which may cause concern for publishers. Changes by ISPs. 2.4% showed injected adverts in web page (NebuAd (contracted by ISP to show targeted ads), MetroFi, LokBox (free WiFi providers)). Leads to revenue for ISP, but may annoy users. This may be a growing trend: PerfTech, Front Porch, Adzilla and Phorm who are basing business models on this idea. 4.6% showed compression by ISPs (e.g. image distillation on a cellular network). Changes by enterprises. Security-checking scripts (2.3%; BlueCoat Web Filter), looking for phishing attacks, for example. Makes things safer for clients, reducing risk of browsing. Changes by client proxies. Popup and Ad blockers (71%; Zone Alarm, Ad Muncher,...). Make the web less annoying for users, but impacts the publishers' revenue streams. Changes by malware. 1 client had adware installed that modified the page to show adverts amongst the content (adding extra links on the page, double-underlined so that a frame would appear with adverts related to those words). 2 clients may have been infected with worms (ARP poisoning?) that send ARP notifications to make the machine look like the router, and then inject changes. The changes help the malware author, but at a risk to the user. Some changes inadvertantly broke pages, with JavaScript errors. Popup blocking or image compression JavaScript were bad for this. These interfered with MySpace or forum posting forms. Some introduced vulnerabilities, such as cross-site scripting. This is usually fixed at the server, but sometimes proxies make the pages vulnerable (Ad Muncher, Proxomitron). Affected most (unencrypted) HTTP requests, and could even affect encrypted pages (with Proxomitron): analogously bad as a root exploit. XSS via Proxy, which injects script code to look for ads and remove them. The code includes the full URL of the page in a comment or a string used for logging purposes. The attacker can place script code at the end of the URL and it will still be valid. Users who follow the URL then run the injected code. An example exploit uses a web tripwire to see if the modification is happening as expected, then if so redirect the user to the Google front page and inject code into the search form that ultimately appends exploit code to all outgoing links. These vulnerabilities have been reported and are now fixed. Web tripwires helped them to find the vulnerabilities, by searching for the URL in page changes. How to react? Option 1: use HTTPS, but this is costly (for the certificate, and in terms of additional processing) and rigid. Some enterprise security checks, transparent caching etc. can't be used. Option 2: web tripwires. Easy for publishers to deploy (a configurable toolkit or a web tripwire service (using a single extra line of JS in a third-party page)) which generates reports a la Google Analytics. However, this is not cryptographically secure: possible for the attacker to get round it. Tradeoffs (HTTPS vs tripwires). HTTPS prevents most changes and some useful services; Web tripwires only detect in-flight changes. HTTPS is cryptographically robust; Web tripwires could face an arms race (attackers try to detect the tripwire code), but obfuscation can challenge adversaries. HTTPS is expensive; tripwires are less expensive to deploy. Evaluation of performance impact. Tripwires have low latency and high throughput, when compared to HTTPS (and little overhead over the unprotected case). HTTPS massively impacts on throughput (in sessions per second) whereas tripwires only add a few bytes of extra code to each page. --- Phalanx: Withstanding Multimillion-Node Botnets (Colin Dixon, UW) For example, BlueSecurity (anti-spammers) were forced offline by a botnet. Firms are hitting rivals with web attacks. Estonia was taken offline by a botnet. 12 Fortune 500 companies have been afflicted by botnets. Why hasn't it been solved? It is solved for static content (large-scale CDNs, replication everywhere, larger than any feasible botnet). Could be solved if we can replace all routers, based on "clean slate" academic research, but the pervasive nature of bots means that you need universal deployment. Remains unsolved for dynamic content on today's Internet. How can we use a pervasive set of machines to solve the problem, without changing every router? Goals are to be deployable and scalable. How would we design it? Use nodes as proxies, making filtering decisions and forward remaining traffic to the server. How do they make these decisions? Do we trust their decisions? And how does the network know that we trust them? Idea is to use nodes as packet mailboxes: the server must explicitly request packets from the nodes, placing the policy at the destination and requiring no trust in mailboxes, explicit network trusting. But any node is vulnerable to attack. Secure random multipathing using a shared secret sequence chooses a pseudorandom sequence of mailboxes in communication. Botnet can take down one mailbox, but communication can continue and diluted attacks against all mailboxes will fail. Negotiate the secret X at connection setup, and make a hash chain between these to choose the random mailbox. Filtering ring. Why don't attackers just hammer the server? Need to drop unrequested traffic in the network. Must upgrade the routers at your ISP to filter these packets out. Put a nonce in requests, which come back with the data packet. Whitelist the nonce on the request, blacklist it once the data has been retrieved. For connection setup, the server periodically issues "first packet" requests, access to which must be mediated. Could use computational puzzles or authentication tokens. Large CDN serves static content. Dynamic server is protected by Phalanx. First get static content and Java applet from the CDN. Applet does connection setup, gets and solves the puzzle or presents auth token. Server issues first packet request. The first packet and the request are eventually paired and sent to the server. The server returns the list of mailboxes and the secret X. Now protected communication can begin. Evaluation with microbenchmarks on PlanetLab (in paper) and simulation based on gathered topology data (from Akamai) and with PlanetLab nodes sending in for server. 7200 Akamai nodes as mailboxes. Upload bandwidth based on BitTorrent data. With 7200 10Mb mailboxes and only the victim AS doing filtering, it can defeat 100k-node botnets. All mailboxes see less than 30% "goodput". 60% of mailboxes see no loss. 20% of mailboxes see high loss, which demonstrates the need for multipathing. At some point, any fixed deployment is going to reach its limit and fail. With 4 million attackers, see a change which is due to the limit of the mailboxes. If we have 5x as many mailboxes, 40% see no loss even with 4 million attackers (and we're not sure if botnets are really that big -- these attacks are probably 5 or 6 years off). --- Lunch: Lemon Chicken at Bamboo Chinese restaurant on Polk. --- Harnessing Exposed Terminals in Wireless Networks (Mythili Vutukuru, MIT) A conflict is what occurs when two transmissions lead to lower throughput when sent concurrently. Need a threshold loss rate to infer that a conflict is occurring. Loss rate of u->v when x is concurrent: if >50%, then infer that there is a conflict at v. Conflict entries if inferred by mistake are timed out periodically. Populating the conflict map. Stop u transmitting to v when x transmits to anyone; stop x transmitting to anyone when u->v. Channel access decisions. Nodes always overhear channel, and consult conflict map before transmission: totally overriding CSMA. A new ACK scheme. The ACK can be lost because x might start transmitting before v can send an ack to u. So have a sliding window of ACKs at the sender. Backoff policy: 802.11-like backoff policy. Cannot defer when there are hidden terminals. So use an exponential backoff when the loss rate in ACKs > some threshold. Implementation challenges. First how to identify colliding senders. Add a trailer: when two packets overlap they are not totally synchronised, so we can probably get a header from one and trailer from another. How to identify ongoing transmissions at the sender: need to change the MAC/Physical layer interface. At present, the header is treated as part of the packet by the Physical layer. Could use software radios, or send separate header and trailer packets (using commodity hardware). Conflict maps, ACKs and Backoff implemented as a Click kernel module. More implementation at the physical layer. Evaluated on a 50-node 802.11a testbed - does it improve concurrency? Pick two senders in range and send 1400-byte UDP packets at 6Mbps. Take 50 unique sets of four nodes. Evaluate with CMAP, CSMA, and no CS or ACKs. Either we can have interfering transmissions (should send separately) or exposed (should send concurrently). With CSMA, for 20% of the nodes upto 4Mbps this is better; for about 20% it was better to use no CS. Ideal MAC should emulate CSMA when the nodes are interfering and be like CS-off when they are non-interfering. CMAP approximately traces the ideal curve. Experiment with multiple concurrent senders: multiple single-hop transmissions. Varied number of concurrent senders from 3--6. CMAP has 20--47% better throughput than CSMA. Limitations: Losses incurred until the CMAP is populated. Unequal packet sizes make it take longer to detect conflicts, and cannot detect conflicts when headers cannot be decoded (e.g. from a microwave oven). Contributed a MAC that improves throughput by increasing concurrency by watching and discovering conflicts. The experiments show increased throughput, with a 2x improvement over CSMA for exposed terminals (no interference) and a ~50% improvement in AP and mesh networks. --- Designing High Performance Enterprise Wi-Fi Networks (Rohan Murty, Harvard) Wireless using in the Enterprise is increasing. Users tend to use WiFi even when wired connections are available. Could move towards an all-wireless office. What about capacity issues? Conventional WLAN: 1 AP, 12 users with each getting < 1Mbps each (despite a nominal capacity of 54Mbps). Current focus on coverage (fewer APs than clients), which causes a bad rate anomaly. Clients pick which AP they associate with, based on the RSSI of beacon packets, but without finding out about congestion issues. No adaptive behaviour (load balancing). DenseAP focuses on capacity, deploying lots of APs densely. Clients now talk to APs nearby, which mitigates the rate anomaly. The infrastructure picks client-AP associations, based on a global view of network conditions (channel load, interference, etc.). System is adaptable, enabling the load-balancing of associations and dynamic channel assignment. For practical use, want no modifications to legacy client hardware (only to the infrastructure). Also want it to be self-managing, so that adding more APs does not increase management challenge. DenseAP nodes (DAPs) send summaries of packets handled to the DenseAP Central Controller (DC), which makes decisions. DAPs contain an ACL which says to what clients that DAP may reply to a probe request. Want to associate client with a DAP that will yield good throughput. Need to consider the quality of the connection between them (rate) and the congestion (free air time). RateMap: correlation between the RSSI (received signal strength indication) of probe request packets with the average throughput between a DAP-client pair. This is a rough approximation which can be used to generate an ordering of DAPs. RateMap is an online profiling method that builds the estimated RSSI-to-data rate table. The free-air time is estimated using a technique similar to ProbeGap: measure the time taken to finish a packet transmission. Channel assignment is integrated into the association process. The DC assigns a channel to the DAP only when it goes from being passive to being active. Load balancing occurs as the DC monitors load on every DAP: when the channel load exceeds a certain threshold, the client causing the most load is determined and moved to another DAP. Evaluated on a testbed, deployed 24 DAPs in a space that is ordinarily served by a single AP. DenseAP gets a 1250% gain in capacity over the single AP. Why? More channels, denser APs and intelligent associations. What if all DAPs were on the same channel? Evaluates the effect of density. DenseAP consistently does better than the single AP case: higher density => better performance. So is it necessary to perform intelligent association? Compare client-driven choice (conventional WLAN) with DenseAP: DenseAP achieves a 160% gain. Load balancing shows that moving three clients from all on the same channel to all on different channels yields much better performance for each client. --- Fat Virtual Access Points (Srikanth Kandula, MIT) State of the art is to connect to the AP with the highest RSSI. But the AP uplink is probably the bottleneck: why not access all local access points concurrently to improve throughput? Also, all laptops in the same room connect to the same AP while another AP with lower RSSI might be idle (and hence offer better throughput). Should spread the load across all local APs: individual changes make a global benefit. Instead of joining the closest AP, join a virtual AP, which is the sum of the nearby APs. This aggregates bandwidth for the user, and balances the load across APs. Realise this with only client-side changes. Send packets to one AP, which builds up a queue: while the AP is processing the queue, toggle to another AP and build up a queue there. But what about receive? If you're toggled away, then you would drop packets from the first AP. So use the power-save idea, whereby the AP buffers packets when disconnected. Hard to sustain TCP flows through each AP. Cannot lose packets yet switch quickly. How do you choose which APs to connect to and for how long. Some APs are more valuable than others (e.g. more bandwidth). How do you divide traffic across the APs? FatVAP: divides time across APs to maximise throughput, and is transparent to APs and remote endpoints. Scans for available APs, computs a schedule to divide time across them, and switches APs per the schedule. Traffic is spread by pinning flows to APs. How much time do you spend at an AP. If AP drains queue at rate e (end-to-end throughput), and client sends packets at rate w (wireless link throughput), then the useful fraction of time is <= e/w. If e = 2w, it's only worth spending half of the time on that AP. Don't pick based on end-to-end throughput: the larger the wireless bandwidth, the more data you can send/receive in the time. But it takes time to switch: 5ms. Can't linger too long (100ms period) based on maintaining TCP flows. Schedule based on bin-packing optimisation/approximation algorithm. Use synchronous ACKs to estimate the wireless achievable bandwidth. For end-to-end, count bytes recevied in a window, looking only at back-to-back large packets. How to spread traffic across APs? Need to virtualise 802.11 state, having an IP for each interface, and modifying the kernel to spread flows to different APs, by performing fast header re-writing. There is a warm-up cost on each switch to a different AP, so maintain a state machine for each AP and perform soft switches. Also maintain private queues for each AP. Evaluation: compare FatVAP driver with unmodified MadWifi. Three scenarios: testbed, residentialy and commercial. Long-lived TCP flows, BitTorrent, and one other kind of flow. FatVAP successfully aggregates bandwidth linearly until it hits the wireless bandwidth bottleneck. For load-balancing, FatVAP gives a roughly fair share between two APs with 2Mbps and 12Mbps uplinks respectively. Is FatVAP adaptable? Experiment shows switching the uplinks between two APs: MadWifi doesn't adapt; FatVAP does (after a slightly noisy period). --- Efficiency through Eavesdropping: Link-layer packet caching (Mikhail Afanasyev, UCSD) Three nodes: A, B and C. Send A -> B -> C. If A sends directly to C, B may overhear the transmission, and if C doesn't quite receive it, A has to send through B, which is wasteful, as it has already got a copy of the packet. Or what if the data packet is sent, ACK lost, and the data packet is retransmitted (with two nodes)? Still wasteful. Challenges: wireless is shared and broadcast-based, and routing must adapt to changing channel conditions and mobile devices. Existing solutions require total protocol redesign, and increase latency significantly, making them incompatible with TCP. Idea: ask receiver before packet is sent, "Do you already have this?" The 802.11 spec helps us: RTS/CTS exchange, which is designed for hidden terminals. Nodes send RTS to reserve the channel, and only one is given CTS (clear to send) to stop the hidden terminal transmitting. Idea is to put packet ID in the RTS. If the receiver has that packet, it sends an RTS-ACK, which precludes sending the packet. Received packets are always added to the cache, before possibly passing to upper layers (if it's intended for that node). On receiving an RTS-id, check to see if it's in the cache, and send an RTS-ACK if it is, passing the cached packet to the upper layers (making it look like the packet has just been received); otherwise send a CTS. Packet IDs are CRC32 or SHA-1 of the packet contents. Need to account for changing fields: next-hop MAC, IP TTL, protocol-specific IDs. So checksum only the static fields. RTS-id adds overhead: even higher overhead if RTS/CTS is not enabled. Should enable it selectively on a per-destination basis, based on the cache hit rate. Implemented using CalRadio, an experimental radio platform. The RTS packet is the same, with a 32-bit trailer for the RTS-id, so it will look normal to non-supporting stations. RTS-ACK is a regular CTS with duration = 0, which releases the channel for normal operation. In testbed, hard-code routes (A to B, then to C), experiment is to send UDP as fast as possible. With no overhearing, need about 2.05 transmissions per packet. With overhearing, need only 1.01 transmissions per packet (because C has already overhead most packets that B wants to send it). Experiment with a trace-driven simulation from Roofnet 2004. Uses Roofnet routing algorithm to compute routes. Lots of overhearing occurs: for 25% of source-destination pairs, the probability of overhearing is >= 10%. For paths of different lengths, as paths get longer, you can save more and more transmissions. Median transmission savings is 17%. Also need to look at total air time (different rates/rate adaptation; transmission of RTS/CTS takes time), and traffic patterns. RTS-id always results in a saving over total air time over RTS/CTS. But for 80% of paths, it takes more time than no RTS. Better to make it adaptive in this case, and fall back to no RTS, which gives no-worse performance than no RTS, and some benefits for about 20% of paths. RTS-id effectiveness tends to depend on poor routing choices. But this means you can use simple routing algorithms and still get good performance. Future work to design a routing protocol that takes advantage of RTS-id. --- Beyond Pilots: Keeping Rural Wireless Networks Alive (Sonesh Surana, UCB) First challenge was to enable rural wireless network: get wifi-based long distance links, by modifying MAC layer etc. (previous paper). Next challenge is to keep them alive! Have been able to show that emulated performance has been matched, even over 382kms: the world record distance for a wifi connection. The operating environment is different to the one we're used to: pilots that rely on standard equipment do not scale. Systems design can mitigate these challenges. Challenges: bad quality grid power (higher component failures and more downtime); limited local expertise (local operation, maintenance and diagnosis is difficult); lack of connectivity (complicates remote diagnosis/management); remote locations (travelling to them is difficult and infrequent). AirJaldi network: north India, serving Tibetan community in exile. WiLD links and APs, links 10--40km long, 4--5Mbps per link. Works for VoIP and Internet. Aravind Eye Hospital Network: All WiLD links, 1--15km long, 4--5Mbps per link, for video conferencing. Hardware faults in Aravind network: mostly router board not powered (63 incidents costing 63 days downtime), most other ones are power-related. >90% of all faults are power-related. Also software faults, but less prevalent. The quality of the power (voltage regulation) is very poor: many spikes of up to 1000V. Leads to lost power adapters, burned Power-over-Ethernet ports, CF corruptions, battery damage. Low voltages also cause incomplete boots. Affordable UPS systems are "standby-type", to deal with outages, but tend to pass through grid power when it is available. Problem of low voltage. Disconnect the load at low voltages, which prevents battery over-discharge and hung routers. Without these, they had 50 visits/week for manual reboots at AirJaldi. Off-the-shelf LVDs oscillate too much, causing too many automatic reboots. So they designed a new circuit with a better delay, leading to no more manual visits or reboots. To deal with spikes, swells and power at remote sites, they made a low-cost solar power controller. Features Peak Power Tracking, leading to 15% more power draw. LVD + trickle charging doubles the battery life. Voltage regulator prevents spikes and swells (when grid power is used in addition to solar). Power-over-Ethernet can be used for remote management. And it only costs $70, compared to $300 commercial units (more expensive to deal with higher wattages). No routers lost in 1 year of use. Data collection and monitoring: a push-based PhoneHome system, which measures node, link and network level properties, posting to server every 3 hours. Can also be used for remote management. Has been used to help with diagnosis. Alternate network entry points: getting into the network is itself a challenge. Tries using satellite and cellphone access points. Satellite access works 60% of the time. GPRS phone access also possible. Can be used to recover from software configuration errors, but doesn't help with physical faults (which require visits). Recovery mechanism: use a hardware watchdog, whihch costs less than $1 for routers without on-board watchdogs. Results: measures number of fault incidents, and the migration of responsibility. Number of faults is much less (or none at all) in all categories. Particularly good is the elimination of weekly manual reboots. At first, UCB were responsible for everything; now only for equipment supply. Arvind do maintenance and management; local vendor does installation. At the same time, five more clinics have been added. Lessons: reduce the need for trained staff (higher training leads to higher turnover as people get other jobs) by automating as much as possible. A simple re-design is often enough. The real cost of power is cleaning it up, which is a 20x overhead in cost => solar become competitive. --- UsenetDHT: A Low Overhead Design for Usenet (Emil Sit, MIT) Usenet is growing at over 2 million new articles and files daily; 30Mbyte/s new content globally. When an article appears on Usenet, the server floods the article to its peers, which flood it recursively, until all the servers have a copy of the article. Means full replication of articles, but requires massive capacity at the servers (needs to match global input rate of 30Mbytes/s -> Tbytes/day of storage). UsenetDHT is a shared Usenet server, which enables smaller sites to cooperatively store more of Usenet. Only one copy of each article is stored in a DHT. Developed a new DHT maintenance algorithm which is simple. DHash, the first public DHT to store Tbytes durably, with 6Mbytes/s random-access reads per server. Approach is to share article storage among peers, using a DHT (ideal because articles are immutable). Storage based on message ID. Write implementation: separate header from article, store article in DHT, all servers exchange only the header (which is much smaller). Advantages: only one logical copy of the article is stored, servers benefit from peer resources, per-server meta-data is compatible with existing architecture (also only <1% of object size). Demands high throughput storage (to cope with the incoming data rate), and data durability (need to detect and repair after disk failures, without overcompensating for transient failures). Maintenance work must not interfere with storage performance. Data maintenance algorithms at present are synchronisation-heavy. Maintenance involves ensuring that there are enough copies of every objects, which often involves tracking replicats through synchronisation. Each node must sync with many others (O(N) or O(log N) others) -- in a wide area this will mean some nodes are slow. Also a database of replica counts is required: costing much disk or memory, but is also a single point of failure. Also, nodes are dependent on other to know what to do, receiving objects randomly from different sources, and can't individually act to reclaim space or create new replicas. Passing Tone: repairs based on r_L replica threshold (previous work). Each node normally only talks to two peers, and each node is responsible for its own replicas. Two algorithms: local (in talk) and global (in paper). Objects are written to r_L immediate successors. Replicas are automatically used after failures, and individual failures affect fewer nodes. Need to teach each node which objects it must replicate, instead of placing responsibility on a single successor. Use predecessor lists to allow nodes to handle their own maintenance. So each node can identify if it is missing an object or has an extra. (Because it knows the range of keys for which it is responsible.) Then synchronises periodically with its successor and predecessor to identify remote objects in range but missing locally. Each node only has to maintain one single synchronisation data structure (Merkle sync tree), which is easy to update on disk, and cached in memory. Implemented in 1500 lines of code. Is its simplicity a problem? Risk losing data due to failures, or make too many copies because of implicit state. Evaluated with a simulation over a 1 year trace of PlanetLab (20k transient and 219 permanent failures). No objects end up with 0 replicas. Overall, Passing Tone ends up with more replicas than Carbonite. Performance scales (for writes and reads) roughly linearly up to 8 nodes in a switched GigE network. Wide-area deployment, based on 12 test-bed nodes at 7 sites on the east and west coast of the US. Supports the MIT Usenet feed (2.5Mbyte/s). --- San Fermín: Aggregating Large Data Sets using a Binomial Swap Forest (Justin Cappos, University of Arizona) For information retrieval from large networks, we have a large number of nodes, and can aggregate the large quantity of information about that network. For example, debugging with remote sampling (of end-user traces) by extracting execution counts for statements (to find bugs or detect hotspots): but you can sum these across all executions of the program. The bc calculator creates 120kb of data. Also CERT flow data to detect attacks. Goals of aggregation: completeness, containing no duplicates (which could skew the result), no partial data (could also skew), speed (semi-interactive), failure tolerance and load-balance. Binomial Swap tree is a binomial tree where parent-child relationship indicates nodes that swap data with one another. But this relationship is symmetrical. So we end up with a proliferation of trees (one per node). Each node has a Node ID (think of it as a bit vector). For each bit of the ID (from LSB to MSB) swap with only one node that has the ID that differs in only that bit and has the same prefix (i.e. 011 with 010, then 001, then 111). Simple where you have a full tree, 3-bit IDs and 8 nodes, for example. Trickier when you have missing nodes. Where there are no nodes with a target bit, skip that bit. When all nodes with the target bit are taken, abort. The node that aborts doesn't get a full tree, but its data is included in other trees. Aggregation: requestor disseminates query, a binomial swap forest is built, a node returns the answer, and the requestor stops the query (because nodes may process at different rates). Evaluated a prototype, based on Pastry, on PlanetLab. Used nodes that had low clock skew and transitive connectivity. Compared against SDIMS (also based on Pastry, designed for working with small amounts of data (a couple of bytes)). Up to 128kb, all algorithms perform well. After that, SDIMS falls off (especially with 0.5Mb). Compared completeness with the number of nodes. And completeness versus completion time. San Fermín is faster and more consistent. For data transfer rate, SDIMS peaks very high then has a long tail; San Fermín plateaus at a high rate and then goes quickly to zero.