CamSaS – QJump, or: how to be un-British in the data centre

tl;dr: Our QJump paper at NSDI 2015 shows how to build a guaranteed-latency messaging service within a data centre by allowing a limited number of packets to skip queues.⁰ Our approach works on commodity hardware and requires neither protocol nor application modifications. This is exciting, because it enables fast failure detection and more optimistic distributed system designs.

Distributed systems are hard for two key reasons: first, they can experience component failures, and second, messages sent across the network can experience all sorts of craziness – they might be dropped, re-ordered, or delayed for arbitrarily long times. As a programmer, I basically can make no assumptions about anything.

As a result, distributed systems designers must be incredibly conservative, and always engineer for the worst case. This is prudent when dealing with a wholly unpredictable distributed system such as the global internet. However, data centres are more predictable (we're not the only ones to observe this): we know the network topology, the link rates and have some idea of the workloads. Moreover, the entire physical network is under a single administrative authority.¹

Our QJump system (on which we presented a paper at NSDI 2015 in sunny Oakland last week) exploits this, and we show that some simple calculations combined with prioritisation support found in existing commodity switches are sufficient to offer a guaranteed-latency messaging service for a data centre. The trick is simple: if we pace the ingress of messages into the network in such a way that no unbounded queues can ever build up, then we can calculate a hard bound on packet latency.²

Of course, limiting the ingress rate has an impact on throughput: data centre networks are typically oversubscribed at the core and use statistical multiplexing to share the fabric between network flows. QJump's guaranteed latency bound is a function of the number of hosts on the network: in essence, the maximum rate of ingress is the network's rate (e.g. R = 10 Gbit/s)³ divided by the number of hosts (e.g. n = 1,000). However, many coordination services are happy with this rate (consider, e.g. a quorum of Zookeeper servers). Moreover, QJump allows other traffic on the same network. This traffic is allowed greater throughput, but receives no hard latency guarantees.

In this blog post, we focus on the utility of the hard-bounded, guaranteed latency offered by QJump; if you are interested in how we offer statistical improvements at different throughput levels, see the paper.

Specifically, QJump guarantees that if a message is sent and no failures occur, it arrives at the destination within a known, bounded delay. Let's see what benefits this gives us:

We can reliably detect failures.
Distributed systems often conflate delays and failures; for example, a recent EuroSys paper observes that "When a remote process does not respond, the application cannot tell if the process or network have failed, or if they are just slow." Not so with QJump: since QJump's guaranteed latency level ensures that no unbounded congestion can build up in the network, no packets can be lost due to congestion. This means that a packet loss now indicates that a component has failed. This can be the remote host, a link or a network switch.⁴ In other words, if we observe a lost packet, we can immediately initiate recovery action, since a failure must have happened. This is quite a powerful property: consider, for example, the recently proposed (and excellently named) Visigoth Fault Tolerance (VFT). VFT relaxes the traditional Byzantine Fault Tolerance threshold for consensus from 3f + 1 (three times the number of faulty nodes, f, plus one) to f + s + o + 1 (the sum of the faulty nodes, f, the slow-but-correct nodes, s, the number of correlated faults, o, plus one) under certain assumptions that hold true in data centres⁵. With QJump, we can improve even further on this: since we know the exact upper bound on message delivery time in the absence of faults, we can set s = 0, reducing the VFT consensus threshold to f + o + 1.
We can make the common case blazingly fast.
Most applications use time-outs to detect component and message delivery failures. These timeouts are typically chosen conservatively to avoid spurious failures being reported due to packets that are simply delayed. With QJump, we can set the timeout aggressively: for example, the response from a round-trip RPC that is handled immediately and whose processing takes less than a network epoch will definitely arrive within four network epochs (<1ms for most networks): transmission of the request takes at most one network epoch, processing takes at most one epoch, and transmission of the response takes at most one epoch. Such aggressive timeouts make rapid failure detection possible, and allow for fast request completion in the common, non-failure, case. For example, consider what recovery performance RAMCloud could deliver with a sub-millisecond failure detection mechanism (currently, the timeout is 150ms), or what latency a ZooKeeper or an etcd consensus service can deliver with guaranteed 100μs round-trips!
We can use datagram protocols, multicast and broadcast in new ways.
Data centres today are dominated by applications using TCP for its convenient reliable delivery, ordering semantics and flow control. However, many have observed that today's TCP (and especially its conservative congestion control) is not a great fit for data centres. Moreover, application-level multi-cast is common in data centres, but today has no effective support. However, QJump's guaranteed latency level is based on worst-case assumptions. This worst-case covers, the case when all hosts send their traffic as broadcast (i.e. to all other hosts). The QJump network epoch is calculated such that we can tolerate both maximal fan-in (all hosts send to one) and full broadcast (all hosts send to all).⁶ In our paper, we use the example of a two-phase commit service to illustrate this property: the QJump-enabled version of the service not only achieves stable throughput even in the presence of aggressive bursts that badly degrade TCP, it also increases throughput by ≈30% over TCP or UDP with acknowledgements by using broadcast UDP. QJump also enables interesting protocol innovations. For example, the guaranteed latency level can be used as a coordination channel to negotiate transmission slots (similar to TDMA Ethernet or Fastpass), or to signal congestion and notify hosts to slow down. Indeed, a version of QJump that is based on less pessimal assumptions (using additional information about outstanding flows) could be boot-strapped using QJump itself!

As you can see, QJump's guaranteed latency level offers an interesting network fabric with unusual properties. In fact, when discussing possible applications for QJump amongst ourselves, we found that it is surprisingly hard to reason about reliable, bounded delay messaging – it is hard to side-step deeply ingrained assumptions about how packet-switched networks work!

We're quite excited about QJump and hope to take it further in the future. If you have any thoughts about possible applications, or other interesting properties that can be derived from guaranteed-latency messageing, get in touch or leave us a message in the comments below.

Follow us on Twitter.

Posted on May 12, 2015 by Malte / @ms705.

⁰ – The British are famously rigid about enforcing FIFO queueing. However, only two of our paper's seven authors are British, and one of them was born in America, so we're excused!

¹ – This is true even in a multi-tenant environment such as Amazon EC2: the provider controls the physical network (switches, cables etc.) and the hypervisor on each machine. Thus, it can force tenants' VMs to play by certain rules.

² – A simple intuition shows why this works: every time a host gets to send a message, it can choose exactly one target. If all hosts happen to choose the same destination, we still make the latency bound; if they choose different targets (and thus distinct paths through the network), the message latency remains below the bound.

³ – More precisely, we're looking for the rate of the slowest link in the network here: in the worst case, it is the bottle-neck and all messages have to proceed through this link. However, the link rates in a data centre usually increase from the edge (hosts, at e.g. 10 Gbit/s) to the core (at e.g. 40 Gbit/s), so the rate typically depends on the rate of the hosts' edge links.

⁴ – Network partitions are covered by the latter two. Of course, as smart readers might observe, this does not account for the possibility of the remote host being simply slow to respond, rather than having failed. This is true, but arguably there are ways to ensure that the host responds pronto: dedicating cores, spinning on a socket, integrating the responder in the OS or generating acknowledgements in a programmable NIC.

⁵ – See §2.1 and §2.2 of the VFT paper.

⁶ – These two cases are actually identical from the point of view of QJump: the network epoch calculation is based on the hose constraint and assumes that link rates never increase towards the edge of the data centre (i.e. hosts are connected at ≤ the bandwidth of the core), and thus all hosts fanning into one host creates queues no longer than if all hosts send to all hosts.

Related Links

Study at Cambridge

About the University

Research at Cambridge