IMPROVED COMMUNICATION LOWER BOUNDS FOR DISTRIBUTED OPTIMISATION

Abstract

Motivated by the interest in communication-efficient methods for distributed machine learning, we consider the communication complexity of minimising a sum of d-dimensional functions N i=1 f i (x), where each function f i is held by one of the N different machines. Such tasks arise naturally in large-scale optimisation, where a standard solution is to apply variants of (stochastic) gradient descent. As our main result, we show that Ω(N d log d/ε) bits in total need to be communicated between the machines to find an additive -approximation to the minimum of The result holds for deterministic algorithms, and randomised algorithms under some restrictions on the parameter values. Importantly, our lower bounds require no assumptions on the structure of the algorithm, and are matched within constant factors for strongly convex objectives by a new variant of quantised gradient descent. The lower bounds are obtained by bringing over tools from communication complexity to distributed optimisation, an approach we hope will find further use in future.

1. INTRODUCTION

The ability to distribute the processing of large-scale data across several computing nodes has been one of the main technical enablers of recent progress in machine learning, and the last decade has seen significant research effort dedicated to efficient distributed optimisation. One specific area of interest is communication reduction for distributed machine learning. Recently, several algorithms have been proposed to reduce the communication footprint of popular methods in machine learning and optimisation, in particular gradient descent and stochastic gradient descent; see e.g. Arjevani & Shamir (2015) ; Alistarh et al. (2017) ; Suresh et al. (2017) ; Tang et al. (2019) for recent work, and Ben-Nun & Hoefler (2019) for a survey. Despite this extensive work, less is known about theoretical limits of communication complexity of optimisation, especially in terms of lower bounds on the minimal number of bits which machines need to transmit to jointly solve an optimisation problem. In this paper, we study this question in a classical distributed optimisation setting, where data is split among N which that can communicate by sending point-to-point messages to each other. Given input dimension d, and a domain D ⊆ R d , each machine i is given an input function f i : D → R, and the machines need to jointly minimise the sum N i=1 f i (x), e.g. the empirical risk, with either deterministic or probabilistic guarantees on the output. The setting is a standard way to model the distributed training of machine learning models. For instance, if the individual loss functions are assumed to be (strongly) convex, we can model a classic regression setting, whereas if the function is non-convex, we can model distributed training of deep neural networks. In this context, the key question is: what is the minimal number of bits which need to be exchanged for this optimisation procedure to be successful, and how does this number depend on the properties of the functions f i , and the parameters N and d?

1.1. OUR RESULTS

Setting. We consider this question in the classic message-passing model, where N nodes communicate by sending messages to each other; specifically, each message is sent to a single receiver and not seen by the other nodes. Our complexity measure is the total number of bits sent by all the nodes. Given this complexity measure, the model is equivalent (up to a constant factor) to a model where all messages are relayed via a special coordinator node, known in communication complexity as the coordinator model and in machine learning as the parameter server model (Li et al., 2014) . For convenience of presentation, we set D = [0, 1] d , and consider a problem where each node i is given an input function f i : [0, 1] d → R, and the task is to approximate the minimum of the sum of the functions. That is, the coordinator needs to output z ∈ [0, 1] d and an estimate r ∈ R for the minimum function value such that N i=1 f i (z) ≤ inf x∈[0,1] d N i=1 f i (x) + ε and N i=1 f i (z) ≤ r ≤ N i=1 f i (z) + ε. (1) Specifically, this models a standard distributed machine learning setting where we require one of the nodes to return the optimised final model, as well as the final value of the loss function. When proving lower bounds, we allow the nodes to compute arbitrary values that depend on the input functions and operate on real numbers; only the amount of communicated bits is limited. The precise definition is somewhat subtle, so we defer the details of the model to Section 2. Lower bounds for convex functions. We show that, even if the input functions f i at the nodes are promised to be quadratic functions x → β 0 x -x * 2 2 for a constant β 0 > 0, finding a solution satisfying (1) deterministically requires Ω N d log βd ε total bits to be communicated, where β = β 0 N is the smoothness parameter of N i=1 f i , for parameters satisfyingfoot_0 βd/ε = Ω(1). For randomised algorithms, we give a lower bound of Ω N d log βd N ε total bits to be communicated, for parameters satisfying βd/N 2 ε = Ω(1). While this lower bound is slightly weaker due to the additional dependence in N , in most practical settings the number of parameters d will be significantly larger than the number of machines N , multiplied by the error tolerance ε. (Specifically, in most practical settings N 1000, whereas ε ≤ 10 -3 . More generally, it is sufficient that d = Ω(N 2+δ ) for constant δ > 0, for the randomised lower bound to match the deterministic one asymptotically.) At a very high level, our results generalise the Tsitsiklis & Luo (1987) idea of linking the communication complexity with the number of quadratic functions with distinct minima in the domain. To extend this approach to the multi-node case N > 2 and to randomised (stochastic) algorithms, we build connections to results and techniques from communication complexity. Such connections have not to our knowledge been explored in the context of (real-valued) optimisation tasks, despite reductions from communication complexity being a standard lower bound technique e.g. in distributed computing (Das Sarma et al., 2012; Abboud et al., 2016; Drucker et al., 2014) . Our work thus provides a model and a basic toolkit for applying communication complexity results to distributed optimisation, which should also be useful for understanding other optimisation tasks. Extensions. While, for convenience, we work with functions over [0, 1] d , our bounds immediately extend to arbitrary convex domains, as long as we can bound the number of functions with distinct minima in the domain. Beyond strongly convex and smooth functions, we also show that for non-convex λ-Lipschitz functions, solving (1) requires N exp Ω d log λd ε total bits communicated. The main takeaway from this result is that for non-convex objectives, one can induce exponentially higher communication cost by building convoluted input families where the coordinator is required to essentially learn all local input functions of the nodes. Optimal upper bound. To complement our lower bound, we show that for strongly convex and strongly smooth functions, finding a solution to (1) can be done deterministically with O N dκ log κ log βd ε total bits communicated, where N i=1 f i is α-strongly convex and β-strongly smooth, and κ = β/α is the condition number. This algorithm matches our lower bound for constant condition number. It is a variant of deterministic quantised gradient descent (Magnússon et al., 2019) . However, to achieve a tight bound, we need to (a) ensure that our gradient quantisation is sufficiently parsimonious, using O(d log κ) bits per gradient, and (b) avoid all-to-all exchange of gradients. For (a), we specialise a recent lattice-based quantisation scheme which allows arbitrary centering of iterates (Alistarh et al., 2020) , and for (b), we use two-stage quantisation approach, where the nodes first send their quantised gradients to the coordinator, and the coordinator then broadcasts the carefully quantised sum back to nodes.

1.2. RELATED WORK

Message-passing versus broadcast. There are two communication models frequently used in both communication complexity and distributed optimisation: the first is the message-passing model we focus on, and the second is the broadcast or blackboard model, where each message sent by a node is seen by all nodes. The broadcast model is more powerful than the message-passing model: lower bounds for broadcast model apply also for message-passing, but upper bounds for broadcast do not directly translate to message passing. Arguably, message-passing model is closer to reality, as constant-cost, high-bandwidth broadcast mechanisms do not exist in real systems. Optimisation lower bounds. The first communication lower bounds for a variant of (1) were given in the seminal work of Tsitsiklis & Luo (1987) , who study optimising sums of convex functions in a two-machine setting. For deterministic algorithms, they prove that Ω(d log(β 0 d/ε)) bits are necessary. Zhang et al. (2013) use a nearly identical argument to give a Ω(d log(β 0 d/ε)) lower bound for randomised algorithms in the broadcast model. (See also Table 1 in the Appendix.) The basic intuition behind these lower bounds is that a node without information about the input needs to receive Ω(d log(β 0 d/ε)) bits, as otherwise the node cannot produce sufficiently many different output distributions to cover all possible locations of the minimum (cf. Lemma 2.) It is worth emphasising that their bound is on the received bits of the output node, and does not directly imply anything for other nodes; for example, an algorithm where each node transmits O (d log(β 0 d/ε))/N bits is not ruled out by these previous results. Generalising their approach to match our results seems challenging, as we would have to (a) explicitly require that all nodes output the solution, and (b) ensure that no node can use their local input as a source of extra information. Our study was inspired in part by the recent work of Vempala et al. (2020) , who characterised the communication complexity of solving linear systems, linear regression, and related problems, over bounded integer matrices. The results are based on communication complexity arguments, similarly to our lower bound. However, there are some notable technical differences: first, the arbitrary real functions we consider do not have a natural binary encoding, and therefore their approach would not directly extend to our setting. Second, the approximation ratio for linear regression is defined multiplicatively in their work, whereas we consider additive approximations. Both formulations are popular in terms of upper bounds, with the additive error formulation being arguably more popular in the context of machine learning applications. Overall, our results complement theirs, enriching the landscape of lower bounds for distributed optimisation. Statistical estimation lower bounds. In statistical estimation, nodes receive random samples from some input distribution, and must infer properties of the input distribution, e.g. its mean. Specifically, for mean estimation, there are statistical limits on how good an estimate one can obtain from limited number of samples, although inputs are drawn from a distribution instead of adversarially. Concretely, the results of Shamir (2014) and Suresh et al. (2017) apply only to restricted types of protocols. Garg et al. (2014) and Braverman et al. (2016) give lower bounds for Gaussian mean estimation, where each node receives s samples from a d-dimensional Gaussian distribution with variance σ 2 . The latter reference shows that to achieve the minimax rate σ 2 d/N s on mean squared error requires Ω(N d) total communication. These results do not imply optimal lower bounds for our setting. Lower bounds on round and oracle complexity. Beyond bit complexity, one previous setting assumes that nodes can transmit vectors of real numbers, while restricting the types of computation allowed for the nodes. This is useful to establish bounds for the number of iterations required for convergence of distributed optimisation algorithms (Arjevani & Shamir, 2015; Scaman et al., 2017) , but does not address the communication cost of a single iteration. A second related but different setting assumes the nodes can access their local functions only via specific oracle queries, such as gradient or proximal queries, and bound the number of such queries required to solve an optimisation problem (Woodworth & Srebro, 2016; Woodworth et al., 2018) . Upper bounds. There has been a tremendous amount of work recently on communication-efficient optimisation algorithms in the distributed setting. Due to space constraints, we focus on a small selection of closely-related work. One critical difference relative to practical references, e.g. Alistarh et al. (2017) , is that they usually assume gradients are provided as 32-bit inputs, and focus on reducing the amount of communication by constant factors, which is reasonable in practice. One exception is Suresh et al. ( 2017), who present a series of quantisation methods for mean estimation on real-valued input vectors. Recently, Alistarh et al. (2020) studied the same problem, focusing on replacing the dependence on input norm with a variance dependence. We adapt their scheme for our upper bound. Tsitsiklis & Luo (1987) gave a deterministic upper bound in a two-node setting, with O κd log(κd) log(βd/ε) total communication cost. Recently, Magnússon et al. (2019) extended this to N -node case in the broadcast model, with O N κd log(κd) log(βd/ε) total communication cost. For randomised algorithms and constant condition number, better upper bound of O(N d log(βd/ε)) total communication cost in the broadcast model follows by using QSGD stochastic quantisation (Alistarh et al., 2017) plugged into stochastic variance-reduced gradient descent (SVRG) (Johnson & Zhang, 2013) . See Künstner (2017) for a detailed treatment. (See also Table 2 in the Appendix.)

2. PRELIMINARIES AND BACKGROUND

Coordinator model. We consider communication protocols in the classic coordinator model (Dolev & Feder, 1992; Phillips et al., 2012; Braverman et al., 2013) . In this model, we have N nodes as well as a separate coordinator node. The task is to compute the value of a function Γ : B N → A, where B and A are arbitrary input and output domains; each node i = 1, 2, . . . , N receives an input b i ∈ B. There is a communication channel between each of the nodes and the coordinator, and nodes can communicate with the coordinator by exchanging binary messages. The coordinator has to output the value Γ(b 1 , b 2 , . . . , b N ). Furthermore, all nodes, including the coordinator, have access to a stream of private random bits. More precisely, we assume without loss of generality that computation is performed as follows: (1) Initially, each node i = 1, 2, . . . , N receives the input b i . The coordinator and nodes i = 1, 2, . . . , N receive independent and uniformly random binary strings r, r i ∈ {0, 1} c , respectively, where c is a constant. (2) The computation then proceeds in sequential rounds, where in each round, (a) the coordinator first takes action by either outputting an answer, or sending a message to a single node i, and (b) the node i that received a message from the coordinator responds by sending a a message to the coordinator. A transcript for a node is a list of the messages it has sent and received. A protocol Π is a mapping giving the actions of the coordinator and the nodes; for the coordinator, the next action is a function of its transcript so far and the private random bits r, and for node i, the next action is a function of its input b i , its transcript so far and the private random bits r i . The protocol Π also determines the number of random bits the nodes receive. We say that a protocol Π computes Γ : B N → A with error p if for all (b 1 , b 2 , . . . , b N ) ∈ B N , the output of Π is Γ(b 1 , b 2 , . . . , b N ) with probability at least 1 -p. The communication complexity of a protocol Π is the maximum number of total bits transmitted by all nodes, i.e. the total length of the transcripts, on any input (b 1 , b 2 , . . . , b N ) ∈ B N and any private random bits of the nodes. While the model definition may appear restrictive, the protocol restrictions do not matter when the complexity measure is the total number of bits exchanged. Any algorithm using parallel synchronous or even asynchronous communication can be transformed into a sequential protocol by sequentialising the communication steps to occur one after the other. Likewise, algorithms using all-to-all messagepassing can be transformed to the coordinator model, by routing all messages via the coordinator. The transformation incur at most constant factor overhead. Finally, observe that the model is nonuniform, i.e. each protocol is defined only for specific functions Γ : B N → A and specific input and output sets B and A. As such, we do not need to impose any requirements on the computability of the protocol actions, rather these can be arbitrary functions. Any uniform algorithm working for a range of parameters induces a series of nonuniform protocols, so lower bounds for coordinator model translate to uniform algorithms. Communication complexity. We now recall some basic definitions and results from communication complexity. In the following, we assume that sets B and A are finite, as this is the standard setting of communication complexity. For a function Γ : B N → A, the deterministic communication complexity CC(Γ) is the minimum communication complexity of a deterministic protocol computing Γ. Likewise, the δ-error randomised communication complexity RCC δ (Γ) is the minimum communication complexity of a protocol that computes Γ with error probability δ. For a distribution µ over B N , we define the δ-error µ-distributional communication complexity of Γ, denoted by D µ δ (Γ), as the minimum communication complexity of a deterministic protocol that computes Γ with error probability δ when the input is drawn from µ. Similarly, the δ-error µ-distributional expected communication complexity of Γ, denoted by ED δ µ (Γ), is the minimum expected communication cost of a protocol that computes Γ with error probability δ, where the expectation is taken over input drawn from µ and the random bits of the protocol. Yao's Lemma (Yao, 1977) relates the distributional communication complexity to the randomised communication complexity; see Woodruff & Zhang (2017) for a proof in the coordinator model. Lemma 1 (Yao's Lemma). For function Γ and δ > 0, we have RCC δ (Γ) ≥ max µ D δ µ (Γ). Properties of convex functions. Recall that a continuously differentiable function f is β-(strongly) smooth if ∇f (x) -∇f (y) 2 ≤ β x -y 2 , α-strongly convex if ∇f (x) -∇f (y) T (x -y) ≥ α x -y 2 2 for all x and y in the domain of f . For α-strongly convex and β-strongly smooth function f , we say that f has condition number κ = β/α. If f 1 is α 1 -strongly convex and β 1 -strongly smooth and f 2 is α 2 -strongly convex and β 2 -strongly smooth, then f 1 + f 2 is (α 1 + α 2 )-strongly convex and (β 1 + β 2 )-strongly smooth. A quadratic function f (x) = β x -y 2 2 + C is β-strongly convex and β-strongly smooth. For ε > 0, if f (x) ≤ ε, then x -x * 2 ≤ (ε/β) 1/2 . A sum of quadratics F (x) = k j=1 a j x -y j 2 2 , where y j ∈ R d and a j ≥ 0 for j = 1, 2, . . . , k, is a quadratic function F (x) = A x -x * 2 2 + C, where C is a constant and x * = k j=1 a j y j /A is the minimum of F . Point packing. We will make use of the following elementary result, which bounds the number of points we can pack into [0, 1] d while maintaining a minimum distance between all points. Lemma 2 (Tsitsiklis & Luo (1987)). For δ > 0 and d ≥ 1, there is a set of points S ⊆ [0, 1] d with x -y 2 > δ for all distinct x, y ∈ S, and |S| ≥ (d 1/2 /Cδ) d , where C = (πe/2) 1/2 is a constant.

3.1. DETERMINISTIC LOWER BOUND FOR QUADRATIC FUNCTIONS

We start with a warm-up result, proving a lower bound against deterministic protocols. Essentially, we show that even recognising if all the nodes have the same input function is hard. Recall that in the Nplayer equality over universe of size d, denoted by EQ d,N , each player i is given an input b i ∈ {0, 1} d , and the task is to decide if all players have the same input. That is, EQ d,N (b 1 , . . . , b N ) = 1 if all inputs are equal, and 0 otherwise. It is known (Vempala et al., 2020) that the deterministic communication complexity of EQ d,N is CC(EQ d,N ) = Ω(N d). Theorem 3. Given parameters N , d, ε, β 0 and β = β 0 N satisfying dβ/ε = Ω(1), any deterministic protocol solving (1) for quadratic input functions x → β 0 x -x 0 2 2 has communication complexity Ω N d log(βd/ε) . Proof. Assume Π is a deterministic protocol solving (1) with communication complexity C Π . We show that Π can then solve N -party equality over a universe of size D = Ω(d log(βd/ε)), implying C Π = Ω(N D) = Ω N d log(βd/ε) . More specifically, let S be the set given by Lemma 2 with δ = (ε/2β) 1/2 , and let D = log |S| = Θ(d log(βd/ε)). Note that since we assume dβ/ε = Ω(1), the set S has at least two elements and D ≥ 1. For technical convenience, assume |S| = 2 D , and identify each binary string b ∈ {0, 1} D with an element τ (b) ∈ S. Next, assume that each node i is given a binary string b i ∈ {0, 1} D as input, and we want to compute EQ D,N (b 1 , b 2 , . . . , b N ). The nodes simulate protocol Π with input function f i for node i, where f i (x) = β 0 x -τ (b i ) 2 2 . Let us denote F = d i=1 f i . Upon termination of the protocol, the coordinator learns a point y ∈ [0, 1] d satisfying F (y) ≤ F (x * ) + ε and an estimate r ∈ R satisfying r ≤ F (y) + ε, where x * is the true global minimum. The coordinator can now adjudicate equality based on F (y) as follows: (1) If all inputs b i are equal, then the functions f i are also equal, and F (x * ) = 0. In this case, we have F (y) ≤ 2ε, and the coordinator outputs 1. (2) If there are nodes i and j such that i = j, then for all points x ∈ [0, 1] d , we have f i (x) + f j (x) > 2ε by the definition of S, and thus F (x * ) > 2ε. In this case, we have r > 2ε, and the coordinator outputs 0. Since communication is only used for the simulation of Π, this computes EQ D,N (b 1 , b 2 , . . . , b N ) with C Π total communication, completing the proof.

3.2. RANDOMISED LOWER BOUND FOR QUADRATIC FUNCTIONS

We now prove our main result, by giving a lower bound for communication complexity of any algorithm solving (1) that holds even for randomised protocols, albeit with a slightly weaker bound. Theorem 4. Given parameters N , d, ε, β 0 and β = β 0 N satisfying dβ/N 2 ε = Ω(1), any protocol solving (1) for quadratic input functions x → β 0 x -x 0 2 2 has communication complexity Ω N d log(βd/N ε) . Discussion. As this is our main result, we pause to discuss its implications. First, this result is more general than Theorem 3, as it applies to stochastic algorithms such as SGD or SVRG Johnson & Zhang (2013) , which are arguably more popular in practice. The price for the increased generality is the additional N factor in the denominator of the log, which appears due to technical requirements in the reduction. Finally, we note that the constant lower bound on dβ/N 2 ε is the relatively small eπ < 8.6, and that this lower bound is likely to hold in most practical settings of interest, as d is usually quite large, and ε is usually quite small. Proof overview. To formally apply communication complexity tools, will prove a lower bound for a discretised version of (1) -where both the input and output sets are finite -which will imply Theorem 4. Let N , d, ε, and β be fixed, assume dβ/N 2 ε = Ω(1), and (1) let S be the set given by Lemma 2 with δ = 3N (ε/β) 1/2 , and (2) let T ⊆ [0, 1] d be an arbitrary finite set of points such that for any x ∈ [0, 1] d , there is a point t ∈ T with x -t ≤ (ε/4β) 1/2 . By assumption dβ/N 2 ε = Ω(1), the set S has size at least 2. First, we observe that any algorithm for solving (1) can be used to solve MEAN ε,β d,N . Lemma 6. For fixed N , d, ε, β 0 and β = β 0 N , any randomised protocol solving (1) for quadratic functions x → β 0 x -x 0 2 2 with error probability 1/3 has communication complexity at least RCC 1/3 MEAN ε,β/4 d,N . The natural next step is to prove a lower bound on the communication complexity of MEAN ε,β d,N . We do this by using an instance of the symmetrisation technique of Phillips et al. (2012) , via reduction to the expected communication complexity of a two-party communication problem where one player has to learn the complete input of the other player. Specifically, in the two-player problem called 2-BITS d , player 1 (Alice) receives a binary string b ∈ {0, 1} d , of length d, and the task is for player 2 (Bob) to output b. Let ζ p be a distribution over binary strings b ∈ {0, 1} d where each bit is set to 1 with probability p and to 0 with probability 1 -p. The following lower bound for 2-BITS d is known, and holds even for protocols with public randomness, i.e. when Alice and Bob have access to the same string of random bits: Lemma 7 (Phillips et al. (2012)). ED 1/3 ζp (2-BITS d ) = Ω(dp log p -1 ). Lemma 8. For N , d, ε, and β satisfying dβ/N 2 ε = Ω(1), we have RCC 1/3 (MEAN ε,β d,N ) = Ω N • ED 1/3 ζ 1/2 (2-BITS D ) = Ω(N d log(βd/N ε)) . Due to space constraints, we defer the proofs of Lemmas 6 and 8 to Appendix A.1. Theorem 4 now follows immediately from Lemmas 6 and 8. The result can be generalised for arbitrary convex domains D ⊆ R d as Ω(N log s), given a point packing bound s for D as in Lemma 2.

3.3. LOWER BOUND FOR NON-CONVEX FUNCTIONS

We now show a simple lower bound for optimisation over non-convex objective functions. Specifically, we construct a set of hard input functions as follows. Let ε, d and β be constant satisfying dβ/ε = Ω(1), and consider the set S given by Lemma 2 with δ = 2ε/β. This gives a set S with size at least (βd 1/2 /2Cε) d = exp(Ω(d log(βd)/ε). Let us identify the points in S with elements of {1, 2, . . . , |S|}. For a binary string b ∈ {0, 1} |S| , define the function f b by f b (x) = β x -s 2 if x -s 2 < ε/β for s with b s = 1, ε otherwise. Since the distance between points in S is at least 2ε/β, the functions f T are well-defined, continuous and β-Lipschitz. The proof works by reduction from N -player set disjointness (Braverman et al., 2013) ; we defer the details to Appendix B. Theorem 9. Given parameters N , d, ε and β satisfying dβ/ε = Ω(1) and (βd 1/2 /2Cε) d = ω(log N ), any protocol solving (1) with error probability δ > 0 when the inputs are guaranteed to be functions f b for b ∈ {0, 1} |S| has communication complexity N exp(Ω(d log(βd)/ε)).

4. TIGHT DETERMINISTIC UPPER BOUND

We now present a deterministic algorithm which matches our lower bound for constant condition number, in the coordinator model. We follow the general structure of communication-reduced algorithms, e.g. Magnússon et al. (2019) : the nodes collectively execute an instance of gradient descent (GD), where they carefully quantise their updates, accumulated at a coordinator. The algorithm relies on two new technical ingredients to achieve optimality: 1) we use two-step quantisation to avoid (inherently suboptimal) all-to-all communication; 2) we remove a superfluous log d factor in the communication by employing a lattice-based quantisation scheme allowing for arbitrary centring of the gradient estimates to be averaged (Alistarh et al., 2020) . The second step causes non-trivial complications, as this scheme may fail if inputs are too far apart. Preliminaries. We assume that the input functions of each node i is f i : {0, 1} d → R, which is α 0 -strongly convex and β 0 -strongly smooth. This implies that F = N i=1 is α-strongly convex and β-strongly smooth for α = N α 0 and β = N β 0 . Consequently, the functions f i and F have condition number bounded by κ = β/α. Furthermore, we assume that the local functions f i all have minimum value inf x∈[0,1] d f i (x) = 0, and thus range [0, β 0 d]. We aim to reach the global minimum x of the sum N i=1 f i (x) by starting from an arbitrary point x 0 ∈ [0, 1] d , and applying a surrogate of the GD update x (t+1) = x (t) -γ N i=1 ∇f i (x (t) ) , where γ > 0 is the learning rate parameter. It is well-known, e.g. Bubeck (2015) , that GD converges at an exponential rate in (1 -1/κ). The algorithm has each node generate gradients of its local function f i , and quantise them in a carefully-parametrised way. Specifically, the quantisation we use works in a setting where the nodes all know a point q ∈ [0, 1] d , and the points to be quantised are in the vicinity of q -in the algorithm, the point q will be the previous quantised gradient. The quantisation is parametrised by R, the maximum distance between the input point and q, and by ε, the maximum quantisation error we wish to tolerate. Corollary 10 (Alistarh et al. ( 2020)). Let R and ε be fixed positive parameters, and q ∈ R d be an estimate vector, and B ∈ N be the number of bits used by the quantisation scheme. Then, there exists a deterministic quantisation scheme, specified by a function Q ε,R : R d × R d → R d , an encoding function enc ε,R : R d → {0, 1} B , and a decoding function dec ε,R : R d × {0, 1} B → R d , with the following properties: (1) (Validity.) dec ε,R (q, enc ε,R (x)) = Q ε,R (x, q) for all x, q ∈ R d with x -q 2 ≤ R. (2) (Accuracy.) Q ε,R (x, q) -x 2 ≤ ε for all x, q ∈ R d with x -q 2 ≤ R. (3) (Cost.) If ε = λR for any λ < 1, the bit cost of the scheme satisfies B = O(d log λ -1 ). Algorithm description. We now describe the algorithm, and overview its guarantees. The full description and analysis are available in Appendix C. We assume that the constants α and β are known to all nodes, so the parameters of the quantised gradient descent can be computed locally, and use W to be an upper bound on the diameter on the convex domain D, e.g. W = d 1/2 if D = [0, 1] d . We assume that the initial iterate x (0) is arbitrary, but the same at all nodes, and set the initial quantisation estimates q (0) and q (0) i at each i as the origin. The algorithm proceeds in rounds t = 1, 2, . . . , T . At the beginning of round t + 1, each node i knows the values of the iterate x (t) , the previous global quantised gradient q (t) , and its local quantised gradient q (t) i ; the coordination knows all these values. We define the following parameters for the algorithm. Let γ = β -1 and ξ = (1 -κ -1 ) be the step size and convergence rate of gradient descent, and let W be such that x 0 -x * ≤ W . We define K = 2/ξ, δ = ξ(1 -ξ)/4, µ = δK + ξ , R (t) = βKW µ t . Assuming κ ≥ 2, we have µ < 1, ξ ≥ 1/2 and K ≥ 1. At step t, nodes perform the following steps: (1) Each node i updates its iterate as x (t+1) = x (t) -γq (t) . (2) Each node i computes its local gradient over x (t+1) , and transmits it in quantised form to the coordinator as follows. Let ε 1 = δR (t+1) /(2N ) and ρ 1 = R (t+1) /N . (a) Node i computes ∇f i (x (t+1) ) locally, and sends message m i = enc ε1,ρ1 (∇f i (x (t+1) )) to the coordinator. (b) The coordinator receives messages m i for i = 1, 2, . . . , N , and decodes them as q (t+1) i = dec ε1,ρ1 (q (t) i , m i ). The coordinator then computes r (t+1) = N i=1 q (t+1) i . (3) The coordinator sends the quantised sum of gradients to all other nodes as follows. Let ε 2 = δR (t+1) /2 and ρ 2 = (1 + δ/2)R (t+1) . (a) The coordinator sends the message m = enc ε2,ρ2 (r (t+1) ) to each node i. (b) Each node decodes the coordinator's message as q (t+1) = dec ε2,ρ2 (q (t) , m). Guarantees. The key technical trick behind the algorithm is the extremely careful choice of parameters for quantisation at every step. This balances the fact that the quantisation has to be fine enough to ensure optimal GD convergence, but coarse enough to ensure optimal communication cost. Overall, the algorithm ensures the following guarantees, whose proof is provided in Appendix C. Theorem 11. Let ε > 0, a dimension d, and a convex domain D ⊆ R d of diameter W be fixed. Given N nodes, each assigned a function f i : D → R such that F = N i=1 f i is α-strongly convex and β-smooth, the above algorithm converges to a point x (T ) with F (x (T ) ) -F (x * ) 2 ≤ ε using O N dκ log κ log βW ε bits of communication.

5. DISCUSSION AND FUTURE WORK

We have provided the first tight bounds on the communication complexity of optimising sums of quadratic functions in the N -party model with a coordinator. Our results are algorithm-independent, and immediately imply the same lower bound for the practical parameter server and decentralised models of distributed optimisation. In terms of future work, we expect that the randomised lower bound could be improved to match the deterministic one even for small d, possibly via reduction from a suitable gap problem in communication complexity (e.g. Chakrabarti & Regev (2011) ). Another avenue for future work is to investigate tight upper and lower bounds in the case where the functions being optimised are not quadratics, as isolating the "right" dependency on the condition number does not appear immediate. Finally, understanding the exact complexity of optimisation in the broadcast model remains open. A OMITTED PROOFS, SECTION 3.2 A.1 PROOF OF LEMMA 6 Proof of Lemma 6. Let Π be protocol solving (1) with communication complexity C and error probability 1/3. We show that we can use it to solve MEAN  f i (x) = β 0 x -τ (b i ) 2 2 . By the properties of quadratic functions, we have F (x) = N i=1 f i (x) = β x -x * 2 2 + C, where x * = N i=1 τ (bi) N . Thus, the output y of Π satisfies y -x * 2 ≤ (ε/β) 1/2 . The coordinator now outputs the closest point t ∈ T to y. We therefore have x * -t 2 = x * -y + y -t 2 ≤ x * -y 2 + y -t 2 ≤ 2(ε/β) 1/2 = (4ε/β) 1/2 .

A.2 PROOF OF LEMMA 8

Proof of Lemma 8. Let µ denote a distribution on N i=1 {0, 1} D , where each D-bit string is selected uniformly at random, and let ζ be uniformly random on {0, 1} D . We will prove that D 1/3 µ (MEAN ε,β d,N ) = Ω N • ED 1/3 ζ (2-BITS D ) . Since ED Suppose now that we have a deterministic protocol Π 1 for MEAN ε,β d,N with worst-case communication cost C and error probability 1/3 on input distribution µ. Given Π 1 , we define a 2-player protocol Π 2 with public randomness for 2-BITS D as follows; assume that Alice is given b ∈ {0, 1} D as input. (1) Alice and Bob pick a random index i ∈ [N ] uniformly at to select a random i node using the shared randomness. Without loss of generality, we can assume that the picked node was node i = 1. ( (3) Once the simulation is complete, Bob knows the output t ∈ T of Π 1 which satisfies t -z * 2 ≤ (ε/β) 1/2 , where z * = N i=1 τ (b i )/N . As the final step, we show that Bob can now recover Alice's input from t. Let y = N i=2 τ (b i )/(N -1) be the weighted average of points τ (b 2 ), τ (b 3 ), . . . , τ (b N ). We now have that N z * -(N -1)y = τ (b 1 ) by simple calculation. Since t -z * 2 ≤ (ε/β) 1/2 , it follows that (N t -(N -1)y) -τ (b 1 ) 2 = N t -(N -1)y -N z * + N z * -τ (b 1 ) 2 = N t -N z * + N z * -(N -1)y -τ (b 1 ) 2 = N t -N z * 2 = N t -z * 2 ≤ N (ε/β) 1/2 . Since the distance between any two points in S is at least 3N (ε/β) 1/2 , we have that τ (b 1 ) is the only point from S within distance (ε/β) 1/2 from N z -(N -1)y. As Bob knows both z and τ (b 2 ), τ (b 3 ), . . . , τ (b N ) after the simulation, he can recover the point x 1 and thus infer Alice's input. Now let us analyse the expected cost of Π 2 under input distribution ζ. First, observe that since the simulation runs Π 1 on input distribution µ, the output y is correct with probability 2/3, and thus the output of Π 2 is correct with probability 2/3. Now let C Π1 be the worst-case communication cost of Π 1 and let C Π1 (b 1 , . . . , b N ) and C Π1,i (b 1 , . . . , b N ) denote the total communication cost and the communication used by node i in Π 1 on input b 1 , . . . , b N , respectively. Finally, let C Π2 (b, r) be a random variable giving the communication cost of Π 2 on input b and random bits r.

Now we have that

E b1,r [C Π2 (b 1 , r)] = b1∈{0,1} D 1 2 D E r [C Π2 (b 1 , r)] = b1∈{0,1} D 1 2 D b2,...,b N N i=1 C Π1,i (b 1 , . . . , b N ) N 2 (N -1)D = 1 N b1,b2,...,b N 1 2 N D N i=1 C Π1,i (b 1 , . . . , b N ) = 1 N b1,b2,...,b N 1 2 N D C Π1 (b 1 , b 2 , . . . , b N ) ≤ 1 N b1,b2,...,b N 1 2 N D C Π1 = C Π1 N Since E b1,β [C Π2 (b 1 , r)] ≥ ED 1/3 ζ (2-BITS D ) , and the argument holds for any protocol Π 1 solving MEAN ε,β d,N with error probability 1/3, we have that D 1/3 µ (MEAN ε,β d,N ) ≥ N • ED 1/3 ζ (2-BITS D ) , completing the proof.

B LOWER BOUND FOR NON-CONVEX FUNCTIONS, FULL VERSION

We now show a simple lower bound for optimisation over non-convex objective functions. We reduce from the N -player set disjointness over universe of size d, denoted by DISJ d,N : each player i is given an input b i ∈ {0, 1} d , and the coordinator needs to output 0 if there is a coordinate ∈ [d] such that b i ( ) = 1 for all i ∈ [N ], and 1 otherwise. Theorem 12 (Braverman et al. (2013) ). For δ > 0, N ≥ 1 and d = ω(log N ), the randomised communication complexity of set disjointness is RCC δ (DISJ d,N ) = Ω(N d). Again consider for fixed ε, d and β the set S given by Lemma 2 with δ = 2ε/β. This gives a set S with size at least (βd 1/2 /2Cε) d = exp(Ω(d log(βd)/ε). Let us identify the points in S with indices in [|S|] . For a binary string b ∈ {0, 1} |S| , define the function f b by f b (x) = β x -s 2 if x -s 2 < ε/β for s with b s = 1, ε otherwise. Since the distance between points in S is at least 2ε/β, the functions f b are well-defined, continuous and β-Lipschitz. By definition, Π can be used to distinguish between the two cases, and thus to solve set disjointness.

C DESCRIPTION AND ANALYSIS OF THE UPPER BOUND, FULL VERSION

We now describe in detail our deterministic upper bound. Our algorithm uses quantised gradient descent, loosely following the outline of Magnússon et al. (2019) . However, there are two crucial differences. First, we use a carefully-calibrated instance of the quantisation scheme of Alistarh et al. (2020) to remove a log d factor from the communication cost, and second, we use use two-step quantisation to avoid all-to-all communication. Preliminaries on gradient descent. We will assume that the input functions f i : [0, 1] d → R are α 0 -strongly convex and β 0 -strongly smooth. This implies that F = N i=1 f i is α-strongly convex and β-strongly smooth for α = N α 0 and β = N β 0 . Consequently, the functions f i and F have condition number bounded by κ = β/α. Furthermore, we assume that the local functions f i all have minimum value inf x∈[0,1] d f i (x) = 0, and thus range [0, β 0 d]. Gradient descent optimises the sum N i=1 f i (x) by starting from an arbitrary point x (0) ∈ [0, 1] d , and applying the update rule x (t+1) = x (t) -γ N i=1 ∇f i (x (t) ) , C.2 ANALYSIS For simplicity, we will split the analysis into two parts. The first describes and analyses the algorithm in an abstract way; the second part describes the details of implementing it in the coordinator model. For technical convenience, assume κ ≥ 2; for smaller condition numbers, we can run the algorithm with κ = 2. Convergence. Let γ = β -1 , let x (0) ∈ [0, 1] d , q (0) i ∈ R d and q (0) i ∈ R d for i = 1, 2, . . . , N be arbitrary initial values. From the algorithm description, we see that the update rule for our quantised gradient descent is x (t+1) = x (t) -γq (t) , q (t+1) i = Q ∇f i (x (t+1) ), q (t) i , R (t+1) /N, δR (t+1) /(2N ) , r (t+1) = N i=1 q (t+1) i , q (t+1) = Q r (t+1) , q (t) , (1 + δ/2)R (t+1) , δR (t+1) /2 . Lemma 14. The inequalities x (t) -x * 2 ≤ µ t W , (Q1) ∇f i (x (t) ) -q (t) i 2 ≤ δR (t) /(2N ) , (Q2) ∇F (x (t) ) -q (t) 2 ≤ δR (t) (Q3) hold for all t, assuming that they hold for x (0) , q (0) and q (0) i for i = 1, 2, . . . , N . Proof. We apply induction over t; we assume that all the inequalities hold for t, and prove that they also hold for t + 1. Since we assume the inequalities hold for t = 0, the base case is trivial. Convergence (Q1): We have x (t+1) -x * 2 = x (t) -γq (t) + γ∇F (x (t) ) -γ∇F (x (t) ) + x * 2 ≤ γq (t) -γ∇F (x (t) ) 2 + (x (t) -γ∇F (x (t) )) -x * 2 ≤ γ ∇F (x (t) ) -q (t) 2 + ξ x (t) -x * 2 ≤ β -1 δR (t) + ξµ t W = β -1 δβKµ t W + ξµ t W = (δK + ξ)µ t W = µ t+1 W . Local quantisation (Q2): First, let us observe that to prove that (Q2) holds for t + 1, it is sufficient to show ∇f i (x (t+1) ) -q (t) i 2 ≤ R (t+1) /N , as the claim then follows from the definition of q (t+1) i and Corollary 10. We have ∇f i (x (t+1) ) -q (t) i 2 = ∇f i (x (t+1) ) -∇f i (x (t) ) + ∇f i (x (t) ) -q (t) i 2 ≤ ∇f i (x (t+1) ) -∇f i (x (t) ) 2 + ∇f i (x (t) ) -q (t) i 2 ≤ β 0 x (t+1) -x (t) 2 + δR (t) /N ≤ β 0 x (t+1) -x * 2 + x (t) -x * 2 + δR (t) /N ≤ 2β 0 µ t W + δR (t) /N = 2βµ t W/N + δβKµ t W/N = (2/K + δ)Kβµ t W/N = (ξ + δ)Kβµ t W/N ≤ (ξ + δK)Kβµ t W/N = Kβµ t+1 W/N = R (t+1) /N .

Global quantisation (Q3):

To prove (Q3), we start by giving two auxiliary inequalities. First, we prove that ∇F (x (t+1) ) -r (t+1) 2 ≤ δR (t+1) /2: ∇F (x (t+1) ) -r (t+1) 2 = N i=1 ∇f i (x (t+1) ) - N i=1 q (t+1) i 2 ≤ N i=1 ∇f i (x (t+1) ) -q (t+1) i 2 ≤ N δR (t+1) /(2N ) = δR (t+1) /2 . Next, we want to prove r (t+1) -q (t+1) 2 ≤ δR (t+1) /2. Again, it is sufficient to show r (t+1)q (t) 2 ≤ (1 + δ/2)R (t+1) , as the claim then follows from the definition of q (t+1) and Corollary 10. We have r (t+1) -q (t) i 2 = r (t+1) + ∇F (x (t+1) ) -∇F (x (t+1) ) + ∇F (x (t) ) -∇F (x (t) ) -q (t) 2 ≤ r (t+1) -∇F (x (t+1) ) 2 + ∇F (x (t+1) ) -∇F (x (t) ) 2 + ∇F (x (t) ) -q (t) 2 ≤ δR (t+1) /2 + β x (t+1) -x (t) 2 + δR (t) ≤ δR (t+1) /2 + R (t+1) = (1 + δ/2)R (t+1) , where the last inequality follows from the argument used in the proof of (Q2). Finally, putting things together, we have ∇F (x (t+1) ) -q (t+1) 2 = ∇F (x (t+1) ) -r (t+1) + r (t+1) -q (t+1) 2 ≤ ∇F (x (t+1) ) -r (t+1) 2 + r (t+1) -q (t+1) 2 ≤ δR (t+1) /2 + δR (t+1) /2 = δR (t+1) , completing the proof. Lemma 15. For any ε > 0 and t ≥ 2κ log W ε , we have x (t) -x * ≤ ε. Proof. By Lemma 14, we have x (t) -x * 2 ≤ µ t W = (1 -(1 -µ)) t W ≤ e -(1-µ)t W . Assuming t ≥ 1 1-µ log W ε . we have e -(1-µ)t W ≤ e -(1-µ)(1-µ) -1 log W/ε W = e log ε/W W = εW/W = ε . The claim follows by observing that 1 1-µ = 2κ by definition. Communication cost. Finally, we analyse the distributed implementation described at the beginning of this section, and analyse its total communication cost. Recall that we assume that the parameters α and β are known to all nodes, so the parameters of the quantised gradient descent can be computed locally, and use W = d 1/2 . Note that W is the only parameter depending on the input domain, so the algorithm also applies for arbitrary convex domain D ⊆ R d , setting W to be the diameter of D. Since δ < 1, we have by Lemma 10 that the each of the messages sent by the nodes has length at most O(d log δ -1 ) bits. Assuming κ ≥ 2, we have log δ -1 = log 2κ 1 -κ -1 ≤ log 8κ . Since the nodes send a total of 2N messages of O(d log κ) bits each, the total communication cost of a single round is O(N d log κ) bits. To get F (x (T ) ) -F (x * ) 2 ≤ ε, we need x (T ) -x * ≤ (ε/β) 2 . By Lemma 15, selecting T = O(2κ log βW ε ) is sufficient. Finally, using W = O(d 1/2 ), we have that the total communication cost of the optimisation is O N dκ log κ log βd ε . For the transmission of the local function values f i (x T ), there are at most (β 0 d + 1)N/ε possible values, so each node needs to send O(log βd/ε) bits.



The constant hidden by Ω(1) in the parameter dependency is at most πe < 8.6 in all lower bounds.



Let D = log |S| = Θ(d log(βd/N ε)). Again, for convenience, assume 2 D = |S|, and identify each binary string b ∈ {0, 1} D with an element τ (b) ∈ S. Definition 5. Given parameters N, d, ε, β, we define the problem MEAN ε,β d,N as follows: -The node inputs are from {0, 1} D , and -Valid outputs for input (b 1 , b 2 , . . . , b N ) are points t ∈ T that satisfy the condition x *t 2 ≤ (ε/β) 1/2 , where x * = N i=1 τ (b i )/N is the average over inputs.

with total communication cost C and error probability 1/3, implying the claim. Given input (b 1 , b 2 , . . . , b N ) for MEAN ε,β/4 d,N , nodes can simulate the protocol Π with input functions

(2-BITS D ) = Ω(D) by Lemma 7, the claim follows by Yao's Lemma.

) Alice and Bob simulate protocol Π 1 , with Alice simulating node 1 and Bob simulating the coordinator and nodes 2, 3, . . . , N . For the inputs b 1 , b 2 , . . . , b N to Π 1 , Alice sets b 1 = x, and Bob selects the inputs b 2 , b 3 , . . . , b N uniformly at random by using the public randomness. Messages Π 1 sends between the coordinator and node 1 are communicated between Alice and Bob, and all other communication is simulated by Bob internally.

Given parameters N , d, ε and β satisfying dβ/ε = Ω(1) and (βd 1/2 /2Cε) d = ω(log N ), any protocol solving (1) with error probability δ > 0 when the inputs are guaranteed to be functions f b for b ∈ {0, 1} |S| has communication complexity N exp(Ω(d log(βd)/ε)).Proof. Assume there is a protocol Π with the properties stated in the claim, and worst-case communication cost C Π . We now show that we can use Π to solve set disjointness over universe of size |S| with C Π total communication, which impliesC Π ≥ RCC δ (DISJ |S|,N ) = Ω(N exp(Ω(d log(βd)/ε)) ,yielding the claim. First, we observe that for b 1 , b 2 , . . . b N ∈ {0, 1} |S| that all contain 1 in some position s, then we have N i=1 f bi (x) = 0. Otherwise, for any point x ∈ [0, 1] d , consider the closest point s ∈ S to x; there is at least one b i with b s = 0, and for that function f bi (x) = ε by definition. Thus, if b 1 , b 2 , . . . b N are a YES-instance for set disjointness, then inf x∈[0,1] d N i=1 f bi (x) ≥ ε, and if b 1 , b 2 , . . . b N are a NO-instance, then inf x∈[0,1] d N i=1 f bi (x) = 0.

Comparison of existing lower bounds on total communication required to solve (1). 'BC' denotes results for broadcast model, and 'MP' denotes results for message-passing model. Note that lower bounds for the broadcast model also apply to the message-passing model.

Upper bounds for distributed optimisation over β-smooth, α-strongly convex input functions with condition number κ. 'BC' denotes results for broadcast model, and 'MP' denotes results for message-passing model. Note that upper bounds for the message-passing model also apply to the broadcast model, but not vice versa.

annex

where γ > 0 is a parameter. Let x * denote the global minimum of F . We use the following standard result on the convergence of gradient descent; see e.g. Bubeck (2015) . Theorem 13. For γ = β -1 , we have that x (t+1) -x * 2 ≤ (1 -κ -1 ) x (t) -x * 2 .Preliminaries on quantisation. For compressing the gradients the nodes will send to coordinator, we use the recent quantisation scheme of Alistarh et al. (2020) . Whereas the original uses randomised selection of the quantisation point to obtain a unbiased estimator, we can use a deterministic version that picks an arbitrary feasible quantisation point (e.g. the closest one). This gives the following guarantees:Corollary 10 (Alistarh et al. ( 2020)). Let R and ε be fixed positive parameters, and q ∈ R d be an estimate vector, and B ∈ N be the number of bits used by the quantisation scheme. Then, there exists a deterministic quantisation scheme, specified by a function (1) (Validity.)((3) (Cost.) If ε = λR for any λ < 1, the bit cost of the scheme satisfies B = O(d log λ -1 ).

C.1 ALGORITHM DESCRIPTION

We now describe the algorithm, and overview its guarantees. We assume that the constants α and β are known to all nodes, so the parameters of the quantised gradient descent can be computed locally, and use W to be an upper bound on the diameter on the convex domain D, e.g.We assume that the initial iterate x (0) is arbitrary, but the same at all nodes, and set the initial quantisation estimate q (0) i at each i as the origin.The algorithm proceeds in rounds t = 1, 2, . . . , T . At the beginning of round t + 1, each node i knows the values of the iterate x (t) , the global quantisation estimate q (t) , and its local quantisation estimate q (t)i for i = 1, 2, . . . , N . We define the following parameters for the algorithm. Let γ = β -1 and ξ = (1 -κ -1 ) be the step size and convergence rate of gradient descent, and let W be such that x (0) -x * ≤ W . We defineAssuming κ ≥ 2, we have µ < 1, ξ ≥ 1/2 and K ≥ 1. At step t, nodes perform the following steps:(1) Each node i updates its iterate as x (t+1) = x (t) -γq (t) .(2) Each node i computes its local gradient over x (t+1) , and transmits it in quantised form to the coordinator as follows. Let ε 1 = δR (t+1) /(2N ) and ρ 1 = R (t+1) /N .(a) Node i computes ∇f i (x (t+1) ) locally, and sends message m i = enc ε1,ρ1 (∇f i (x (t+1) )) to the coordinator. (b) The coordinator receives messages m i for i = 1, 2, . . . , N , and decodes them as q (t+1) i = dec ε1,ρ1 (q) The coordinator sends the quantised sum of gradients to all other nodes as follows. Let ε 2 = δR (t+1) /2 and ρ 2 = (1 + δ/2)R (t+1) .(a) The coordinator sends the message m = enc ε2,ρ2 (r (t+1) ) to each node i.(b) Each node decodes the coordinator's message as q (t+1) = dec ε2,ρ2 (q (t) , m).After round T , all nodes know the final iterate x (T ) . The nodes compute their local value f i (x (T ) ), and send an approximate value to the coordinator; specifically, each node computes a partition of the range [0, β 0 d] into segments of length ε/N , and sends the index of the smallest segment endpoint r satisfying r ≥ f i (x (T ) ) to the coordinator.

