IMPROVED COMMUNICATION LOWER BOUNDS FOR DISTRIBUTED OPTIMISATION

Abstract

Motivated by the interest in communication-efficient methods for distributed machine learning, we consider the communication complexity of minimising a sum of d-dimensional functions N i=1 f i (x), where each function f i is held by one of the N different machines. Such tasks arise naturally in large-scale optimisation, where a standard solution is to apply variants of (stochastic) gradient descent. As our main result, we show that Ω(N d log d/ε) bits in total need to be communicated between the machines to find an additive -approximation to the minimum of The result holds for deterministic algorithms, and randomised algorithms under some restrictions on the parameter values. Importantly, our lower bounds require no assumptions on the structure of the algorithm, and are matched within constant factors for strongly convex objectives by a new variant of quantised gradient descent. The lower bounds are obtained by bringing over tools from communication complexity to distributed optimisation, an approach we hope will find further use in future.

1. INTRODUCTION

The ability to distribute the processing of large-scale data across several computing nodes has been one of the main technical enablers of recent progress in machine learning, and the last decade has seen significant research effort dedicated to efficient distributed optimisation. One specific area of interest is communication reduction for distributed machine learning. Recently, several algorithms have been proposed to reduce the communication footprint of popular methods in machine learning and optimisation, in particular gradient descent and stochastic gradient descent; see e.g. Arjevani & Shamir (2015) ; Alistarh et al. (2017); Suresh et al. (2017); Tang et al. (2019) for recent work, and Ben-Nun & Hoefler (2019) for a survey. Despite this extensive work, less is known about theoretical limits of communication complexity of optimisation, especially in terms of lower bounds on the minimal number of bits which machines need to transmit to jointly solve an optimisation problem. In this paper, we study this question in a classical distributed optimisation setting, where data is split among N which that can communicate by sending point-to-point messages to each other. Given input dimension d, and a domain D ⊆ R d , each machine i is given an input function f i : D → R, and the machines need to jointly minimise the sum N i=1 f i (x), e.g. the empirical risk, with either deterministic or probabilistic guarantees on the output. The setting is a standard way to model the distributed training of machine learning models. For instance, if the individual loss functions are assumed to be (strongly) convex, we can model a classic regression setting, whereas if the function is non-convex, we can model distributed training of deep neural networks. In this context, the key question is: what is the minimal number of bits which need to be exchanged for this optimisation procedure to be successful, and how does this number depend on the properties of the functions f i , and the parameters N and d?

1.1. OUR RESULTS

Setting. We consider this question in the classic message-passing model, where N nodes communicate by sending messages to each other; specifically, each message is sent to a single receiver and not seen by the other nodes. Our complexity measure is the total number of bits sent by all the nodes. Given this complexity measure, the model is equivalent (up to a constant factor) to a model where all messages are relayed via a special coordinator node, known in communication complexity as the coordinator model and in machine learning as the parameter server model (Li et al., 2014) . For convenience of presentation, we set D = [0, 1] d , and consider a problem where each node i is given an input function f i : [0, 1] d → R, and the task is to approximate the minimum of the sum of the functions. That is, the coordinator needs to output z ∈ [0, 1] d and an estimate r ∈ R for the minimum function value such that N i=1 f i (z) ≤ inf x∈[0,1] d N i=1 f i (x) + ε and N f i (z) ≤ r ≤ N i=1 f i (z) + ε. (1) Specifically, this models a standard distributed machine learning setting where we require one of the nodes to return the optimised final model, as well as the final value of the loss function. When proving lower bounds, we allow the nodes to compute arbitrary values that depend on the input functions and operate on real numbers; only the amount of communicated bits is limited. The precise definition is somewhat subtle, so we defer the details of the model to Section 2. Lower bounds for convex functions. We show that, even if the input functions f i at the nodes are promised to be quadratic functions x → β 0 x -x * 2 2 for a constant β 0 > 0, finding a solution satisfying (1) deterministically requires Ω N d log βd ε total bits to be communicated, where β = β 0 N is the smoothness parameter of N i=1 f i , for parameters satisfyingfoot_0 βd/ε = Ω(1). For randomised algorithms, we give a lower bound of Ω N d log βd N ε total bits to be communicated, for parameters satisfying βd/N 2 ε = Ω(1). While this lower bound is slightly weaker due to the additional dependence in N , in most practical settings the number of parameters d will be significantly larger than the number of machines N , multiplied by the error tolerance ε. (Specifically, in most practical settings N 1000, whereas ε ≤ 10 -3 . More generally, it is sufficient that d = Ω(N 2+δ ) for constant δ > 0, for the randomised lower bound to match the deterministic one asymptotically.) At a very high level, our results generalise the Tsitsiklis & Luo (1987) idea of linking the communication complexity with the number of quadratic functions with distinct minima in the domain. To extend this approach to the multi-node case N > 2 and to randomised (stochastic) algorithms, we build connections to results and techniques from communication complexity. Such connections have not to our knowledge been explored in the context of (real-valued) optimisation tasks, despite reductions from communication complexity being a standard lower bound technique e.g. in distributed computing (Das Sarma et al., 2012; Abboud et al., 2016; Drucker et al., 2014) . Our work thus provides a model and a basic toolkit for applying communication complexity results to distributed optimisation, which should also be useful for understanding other optimisation tasks. Extensions. While, for convenience, we work with functions over [0, 1] d , our bounds immediately extend to arbitrary convex domains, as long as we can bound the number of functions with distinct minima in the domain. Beyond strongly convex and smooth functions, we also show that for non-convex λ-Lipschitz functions, solving (1) requires N exp Ω d log λd ε total bits communicated. The main takeaway from this result is that for non-convex objectives, one can induce exponentially higher communication cost by building convoluted input families where the coordinator is required to essentially learn all local input functions of the nodes.



The constant hidden by Ω(1) in the parameter dependency is at most πe < 8.6 in all lower bounds.

