IMPROVED COMMUNICATION LOWER BOUNDS FOR DISTRIBUTED OPTIMISATION

Abstract

Motivated by the interest in communication-efficient methods for distributed machine learning, we consider the communication complexity of minimising a sum of d-dimensional functions N i=1 f i (x), where each function f i is held by one of the N different machines. Such tasks arise naturally in large-scale optimisation, where a standard solution is to apply variants of (stochastic) gradient descent. As our main result, we show that Ω(N d log d/ε) bits in total need to be communicated between the machines to find an additive -approximation to the minimum of The result holds for deterministic algorithms, and randomised algorithms under some restrictions on the parameter values. Importantly, our lower bounds require no assumptions on the structure of the algorithm, and are matched within constant factors for strongly convex objectives by a new variant of quantised gradient descent. The lower bounds are obtained by bringing over tools from communication complexity to distributed optimisation, an approach we hope will find further use in future.

1. INTRODUCTION

The ability to distribute the processing of large-scale data across several computing nodes has been one of the main technical enablers of recent progress in machine learning, and the last decade has seen significant research effort dedicated to efficient distributed optimisation. One specific area of interest is communication reduction for distributed machine learning. Recently, several algorithms have been proposed to reduce the communication footprint of popular methods in machine learning and optimisation, in particular gradient descent and stochastic gradient descent; see e. 2019) for a survey. Despite this extensive work, less is known about theoretical limits of communication complexity of optimisation, especially in terms of lower bounds on the minimal number of bits which machines need to transmit to jointly solve an optimisation problem. In this paper, we study this question in a classical distributed optimisation setting, where data is split among N which that can communicate by sending point-to-point messages to each other. Given input dimension d, and a domain D ⊆ R d , each machine i is given an input function f i : D → R, and the machines need to jointly minimise the sum N i=1 f i (x), e.g. the empirical risk, with either deterministic or probabilistic guarantees on the output. The setting is a standard way to model the distributed training of machine learning models. For instance, if the individual loss functions are assumed to be (strongly) convex, we can model a classic regression setting, whereas if the function is non-convex, we can model distributed training of deep neural networks. In this context, the key question is: what is the minimal number of bits which need to be exchanged for this optimisation procedure to be successful, and how does this number depend on the properties of the functions f i , and the parameters N and d?

1.1. OUR RESULTS

Setting. We consider this question in the classic message-passing model, where N nodes communicate by sending messages to each other; specifically, each message is sent to a single receiver and not seen by the other nodes. Our complexity measure is the total number of bits sent by all the nodes.



g. Arjevani & Shamir (2015); Alistarh et al. (2017); Suresh et al. (2017); Tang et al. (2019) for recent work, and Ben-Nun & Hoefler (

