DECENTRALIZED SGD WITH ASYNCHRONOUS, LOCAL, AND QUANTIZED UPDATES

Abstract

The ability to scale distributed optimization to large node counts has been one of the main enablers of recent progress in machine learning. To this end, several techniques have been explored, such as asynchronous, decentralized, or quantized communication-which significantly reduce the cost of synchronization, and the ability for nodes to perform several local model updates before communicatingwhich reduces the frequency of synchronization. In this paper, we show that these techniques, which have so far been considered independently, can be jointly leveraged to minimize distribution cost for training neural network models via stochastic gradient descent (SGD). We consider a setting with minimal coordination: we have a large number of nodes on a communication graph, each with a local subset of data, performing independent SGD updates onto their local models. After some number of local updates, each node chooses an interaction partner uniformly at random from its neighbors, and averages a possibly quantized version of its local model with the neighbor's model. Our first contribution is in proving that, even under such a relaxed setting, SGD can still be guaranteed to converge under standard assumptions. The proof is based on a new connection with parallel load-balancing processes, and improves existing techniques by jointly handling decentralization, asynchrony, quantization, and local updates, and by bounding their impact. On the practical side, we implement variants of our algorithm and deploy them onto distributed environments, and show that they can successfully converge and scale for large-scale image classification and translation tasks, matching or even slightly improving the accuracy of previous methods.

1. INTRODUCTION

Several techniques have been recently explored for scaling the distributed training of machine learning models, such as communication-reduction, asynchronous updates, or decentralized execution. For background, consider the classical data-parallel distribution strategy for SGD (Bottou, 2010) , with the goal of solving a standard empirical risk minimization problem. Specifically, we have a set of samples S, and wish to minimize the d-dimensional function f : R d → R, which is the average of losses over samples from S, by finding x = argmin x s∈S s (x)/|S|. We have n compute nodes which can process samples in parallel. In data-parallel SGD, each node computes the gradient for one sample, followed by a gradient exchange. Globally, this leads to the iteration: x t+1 = x t -η t n i=1 gi t (x t ), where x t is the value of the global parameter, initially 0 d , η t is the learning rate, and gi t (x t ) is the stochastic gradient with respect to the parameter x t , computed by node i at time t. When executing this procedure at large scale, two major bottlenecks are communication, that is, the number of bits transmitted by each node, and synchronization, i.e., the fact that nodes need to wait for each other in order to progress to the next iteration. Specifically, to maintain a consistent view of the parameter x t above, the nodes need to broadcast and receive all gradients, and need to synchronize globally at the end of every iteration. Significant work has been dedicated to removing these two barriers. In particular, there has been progress on communication-reduced variants of SGD, which propose various gradient compression schemes (Seide et al., 2014; Strom, 2015; Alistarh et al., 2017; Wen et al., 2017; Aji and Heafield, 2017; Dryden et al., 2016; Grubic et al., 2018; Davies et al., 2020) , asynchronous variants, which relax the strict iteration-by-iteration synchronization (Recht et al., 2011; Sa et al., 2015; Duchi et al., 2015) , as well as large-batch or periodic model averaging methods, which aim to reduce the frequency of communication (Goyal et al., 2017; You et al., 2017) and (Chen and Huo, 2016; Stich, 2018) , or decentralized variants, which allow each node to maintain its own, possibly inconsistent, model variant (Lian et al., 2017; Tang et al., 2018; Koloskova et al., 2019) . (We refer the reader to the recent surveys of (Ben-Nun and Hoefler, 2019; Liu and Zhang, 2020) for a detailed discussion.) Using such techniques, it is possible to scale SGD, even for complex objectives such as the training of deep neural networks. However, for modern large-scale models, the communication and synchronization requirements of these parallel variants of SGD can still be burdensome. Contribution. In this paper, we take a further step towards removing these scalability barriers, showing that all the previous scaling techniques-decentralization, quantization, asynchrony, and local steps-can in fact be used in conjunction. We consider a highly decoupled setting with n compute agents, located at vertices of a connected communication graph, each of which can execute sequential SGD on its own local model, based on a fraction of the data. Periodically, after some number of local optimization steps, a node can initiate a pairwise interaction with a uniform random neighbor. Our main finding is that this procedure can converge even though the nodes can take several local steps between interactions, may perform asynchronous communication, reading stale versions of each others' models, and may compress data transmission through quantization. However, both in theory and practice, we observe trade-offs between convergence rate and degree of synchronization, in that the algorithm may need to perform additional gradient steps in order to attain a good solution, relative to the sequential baseline. Our algorithm, called SWARMSGD, is decentralized in sense that each node maintains local version of the model, and two interacting nodes only see each others' models. We further allow that the data distribution at the nodes may not be i.i.d. Specifically, each node i is assigned a set of samples S i , and maintains its own parameter estimate x i . Each node i performs local SGD steps on its model x i based on its local data, and then picks a neighbor uniformly at random to share information with, by averaging of the two models. (To streamline the exposition, we ignore quantization and model staleness unless otherwise specified.) Effectively, if node i interacts with node j, node i's updated model becomes x i t+1 ← x i t,Hi + x j t,Hj 2 , ( ) where t is the total number of interactions performed by all nodes up to this point, j is the interaction partner of i at step t + 1, and the input models x i t,Hi and x j t,Hj have been obtained by iterating the SGD step H i and H j times, respectively, locally from the previous interaction of either node. We assume that H i and H j are random variables with mean H, that is, each node performs H local steps in expectation between two communication steps. The update for node j is symmetric, so that the two models match after the averaging step. In this paper, we analyze variants of the above SwarmSGD protocol. The main intuition behind the algorithm is that the independent SGD steps will allow nodes to explore local improvements to the objective function on their subset of the data, while the averaging steps provide a decentralized way for the models to converge jointly, albeit in a loosely coupled way. We show that, as long as the maximum number of local steps is bounded, this procedure still converges, in the sense that gradients calculated at the average over all models are vanishing as we increase the number of interactions. Specifically, assuming that the n nodes each take a constant number of local SGD steps on average before communicating, we show that SwarmSGD has Θ( √ n) speedup to convergence in the nonconvex case. This matches results from previous work which considered decentralized dynamics but which synchronized upon every SGD step, e.g. (Lian et al., 2017; 2018) . Our analysis also extends to arbitrary regular graph topologies, non-blocking (delayed) averaging of iterates, and quantization. Generally, we show that the impact of decentralization, asynchrony, quantization, and local updates can be asymptotically negligible in reasonable parameter regimes. On the practical side, we show that this algorithm can be mapped to a distributed system setting, where agents correspond to compute nodes, connected by a dense communication topology. Specifically, we apply SwarmSGD to train deep neural networks on image classification and machine translation (NMT) tasks, deployed on the Piz Daint supercomputer (Piz, 2019) . Experiments confirm the intuition that the average synchronization cost of SwarmSGD per iteration is low: it stays around 10% or less of the batch computation time, and remains constant as we increase the number of nodes. For example, using SwarmSGD deployed on 16 nodes, we are able to train a Transformer-XL (Vaswani et al., 2017) model on WMT17 (En-Ge) 1.5× faster than a highly-optimized largebatch SGD baseline, and to slightly higher accuracy, without additional hyper-parameter tuning. At the same time, our method appears to be faster and more accurate than the previous practical decentralized methods, e.g. (Lian et al., 2017; 2018; Assran et al., 2018) , in the same setting. Importantly, we also note a negative result: in less overparametrized settings such as training residual CNNs (He et al., 2016) on ImageNet (Russakovsky et al., 2015) , nodes do need to perform more iterations over the dataset relative to the baseline in order to recover full accuracy. This is predicted by the analysis, and confirms similar findings in previous work (Assran et al., 2018) . Overall, however, our family of methods should be well-suited to training very large modern models in large-scale settings, where global synchronization among all nodes is prohibitively expensive. Related Work. The study of decentralized optimization algorithms dates back to Tsitsiklis (1984) , and is related to the study of gossip algorithms for information dissemination (Kempe et al., 2003; Xiao and Boyd, 2004; Boyd et al., 2006) . Gossip is usually studied in one of two models (Boyd et al., 2006) : synchronous, structured in global rounds, where each node interacts with a randomly chosen neighbor, and asynchronous, where each node wakes up at times given by a local Poisson clock, and picks a random neighbor to interact with. The model we consider can be seen as equivalent to the asynchronous gossip model. The key differences between our work and averaging in the gossip model, e.g. Boyd et al. (2006) , are that that 1) we consider local SGD steps, which would not make sense in the case of averaging fixed initial values; and 2) the gossip input model is static (node inputs are fixed, and node estimates must converge to the true mean), whereas we study a dynamic setting, where models are continually updated via SGD. Several optimization algorithms have been analyzed in this setting (Nedic and Ozdaglar, 2009; Johansson et al., 2009; Shamir and Srebro, 2014) , while Tang et al. (2018) ; Koloskova et al. (2019) analyze quantization in the synchronous gossip model. Lian et al. (2017; 2018) and Assran et al. (2018) considered SGD-type algorithms in gossip-like models. Specifically, they analyze the SGD averaging dynamic in the non-convex setting but do not allow nodes to perform local updates or quantize. In particular, nodes perform pairwise averaging upon every SGD step. Table 2 in the Appendix provides a thorough comparison of assumptions, results, and rates. Their results are phrased in the synchronous gossip model, in which nodes interact in a sequence of perfect matchings, for which they provide O(1/

√

T n) convergence rates under analytical assumptions. Lian et al. (2018) extends these results to a variant of the gossip model where updates can be performed based on stale information, similarly to our non-blocking extension. Upon careful examination, one can find that their results can be extended to the asynchronous gossip setting we consider, as long as nodes are not allowed to perform local SGD updates to their models (corresponding to H = 1) or to quantize communication. Extending the analysis of distributed SGD to allow for local steps is challenging even in centralized models, see for instance Stich (2018) . If we assume H = 1, our technique yields similar or better bounds relative to previous work in the decentralized model, as our potential analysis is specifically-tailored to this dynamic interaction model. For instance, for Assran et al. (2018) , the speedup with respect to the number of nodes depends on a parameter C, which in turn, depends on 1) the dimension d of the objective function, 2) the number of iterations for the graph given by edge sets of all matrices used in averaging to be connected, and the 3) diameter of the aforementioned connected graph. In the dynamic interaction model we consider, the parameter C will be at least linear in the number of nodes n, which will eliminate any speedup. We present a systematic comparison in Appendix B. In sum, relative to prior work on decentralized algorithms, our contributions are as follows. We are the first to consider the impact of local updates, asynchrony, and quantization in conjunction with decentralized SGD. We show that the cost for the linear reduction in communication in H given by local steps is at worst a squared variance increase in the parameter H. Our analysis technique relies on a fine-grained analysis of individual interactions, which is different than that of previous work, and can yield improved bounds even in the case where H = 1. By leveraging the lattice-based quantization scheme of Davies et al. (2020) , we also allow for communication-compression. From the implementation perspective, the performance of our algorithm is superior to that of previous methods, notably D-PSGD (Lian et al., 2017) , AD-PSGD (Lian et al., 2018) and SGP (Assran et al., 2018) , mainly due to the ability to take local steps. Wang and Joshi (2018) and Koloskova et al. (2020) provide analysis frameworks for the synchronous version of decentralized SGD with local updates, and possibly changing topologies. This is a different setting from ours, since it requires each agent to take an equal number of gradient steps before every interaction round, and therefore does not allow for agents to progress at different speeds (asynchrony). Further, we support quantization, and validate our analysis at scale.

2. PRELIMINARIES

The Distributed System Model. We consider a model which consists of n ≥ 2 anonymous agents, or nodes, each of which is able to perform local computation. We assume that communication network of nodes is a r-regular graph G with spectral gap λ 2 , which denotes the second smallest eigenvalue of the Laplacian of G. This choice of communication topology models supercomputing and cloud networks, which tend to be regular, densely connected and low-diameter, mimicking regular expanders (Kim et al., 2008; Besta and Hoefler, 2014) . The execution proceeds in discrete steps, where in each step we sample an edge of the graph G uniformly at random and we allow the agents corresponding to the edge endpoints interact. Each of the two chosen agents updates its state according to a state update function, specified by the algorithm. The basic unit of time is a single pairwise interaction between two nodes. Notice however that in a real system Θ(n) of these interactions could occur in parallel. Thus, a standard global measure is parallel time, defined as the total number of interactions divided by n, the number of nodes. Parallel time intuitively corresponds to the average number of interactions per node to convergence. We note that our model is virtually identical to the population model of distributed computing (Angluin et al., 2006) , or to asynchronous gossip models (Xiao and Boyd, 2004) . Stochastic Optimization. We assume that the agents wish to minimize a d-dimensional, differentiable function f : R d → R. Specifically, we will assume the empirical risk minimization setting, in which agents are given access to a set of m data samples S = {s 1 , . . . , s m } coming from some underlying distribution D, and to functions i : R d → R which encode the loss of the argument at the sample s i . The goal of the agents is to converge on a model x * which minimizes the empirical loss over the m samples, that is x * = argmin x f (x) = argmin x (1/m) m i=1 i (x). (2) In this paper, we assume that the agents employ these samples to run a decentralized variant of SGD, described in detail in the next section. For this, we will assume that each agent i has access to stochastic gradients g i of the function f , which are functions such that E[ g i (x)] = ∇f (x). Stochastic gradients can be computed by each agent by sampling i.i.d. the distribution D, and computing the gradient of f at θ with respect to that sample. (Our analysis can be extended to the case where each agent is sampling from its own partition of data, see Section H in the Appendix.) We will assume a the following conditions about the objective function (One of the extensions removes the second moment bound): 1. Smooth Gradients: The gradient ∇f (x) is L-Lipschitz continuous for some L > 0, i.e. for all x, y ∈ R d : ∇f (x) -∇f (y) ≤ L x -y . (4) 2. Bounded Second Moment: The second moment of the stochastic gradients is bounded by some M 2 > 0, i.e. for all x ∈ R d and agent i: E g i (x) 2 ≤ M 2 . ( ) Note that throughout this paper for any random variable X, by E X 2 we mean E[ X 2 ].

3. THE SWARMSGD ALGORITHM

Algorithm Description. We now describe a decentralized variant of SGD, designed to be executed by a population of n nodes, interacting over the edges of r-regular graph G. We assume that each node i has access to local stochastic gradients g i , and maintains a model estimate X i . For simplicity, we will assume that this initial model is 0 d at each agent, although its value may be arbitrary. Each agent performs SGD steps on its local estimate X i . At random times given by a clock of Poisson rate, we pick two neighboring agents i and j uniformly at random from G, and have them average their estimates. The interaction is precisely described in Algorithm 1. For simplicity, the pseudocode is sequential, although in practice nodes perform their local SGD steps in parallel. Also, we have assumed a constant learning rate; we will detail the update procedure in the next section, as well as more complex variants of this basic update. Algorithm 1 Sequential SwarmSGD pseudocode for each interaction between nodes i and j. % Let G be r-regular graph. % Sample an edge (i, j) of G uniformly at random. Require: agents i and j chosen for interaction % choose H i and H j % agent i performs H i local SGD steps for q = 1 to H i do X i ← X i -η g i (X i ) end for % agent j performs H j local SGD steps for q = 1 to H j do X j ← X j -η g j (X j ) end for % agents average their estimates coordinate-wise avg ← (X i + X j )/2 X i ← X j ← avg

4. THE CONVERGENCE OF SWARMSGD

We begin by analyzing the convergence of the baseline SwarmSGD algorithm. Fix an integer H ≥ 1. First, we will consider a variant where H i and H j are independent, geometrically-distributed random variables, with mean H. This corresponds to interaction times being chosen by a Poisson clock of constant rate. To handle the fact that the number of local steps upon an interaction is a random variable, in this first case we will require stochastic gradients to satisfy the bounded second moment assumption, specified above. Intuitively, this is required since otherwise the "distance travelled" by a node could be virtually unbounded. In this setting, we prove the following: Theorem 4.1. Let f be an non-convex, L-smooth function, whose stochastic gradients satisfy the bounded second moment assumption above. Let the number of local stochastic gradient steps performed by each agent upon interaction be a geometric random variable with mean H. Let the learning rate we use be η = n/ √ T . Define µ t = n i=1 X i t /n, where X i t is a value of model i after t interactions, be the average of the local parameters. Then, for learning rate η = n/ √ T and any number of interactions T ≥ n 4 : 1 T T -1 t=0 E ∇f (µ t ) 2 ≤ 4(f (µ 0 ) -f (x * )) √ T H + 2304H 2 max(1, L 2 )M 2 √ T r 2 λ 2 2 + 1 . Discussion. First, we note that this notion of convergence is standard in the non-convex case, e.g. (Lian et al., 2015; 2017; 2018) , and that each of the upper bound terms has an intuitive interpretation: the first represents the reduction in loss relative to the initialization, and gets divided by the number of local steps H, since progress is made in this term in every local step; the second represents the influence of the variance of each individual local step multiplied by a term which bounds the impact of the graph topology on the convergence. In particular, this term negatively impacts convergence for large values of H, L, and M , but gets dampened if the graph is well-connected (i.e. large λ 2 ). For example, in the case of the complete graph, we have λ 2 = n. Second, let us consider the algorithm's communication complexity, which we measure in terms of the total number of communication steps. We notice an interesting trade-off between the linear reduction in H in the first term of the bound, showing that the algorithm takes advantage of the local gradient steps, and the quadratic increase in the second variance term also due to H, in the second term. Hence, the end-to-end speedup of our algorithm versus the variant with H = 1 will depend on the relationship between these two terms, which depends on the parameter values. Third, importantly, the time T in this bound counts the total number of interactions. However, in practice Θ(n) pairwise interactions will occur in parallel, as they are independent. Therefore, we can replace T by nT in the above formula, to estimate the speedup in terms of wall-clock time, obtaining a speedup of Θ( √ n). At the same time, notice that this speedup is dampened in the second term by the non-trivial additional variance due to noisy local gradient steps, a fact which we will revisit in the experimental section. Fourth, although the requirement T ≥ n 4 appears restrictive, some non-trivial dependency between n and T is necessary, as gradient information has to "mix" well in the graph before global optimization can occur. Previous work requires stronger variants of this restriction: specifically, Lian et al. (2018) require T ≥ n 6 , while Assran et al. (2018) requires T = Ω(nd 2 ). Proof Overview. At a high level, the argument rests on two technical ideas. The first idea is to show that, due to the pairwise averaging process, and in spite of the local steps, the nodes' parameters will have to remain concentrated around their mean µ t . The second is to show that, even though stochastic gradients are taken at perturbed, noisy estimates of this mean, the impact of this noise on convergence can be bounded. In particular, the main technical difficulty in the proof is to correctly "encode" the fact that parameters are well concentrated around the mean. For this, we define the potential Γ t , which denotes the variance of models after t interactions. Formally, Γ t = n i=1 X i t -µ t 2 , ( ) where µ t = n i=1 X i t /n. We bound the expected evolution of Γ t in terms of r, the degree of nodes in the interaction graph G, and λ 2 , the second smallest eigenvalue of the Laplacian of G. For both algorithm variants we consider, our bound depends on the learning rate, number of local steps, and the bound provided by the assumption on the stochastic gradients (the bound M 2 ). The critical point is that the upper bound on the expectation of Γ t does not depend on the number of interactions t. Our approach leverages techniques from the analysis of static load balancing schemes, e.g. Berenbrink et al. (2009) . Two key elements of novelty in our case are that (1) for us the load balancing process is dynamic, in the sense that loads (gradients) get continually added; (2) the load-balancing process we consider is multi-dimensional, whereas usually the literature considers simple scalar weights. The complete argument is presented in the Appendix. This technique is quite powerful, as it allows for a number of non-trivial extensions: Extension 1: Removing the second-moment bound and allowing for non-i.i.d. local data. In the first extension of the algorithm, we assume that the number of local steps performed by each agent is fixed and is equal to H. In this case, we are able to remove the bounded second moment assumption, and are able to prove convergence under standard assumptions for non-i.i.d data. Specifically, in the non-i.i.d. setting, we consider that each f i (x) is the local function of agent i (computed over the samples available to i). We will require that 1) the function f i is L-smooth, and that 2) for each agent i , gi is unbiased estimate of f i and that 3) for any x, E[ gi (x) -f i (x) 2 ] ≤ σ 2 . We define f (x) = n i=1 f i (x)/n and the bound n i=1 ∇f i (x) -∇f (x) 2 /n ≤ ρ 2 . Theorem 4.2. Let f be an non-convex, L-smooth function whose minimum x we are trying to find via the SwarmSGD procedure given in Algorithm 1. Assume the local functions of agents satisfy the conditions discussed above. Let H be the number of local stochastic gradient steps performed by each agent before interacting. Define µ t = n i=1 X i t /n, where X i t is a value of model i after t interactions. For learning rate η = n √ T and T = Ω n 4 H 2 max(1, L 2 ) r 2 λ 2 2 + 1 2 we have that: T -1 i=0 E ∇f (µ t ) 2 T ≤ 1 √ T H E[f (µ 0 ) -f (x )] + 376H 2 max(1, L 2 )(σ 2 + 4ρ 2 ) √ T r 2 λ 2 2 + 1 . Please see Appendix H for the details of the proof; we note that we did not optimize for constants. Relative to Theorem 4.1, we have the same quadratic dependency on the number of local steps H and on L, but now the second moment bound is replaced by the variance terms. We emphasize that for non-i.i.d data under second-moment bounds, the exact same bounds as in Theorem 4.1 will hold. Extension 2: Non-blocking averaging. Algorithm 1 is blocking, in that it requires both nodes to complete their local iterations at the same time before they can interact. In practice, nodes can average their local updates without synchronizing, as follows. Each node i keeps two copies of the model: the live copy X i (on which local SGD iterations are applied) and the communication copy Y i , which can be accessed asynchronously by communicating nodes. When completing its local steps, a node i first checks if some other node averaged against its communication copy Y i since its last communication step. If the answer is yes, it simply applies its locally-generated update to its communication model Y i , updates the live copy so that X i = Y i , and proceeds with the next iteration of local computation. If no other node has averaged against its communication copy, then the node actively seeks a random communication partner j, and averages its live copy against its model Y j , updating both to (X i + Y j )/2. The node then proceeds with the next iteration of local computation. Please see Appendix F for the precise definition of the algorithm, and for the formal convergence guarantee for this variant. Extension 3: Quantization. For large models, the cost of the averaging step can become significant, due to bandwidth constraints. To remove the bandwidth bottleneck, we allow the averaging step to be performed with respect to quantized versions of the two models. While communication-compression has been considered in a decentralized context before, e.g. (Tang et al., 2018; Lu and Sa, 2020) , our approach is different. Instead of modifying the algorithm or maintaining neighbor information at every node, we make use of a quantization scheme with some useful properties by Davies et al. (2020) , which we slightly adapt to our context. The key issue when quantizing decentralized models is that for most known quantization schemes, e.g. (Alistarh et al., 2017) , the quantization error depends on the norm of the inputs: here, the inputs are models, which are not necessarily close to the origin. Thus, the quantization error at each step would depend on the norm of the models, which would break our bound on Γ t . Instead, we observe that the quantization scheme of Davies et al. (2020) has error which is bounded by the distance between inputs, rather than input norms. Crucially, we show that Γ t can in fact be used to bound the distance between models, so we can bound the quantization error in terms of Γ t at each step. This allows us, with some care, to generalize the analysis to the case where the models are quantized. We provide a full description and proof of convergence in Appendix G. Specifically, quantization ensures the same convergence bounds as in Theorem 4.1, but with an expected communication cost of O(d + log T ) bits per step.foot_0 By contrast, non-quantized decentralized algorithms assume that nodes can exchange infinite-precision real numbers, while the only other memory-less compression scheme (Lu and Sa, 2020) induces a linear dependence in d in the rate. In our applications, d log T , and therefore our cost is essentially constant per dimension; specifically, we show that we can quantize to 8 bits per coordinate without loss of accuracy.

5. EXPERIMENTAL RESULTS

In this section, we validate our analysis, by applying the algorithm to training deep neural networks for image classification and machine translation. We map the algorithm onto a multi-node supercomputing setting, in which we have a large number of compute nodes, connected by fast communication links. The key overhead in this setting is synchronization: at large node counts, the cost of synchronizing all nodes so they execute in lock-step can be very high, see e.g. Li et al. (2019) for numerical results on different workloads. SwarmSGD mitigates this overhead, since nodes synchronize only sporadically and in pairs. Harnessing the computational power of this large-scale distributed setting is still an underexplored area (Ben-Nun and Hoefler, 2019). Target System and Implementation. We run SwarmSGD on the CSCS Piz Daint supercomputer, which is composed of Cray XC50 nodes, each with a Xeon E5-2690v3 CPU and an NVIDIA Tesla P100 GPU, using a state-of-the-art Aries interconnect over a Dragonfly network topology, which is regular. Please see (Piz, 2019) for more details. We implemented SwarmSGD in Pytorch and TensorFlow using NCCL and MPI-based primitives. Both variants implement the version with nonblocking averaging. The Pytorch implementation is on top of SGP framework (Assran et al., 2018) , and uses SwarmSGD to train ResNets on the CIFAR-10/100 (Krizhevsky et al., 2014) and Ima-geNet (Russakovsky et al., 2015) datasets, while we use the TensorFlow implementation to train a much larger Transformer-XL model (Vaswani et al., 2017) on the WMT17 (En-Ge) dataset. We note that all algorithms used the same topology overlay (fully-connected with random pairings), and that SGP was run with overlap factor 1, as suggested by Assran et al. (2018) . Under review as a conference paper at ICLR 2021 Training Process. Our training methodology follows data-parallel training, with some differences due to decentralization, and is identical to previous work on decentralized and local SGD, e.g. (Lian et al., 2017; Assran et al., 2018; Lin et al., 2018) . Training proceeds in epochs, each of which corresponds to processes collectively performing a full pass over the dataset. At the beginning of each epoch, we re-shuffle the dataset and partition it among processes (Lin et al., 2018) . As noted in previous work (Lian et al., 2017; 2018; Assran et al., 2018) variants of decentralized SGD are not always able to recover sequential SGD accuracy within the same number of epochs as this baseline. This is justified by Theorems 4.1 and 4.2, which predict that the slower mixing (and higher local model variance) can affect convergence. Thus, in some experiments, we will allow the decentralized schemes to execute for more epochs, by a constant multiplier factor between 1 and 3. Once we have fixed the number of epochs, we do not alter the other training hyperparameters: in particular, the learning rate schedule, momentum and weight decay terms are identical to sequential SGD, for each individual model. Accuracy and Speed. We first examined whether SwarmSGD can in fact recover full accuracy versus the sequential or large-batch SGD baselines. In Table 1 we provide an overview of parameter values to recover or exceed large-batch SGD accuracy (following (Goyal et al., 2017 )) using SwarmSGD, on the ResNet/ImageNet/CIFAR tasks. We execute for 32 nodes on ImageNet, and 8 nodes on CIFAR-10. (Local batch sizes are 256 for ResNet20 and ResNet18, and 128 for ResNet50. Quantization is not applied.) The results show that Swarm can recover or slightly exceed the accuracy of the large-batch baselines, and that it has lower practical communication cost relative to existing methods (see Figure 2(b) , where we separate the average computation cost per batch). However, Swarm requires significant additional passes over the data (up to 2.7×) to achieve full accuracy, which negates its performance benefits in this specific setting, relative to large-batch SGD. (Please see Appendix Figure 5 for an end-to-end time comparison. We do not take the cost of fine-tuning the hyperparameters for large-batch SGD into account in this example.) This finding is in line with previous work on decentralized methods (Assran et al., 2018) . Next, we examine accuracy for the WMT17 task. The results are provided in Figure 1 (a), in accuracy-vs-time format, for 16 and 32 nodes, executing for 10 global epochs. Here, the large-batch SGD (LB-SGD) baseline (BLEU score 26.1 at 16 nodes) is a poor alternative at high node counts: its throughput is very low, due to the size of the model (see Figure 1(b) ). At 16 nodes, Swarm slightly exceeds the baseline accuracy at 26.17 BLEU, for an end-to-end speedup of ∼ 1.5×. In the same setting, Swarm outperforms all other decentralized methods (the fastest previous method, AD-PSGD, is 30% slower, and less accurate), both in terms of BLEU score, and in terms of endto-end time. (The objective loss graph is similar, and is given in Appendix Figure 7 .) At 32 nodes, all decentralized methods reach lower scores (∼ 23.5) after 10 epochs. However, we observed experimentally that running Swarm for an additional 5 epochs at 32 nodes recovered a BLEU score of ∼ 25.9, 30% faster than the 16-node version in terms of end-to-end time (omitted for visibility). In addition, we investigated 1) the accuracy of the real average of all models throughout training: it is usually more accurate than an arbitrary model, but not significantly, corroborating the claim that individual models tend to stay close to the mean; 2) the influence of the number of local steps on accuracy: perhaps surprisingly, we were able to recover baseline accuracy on ResNet18/ImageNet for up to 4 local steps (see Figure 2 (a)); 3) the impact of quantization on convergence, where we were able to recover accuracy when applying 8-bit model quantization to Swarm. We encourage the reader to examine the full experimental report in the Appendix, which contains data on these experiments, as well as additional ablation studies. Discussion. Generally, the performance and accuracy of SwarmSGD are superior to previous decentralized methods (see Figure 1 for an illustration, and Figure 2 (b) for a performance breakdown). In particular, a closer examination of the average batch times in Figure 2(b) shows that time per node per batch (including communication and computation) is largely constant as we increase the number of nodes, which gives our method close-to-ideal scaling behaviour. This advantage relative to previous schemes, notably AD-PSGD, comes mainly from the reduction in communication frequency: Swarm communicates less often, and therefore incurs lower average communication cost. The main disadvantage of Swarm is that, similar to previous decentralized methods, it may need additional data passes in order to fully recover accuracy at high node counts. However, we also note that our method did not benefit from the high level of hyperparameter tuning applied to large-batch SGD, e.g. (Goyal et al., 2017) . We find it interesting that this accuracy issue is less prevalent in the context of large, over-parameterized models, such as the Transformer, where Swarm could be a practically-viable alternative to large-batch SGD within the same number of epochs.

6. CONCLUSIONS AND FUTURE WORK

We analyzed the convergence of SGD in a decoupled model of distributed computing, in which nodes mostly perform independent SGD updates, interspersed with intermittent pairwise averaging steps, which may be performed in an inconsistent and noisy manner. We showed that SGD still converges in this restrictive setting, and under considerable consistency relaxations, and moreover can still achieve speedup in terms of iteration time. Empirical results in a supercomputing environment complement and validate our analysis, showing that this method can outperform previous proposals. A natural extension would be to generalize the bounds to arbitrary communication graphs, or in terms of the assumptions on the objective, or to experiment on large-scale decentralized testbeds.

A SUMMARY OF THE APPENDIX SECTIONS

Appendix contains the following sections: • In Section B we compare SwarmSGD with some of the existing algorithms. We list convergence bounds and the assumptions needed to achieve them. • In Section C we provide crucial properties for the load balancing on the graph. • In Section D we provide definitions for the local steps we use in the later sections. • In Section E we provide the sketch of proof of Theorem 4.1, which shows the convergence of SwarmSGD assuming the second moment bound on the gradients. Recall that the number of local steps in this case is a geometric random variable with mean H. • In Section F we provide the proof for the non-blocking version of the swarm SGD algorithm. We again assume the second moment bound on the gradients and that the number of local steps is a geometric random variable with mean H. • In Section G we provide the proof for the quantized version of the swarm SGD algorithm. We again assume the second moment bound on the gradients and that the number of local steps is a geometric random variable with mean H. • In Section H we prove Theorem 4.2. In this case we do not assume the second moment bound, data is not distributed identically and the number of the local steps performed by each agent is a fixed number H. • In Section I we provide additional experimental results for SwarmSGD.

B COMPARISON OF RESULTS

In this section we compare convergence rates of existing algorithms, while specifying the bounds they require for convergence. In the tables T -corresponds to the parallel time and n is a number of processes. We use the following notations for needed bounds (or assumptions): 1. σ 2 -bound on the variance of gradient . 2. M 2 -bound on the second moment of gradient. 3. d -bounded dimension. 4. λ 2 -bounded spectral gap of the averaging matrix (interaction graph in case of SwarmSGD). 5. τ -bounded message delay. 6. r -interaction graph is r-regular. 7. ∆ -bounded diameter of interaction graph. Algorithm Assumptions Convergence Rate SwarmSGD σ 2 , λ 2 , r O(1/ √ T n) SwarmSGD M 2 , λ 2 , r O(1/ √ T n) AD-PSGD Lian et al. (2018) σ 2 , λ 2 , τ O(1/ √ T n) SGP Assran et al. (2018) σ 2 , d, ∆, τ O(1/ √ T n) Table 2 : Comparison of theoretical results in the non-convex case. Discussion. We compare in more detail against Lian et al. (2018) and Assran et al. (2018) , since these are the only other papers which do not require explicit global synchronization in the form of rounds. (By contrast, e.g. Wang and Joshi (2018) ; Koloskova et al. (2020) require that nodes synchronize in rounds, so that at every point in time each node has taken the same number of steps.) In Assran et al. (2018) , all nodes perform gradient steps at each iteration, but averaging steps can be delayed by τ iterations. Unfortunately, in this case the mixing time depends on the dimension -d (more precisely, it contains a √ d factor!), on the delay bound τ , and on ∆, defined as the number of iterations over which the interaction graph is well connected. Additionally, the analysis is not suitable for random interactions. On the other hand, Lian et al. (2018) consider random interaction matrices and do not require the agents to perform the same number of gradient steps. Unlike our model, in their case more than two nodes can interact during the averaging step. To circumvent the global synchronization issue, Lian et al. (2018) allow agents to have outdated views during the averaging step. Yet, we would like to emphasize that they require blocking during the averaging steps, while we allow some amount of non-blocking property. By some amount we means that algorithm needs blocking only in the case when some node takes more than two consecutive iterations to complete it's local gradient steps. This means that for each node i to complete H i local steps should not more take more then O(n) global steps (since each node interacts with probability 2/n at each step), this assumption also holds for Lian et al. (2018) . In summary, our algorithm reduces the synchronization required by averaging steps, by considering pairwise interactions and by introducing local steps and providing non-blocking version of the algorithm as well (in SGP and AD-PSGD, agents perform one local step and one averaging step per iteration). We would like to point out that we also allow a random number of local steps between interactions in the case when we have second moment bound on the stochastic gradient, which reduces synchronization costs even further. Finally, our algorithm requires T ≥ O(n 4 ) number of iterations to achieve the convergence rate of O(1/

√

T n) in the case of blocking algorithm and T = Ω(n 6 ) in general. (By contrast, Lian et al. (2018) requires T = Ω(n 6 ).)

C PROPERTIES OF THE LOAD BALANCING

In this section provide the useful lemmas which will help as in the later sections. We are given a simple undirected graph G, with n nodes (for convenience we number them from 1 to n) and edge set E. Each node is adjacent to exactly r nodes. Each node i of graph G keeps a local vector model X i t ∈ R d (t is the number of interactions or steps); let X t = (X 1 t , X 2 t , ..., X n t ) be the vector of local models at step t. An interaction (step) is defined as follows: we pick an edge e = (u, v) of G uniformly at random and update the vector models correspondingly. Let µ t = n i=1 X i t /n be the average of models at step t and let Γ t = n i=1 X i t -µ t 2 be a potential at time step t. Let L be the Laplacian matrix of G and let let λ 2 be a second smallest eigenvalue of L. For example, if G is a complete graph λ 2 = n. First we state the following lemma from Ghosh and Muthukrishnan (1996): Lemma C.1. λ 2 = min v=(v1,v2,...,vn) v T Lv v T v | n i=1 v i = 0 . Now, we show that Lemma C.1 can be used to lower bound (i,j)∈E X i t -X j t 2 : Lemma C.2. (i,j)∈E X i t -X j t 2 ≥ λ 2 n i=1 X i t -µ t 2 = λ 2 Γ t . Proof. Observe that (i,j)∈E X i t -X j t 2 = (i,j)∈E (X i t -µ t ) -(X j t -µ t ) 2 . ( ) Also, notice that Lemma C.1 means that for every vector v = (v 1 , v 2 , ..., v n ) such that n i=1 v i = 0, we have: (i,j)∈E (v i -v j ) 2 ≥ λ 2 n i=1 v 2 i . Since n i=1 (X i t -µ t ) is a 0 vector, we can apply the above inequality to the each of d components of the vectors X 1 t -µ t , X 2 t -µ t , ..., X n t -µ t separately, and by elementary properties of 2-norm we prove the lemma.

D DEFINITIONS FOR THE LOCAL STEPS

In this section we provide the formal definition of the local steps performed by our algorithms. Recall that X i t is a local model of node i at step t. Let H i t be the number of local steps node i performs in the case when it is chosen for interaction at step t + 1. A natural case is for H i t to be fixed throughout the whole algorithm, that is: for each time step t and node i, H i t = H. However, optimal choice H i t depends on whether a second moment bound on gradients (5) is assumed. Let: h 0 i (X i t ) = 0. and for 1 ≤ q ≤ H i t let: h q i (X i t ) = g i (X i t - q-1 s=0 η h s i (X i t )), Note that stochastic gradient is recomputed at each step, but we omit the superscript for simplicity, that is: h q i (X i t ) = g q i (X i t - q-1 s=0 η h s i (X i t )). Further , for 1 ≤ q ≤ H i t , let h q i (X i t ) = E[ g i (X i t - q-1 s=0 η h s i (X i t ))] = ∇f (X i t - q-1 s=0 η h s i (X i t )) be the expected value of h q i (X i t ) taken over the randomness of the stochastic gradient g i . Let h i (X i t ) be the sum of H i t local stochastic gradients we computed: h i (X i t ) = H i t q=1 h q i (X i t ). Similarly, for simplicity we avoid using index t in the left side of the above definition, since it is clear that if the local steps are applied to model X i t we compute them in the case when node i interacts at step t + 1. The update step in Swarm SGD (Algorithm 1) is (before averaging): X i t+1 = X i t -η h i (X i t ) = X i t -η H i t q=1 h q i (X i t ) = X i t -η H i t q=1 g i (X i t - q-1 s=0 η h s i (X i t )). Notice that E h q i (X i t ) 2 = E g i (X i t - q-1 s=0 η h s i (X i t )) 2 Assumption 5 ≤ M 2 . ( )

E ANALYSIS UNDER SECOND MOMENT BOUND AND RANDOM NUMBER OF LOCAL STEPS

In this section we consider Algorithm 1, where for each node i, H i is a geometric random variable with mean H. We also assume a gradient second moment bound (5). We provide only a sketch of the proof since the proof for the non-blocking version of algorithm in Section F is more general. If nodes i and j interact at step t + 1 and their local models have values of X i t and X j t after step t. Their new model values become: X i t+1 = X j t+1 = (X i t + X j t -η h i (X i t ) -η h j (X j t ))/2. Recall that µ t is average of the values of models after time step t and Γ t = n i=1 X i t -µ t 2 First of all we can prove that (see Lemma F.1, the result can be achieved even though algorithms differ): E[Γ t+1 ] ≤ E[Γ t ](1 - λ 2 2rn ) + (2 + 4r λ 2 )η 2 n i=1 4E h i (X i t ) 2 . ( ) We can further show that Lemma F.2, and therefore Lemmas F.3 and F.4, also hold, yielding: E[Γ t ] ≤ ( 40r λ 2 + 80r 2 λ 2 2 )nη 2 H 2 M 2 , and n i=1 E ∇f (µ t ), -h i (X i t ) ≤ 2HL 2 E[Γ t ] - 3Hn 4 E ∇f (µ t ) 2 + 12H 3 nL 2 M 2 η 2 . ( ) Next in the similar fashion as in the proof of Theorem F.8 we can show that : E[f (µ t+1 )] ≤ E[f (µ t )] + 2η n 2 n i=1 E ∇f (µ t ), -h i (X i t ) + 20Lη 2 H 2 M 2 n 2 . ( ) This allows to show that: Theorem 4.1. Let f be an non-convex, L-smooth function, whose stochastic gradients satisfy the bounded second moment assumption above. Let the number of local stochastic gradient steps performed by each agent upon interaction be a geometric random variable with mean H. Let the learning rate we use be η = n/ √ T . Define µ t = n i=1 X i t /n, where X i t is a value of model i after t interactions, be the average of the local parameters. Then, for learning rate η = n/ √ T and any number of interactions T ≥ n 4 : 1 T T -1 t=0 E ∇f (µ t ) 2 ≤ 4(f (µ 0 ) -f (x * )) √ T H + 2304H 2 max(1, L 2 )M 2 √ T ( r 2 λ 2 2 + 1). Proof. We again skip the calculations and follow steps from the proof of Theorem F (note that constants can be improved, but for simplicity we keep them the same). After applying ( 13) and ( 14), this results in: E[f (µ t+1 )] -E[f (µ t )] ≤ ( 160r λ 2 + 320r 2 λ 2 2 ) η 3 H 3 M 2 L 2 n n 2 - Hn 4 E ∇f (µ t ) 2 + 76H 3 L 2 M 2 η 3 n + 20Lη 2 H 2 M 2 n 2 . once we sum up the above inequality t = 0 to t = T -1 and massage terms we get (additionally recall that E[f (µ T )] ≥ f (x * )): 1 T T -1 t=0 E ∇f (µ t ) 2 ≤ 4n(f (µ 0 ) -f (x * )) T Hη + ( 640r λ 2 + 1280r 2 λ 2 2 )η 2 H 2 M 2 L 2 T + 80T LηHM 2 n + 304T H 2 L 2 M 2 η 2 . Finally since η = n/ √ T ≤ 1 n (because T ≥ n 4 ) we get the proof of the lemma. Note that the difference between this theorem and Theorem F.8 is that we have lower bound of n 4 for T here, instead of n 4 (n + 1) 2 . The reason is that we are not required to use Lemmas F.7 and F.5 since our algorithm allows blocking and this means that interacting agents do not have incomplete values for the models.

F ANALYSIS OF THE NONBLOCKING VARIANT, WITH SECOND MOMENT BOUND AND RANDOM NUMBER OF LOCAL STEPS

First we define how the non-blocking property changes our interactions. Let i and j be nodes which interact at step t + 1, we set X i t+1/2 = X i t 2 + X j t 2 , X j t+1/2 = X j t 2 + X i t 2 , and X i t+1 = X i t+1/2 -η h i (X i t ), X j t+1 = X j t+1/2 -η h j (X j t ), where for each node k, if p k t + 1 is the last time interacting before and including step t: X k t = X i p k t +1/2 = X p k t +1 + η h k (X k p k t ) = X k t + η h k (X k p k t ). Intuitively the last definition means that node k has computed X i p k t +1/2 but has not finished computing X p k t +1 , hence when some other node tries to read X p k t +1 , it reads the value which is missing local gradient update step, but it does not have to wait for node k to finish computing. Since p k t + 1 is the last step node k interacted we have that X k t = X k p k t +1 . More formally: Algorithm 2 Sequential non-blocking SwarmSGD pseudocode for each interaction between nodes i and j. % Let G be r-regular graph. % Sample an edge (i, j) of G uniformly at random. Require: agents i and j chosen for interaction, i is initiator % choose H i and H j % agent i performs H i local SGD steps S i ← X i for q = 1 to H i do X i ← X i -η g i (X i ) end for % agent j performs H j local SGD steps S j ← X j for q = 1 to H j do X j ← X j -η g j (X j ) end for % agents update their estimates X i ← (S i + X j )/2 + (X i -S i ) X j ← (S j + X i )/2 + (X j -S j ) Notice the differences between the main algorithm and non-blocking one: first, local gradient steps are applied only after the averaging steps (this corresponds to term X i -S i for node i), and second, nodes get access to the model of their interacting partner, which might not be complete for the reasons described above (for example, node i is forced to use X j instead of X j in its averaging step). If node i is the initiator of the interaction and its chosen interaction partner j is still computing the local gradients from its previous interaction, this algorithm allows node i not to wait for j to finish computation. In this case, i simply leaves its value X i in j's memory. Notice that since i is finished computation it does not need to pass its outdated model to j, but we assume the worst case. We proceed by proving the following lemma which upper bounds the expected change in potential: Lemma F.1. For any time step t we have: E[Γ t+1 ] ≤ E[Γ t ](1 - λ 2 2rn ) + (2 + 4r λ 2 )η 2 n i=1 4E h i (X i t ) 2 + E h i X i p i t 2 . Proof. First we bound change in potential ∆ t = Γ t+1 -Γ t for some fixed time step t > 0. For this, let ∆ i,j t be the change in potential when we choose agents (i, j) ∈ E for interaction (While calculating ∆ i,j t we assume that X t is fixed). Let R i t = -η h i (X i t ) + η hj X j p j t 2 and R j t = -η h j (X j t ) + η hi X i p i t 2 . We have that: X i t+1 = X i t + X j t 2 + R i t . X j t+1 = X i t + X j t 2 + R j t . µ t+1 = µ t + R i t + R j t n . This gives us that: X i t+1 -µ t+1 = X i t + X j t 2 + n -1 n R i t - 1 n R j t -µ t . X i t+1 -µ t+1 = X i t + X j t 2 + n -1 n R j t - 1 n R i t -µ t . For k = i, j we get that X k t+1 -µ t+1 = X k t - 1 n (R i t + R j t ) -µ t . Hence: ∆ i,j t = X i t + X j t 2 + n -1 n R i t - 1 n R j t -µ t 2 -X i t -µ t 2 + X i t + X j t 2 + n -1 n R j t - 1 n R i t -µ t 2 -X j t -µ t 2 + k =i,j X k t - 1 n (R i t + R j t ) -µ t 2 -X k t -µ t 2 = 2 X i t -µ t 2 + X j t -µ t 2 2 -X i t -µ t 2 -X j t -µ t 2 + X i t -µ t + X j t -µ t , n -2 n R i t + n -2 n R j t + n -1 n R i t - 1 n R j t 2 + n -1 n R j t - 1 n R i t 2 + k =i,j 2 X k t -µ t , - 1 n (R i t + R j t ) + k =i,j ( 1 n ) 2 R i t + R j t 2 . Observe that: n k=1 X k t -µ t , - 1 n (R i t + R j t ) = 0. After combining the above two equations, we get that: ∆ i,j t = - X i t -X j t 2 2 + X i t -µ t + X j t -µ t , R i t + R j t + n -2 n 2 R i t + R j t 2 + n -1 n R i t - 1 n R j t 2 + n -1 n R j t - 1 n R i t 2 Cauchy-Schwarz ≤ - X i t -X j t 2 2 + X i t -µ t + X j t -µ t , R i t + R j t + 2 n -2 n 2 + 1 n 2 + (n -1) 2 n 2 R i t 2 + R j t 2 . ( ) Recall that R i t = -η h i (X i t )+ η hj X j p j t 2 and R j t = -η h j (X j t )+ η hi X i p i t 2 , Using Cauchy-Schwarz inequality we get that R i t 2 ≤ 2η 2 h i (X i t ) 2 + η 2 2 h j X j p j t 2 . R j t 2 ≤ 2η 2 h j (X j t ) 2 + η 2 2 h i X i p i t 2 . Denote 2η 2 h i (X i t ) 2 + η 2 2 h i X i p i t 2 by S i t and 2η 2 h i (X j t ) 2 + η 2 2 h j X j p j t 2 by S j t . Hence (17) can be rewritten as: ∆ i,j t ≤ - X i t -X j t 2 2 + X i t -µ t + X j t -µ t , R i t + R j t + 2 n -2 n 2 + 1 n 2 + (n -1) 2 n 2 S i t + S j t ≤ - X i t -X j t 2 2 + X i t -µ t + X j t -µ t , R i t + R j t + 2 S i t + S j t . Further: X i t -µ t + X j t -µ t , R i t + R j t Young ≤ λ 2 X i t -µ t + X j t -µ t 2 8r + 2r R i t + R j t 2 λ 2 Cauchy-Schwarz ≤ λ 2 X i t -µ t 2 + λ 2 X j t -µ t 2 4r + 4r R i t 2 + 4r R j t 2 λ 2 ≤ λ 2 X i t -µ t 2 + λ 2 X j t -µ t 2 4r + 4r(S i t + S j t ) λ 2 . This gives us: (i,j)∈E ∆ i,j t ≤ (i,j)∈E - X i t -X j t 2 2 + λ 2 X i t -µ t 2 + λ 2 X j t -µ t 2 4r + 4r(S i t + S j t ) λ 2 + 2(S i t + S j t ) Lemma C.2 ≤ - λ 2 Γ t 2 + n i=1 (2r + 4r 2 λ 2 )S i t + n i=1 λ 2 X i t -µ t 2 4 = - λ 2 Γ t 4 + n i=1 (2r + 4r 2 λ 2 )S i t . Next, we use the above inequality to upper bound ∆ t in expectation: E[∆ t |X 0 , X 1 , ..., X t ] = 1 rn/2 (i,j)∈E E[∆ i,j t |X 0 , X 1 , ..., X t ] ≤ 1 rn/2 - λ 2 Γ t 4 + n i=1 (2r + 4r 2 λ 2 )E S i t |X 0 , X 1 , ..., X t = - λ 2 Γ t 2rn + n i=1 (4 + 8r λ 2 ) E S i t |X 0 , X 1 , ..., X t n . Finally, we remove the conditioning: E[∆ t ] = E[E[∆ t |X 0 , X 1 , ..., X t ]] ≤ - λ 2 E[Γ t ] 2rn + (4 + 8r λ 2 ) n i=1 E[S i t ] n . By considering the definition of ∆ t and S i t , we get the proof of the lemma. Next, we upper bound the second moment of local updates , for any step t and node i: Lemma F.2. n i=1 E η h i (X i t ) 2 ≤ 2η 2 nH 2 M 2 . Proof. n i=1 E η h i (X i t ) 2 = η 2 ∞ u=1 P r[H i t = u] n i=1 E u q=1 h q i (X i t ) 2 ≤ η 2 ∞ u=1 P r[H i t = u] n i=1 u u q=1 E h q i (X i t ) 2 (5) ≤ η 2 ∞ u=1 P r[H i t = u]u 2 n i=1 M 2 ≤ 2nη 2 H 2 M 2 . Where in the last step we used ∞ u=1 P r[H i t = u]u 2 = E[(H i t ) 2 ] = 2H 2 -H ≤ 2H 2 . This allows us to upper bound the potential in expectation for any step t. Lemma F.3. E[Γ t ] ≤ ( 40r λ 2 + 80r 2 2 2 )nη 2 H 2 M 2 . ( ) Proof. We prove by using induction. Base case t = 0 trivially holds. For an induction step step we assume that E[Γ t ] ≤ ( 40r λ2 + 80r 2 λ 2 2 )nη 2 H 2 M 2 r 2 . We get that : E[Γ t+1 ] ≤ E[Γ t ](1 - λ 2 2rn ) + (2 + 4r λ 2 )η 2 n i=1 4E h i (X i t ) 2 + E h i X i p i t 2 Lemma (F.2) ≤ (1 - λ 2 2rn )E[Γ t ] + (20 + 40r λ 2 )H 2 M 2 η 2 ≤ (1 - λ 2 2rn )( 40r λ 2 + 80r 2 λ 2 2 )nη 2 H 2 M 2 + (20 + 40 λ 2 )H 2 M 2 η 2 = ( 40r λ 2 + 80r 2 λ 2 2 )nη 2 H 2 M 2 . The next lemma allows us to upper bound n i=1 E ∇f (µ t ), -h i (X i t ) which will be used later once we apply L-smoothness to upper bound f (µ t+1 ). The intuition is as follows: if for each i, h i (X i t ) was just a sum of single stochastic gradient(H i = 1) by the unbiasedness property we would have to upper bound n i=1 E ∇f (µ t ), -∇f (X i t ) = n i=1 E ∇f (µ t ), ∇f (µ t ) -∇f (X i t ) - E ∇f (µ t ) 2 , which can be done by using L-smoothness and then definition of Γ t . Lemma F.4. For any time step t. n i=1 E ∇f (µ t ), -h i (X i t ) ≤ 2HL 2 E[Γ t ] - 3Hn 4 E ∇f (µ t ) 2 + 12H 3 nL 2 M 2 η 2 . (19) Proof. n i=1 E ∇f (µ t ), -h i (X i t ) = n i=1 ∞ u=1 P r[H i t = u]E ∇f (µ t ), - u q=1 h q i (X i t ) = n i=1 ∞ u=1 P r[H i t = u] u q=1 E ∇f (µ t ), ∇f (µ t ) -h q i (X i t ) -E ∇f (µ t ) 2 = n i=1 ∞ u=1 P r[H i t = u] u q=1 E ∇f (µ t ), ∇f (µ t ) -∇f (X i t - q-1 s=0 η h s i (X i t )) -E ∇f (µ t ) 2 Using Young's inequality we can upper bound E ∇f (µ t ), ∇f (µ t ) -∇f (X i t - q-1 s=0 η h s i (X i t )) by E ∇f (µt) 2 4 + E ∇f (µ t ) -∇f (X i t - q-1 s=0 η h s i (X i t )) 2 . Plugging this in the above inequality we get: n i=1 E ∇f (µ t ), -h i (X i t ) ≤ ≤ n i=1 ∞ u=1 P r[H i t = u] u q=1 E ∇f (µ t ) -∇f (X i t - q-1 s=0 η h s i (X i t )) 2 - 3E ∇f (µ t ) 2 4 (4) ≤ n i=1 ∞ u=1 P r[H i t = u] u q=1 L 2 E µ t -X i t + q-1 s=0 η h s i (X i t )) 2 - 3E ∇f (µ t ) 2 4 . Next we use Cauchy-Schwarz inequality on E µ t - X i t + q-1 s=0 η h s i (X i t )) 2 n i=1 E ∇f (µ t ), -h i (X i t ) ≤ ≤ n i=1 ∞ u=1 P r[H i t = u] u q=1 2L 2 E µ t -X i t 2 + 2L 2 E q-1 s=0 η h s i (X i t )) 2 - 3E ∇f (µ t ) 2 4 Term E q-1 s=0 η h s i (X i t )) 2 can be upper bounded by q 2 M 2 using Cauchy-Schwarz and assumption (5). Hence: n i=1 E ∇f (µ t ), -h i (X i t ) ≤ ≤ n i=1 ∞ u=1 P r[H i t = u] u q=1 2L 2 E µ t -X i t 2 + 2L 2 η 2 q 2 M 2 - 3E ∇f (µ t ) 2 4 = n i=1 ∞ u=1 P r[H i t = u]u 2L 2 E µ t -X i t 2 - 3E ∇f (µ t ) 2 4 + n i=1 ∞ u=1 P r[H i t = u]u(u + 1)(2u + 1)L 2 M 2 η 2 /3 (20) Note that: n i=1 ∞ u=1 P r[H i t = u]u 2L 2 E µ t -X i t 2 - 3E ∇f (µ t ) 2 4 = 2HL 2 E[Γ t ] - 3Hn 4 E ∇f (µ t ) 2 . (21) Also: n i=1 ∞ u=1 P r[H i t = u]u(u + 1)(2u + 1)L 2 M 2 η 2 /3 ≤ n i=1 ∞ u=1 P r[H i t = u]2u 3 L 2 M 2 η 2 ≤ 12H 3 nL 2 M 2 η 2 . ( ) Where in the last step we used (Recall that H i t is a geometric random variable with mean H): ∞ u=1 P r[H i t = u]u 3 = E[(H i t ) 3 ] = 6H 3 -6H 2 + H ≤ 6H 3 . By plugging inequalities ( 22) and ( 21) into inequality (20) we get the proof of the lemma. Our next goal is to upper bound n i=1 E ∇f (µ t ), -h i (X i p i t ) . Lemma F.5. n i=1 E ∇f (µ t ), h i (X i p i t ) ≤ 2HL 2 n i=1 E µ t -X i p i t 2 + 5Hn 4 E ∇f (µ t ) 2 + 12H 3 nL 2 M 2 η 2 . ( ) Proof. The proof is very similar to the proof of lemma F.4, except that when we subtract and add term E|∇f (µ t ) 2 in the proof it will eventually end up with a positive sign (After using Young's inequality it will have factor of 1 4 + 1 instead of factor of 1 4 -1) and we have n i=1 E µ t -X i p i t 2 instead of Γ t = n i=1 E µ t -X i t 2 . Thus, we omit the proof in this case. Next step is to upper bound n i=1 E µ t -X i p i t 2 , for this we will need the following lemma: Lemma F.6. For any node i and time step t, E µ t -µ p i t 2 ≤ 10η 2 H 2 M 2 . Proof. Notice that E µ t -µ p i t 2 = t t =0 P r[p i t = t ]E t-1 s=t µ s+1 -µ s 2 ≤ t t =0 t-1 s=t P r[p i t = t ](t-t )E µ s+1 -µ s 2 . (24) where we used Cauchy-Schwarz inequality in the lastt step. Fix step s. Let u and v be nodes which interact at step s + 1. We have that E µ s+1 -µ s 2 = E - η h u (X u s ) n + η h u X u p u s 2n - η h v (X v s ) n + η h v X v p v s 2n 2 ≤ 4η 2 n 2 E h u (X u s ) 2 + 4η 2 n 2 E h v (X v s ) 2 + + η 2 n 2 E h u (X u p u s ) 2 + η 2 n 2 E h v (X v p v s ) 2 . We again used the Cauchy-Schwarz inequality since the expectation is taken only over the randomness of sampling and number of local steps. We can use the approach from lemma F.2 to upper bound E h u (X u s ) 2 , E h v (X v s ) 2 , E h u (X u p u s ) 2 and E h v (X v p v s ) 2 . (In the lemma we upper bound the sum of n similar terms, but with η 2 .) Hence: E µ t -µ p i t 2 ≤ t t =0 t-1 s=t P r[p i t = t ](t -t )E µ s+1 -µ s 2 ≤ t t =0 t-1 s=t P r[p i t = t ](t -t ) 2 20η 2 H 2 M 2 n 2 = 20η 2 H 2 M 2 n 2 E[(p i t -t) 2 ]. t -p i t is a geometric random variable with mean n/2 (because the probability that node i interacts is 2/n at every step). Thus, E[(p i t -t) 2 ] = 2(E[t -p i t ]) 2 -E[t -p i t ] ≤ n 2 2 . Thus, E µ t -µ p i t 2 ≤ 10η 2 H 2 M 2 . Finally we can show that: Lemma F.7. For any node i and time step t, n i=1 E µ t -X p i t i 2 ≤ 20nH 2 M 2 η 2 + ( 80r λ 2 + 160r 2 λ 2 2 )n 2 η 2 H 2 M 2 . Proof. Using Cauchy-Schwarz inequality we get: n i=1 E µ t -X i p i t 2 ≤ n i=1 (2E µ t -µ p i t 2 + 2E[µ p i t -X i p i t 2 ) ≤ n i=1 (2E µ t -µ p i t 2 + 2E[Γ p i t ]) ≤ 20nH 2 M 2 η 2 + ( 80r λ 2 + 160r 2 λ 2 2 )n 2 η 2 H 2 M 2 . Where the last inequality comes from Lemmas F.3 and F.6. Now we are ready to prove the main theorem. Theorem F.8. Let f be an non-convex, L-smooth, function satisfying assumption 5, whose minimum x we are trying to find via the non-blocking version of SwarmSGD procedure (See, algorithm 2). Let the number of local stochastic gradient steps performed by each agent upon interaction be a geometric random variable with mean H. Let the learning rate we use be η = n/ √ T . Define µ t = n i=1 X i t /n , where X i t is the value of model i after t interactions. Then, for learning rate η = n/ √ T and any T ≥ n 4 (n + 1) 2 : 1 T T -1 t=0 E ∇f (µ t ) 2 ≤ 4(f (µ 0 ) -f (x * )) √ T H + 2304H 2 max(1, L 2 )M 2 √ T ( r 2 λ 2 2 + 1). Proof. Let E t denote expectation conditioned on {X t 1 , X t 2 , ..., X t n }. By L-smoothness we have that E t [f (µ t+1 )] ≤ f (µ t ) + E t ∇f (µ t ), µ t+1 -µ t + L 2 E t µ t+1 -µ t 2 . After removing conditioning: E[f (µ t+1 )] = E[E t [f (µ t+1 )]] ≤ E[f (µ t )] + E ∇f (µ t ), µ t+1 -µ t + L 2 E µ t+1 -µ t 2 . ( ) First we look at E[µ t+1 -µ t ]. If agents i and j interact, (which happens with probability 1 rn/2 ). We have that µ t+1 - µ t = -η n h i (X i t ) -η n h j (X j t ) + η 2n h i (X i p i t ) + η 2n h j (X j p j t ). Hence we get that E t [µ t+1 -µ t ] = 1 rn/2 (i,j)∈E E t [- η n h i (X i t ) - η n h j (X j t )] + η 2n h i (X i p i t ) + η 2n h j (X j p j t ). = - 2η n 2 n i=1 E t [ h i (X i t )] + η n 2 n i=1 h i (X i p i t ) . and E [ µ t+1 -µ t ] = E[E t [µ t+1 -µ t ]] = - 2η n 2 n i=1 E[ h i (X i t )] + η n 2 n i=1 E[ h i (X i p i t )]. Next we look at E µ t+1 -µ t 2 . If agents i and j interact, (which happens with probability 1 rn/2 ). We have that µ t+1 - µ t = -η n h i (X i t ) -η n h j (X j t ) + η 2n h i (X i p i t ) + η 2n h j (X j p j t ). Hence we get that E t µ t+1 -µ t 2 = 1 rn/2 (i,j)∈E E t - η n h i (X i t ) - η n h j (X j t ) + η 2n h i (X i p i t ) + η 2n h j (X j p j t ) 2 Cauchy-Schwarz ≤ 1 rn/2 (i,j)∈E η 2 n 2 4E t h i (X i t ) 2 + 4E t η n h j (X j t ) 2 + h i (X i p i t ) 2 + h j (X j p j t ) 2 = 2 n n i=1 4η 2 n 2 h i (X i t ) 2 + 2 n n i=1 η 2 n 2 h i (X i pt ) 2 Lemma F.2 ≤ 16η 2 H 2 M 2 n 2 + 2 n n i=1 η 2 n 2 h i (X i pt ) 2 . and E µ t+1 -µ t 2 = E[[E t µ t+1 -µ t 2 ]] ≤ 16η 2 H 2 M 2 n 2 + 2 n n i=1 η 2 n 2 E h i (X i pt ) 2 Lemma F.2 ≤ 20η 2 H 2 M 2 n 2 . Hence, we can rewrite (25) as: E[f (µ t+1 )] ≤ E[f (µ t )] + 2η n 2 n i=1 E ∇f (µ t ), -h i (X i t ) + η n 2 n i=1 E ∇f (µ t ), h i (X i p i t ) + 20Lη 2 H 2 M 2 n 2 . Next, we use Lemmas F.4 and F.5: E[f (µ t+1 )] ≤ E[f (µ t )] + 2η n 2 2HL 2 E[Γ t ] - 3Hn 4 E ∇f (µ t ) 2 + 12H 3 nL 2 M 2 η 2 + η n 2 2HL 2 n i=1 E µ t -X i p i t 2 + 5Hn 4 E ∇f (µ t ) 2 + 12H 3 nL 2 M 2 η 2 + 4Lη 2 H 2 M 2 n 2 = E[f (µ t )] + 4HL 2 ηE[Γ t ] n 2 + 2HL 2 η n i=1 E µ t -X i p i t 2 n 2 - Hη 4n E ∇f (µ t ) 2 + 36H 3 L 2 M 2 η 3 n + 20Lη 2 H 2 M 2 n 2 . We use Lemmas F.3 and F.7 to upper bound E[Γ t ] and n i=1 E µ t -X i p i t 2 respectively : E[f (µ t+1 )] -E[f (µ t )] ≤ ( 160r λ 2 + 320r 2 λ 2 2 ) η 3 H 3 M 2 L 2 (n 2 + n) n 2 - Hn 4 E ∇f (µ t ) 2 + 76H 3 L 2 M 2 η 3 n + 20Lη 2 H 2 M 2 n 2 . by summing the above inequality for t = 0 to t = T -1, we get that E[f (µ T )] -f (µ 0 ) ≤ T -1 t=0 ( 160r λ 2 + 320r 2 λ 2 2 ) η 3 H 3 M 2 L 2 (n + 1) n - ηH 4n E ∇f (µ t ) 2 + 20Lη 2 H 2 M 2 n 2 + 76H 3 L 2 M 2 η 3 n . From this we get that : T -1 t=0 ηH 4n E ∇f (µ t ) 2 ≤ f (µ 0 ) -E[f (µ T )] + ( 160r λ 2 + 320r 2 λ 2 2 ) η 3 H 3 M 2 L 2 T (n + 1) n + 20T Lη 2 H 2 M 2 n 2 + 76T H 3 L 2 M 2 η 3 n . Note that E[f (µ T )] ≥ f (x * ), hence after multiplying the above inequality by 4n ηHT we get that 1 T T -1 t=0 E ∇f (µ t ) 2 ≤ 4n(f (µ 0 ) -f (x * )) T Hη + ( 640r λ 2 + 1280r 2 λ 2 2 )η 2 H 2 M 2 L 2 (n + 1) + 80LηHM 2 n + 304H 2 L 2 M 2 η 2 . Observe that η = n/ √ T ≤ 1 n(n+1) , since T ≥ n 4 (n + 1) 2 . This allows us to finish the proof: 1 T T -1 t=0 E ∇f (µ t ) 2 ≤ 4n(f (µ 0 ) -f (x * )) T Hη + ( 640r λ 2 + 1280r 2 λ 2 2 ) L 2 ηM 2 H 2 n + 80LηHM 2 n + 304H 2 L 2 M 2 η n = (4f (µ 0 ) -f (x * )) √ T H + ( 640r λ 2 + 1280r 2 λ 2 2 ) L 2 M 2 H 2 √ T + 80LHM 2 √ T + 304H 2 L 2 M 2 √ T ≤ 4(f (µ 0 ) -f (x * )) √ T H + 2304H 2 max(1, L 2 )M 2 √ T ( r 2 λ 2 2 + 1).

G ANALYSIS OF QUANTIZED AVERAGING, ASSUMING SECOND MOMENT BOUND AND RANDOM NUMBER OF LOCAL STEPS

First we define how quantization of models changes our interactions. Both the algorithm and the analysis in this case are similar to those of Section F. Let i and j be nodes which interact at step t + 1, we set X i t+1/2 = X i t 2 + X j t 2 , X j t+1/2 = X j t 2 + X i t 2 , and X i t+1 = X i t+1/2 -η h i (X i t ), X j t+1 = X j t+1/2 -η h j (X j t ), where for each node k, X k t is a quantized version of the model X k t . We use the quantization provided in Davies et al. (2020) . The key property of this quantization scheme is summarized below: Lemma G.1. Let q be a parameter we will fix later. If the inputs x u , x v at nodes u and v, respectively satisfy x u -x v ≤ q q d , then with probability at least 1 -log log( 1 x u -x v ) • O(q -d ), the quantization algorithm of Davies et al. (2020) provides node v with an unbiased estimate x u of x u , with x u -x u ≤ (q 2 + 7) , and uses O d log( q x u -x v ) bits to do so. For our purposes, x u and x v are the local models of the nodes u and v and d is their dimension (we omit the time step here). In the following , we refer to the above lemma as the quantization lemma. Recall that in section E, for each node k, if p k t + 1 is the last time interacting before and including step t: X k t = X i p k t +1/2 = X k t + η h k (X k p k t ). Analysis Outline. The crucial property we used in the analysis is that E [ h k (X k p k t )] ≤ 2H 2 M 2 η 2 (see Lemma F.2). We also used that for h k (X k p k t ) , we can use the smoothness property (4) in Lemmas F.5 and F.7. In our case we plan to use the quantization lemma above. For this, first notice that the estimate is unbiased: this means that E[X k t ] = X k t , eliminating the need to use Lemma F.5 and subsequently F.7, since n i=1 E ∇f (µ t ), X i t -X i t = 0 in our case. Secondly if we set (qfoot_1 + 7) = HηM we also satisfy Lemma F.2. This means that entire analysis can be replicated, and even further as in the case of section E we will only need T ≥ n 4 (In one case we do not use Lemma F.5 at all, and in the second case upper bound can be replaced by 0, which is the same as not using it). Now we concentrate on calculating the probability that x u -x v ≤ q q d (which we call the distance criterion) required by the quantization lemma to hold over T steps and each pair of nodes. We also need to calculate the probability with which we fail to decode. Assume that Lemma F.3 holds for step t, as in the proof of this lemma we will use induction and Lemma F.1 (we omit the proof since it will be exactly the same given that the conditions we discuss above hold). That is: E[Γ t ] ≤ ( 40r λ2 + 80r 2 λ 2 2 )nη 2 H 2 M 2 . Notice that for a pair of nodes u and v, X u t -X v t 2 ≤ 2Γ t (Using Cauchy-Schwarz). Hence we need to calculate the probability that Γ t ≥ (q q d ) 2 /2. Using Markov's inequality, the probability of this happening is at most: 2E[Γ t ] (q q d ) 2 ≤ 80r λ 2 + 160r 2 λ 2 2 n(q 2 + 7) 2 2 (q q d ) 2 = 80r λ 2 + 160r 2 λ 2 2 n(q 2 + 7) 2 (q q d ) 2 . We set q = 2 + T 3/d , this means that (q q d ) 2 ≥ 2 T 3 . So given that T ≥ n 4 , we have that P r[Γ t ≥ (q q d ) 2 /2] ≤ O(1/T 4 ) (note that r ≤ n -1 and λ 2 = Ω(1/n 2 ) since our graph is connected). Hence the distance criterion is satisfied with probability 1 -O(1/T 2 ). Given that it is satisfied, we also have failure probability log log ( 1 X u t -X v t ) • O(q -d ) = O( log log T T 3 ). So, the total probability of failure, either due to contravening the distance criterion or by probabilistic failure, is at most O(T -2 ). Hence with probability 1 -O(1/T 2 ) we can use Lemma F.1 and prove that E[Γ t+1 ] ≤ 40r λ 2 + 80r 2 λ 2 2 nη 2 H 2 M 2 . What is left is to union bound over T steps 2 , and we get that with probability 1 -O(1/T ) = 1 -O(1/n 4 ) the quantization algorithm never fails and the distance criterion is always satisfied. The total number of bits used per step is O(d log q) = O(d + log T ). With this we can state the main theorem: Theorem G.2. Let f be an non-convex, L-smooth function, whose stochastic gradients satisfy the bounded second moment assumption above. Consider the quantized version of the algorithm 1. Let the number of local stochastic gradient steps performed by each agent upon interaction be a geometric random variable with mean H. Let the learning rate we use be η = n/ √ T . Define µ t = n i=1 X i t /n, where X i t is a value of model i after t interactions, be the average of the local parameters. Then, for learning rate η = n/ √ T and any number of interactions T ≥ n 4 , with probability at least 1 -O(1/n 4 ) we have that: 1 T T -1 t=0 E ∇f (µ t ) 2 ≤ 4(f (µ 0 ) -f (x * )) √ T H + 2304H 2 max(1, L 2 )M 2 √ T r 2 λ 2 2 + 1 . and additionally we use O(d + log T ) communication bits per step.

H FIXED NUMBER OF LOCAL STEPS WITH VARIANCE BOUND AND NON-IDENTICALLY DISTRIBUTED DATA

We again deal with a non-convex L-smooth function, but we no longer require a second moment bound, and no longer assume that data is distributed identically. We use a constant learning rate η and fixed local steps sizes H i t = H, for each node i and step t. Each agent i has access to local function f i such that: 1. For each agent i, the gradient ∇f i (x) is L-Lipschitz continuous for some L > 0, i.e. for all x, y ∈ R d : ∇f i (x) -∇f i (y) ≤ L x -y . (27) 2. for every x ∈ R d : n i=1 f i (x)/n = f (x). (28) 3. For each agent i and x ∈ R d : E[ g i (x)] = ∇f i (x). (29) 4. For each agent i and x ∈ R d there exist σ 2 such that: E g i (x) -∇f i (x) 2 ≤ σ 2 . ( ) 5. For each x ∈ R d there exist σ 2 such that: n i=1 ∇f i (x) -∇f (x) 2 /n ≤ ρ 2 . ( ) Notice that since data is not distributed identically, for 1 ≤ q ≤ H, we no longer have that h q i (X i t ) = E[ hq i (X i t )] = E[ g i (X i t - q-1 s=0 η h s i (X i t ))] = ∇f (X i t - q-1 s=0 η h s i (X i t )). Instead, h q i (X i t ) = E[ hq i (X i t )] = E[ g i (X i t - q-1 s=0 η h s i (X i t ))] = ∇f i (X i t - q-1 s=0 η h s i (X i t )). We proceed by proving the following lemma which upper bounds the expected change in potential: Lemma H.1. For any time step t , we have: E[Γ t+1 ] ≤ (1 - λ 2 2rn )E[Γ t ] + (2 + 8r λ 2 )η 2 n i=1 E h i (X i t ) 2 n Proof. First we bound change in potential ∆ t = Γ t+1 -Γ t for some fixed time step t > 0. For this, let ∆ i,j t be the change in potential when we choose agents (i, j) ∈ E for interaction. We have that: X i t+1 = (X i t + X j t + η h i (X i t ) + η h j (X j t ))/2. X j t+1 = (X i t + X j t + η h i (X i t ) + η h j (X j t ))/2. µ t+1 = µ t + η h i (X i t )/n + η h j (X j t ) /n. This gives us that: X i t+1 -µ t+1 = X i t + X j t 2 + n -2 2n η h i (X i t ) + n -2 2n η h j (X j t ) -µ t . X j t+1 -µ t+1 = X i t + X j t 2 + n -2 2n η h i (X i t ) + n -2 2n η h j (X j t ) -µ t . For k = i, j we get that X k t+1 -µ t+1 = X k t - 1 n (η h i (X i t ) + η h j (X j t )) -µ t . Hence: ∆ i,j t = (X i t + X j t )/2 + n -2 2n (η h i (X i t ) + η h j (X j t )) -µ t 2 -X i t -µ t 2 + (X i t + X j t )/2 + n -2 2n (η h i (X i t ) + η h j (X j t )) -µ t 2 -X j t -µ t 2 + k =i,j X k t - 1 n (η h i (X i t ) + η h j (X j t )) -µ t 2 -X k t -µ t 2 = 2 X i t -µ t 2 + X j t -µ t 2 2 -X i t -µ t 2 -X j t -µ t 2 + 2 X i t -µ t + X j t -µ t , n -2 2n η h i (X i t ) + n -2 2n η h j (X j t ) + 2( n -2 2n ) 2 η h i (X i t ) + η h j (X j t ) 2 + k =i,j 2 X k t -µ t , - 1 n (η h i (X i t ) + η h j (X j t )) + k =i,j ( 1 n ) 2 η h i (X i t ) + η h j (X j t ) 2 . Observe that: n k=1 X k t -µ t , - 1 n (η h i (X i t ) + η h j (X j t )) = 0. After combining the above two equations, we get that: ∆ i,j t = - X i t -X j t 2 2 + X i t -µ t + X j t -µ t , η h i (X i t ) + η h j (X j t ) + 2( n -2 2n ) 2 + (n -2)( 1 n ) 2 η h i (X i t ) + η h j (X j t ) 2 Cauchy-Schwarz ≤ - X i t -X j t 2 2 + X i t -µ t + X j t -µ t , η h i (X i t ) + η h j (X j t ) + 2 2( n -2 2n ) 2 + (n -2)( 1 n ) 2 η h i (X i t ) 2 + η h j (X j t ) 2 ≤ - X i t -X j t 2 2 + X i t -µ t + X j t -µ t , η h i (X i t ) + η h j (X j t ) + η h i (X i t ) 2 + η h j (X j t ) 2 . This gives us: (i,j)∈E ∆ i,j t ≤ (i,j)∈E - X i t -X j t 2 2 + X i t -µ t + X j t -µ t , η h i (X i t ) + η h j (X j t ) + η h i (X i t ) 2 + η h j (X j t ) 2 Lemma C.2 ≤ - λ 2 Γ t 2 + n i=1 r η h i (X i t ) 2 + (i,j)∈E X i t -µ t + X j t -µ t , η h i (X i t ) + η h j (X j t ) Young ≤ - λ 2 Γ t 2 + n i=1 r η h i (X i t ) 2 + (i,j)∈E λ 2 X i t -µ t + X j t -µ t 2 8r + 2r η h i (X i t ) + η h j (X j t ) 2 λ 2 Cauchy-Schwarz ≤ - λ 2 Γ t 2 + n i=1 r η h i (X i t ) 2 + (i,j)∈E λ 2 X i t -µ t 2 + λ 2 X j t -µ t 2 4r + 4r η h i (X i t ) 2 + 4r η h j (X j t ) 2 λ 2 = - λ 2 Γ t 2 + n i=1 r η h i (X i t ) 2 + n i=1 λ 2 X i t -µ t 2 4 + n i=1 4r 2 η h i (X i t ) 2 λ 2 Next, we use definition of Γ t : (i,j)∈E ∆ i,j t ≤ - λ 2 Γ t 4 + n i=1 (r + 4r 2 λ 2 ) η h i (X i t ) 2 . ( ) Next, we use the above inequality to upper bound ∆ t in expectation: E[∆ t |X 0 , X 1 , ..., X t ] = 1 rn/2 (i,j)∈E E[∆ i,j t |X 0 , X 1 , ..., X t ] ≤ 1 rn/2 - λ 2 Γ t 4 + n i=1 (r + 4r 2 λ 2 )E η h i (X i t ) 2 |X 0 , X 1 , ..., X t = - λ 2 Γ t 2rn + n i=1 (2 + 8r λ 2 )η 2 E h i (X i t ) 2 |X 0 , X 1 , ..., X t n . Finally, we remove the conditioning: E[∆ t ] = E[E[∆ t |X 0 , X 1 , ..., X t ]] ≤ - λ 2 E[Γ t ] 2rn + (2 + 8r λ 2 )η 2 n i=1 E h i (X i t ) 2 n . By considering the definition of ∆ t , we get the proof of the lemma. Lemma H.2. For any 1 ≤ q ≤ H and step t, we have that n i=1 E ∇f i (µ t ) -h q i (X i t ) 2 ≤ 2L 2 E[Γ t ] + n i=1 2L 2 η 2 E q-1 s=0 h s i (X i t ) 2 . Proof. n i=1 E ∇f i (µ t ) -h q i (X i t ) 2 = n i=1 E ∇f i (µ t ) -∇f i (X i t - q-1 s=0 η h s i (X i t )) 2 (27) ≤ n i=1 L 2 E µ t -X i t + η h q-1 i (X i t )) 2 Cauchy-Schwarz ≤ n i=1 2L 2 E X i t -µ t 2 + n i=1 2L 2 η 2 E q-1 s=0 h s i (X i t ) 2 = 2L 2 E[Γ t ] + n i=1 2L 2 η 2 E q-1 s=0 h s i (X i t ) 2 . Lemma H.3. For any 1 ≤ q ≤ H and step T , we have that n i=1 E h q i (X i t ) 2 ≤ nσ 2 + 4nρ 2 + 16L 2 E[Γ t ] + n i=1 16L 2 η 2 E q-1 s=0 h s i (X i ) 2 + 4nE n i=1 h q i (X i t )/n 2 . Proof. n i=1 E h q i (X i t ) 2 ≤ n i=1 (σ 2 + E h q i (X i t ) 2 ) = nσ 2 + n i=1 E ∇f i (X i t - q-1 s=0 η h s i (X i t )) 2 (28) ≤ nσ 2 + n i=1 E ∇f i (X i t -η h q-1 i (X i t )) -∇f i (µ t ) + ∇f i (µ t ) -∇f (µ t ) + n j=1 ∇f j (µ t )/n - n j=1 ∇f j (X j t - q-1 s=0 η h s j (X j t ))/n + n j=1 ∇f j (X j t - q-1 s=0 η h s j (X j t ))/n 2 Cauchy-Schwarz ≤ nσ 2 + n i=1 4E ∇f i (µ t ) -h q i (X i t ) 2 + 4nE n i=1 (∇f i (µ t ) -h q i (X i t ))/n 2 + 4nE n i=1 h q i (X i t )/n 2 + 4 n i=1 ∇f i (µ t ) -∇f (µ t ) 2 Cauchy-Schwarz,(31) ≤ nσ 2 + 4nρ 2 + n i=1 8E ∇f i (µ t ) -h q i (X i t ) 2 + 4nE n i=1 h q i (X i t )/n 2 Lemma H.2 ≤ nσ 2 + 4nρ 2 + 16L 2 E[Γ t ] + n i=1 16L 2 η 2 E q-1 s=0 h s i (X i t ) 2 + 4nE n i=1 h q i (X i t )/n 2 . Next we use the above lemma to show the upper bound for H q=1 n i=1 E h q i (X i t ) 2 : Lemma H.4. For η ≤ 1 6LH , we have that : H q=1 n i=1 E h q i (X i t ) 2 ≤ 2Hn(σ 2 + 4ρ 2 ) + 32HL 2 E[Γ t ] + 8n H q=1 E n i=1 h q i (X i t )/n 2 Proof. Notice that if η ≤ 1 6LH the Lemma H.3 gives us that : n i=1 E h q i (X i t ) 2 ≤ n(σ 2 + 4ρ 2 ) + 16L 2 E[Γ t ] + n i=1 1 2H 2 E q-1 s=0 h s i (X i t ) 2 + 4nE n i=1 h q i (X i t )/n 2 ≤ n(σ 2 + 4ρ 2 ) + 16L 2 E[Γ t ] + n i=1 q 2H 2 q-1 s=0 E h s i (X i t ) 2 + 4nE n i=1 h q i (X i t )/n 2 ≤ n(σ 2 + 4ρ 2 ) + 16L 2 E[Γ t ] + n i=1 1 2H q-1 s=0 E h s i (X i t ) 2 + 4nE n i=1 h q i (X i t )/n 2 . (34) For 0 ≤ q ≤ H, let R q = n i=1 q s=0 E h s i (X i t ) 2 . Observe that the inequality 33 can be rewritten as: R q -R q-1 ≤ 1 2H R q-1 + n(σ 2 + 4ρ 2 ) + 16L 2 E[Γ t ] + 4nE n i=1 h q i (X i t )/n 2 . which is the same as R q ≤ (1 + 1 2H )R q-1 + n(σ 2 + 4ρ 2 ) + 16L 2 E[Γ t ] + 4nE n i=1 h q i (X i t )/n 2 . By unrolling the recursion we get that R H ≤ H-1 q=0 (1 + 1 2H ) q n(σ 2 + 4ρ 2 ) + 16L 2 E[Γ t ] + 4nE n i=1 h H-q i (X i t )/n 2 Since, (1 + 1 2H ) H ≤ (e 1 2H ) H = e 1/2 ≤ 2 we have that R H = H q=1 n i=1 E h q i (X i t ) 2 ≤ 2 H q=1 n(σ 2 + 4ρ 2 ) + 16L 2 E[Γ t ] + 4nE n i=1 h q i (X i t )/n 2 = 2Hn(σ 2 + 4ρ 2 ) + 32HL 2 E[Γ t ] + 8n H q=1 E n i=1 h q i (X i t )/n 2 . Next we derive the upper bound for T t=0 E[Γ t ]: Lemma H.5. For η ≤ 1 10HL √ 2r/λ2+8r 2 /λ 2 2 , we have that : T t=0 E[Γ t ] ≤ 8nrη 2 (σ 2 + 4ρ 2 )H 2 T λ 2 (2 + 8r λ 2 ) + 32nrη 2 H λ 2 (2 + 8r λ 2 ) T t=1 H q=1 E n i=1 h q i (X i t )/n 2 . Proof. By Lemma H.1 we get that: E[Γ t+1 ] ≤ (1 - λ 2 2rn )E[Γ t ] + η 2 H n (2 + 8r λ 2 ) H q=1 n i=1 E h q i (X i t ) 2 Lemma H.4 ≤ (1 - λ 2 2rn )E[Γ t ] + η 2 H n (2 + 8r λ 2 ) 2Hn(σ 2 + 4ρ 2 ) + 32HL 2 E[Γ t ] + 8n H q=1 E n i=1 h q i (X i t )/n 2 = (1 - λ 2 2rn )E[Γ t ] + 2η 2 (σ 2 + 4ρ 2 ) + H 2 (2 + 8r λ 2 ) + 32H 2 L 2 η 2 n (2 + 8r λ 2 )E[Γ t ] + 8η 2 H(2 + 8r λ 2 ) H q=1 E n i=1 h q i (X i t )/n 2 Notice that for η ≤ 1 12HL √ 2r/λ2+8r 2 /λ 2 2 we can rewrite the above inequality as E[Γ t+1 ] ≤ (1 - λ 2 4nr )E[Γ t ] + 2η 2 (σ 2 + 4ρ 2 )H 2 (2 + 8r λ 2 ) + 8η 2 H(2 + 8r λ 2 ) H q=1 E n i=1 h q i (X i t )/n 2 . since ∞ i=0 (1 -λ2 4nr ) i ≤ 1 1-(1- λ 2 4nr ) = 4nr λ2 we get that: T t=0 E[Γ t ] ≤ 4nr λ 2 2(η 2 + 4ρ 2 )σ 2 H 2 (2 + 8r λ 2 ) + 8η 2 H(2 + 4r λ 2 ) H q=1 E n i=1 h q i (X i t )/n 2 ) = 8nrη 2 (σ 2 + 4ρ 2 )H 2 T λ 2 (2 + 8r λ 2 ) + 32nrη 2 H λ 2 (2 + 8r λ 2 ) T t=1 H q=1 E n i=1 h q i (X i t )/n 2 . Now, we are ready to prove the following theorem: Theorem 4.2. Let f be an non-convex, L-smooth function whose minimum x we are trying to find via the SwarmSGD procedure given in Algorithm 1. Let local functions of agents satisfy conditions ( 27), ( 28), ( 29), ( 30) and ( 31). Let H be the number of local stochastic gradient steps performed by each agent upon interaction. Define µ t = n i=1 X i t /n, where X i t is a value of model i after t interactions. For learning rate η = n √ T and T ≥ 57600n 4 H 2 max(1, L 2 )( r 2 λ 2 2 + 1) 2 we have that: T -1 i=0 E ∇f (µ t ) 2 T ≤ 1 √ T H E[f (µ 0 ) -f (x * )] + 376H 2 max(1, L 2 )(σ 2 + 4ρ 2 ) √ T ( r 2 λ 2 2 + 1). Proof. Let E t denote expectation conditioned on {X t 1 , X t 2 , ..., X t n }. By L-smoothness we have that E t [f (µ t+1 )] ≤ f (µ t ) + E t ∇f (µ t ), µ t+1 -µ t + L 2 E t µ t+1 -µ t 2 . ( ) First we look at E t [µ t+1 -µ t ]. If agents i and j interact (which happens with probability 1 rn/2 ), we have that µ t+1 -µ t = -η n h i (X i t ) -η n h j (X j t ). Hence we get: E t [µ t+1 -µ t ] = 1 rn/2 (i,j)∈E E t [- η n h i (X i t ) - η n h j (X j t )] = 2 n n i=1 E t [- η n h i (X i t )] = - 2η n 2 n i=1 E t [ h i (X i t )] (3) = - 2η n 2 n i=1 H q=1 E t [h q i (X i t )]. Using the above inequality we have: E t ∇f (µ t ), µ t+1 -µ t = ∇f (µ t ), E t [µ t+1 -µ t ] = ∇f (µ t ), - 2η n 2 n i=1 H q=1 E t [h q i (X i t )] = η n H q=1 E t ∇f (µ t ) - n i=1 h q i (X i t )/n 2 -∇f (µ t ) 2 -E t n i=1 h q i (X i t )/n 2 (28) = η n H q=1 E t n i=1 (∇f i (µ t ) -h q i (X i t ))/n 2 -∇f (µ t ) 2 -E t n i=1 h q i (X i t )/n 2 ≤ η n H q=1 1 n n i=1 E t n i=1 ∇f i (µ t ) -h q i (X i t ) 2 -∇f (µ t ) 2 -E t n i=1 h q i (X i t )/n 2 . Here we used Cauchy-Schwarz inequality at the last step. Next we look at E t µ t+1 -µ t 2 . If agents i and j interact, (which happens with probability 1 rn/2 ). We have that µ t+1 -µ t = -η n h i (X i t ) - η n h j (X j t ) . Hence we get that E t µ t+1 -µ t 2 = 1 rn/2 (i,j)∈E E t - η n h i (X i t ) - η n h j (X j t ) 2 Cauchy-Schwarz ≤ 1 rn/2 (i,j)∈E η 2 n 2 2E t h i (X i t ) 2 + 2E t η n h j (X j t ) 2 = 2 n n i=1 2η 2 n 2 h i (X i t ) Cauchy-Schwarz ≤ 4η 2 H n 3 n i=1 H q=1 E t h q i (X i t ) 2 . So, we can rewrite (35) as: E t [f (µ t+1 )] ≤ f (µ t ) + η n H q=1 1 n E t n i=1 ∇f (µ t ) -h q i (X i t ) 2 -∇f (µ t ) 2 -E t n i=1 h q i (X i t )/n 2 + 2Lη 2 H n 3 n i=1 H q=1 E t h q i (X i t ) 2 . Next, we remove conditioning: E[f (µ t+1 )] = E[E t [f (µ t+1 )]] ≤ E[f (µ t )] + η n H q=1 1 n n i=1 E ∇f i (µ t ) -h q i (X i t ) 2 -E ∇f (µ t ) 2 -E n i=1 h q i (X i t )/n 2 + 2Lη 2 H n 3 n i=1 H q=1 E h q i (X i t ) 2 Lemma H.2 ≤ E[f (µ t )] + η n H q=1 1 n (2L 2 E[Γ t ] + n i=1 2L 2 η 2 E q-1 s=0 h s i (X i t ) 2 ) -E ∇f (µ t ) 2 -E n i=1 h q i (X i t )/n 2 + 2Lη 2 H n 3 n i=1 H q=1 E h q i (X i t ) 2 Lemma H.4 ≤ E[f (µ t )] + 2ηL 2 H n 2 E[Γ t ] - Hη n E ∇f (µ t ) 2 + 2L 2 η 3 n 2 2Hn(σ 2 + 4ρ 2 ) + 32HL 2 E[Γ t ] + 8n H q=1 E n i=1 h q i (X i t )/n 2 + 2LHη 2 n 3 2Hn(σ 2 + 4ρ 2 ) + 32HL 2 E[Γ t ] + 8n H q=1 E n i=1 h q i (X i t )/n 2 - η n H q=1 E n i=1 h q i (X i t )/n 2 . Next we choose η ≤ 1 9L and η ≤ n 80LH , so that 16L 2 η 3 n ≤ η 5n and 16LHη 2 n 2 ≤ η 5n . This together with the above inequalities allows us to derive the following upper bound for E[f (µ t+1 )] (we eliminate terms with positive multiplicative factor E n i=1 h q i (X i t )/n 2 ): E[f (µ t+1 )] ≤ E[f (µ t )] + 2ηL 2 H n 2 E[Γ t ] - Hη n E ∇f (µ t ) 2 + 2L 2 η 3 n 2 2Hn(σ 2 + 4ρ 2 ) + 32HL 2 E[Γ t ] + 2LHη 2 n 3 2Hn(σ 2 + 4ρ 2 ) + 32HL 2 E[Γ t ] - 3η 5n H q=1 E n i=1 h q i (X i t )/n 2 . We proceed by summing up the above inequality for 0 ≤ t ≤ T -1: Recall that η = n √ T to get:  T -1 t=0 E[f (µ t+1 )] ≤ T -1 t=0 E[f (µ t )] + 4L 2 Hη 3 (σ 2 + 4ρ 2 )T n + 4LH 2 η 2 (σ 2 + 4ρ 2 )T n 2 + 2ηL 2 H n 2 T -1 t=0 E[Γ t ] + 64L 4 Hη 3 n 2 T -1 t=0 E[Γ t ] + 64L 3 H 2 η 2 n 3 T -1 t=0 E[Γ t ] - T -1 t=0 3η 5n H q=1 E n i=1 h q i (X i t )/n 2 - T -1 i=0 ηH n E ∇f (µ t ) 2 . ( T -1 i=0 E ∇f (µ t ) 2 T ≤ 1 √ T H E[f (µ 0 ) -f (µ t )] + 4L 2 (σ 2 + 4ρ 2 ) √ T + 4LH(σ 2 + 4ρ 2 ) √ T + 16L ≤ 1 √ T H E[f (µ 0 ) -f (x * )] + 376H 2 max(1, L 2 )(σ 2 + 4ρ 2 ) √ T ( r 2 λ 2 2 + 1). where in the last step we used f (µ t ) ≥ f (x * ). Notice that all assumptions and upper bounds on η are satisfied if η ≤ 1 240nH max(1, L)( r 2 λ 2 2 + 1) , which is true T ≥ 57600n 4 H 2 max(1, L 2 )( r 2 λ 2 2 + 1) 2 . ( )

I ADDITIONAL EXPERIMENTAL RESULTS

We validated our analysis, by applying the algorithm to training deep neural networks for image classification and machine translation. Target System and Implementation. We run SwarmSGD on the CSCS Piz Daint supercomputer, which is composed of Cray XC50 nodes, each with a Xeon E5-2690v3 CPU and an NVIDIA Tesla P100 GPU, using a state-of-the-art Aries interconnect. Please see (Piz, 2019) for hardware details. best performing alternative (AD-PSGD) is known to drop accuracy relative to the baselines, e.g. (Assran et al., 2018) . Results. The accuracy results for ImageNet experiments are given in Table 1 and Figures 3(a ) and 3(b). As is standard, we follow Top-1 validation accuracy versus number of steps. Communication cost. We now look deeper into SwarmSGD's performance. For this, we examine in Figure 4 the average time per batch of different methods when executed on our testbed. The base value on the y axis (0.4s) is exactly the average time per batch, which is the same across all methods. Thus, the extra values on the y axis equate roughly to the communication cost of each algorithm. The results suggest that the communication cost can be up to twice the batch cost (SGP and D-PSGD). Moreover, this cost is increasing when considered relative to the number of workers (X axis), for all methods except SwarmSGD. This reduced cost is justified simply because our method reduces communication frequency: it communicates less often, and therefore the average cost of communication at a step is lower. We can therefore conclude that our method is scalable, in the sense that its communication cost remains constant relative to the total size of the system. preserves convergence even at very high node counts (256), and suggest a strong correlation between accuracy and the number of epochs executed per model. The number of local steps executed also impacts accuracy, but to a much lesser degree. Quantization. Finally, we show convergence and speedup for a WideResNet-28 model with width factor 2, trained on the CIFAR-10 dataset. We note that the epoch multiplier factor in this setup is 1, i.e. Swarm (and its quantized variant) execute exactly the same number of epochs as the baseline. Notice that the quantized variant provides approximately 10% speedup in this case, for a < 0.3% drop in Top-1 accuracy. 



The unusual log T factor arises because the quantization scheme ofDavies et al. (2020) can fail with some probability, which we handle as part of the analysis. Note that we do not need to union bound over all pairs of u and v, since can assume that u and v are the ones which interact at step t + 1



Throughput vs previous work. Higher is better.

Figure 1: Convergence and Scalability on the Transformer/WMT Task with multiplier = 1.

(a) Convergence in time versus number of local steps for ResNet18 on ImageNet. All variants recover the target accuracy, but we note the lower convergence of variants with more local steps. The experiment is run on 32 nodes. (b) Average time per batch for previous methods, compared to SwarmSGD, on ResNet18/ImageNet. The base value on the y axis (0.4) is the average computation time per batch, so values above represent average communication time per batch.

Figure 2: Convergence results and performance breakdown for ResNet18/ImageNet.

Figure4: Average time per batch for previous methods, compared to SwarmSGD, on ResNet18/ImageNet, across 1000 repetitions with warm-up. Notice that 1) the time per batch of SwarmSGD stays constant relative to the number of nodes; 2) it is lower than any other method. This is due to the reduced communication frequency. Importantly, the base value on the y axis of this graph (0.4) is the average computation time per batch. Thus, everything above 0.4 represents the average communication time for this model.

Figure5: Convergence versus time for ResNet18/Imagenet for the SGD baseline vs Swarm, executing at 32 nodes. We note that Swarm iterates for 2.7× more epochs for convergence, which explains the similar runtime despite the better scalability of Swarm.

Figure 3(b)  shows the convergence versus time for ResNet18 on the ImageNet dataset, at 32 nodes, with 3 local steps per node, and ∼ 7 epochs per model. Convergence versus Steps and Epochs. Figure8shows and discusses the results of additional ablation studies with respect to the number of nodes/processes and number of local steps / total epochs on the CIFAR-10 dataset / ResNet20 model. In brief, the results show that the method still (a) Convergence versus number epochs (per model) for CIFAR-10/ResNet20, at node counts between 8 and 256. We note that the algorithm converges and recovers SGD accuracy (91.35% Top-1) for all node counts, although there are oscillations at high node counts.(b) Accuracy versus local epochs and local steps for CIFAR-10/ResNet20. The original schedule for this model has 300 epochs, and this experiment is executed on 8 nodes. If the convergence scaling were perfect, 300/8 = 37.5 epochs would have been sufficient to converge. However, in this case we need an epoch multiplier of 2, leading to 75 epochs to recover full accuracy (which in this case is 91.35%).

Figure 6: Additional convergence results for CIFAR-10 dataset, versus number of nodes (left), and local steps (right).

Figure 7: Objective loss versus time for the Transformer-XL/WMT experiment, for various methods, executing at 16 nodes.

(a) Convergence versus number of steps for the quantized variant. (b) Convergence versus time .

Figure 8: Convergence results for quantized 2xResNet28 trained on the CIFAR-10 dataset, versus iterations (left), and time (right).

Parameters for full Top-1 validation accuracy on CIFAR-10 and ImageNet running on 32 nodes. Swarm step count represents local SGD steps per model between two averaging steps, and epochs are counted in terms of total passes over the data.

(µ 0 ) -f (µ t )] + 4L 2 η 2 (σ 2 + 4ρ 2 ) + 4LHη(σ 2 + 4ρ 2 ) n + 16η 2 L 2 H 2 (σ 2 + 4ρ 2 )(2r/λ 2 + 8r 2 /λ 2 2 ) + 512L 4 H 2 η 4 (σ 2 + 4ρ 2 )(2r/λ 2 + 8r 2 /λ 2 2 ) + 512 3 H 3 η 3 (σ 2 + 4ρ 2 ) n (2r/λ 2 + 8r 2 /λ 2 2 ).Under review as a conference paper at ICLR 2021 Next we use η ≤ 1/n and η ≤ 1 6HL :

2 H 2 (σ 2 + 4ρ 2 )

annex

We implemented SwarmSGD in Pytorch and TensorFlow using NCCL/MPI respectively. Basically, each node implements a computation thread, and a communication thread, each of which stores a copy of the model. The "live" copy, which is being updated with gradients, is stored by the computation thread. Periodically, the threads synchronize their two models. When interacting, the two nodes exchange model information via their communication threads. Our implementation closely follows the non-blocking Swarm algorithm description.We used SwarmSGD to train ResNets on the classic CIFAR-10/ImageNet datasets, and a Transformer Vaswani et al. (2017) on the WMT17 dataset (English-Germa). The code will be made available upon publication. Hyperparameters. The only additional hyperparameter is the total number of epochs we execute for. Once we have fixed the number of epochs, we do not alter the other training hyperparameters: in particular, the learning rate schedule, momentum and weight decay terms are identical to sequential SGD, for each individual model. Practically, if sequential SGD trains ResNet18 in 90 epochs, decreasing the learning rate at 30 and 60 epochs, then SwarmSGD with 32 nodes and multiplier 2 would 90 * 2/32 5.6 epochs per node, decreasing the learning rate at 2 and 4 epochs. Specifically, for the ImageNet experiments, we used the following hyper-parameters. For ResNet18 and ResNet50, we ran for 240 total parallel epochs using 32 parallel nodes. The first communicated every 3 local steps, whereas the second communicated every 2 local steps. We used the same hyperparameters (initial learning rate 0.1, annealed at 1/3 and 2/3 through training, and standard weightdecay and momentum parameters).For the WMT17 experiments, we ran a standard Transformer-large model, and executed for 10 global epochs at 16, 32, and 64 nodes. We ran a version with multiplier 1 (i.e. 10/NUM NODES epochs per model) and one with multiplier 1.5 (i.e. 15/NUM NODES epochs per model) and registered the BLEU score for each. Baselines. We consider the following baselines:• Data-parallel SGD: Here, we consider both the small-batch (strong scaling) version, which executes a global batch size of 256 on ImageNet/CIFAR experiments, and the largebatch (weak-scaling) baseline, which maximizes the batch per GPU. For the latter version, the learning rate is tuned following Goyal et al. (2017) . • Local SGD: Stich (2018); Lin et al. (2018) We follow the implementation of Lin et al. (2018) , communicating globally every 5 SGD steps (which was the highest setting which provided good accuracy on the WMT task). • Previous decentralized proposals: We experimented also with D-PSGD Lian et al. (2017) , AD-PSGD Lian et al. (2018) , and SGP Assran et al. (2018) . Due to computational constraints, we did not always measure their end-to-end accuracy. Our method matches the sequential / large-batch accuracy for the models we consider within 1%. We note that the

