LOCAL SGD MEETS ASYNCHRONY

Abstract

Distributed variants of stochastic gradient descent (SGD) are central to training deep neural networks on massive datasets. Several scalable versions of data-parallel SGD have been developed, leveraging asynchrony, communicationcompression, and local gradient steps. Current research seeks a balance between distributed scalability-seeking to minimize the amount of synchronization needed-and generalization performance-seeking to achieve the same or better accuracy relative to the sequential baseline. However, a key issue in this regime is largely unaddressed: if "local" data-parallelism is aggressively applied to better utilize the computing resources available with workers, generalization performance of the trained model degrades. In this paper, we present a method to improve the "local scalability" of decentralized SGD. In particular, we propose two key techniques: (a) shared-memory based asynchronous gradient updates at decentralized workers keeping the local minibatch size small, and (b) an asynchronous non-blocking in-place averaging overlapping the local updates, thus essentially utilizing all compute resources at all times without the need for large minibatches. Empirically, the additional noise introduced in the procedure proves to be a boon for better generalization. On the theoretical side, we show that this method guarantees ergodic convergence for non-convex objectives, and achieves the classic sublinear rate under standard assumptions. On the practical side, we show that it improves upon the performance of local SGD and related schemes, without compromising accuracy.

1. INTRODUCTION

In this paper, we consider the classic problem of minimizing an empirical risk, defined simply as min x∈R d i∈[I] f i (x), where d is the dimension, x ∈ R d denotes the set of model parameters, [I] is the training set, and f i (x) : R d → R is the loss on the training sample i ∈ [I]. Stochastic gradient descent (SGD) (Robbins & Monro, 1951 ) is an extremely popular iterative approach to solving this problem: x k+1 = x k -α k ∇f B k (x k ), (2) where ∇f B k (x k ) = 1 |B k | i∈B k ∇f i (x k ) is the sum of gradients computed over samples, typically selected uniformly and randomly as a minibatch B k ⊆ [I], and α k is the learning rate at iteration k. 1.1 BACKGROUND ON DECENTRALIZED DATA-PARALLEL SGD For better or worse, SGD and its variants currently represent the computational backbone for many large-scale optimization tasks, most notably the training of deep neural networks (DNNs). Arguably the most popular SGD variant is minibatch SGD (MB-SGD) (Bottou (2012) ). In a distributed setting with decentralized workers q ∈ [Q], it follows the iteration x k+1 = x k -α k 1 Q Q q=1 ∇f B q k , where B q k ⊆ [I] is a local minibatch selected by worker q ∈ [Q] at iteration k. This strategy is straightforward to scale in a data-parallel way, as each worker can process a subset of the samples in parallel, and the model is then updated by the average of the workers' gradient computations. For convenience, we assume the same batch size per worker. This approach has achieved tremendous popularity recently, and there has been significant interest in running training with increasingly large batch sizes aggregated over a large number of GPUs, e.g. Goyal et al. (2017) . An alternative approach is parallel or local SGD (L-SGD) (Zinkevich et al. (2010) ; Zhang et al. (2016c) ; Lin et al. (2020) ): x q j,t+1 = x q j,t -α j,t ∇f B q j,t , 0 ≤ t < K j ; x q j+1,0 = 1 Q q x q j,Kj , where x q j,t denotes the local model at worker q ∈ [Q] after j synchronization rounds followed by t local gradient updates and B q j,t is the local minibatch sampled at the same iteration. K j denotes the number of local gradient update steps before the j th synchronization. Essentially, workers run SGD without any communication for several local steps, after which they globally average the resulting local models. This method is intuitively easy to scale, since it reduces the frequency of the communication. Recently, a variant called post local SGD (PL-SGD) (Lin et al. (2020) ), was introduced to address the issue of loss in generalization performance of L-SGD, wherein the averaging frequency during the initial phase of training is high and is reduced later when optimization stabilizes.  | = B loc , ∀q ∈ [Q], k ≥ 0 for MB-SGD and |B q j,t | = B loc , ∀q ∈ [Q], j, t ≥ 0 for PL-SGD. The LR is warmed up for the first 5 epochs to scale from α0 to α0 × B loc ×Q B loc , where B loc = 128, Q is the number of workers, 2 here, and α0 = 0.1. In PL-SGD, we average the model after each gradient update for first 150 epochs and thereafter averaging frequency K is set to 16 as in Lin et al. (2020) ; other HPs are identical to theirs. The listed results are average of 3 runs with different seeds. Although very popular in practice, these two approaches suffer from the same limitation: their generalization accuracy decreases for larger local batch sizes, as would be appropriate to fully utilize the computation power offered by the workers. We illustrate this in Table 1 : examine the results of training RESNET-20 (He et al. (2016) ) over CIFAR-10 (Krizhevsky (2009) ) with MB-SGD and PL-SGD over a workstation packing two Nvidia GeForce RTX 2080 Ti GPUs (a current standard). We observe that as the local batch size B loc grows, the throughput improves significantly, however, the optimization results degrade sharply, more glaringly with PL-SGD. Clearly, these methods can not tolerate a larger B loc , though the GPUs can support them. This shortcoming of the existing methods in harnessing the growing data-parallelism is also identified via empirical studies (Golmant et al. (2018) ; Shallue et al. (2019) ) existing in literature. To our knowledge no effective remedy (yet) exists to address this challenge. Notice that, here our core target is maximally harnessing the local data-parallelism and therefore the larger local batch size, as against the existing trend in the literature wherein large number of GPUs are deployed to have a large aggregated global batch size with a relatively small B loc . For example, refer to the performance of MB-SGD and PL-SGD as listed in Table 1 of Lin et al. (2020) . Notice that with 16 GPUs, each with B loc = 128, thus totaling the minibatch size as 2048, identical to the one with 2 GPUs each with B loc = 1024 as above, with exactly the same LR scaling and warmup strategy, both MB-SGD and PL-SGD do not face generalization degradation. However, unfortunately, such an implementation setting would incur excessive wastage of available data-parallel compute resources on each of the GPUs. Indeed, the existing specific techniques such as LARS (You et al. (2017) ) to address the issue of poor generalization for global large batch training are insufficient for the larger local minibatch size; we empirically describe it in Section 3 (Table 11 ).

1.2. LOCALLY-ASYNCHRONOUS PARALLEL SGD

Now, consider an implementation scheme as the following: 1. In a decentralized setting of L-SGD, i.e. wherein each worker q ∈ [Q] has a local model x q undergoing local SGD updates as described earlier, multiple local concurrent processes u ∈ U q share the model x q . Processes u ∈ U q perform asynchronous concurrent gradient updates locally. 2. The workers average their models whenever any one of them would have had at least K j local shared updates, where K j is as that in Equation 4. The averaging is performed asynchronously and in a non-blocking way by the (averaging-) processes a q on behalf of each worker q ∈ [Q]. Essentially, the decentralized workers run shared-memory-based asynchronous SGD locally and periodically synchronize in a totally non-blocking fashion. More formally, consider Algorithm 1. The model x q on a GPU q ∈ [Q] is shared by the processes p ∈ P q = {{a q }∪U q } locally. The processes p ∈ P q also maintain a shared counter S q , initialized to 0. The operation read-and-inc implements an atomic (with lock) read and increment of S q , whereas, read provides an atomic read. S q essentially enables ordering the shared gradient updates. In turn, this order streamlines the synchronization among workers, thereby determines the averaging rounds j. The (updater) processes u ∈ U q asynchronously and lock-freely update x q with gradients computed over a non-blocking, potentially inconsistent, snapshot v a,q of x q , essentially going Hogwild! (Recht et al. (2011) ), see Algorithm 1a. Initialize s = 0; while s ≤ T do v u,q [i] := x q [i], ∀ 1 ≤ i ≤ d; s := read-and-inc(S); Compute ∇f B q s (v u,q ); x q [i] -= αs∇f B q s (v u,q )[i], ∀ 1 ≤ i ≤ d; (a) Local asynchronous gradient update by process u ∈ U q . 1 Initialize scur = spre = |U q |, j = 0; 2 while scur ≤ T do 3 scur := read(S); Compute j corresponding to scur; 4 if scur -spre ≥ Kj then 5 v a,q j [i] := x q [i], ∀ 1 ≤ i ≤ d; 6 Synchronize across a r , r ∈ [Q] \ {q} to compute vj := 1 Q q∈[Q] v a,q j ; 7 Compute ∆v q j = vj -v a,q j ; spre := scur; 8 x q [i] += ∆v q j [i], ∀ 1 ≤ i ≤ d; j = j + 1; (b) Asynchronous non-blocking in-place averaging. Algorithm 1: Locally-asynchronous Parallel SGD (LAP-SGD) The process a q , which performs averaging for the worker q ∈ [Q], concurrently keeps on atomically reading S q , see Algorithm 1b. As soon as it notices an increment K j in S q , i.e. x q got concurrently updated with K j number of gradients , it takes a non-blocking snapshot v a,q j of x q and synchronizes with a r of peers r ∈ [Q]/q to compute the average v j of the snapshots. Thereafter, a q adds the difference of the average with the snapshot v a,q j to the model x q without blocking the concurrent asynchronous local gradient updates. We call this method locally-asynchronous parallel SGD (LAP-SGD). This method closely resembles Hogwild++ (Zhang et al. (2016a) ), which targets the heterogeneous NUMA based multi-core machines, though there are key differences which we describe in Section 4. Results of the same training task as before by LAP-SGD is given in Table 2 . The distinction of this implementation is that it harnesses the compute power of the GPUs not by increasing the size of B loc but by concurrently computing many minibatch gradients. Evidently, LAP-SGD provides speed-up without losing the quality of optimization in comparison to the baseline. More specifically, building on LAP-SGD, we consider locally partitioned gradient computation along with asynchronous lock-free updates. Essentially, we partition the model x q to {x q i(u) } for u ∈ U q , i(u) ∩ i(w) = ∅, ∀u, w ∈ U q (i.e., non-overlapping block components of the vector x). With that, a partitioned gradient computation will amount to computing ∇ i(u) f B q s (v q,u ), the minibatch gradient with respect to the partition x q i(u) at line 5 in Figure 1a . Accordingly, the update step at line 6 in Algorithm 1a transforms to x q [i] -= α s ∇f B q s (v q,u )[i], ∀ i ∈ i(u). It is to be noted that we do not use write lock for iterations at any stage. Having devised a partitioned update scheme, we propose locally-partitionedasynchronous parallel SGD (LPP-SGD) as described below. 1. Processes u ∈ U q maintain a process-local variable last iter which can take two values PARTITIONED and FULL. Each u ∈ U q initializes last iter as FULL. 2. While s ≤ T st , each process u ∈ U q performs LAP-SGD updates as lines 3 to 6 of Algorithm 1a. 3. If T st < s ≤ T , each process u ∈ U q performs (a) a partitioned gradient computation and update: x q,u [i] -= α s ∇f B q s (v u,q )[i], ∀i ∈ i(u) if last iter = FULL, and sets last iter = PARTITIONED. (b) an LAP-SGD update if last iter = PARTITIONED, and sets last iter = FULL. Essentially, after some initial stabilizing epochs each process u ∈ U q alternates between a full and a partitioned lock-free asynchronous gradient updates to the model x q . Our experiments showed that T st = T 10 was almost always sufficient to obtain a competitive optimization result. The results of a sample implementation of LPP-SGD are available in Table 3 . It is clear that LPP-SGD handsomely speeds up the computation and provides equally competitive optimization results.

2. CONVERGENCE THEORY

At a naive first glance, studying the convergence properties of locally asynchronous SGD would be an incremental to existing analyses for local SGD, e.g. Stich (2018); Zhou & Cong (2017) , in particular the local stochastic gradient evaluations at delayed lock-free Hogwild!-like parameter vectors. However, there is one significant difference that presents a theoretical challenge: sometimes the vectors used for gradient computation or components thereof, have been read from the local shared memory before the last averaging across GPUs had taken place. Especially in the nonconvex case, a priori it is impossible to place a reasonable bound on the accuracy of these gradient evaluations relative to what they "should be" in order to achieve descent. Initialize x0 = x q 0 for all q; for j = 1, ..., J do for all q do Set x q,j 0 = xj; for t = 1, ..., Kj do 6 Let x q,j t = x q,j t-1αj,t,q ∇(i(j,t,q) f (v q,j t ); Let xj+1 = 1 Q Q q=1 x q,j K j ; Algorithm 2: Iterations of the view xj. In order to present a convergence rate result, we need to define an anchor point on which we can consider convergence to some approximate stationary point in expectation. This is not trivial as both local iterations and averaging is performed asynchronously across different GPUs at distinct moments in time, with the time at which each iteration occurs potentially varying with system-dependent conditions, while the averaged iterates are what is important to consider for convergence. We seek to define the major iteration xj as consistent with the analysis presented for the convergence of local SGD in the nonconvex setting in Zhou & Cong (2017) . In this case, with asynchrony, xj is a theoretical construct, i.e., it may not exist at any particular GPU's memory at any moment in time. Let s q j := s cur -|U q | be the current state of the shared counter before the j th synchronization round at the GPU q, then xj is defined as x q s q j + ∆v j where x q s q j is the state of the model for GPU q when v a,q j was saved and made available for averaging for "major iteration" j. Thus although de facto averaging could have taken place after a set of local updates in time, these local updates correspond to updates after iteration j conceptually. This makes xj properly correspond to the equivalent iterate in Zhou & Cong (2017) . With that, we consider a sequence of major iteration views {x j } with associated inter-averaging local iteration quantities K j and local views {v q,j t } at which an unbiased estimate of the (possibly partitioned) gradient is computed, with 0 ≤ t < K j as well as the local model in memory {x q,j t }. The partition, which could be the entire vector in general, is denoted by i(j, t, q). As each GPU has its own corresponding annealed stepsize, we denote it in general as α j,t,q as well. We state the formal mathematical algorithm as Algorithm 2. Note that this is the same procedure as the practical "what the processor actually does" Algorithm 1, however, with the redefined terms in order to obtain a precise mathematical sequence well-defined for analysis. We make the following standard assumptions on unbiased bounded variance for the stochastic gradient, a bounded second moment, gradient Lipschitz continuity, and lower boundedness of f . Assumption 2.1. 1. It holds that ∇i f (v q,j t ) satisfies, independent of i, E ∇i f (v q,j t ) = ∇ i f (v q,j t ); E ∇i f (v q,j t ) -∇ i f (v q,j t ) 2 ≤ σ 2 ; E ∇i f (v q,j t ) 2 ≤ G. 2. f is Lipschitz continuously differentiable with constant L and is bounded below by f m . We must also define some assumptions on the probabilistic model governing the asynchronous computation. As these are fairly technical we defer them to the appendix. Theorem 2.1. Given assumption 2.1, it holds that, 1 Q J j=1 Q q=1 Kj -1 t=0 [α j,t,q C 1 -α 2 j,t,q C 2 ]E ∇ i(j,t,q) f (v q,j t ) 2 + 1 Q J j=1 Q q=1 Kj -1 t=0 [α j,t,q C 3 -α 2 j,t,q C 4 ]E ∇ i(j,t,q) f (x j ) 2 ≤ f (x 0 ) -f m (5) where C 1 , C 2 , C 3 , and C 4 depend on L, B and probabilistic quantities defining the asynchronous computation (see Appendix. Thus there exists a set of such constants such that if α j,t,q = Θ 1 √ J then Algorithm 2 ergodically converges with the standard O(1/ √ J) rate for nonconvex objectives. Proof Summary: The proof follows the structure of the ergodic convergence proof of K-step local SGD given in Zhou & Cong (2017) , wherein at each round of averaging there are QK j total updates to the model vector associated with the Q GPUs and K j minor iterations. Insofar as these updates are close (stochastically as an unbiased estimate, and based on the local models not having changed too much) to the globally synchronized model vector at the last averaging step, there is an expected amount of descent achieved due to the sum of these steps. This is balanced with the amount of possible error in this estimate based on how far the model vector had moved. In cases wherein v q,j t,i = x q,j s,i for s < 0 (i.e., the stochastic gradients are taken, due to local asynchrony, at model vectors with components which existed in memory before the last averaging step), we simply bound the worst-case increase in the objective. To balance these two cases, the analysis takes an approach, inspired partially by the analysis given in Cartis & Scheinberg (2018) of separating these as "good" and "bad" iterates, with "good" iterates corresponding to views read after the last model was stored for averaging, with some associated guaranteed descent in expectation, and "bad" iterates those read beforehand. By considering the stochastic process governing the amount of asynchrony as being governed by probabilistic laws, we can characterize the probability of a "good" and "bad" iterate and ultimately seek to balance the total expected descent from one, and worst possible ascent in the other, as a function of these probabilities. Remark 2.1. [Speedup due to concurrent updates] Consider the case of classical vanilla local SGD, in which there is complete symmetry in the number of local gradient computations between averaging steps and block sizes across the processors. In this case, for every major iteration there are Q gradient norms on the left hand side, and at the same time it is divided by Q. Thus local SGD as a parallel method does not exhibit classical speedup, rather it can be considered as an approach of using parallelism to have a more robust and stable method of performing gradient updates with multiple batches computed in parallel. However, due to the idle time that exists between the slowest and fastest processors, it will exhibit negative speedup, relative to the fastest processor. With the approach given in this paper, this negative speedup is corrected for in that this potential idleness is filled with additional stochastic gradient computations by the fastest process. Alternatively, one can also consider this as exhibiting positive speedup relative to the slowest process, whereas standard local SGD has zero speedup relative to the slowest process. Above and beyond this, we can consider that as more processes introduces additional latency and delay, which has a mixed effect: on the one hand, we expect that gradient norms at delayed iterates to be larger as the process is converging, thus by having more delayed gradients on the left hand side, convergence is faster, and on the other hand, such error in the degree to which the gradient decreases the objective, would specifically increase the constants C 2 and C 4 .

3.1. EXPERIMENTAL SET-UP

Decentralized training. We evaluate the proposed methods LAP-SGD and LPP-SGD comparing them against existing MB-SGD and PL-SGD schemes, using CNN models RESNET-20 (He et al. ( 2016)), SQUEEZENET (Iandola et al. (2017) ), and WIDERESNET-16x8 (Zagoruyko & Komodakis (2016) ) for the 10-/100-class image classification tasks on datasets CIFAR-10/CIFAR-100 (Krizhevsky (2009) ). We also train RESNET-18 for a 1000-class classification problem on IMA-GENET (Russakovsky et al. (2015) ) dataset. We keep the sample processing budget identical across the methods. We use the typical approach of partitioning the sample indices among the workers that can access the entire training set; the partition indices are reshuffled every epoch following a seeded random permutation based on epoch-order. To this effect we use a shared-counter among concurrent process u ∈ U q in asynchronous methods. Thus, our data sampling is i.i.d. Platform specification. Our experiments are based on a set of Nvidia GeForce RTX 2080 Ti GPUs (Nvidia ( 2020)) (referred to as GPUs henceforth) with 11 GB on-device memory. We use the following specific settings: (a) S1: a workstation with two GPUs and an Intel(R) Xeon(R) E5-1650 v4 CPU running @ 3.60 GHz with 12 logical cores, (b) S2: a workstation with four GPUs and two Intel(R) Xeon(R) E5-2640 v4 CPUs running @ 2.40 GHz totaling 40 logical cores, and (c) S3: two S2 workstations connected with a 100 GB/s infiniband link. The key enabler of our implementation methodology are multiple independent client connection between a CPU and a GPU. Starting from early 2018 with release of Volta architecture, Nvidia's technology Multi-process Service (MPS) efficiently support this. For more technical specifications please refer to their doc-pages MPS (2020).

Implementation framework.

We used open-source Pytorch 1.5 (Paszke et al. (2017) ) library for our implementations. For cross-GPU/cross-machine communication we use NCCL (NCCL ( 2020)) primitives provided by Pytorch. MB-SGD is based on DistributedDataParallel Pytorch module. PL-SGD implementation is derived from author's code (LocalSGD ( 2020)) and adapted to our setting. Having generated the computation graph of the loss function of a CNN, the autograd package of Pytorch allows a user to specify the leaf tensors with respect to which gradients are needed. We used this functionality in implementing partitioned gradient computation in LPP-SGD. Locally-asynchronous Implementation. One key requirement of the proposed methods is to support a non-blocking synchronization among GPUs. This is both a challenge and an opportunity. To specify, we use a process on each GPU, working as a parent, to initialize the CNN model and share it among spawned child-processes. The child-processes work as u ∈ U q , ∀q ∈ [Q] to compute the gradients and update the model. Concurrently, the parent-process, instead of remaining idle as it happens commonly with such concurrency models, acts as the averaging process a q ∈ P q , ∀q ∈ [Q], thereby productively utilizing the entire address space occupied over the GPUs. The parent-and child-processes share the iteration and epoch counters. Notice that, here we are using the process-level concurrency which is out of the purview of the thread-level global interpreter lock (GIL) (Python (2020)) of python multi-threading framework. Hyperparameters (HPs). Each of the methods use identical momentum (Sutskever et al. ( 2013)) and weight-decay (Krogh & Hertz (1991) ) for a given CNN/dataset case; we rely on their previously used values (Lin et al. ( 2020)). The learning rate (LR) schedule for MB-SGD and PL-SGD are identical to Lin et al. (2020) . For the proposed methods we used cosine annealing schedule without any intermediate restart (Loshchilov & Hutter (2017) ). Following the well-accepted practice, we warm up the LR for the first 5 epochs starting from the baseline value used over a single worker training. In some cases, a grid search (Pontes et al. (2016) ) suggested that for LPP-SGD warming up the LR up to 1.25× of the warmed-up LR of LAP-SGD for the given case, improves the results.

U B

Tr.A. In the following discussion we use these abbreviated notations: Concurrency on GPUs. We allocate the processes on a GPU up to the availability of the on-device memory. However, once the data-parallel compute resources get saturated, allocating more processes degrades the performance. For example, see Table 4 which lists the average performance of 5 runs for different combinations of U and B for training RESNET-20/CIFAR-10 by LAP-SGD on the setting S1. U : |U q |, B : B loc , Asynchronous Averaging Frequency. Following PL-SGD, as a general rule, for the first half of the training, call it up until P epochs, we set the averaging frequency K = 1. However, notice that, unlike PL-SGD, setting a K < Q in LAP-SGD and LPP-SGD may not necessarily increase the averaging rounds j in aggregation. Intuitively, in a locally-asynchronous setting, along with the nonblocking (barrier-free) synchronization among GPUs, the increment events on the shared-counter S would be "grouped" on the real-time scale if the processes u ∈ U q do not face delays in scheduling, which we control by allocating an optimal number of processes to maximally utilize the compute resources. For instance, Table 5 lists the results of 5 random runs of RESNET-20/CIFAR-10 training with B = 128 and U = 6 with different combinations of K and P over S1. This micro-benchmark indicates that the overall latency and the final optimization result of our method may remain robust under small changes in K, which it may not be the case with PL-SGD. Scalability. Table 6 presents the results of WIDERESNET-16x8/CIFAR-10 training in the three system settings that we consider. We observe that in each setting the relative speed-up of different methods are approximately consistent. In particular, we note the following: (a) reduced communication cost helps PL-SGD marginally outperform MB-SGD, (b) interestingly, increasing B loc from 128 to 512 does not improve the latency by more than ∼4%; this non-linear scalability of data-parallelism was also observed by Lin et al. ( 2020) , (c) each of the methods scale by more than 2x as the implementation is moved to S2, which has 40 CPU cores, from S1 which has 12 CPU cores, furthermore, this scalability is approximately 2x with respect to performance on S3 in comparison to S2, this shows that for each of the methods we utilize available compute resources maximally, (d) in each setting LAP-SGD achieves ∼30% better throughput compared to MB-SGD standard batch, (e) in each setting LPP-SGD outperforms LAP-SGD by ∼12% making it the fastest method, (f) the training and test accuracy of local large minibatch is poor, and (g) the methods LAP-SGD and LPP-SGD consistently improve on the baseline generalization accuracy. 7, 8 , and 9. In each case we use K = 16 in LAP-SGD, LPP-SGD, and PL-SGD after 50% of total sample processing budget. As an overview, the relative latency of the methods are as seen in Table 6 , whereas in each case LAP-SGD and LPP-SGD recovers or improves the baseline training results. Imagenet Training Results. Having comprehensively evaluated the proposed methods on CIFAR-10/CIFAR-100, here we present their performance on 1000-classes Imagenet dataset. Notice that for this training task, with 8 commodity GPUs at our disposal we are very much in the small minibatch setting. Plethora of existing work in the literature efficiently train a RESNET on IM-AGENET with BS up to multiple thousands. Other system-dependent constraint that our considered setting faces is that there is hardly any leftover compute resources for us to exploit in the local setting of a worker. Yet, we see for see LR Tuning strategy. It is pertinent to mention that the existing techniques, such as LARS (You et al. (2017) ), which provide an adaptive LR tuning strategy for large minibatch settings over a large number of GPUs, wherein each worker locally processes a small minibatch, are insufficient in the case B loc is increased. For example, see Table 11 which lists the average performance of 5 runs on the setting S1 for Q = 2 for training RESNET-20/CIFAR-10 using the compared methods combined with LARS with η = 0.001 (You et al. (2017) ). We scale the LR proportionately: α 0 × B×Q B loc , where B loc = 128, Q is the number of workers, 2 here, and α 0 = 0.1. Evidently, LARS did not help MB-SGD and PL-SGD in checking the poor generalization due to larger B.

3.3. ON THE GENERALIZATION OF THE PROPOSED METHODS

Let us comment on a theoretical basis of the remarkable performance in terms of generalization error of our methods. In Lin et al. ( 2020), the model of SGD algorithms as a form of a Euler-Maruyama discretization of a Stochastic Differential Equation (SDE) presents the perspective that the batch-size can correspond to the inverse of the injected noise. Whereas distributed SGD combined stochastic gradients as such effectively results in an SGD step with a larger batch-size, local SGD, by averaging models rather than gradients, maintains the noise associated with the local small-batch gradients. Given the well-established benefits of greater noise in improving generalization accuracy (e.g. Smith & Le (2018) and others), this presents a heuristic argument as to why local SGD tends to generalize better than distributed SGD. In the Appendix we present an additional argument for why local SGD generalizes well. However, we see that our particular variation with asynchronous local updates and asynchronous averaging seems to provide additional generalization accuracy above and beyond local SGD. We provide the following explanation as to why this could be the case, again from the perspective of noise as it would appear in a SDE. Let us recall three facts, 1. The noise appearing as a discretization of the Brownian motion term in the diffusion SDE, and, correspondingly the injected noise studied as a driver for increased generalization in previous works on neural network training is i.i.d., 2. Clearly, the covariances of a mini-batch gradients as statistical estimates of the gradient at x and x are going to be more similar when x and x are closer together, 3. (see Section 2) A challenging property from the perspective of convergence analysis with locally asynchronous updates is gradients taken at models taken at snapshots before a previous all-to-all averaging step, and thus far from the current model in memory. Thus, effectively, the presence of these "highly asynchronous" stochastic gradients, while being potentially problematic from the convergence perspective, effectively brings the analogy of greater injected noise for local SGD over distributed data-parallel closer to accuracy by inducing greater probabilistic independence, i.e., the injected noise, for these updates, is far close to the i.i.d. noise that appears in a discretized SDE.

4. RELATED WORK

In the previous sections we cited the existing literature wherever applicable, in this section we present a brief overview of closely related works and highlight our novelty. In the shared-memory setting, HOGWILD! (Recht et al. (2011) ) is now the classic approach to implement SGD. However, it remains applicable to a centralized setting of a single worker and therefore is not known to have been practically utilized for large scale DNN training. Its success led to designs of variants which targeted specific system aspects of a multi-core machine. For example, Buckwild! (Sa et al. (2015) ) proposed using restricted precision training on a CPU. Another variant, called HOG-WILD!++ (Zhang et al. (2016b) ), harnesses the non-uniform-memory-access (NUMA) architecture based multi-core computers. In this method, threads pinned to individual CPUs on a multi-socket mainboard with access to a common main memory, form clusters. In principle, the proposed LAP-SGD can be seen as deriving from HOGWILD!++. However, there are important differences: (a) at the structural level, the averaging in HOGWILD!++ is binary on a ring graph of thread-clusters, further, it is a token based procedure where in each round only two neighbours synchronize, whereas in LAP-SGD it is all-to-all, (b) in HOGWILD!++ each cluster maintains two copies of the model: a locally updating copy and a buffer copy to store the last synchronized view of the model, whereby each cluster essentially passes the "update" in the local model since the last synchronization to its neighbour, however, this approach has a drawback as identified by the authors: the update that is passed on a ring of workers eventually "comes back" to itself thereby leading to divergence, to overcome this problem they decay the sent out update; as against this, LAP-SGD uses no buffer and does not track updates as such, averaging the model with each peer, similar to L-SGD, helps each of the peers to adjust their optimization dynamics, (c) it is not known if the token based model averaging of HOGWILD!++ is sufficient for training DNNs where generalization is the core point of concern, in place of that we observed that our asynchronous averaging provides an effective protocol of synchronization and often results in improving the generalization, (d) comparing the HOGWILD!++ thread-clusters to concurrent processes on GPUs in LAP-SGD, the latter uses a dedicated process that performs averaging without disturbing the local gradient updates thereby maximally reducing the communication overhead, (e) finally, the convergence theory of LAP-SGD guarantees its efficacy for DNN training, which we demonstrated experimentally, by contrast, HOGWILD!++ does not have any convergence guarantee. Recently, Wang et al. (2020) proposed Overlap-local-SGD, wherein they suggested to keep a model copy at each worker, very similar to HOGWILD!++, which is simultaneously averaged when sequential computation for multiple iterations happen locally. They showed by limited experiments that it reduced the communication overhead in a non-iid training case based on CIFAR-10, however, not much is known about its performance in general cases. The asynchronous partitioned gradient update of LPP-SGD derives from Kungurtsev et al. (2019) , however, unlike them we do not use locks and our implementation setting is decentralized, thus scalable.

5. CONCLUSION

Picking from where Golmant et al. (2018) concluded referring to their findings: "These results suggest that we should not assume that increasing the batch size for larger datasets will keep training times manageable for all problems. Even though it is a natural form of data parallelism for largescale optimization, alternative forms of parallelism should be explored to utilize all of our data more efficiently", our work introduces a fresh approach in this direction to addressing the challenge. In our experiments, we observed that the natural system-generated noise in some cases effectively improved the generalization accuracy, which we could not obtain using the existing methods irrespective of any choice of seed for random sampling. The empirical findings suggest that the proposed variant of distributed SGD has a perfectly appropriate place to fit in the horizon of efficient optimization methods for training deep neural networks. As a general guideline for the applicability of our approach, we would suggest the following: monitor the resource consumption of a GPU that trains a CNN, if there is any sign that the consumption was less than 100%, try out LAP-SGD and LPP-SGD instead of arduously, and at times unsuccessfully, tuning the hyperparameters in order to harness the data-parallelism. The asynchronous averaging protocol makes LAP-SGD and LPP-SGD specially attractive to settings with large number of workers. There are a plethora of small scale model and dataset combinations, where the critical batch sizeafter which the returns in terms of convergence per wall-clock time diminish-is small relative to existing system capabilities (Golmant et al. (2018) ). To such cases LAP-SGD and LPP-SGD become readily useful. Yet, exploring the efficiency of LAP-SGD and LPP-SGD to train at massive scales, where hundreds of GPUs enable training IMAGENET in minutes (Ying et al. (2018) ), is an ideal future goal. We also plan to extend the proposed methods to combine with communication optimization approaches such as QSGD (Alistarh et al. (2017) ).

A APPENDIX A: CONVERGENCE THEORY

A.1 PROBABILISTIC ASSUMPTIONS GOVERNING THE ASYNCHRONOUS COMPUTATION Now let us discuss the formalities of the asynchronous computation. We consider that the presence of local HogWild-like asynchronous computation introduces stochastic delays, i.e., at each stochastic gradient computation, the set of parameters at which the stochastic gradient is evaluated is random, it follows some distribution. Thus, considering that, in the general case, v q,j t,i ∈ ∪ k∈{0,...,j} {x q,k s,i } s∈{0,...,t} we can now define a probability that this parameter view block is equal to each of these potential historical parameter values. To this end we define I t,i,q,j s,k as the event that block i in the view for GPU q at major iteration j and minor iteration t is equal to the actual parameter x q,k s,i , and p t,i,q,j s,k is its probability. Now, v q,j t,i could be, e.g., x q,j-1 s,i for some s ∈ {0, ..., K j-1 }, i.e., it could have been evaluated at a parameter before the last averaging took place. In the nonconvex setting, it would be entirely hopeless to bound x q,j-1 s,i -v q,j t,i in general, in this case we can simply hope that the objective function decrease achieved by iterations with a gradient computed after an averaging step outweighs the worst-case increase that takes place before it. In order to perform such an analysis, we will need bound the probability that this worst case occurs. In order to facilitate the notation for this scenario, let us define x q,j -1,i = x q,j-1 K j-1 -1,i , x q,j -2,i = x q,j-1 K j-1 -2,i , etc., and then p t,i,q,j -1,j correspondingly. We can extend this analogously to delays from before two averaging steps, etc. Note that with this notation, p t,i,q,j l,j is well defined for any l ≤ t, l ∈ Z, of course as long as |l| ≤ j-1 k=0 K k . In order to derive an expected decrease in the objective, we need to bound the probability of an increase, which means bounding the probability that the view is older than the previous averaging, which can be bounded by a bound on the probability that a particular read is more than some number τ iterations old. We thus make the following assumptions, Assumption A.1. It holds that, 1. p t,i,q,j l,j = 0 for l ≤ t -D (Maximum Delay) 2. There exists {p τ } τ ∈{1,...,D} such that for all (q, j, t), it holds that P ∪ i I t,i,q,j l,j ≤ p t-τ for l = t -τ (Uniform Bound for Components' Delays) With these, we can make a statement on the error in the view. In particular it holds that, E v q,j t -x q,j t | ∩ i ∪ l≥0 I t,i,q,j l,j ≤ α 2 j,t,q B (6) for some B > 0. This bound comes from Section A.B.6 in Nadiradze et al. (2020) . Thus, if the view is taken such that all components were read after the last averaging step then we can bound the error as given.

A.2 PROOF OF MAIN CONVERGENCE THEOREM

Recall the major iterations xj is defined as the value of the parameters after an averaging step has taken place, which is of course well-defined as every GPU will have the same set of values for the parameters. The update to xj can be written as, xj+1 = 1 Q Q q=1   xj - Kj -1 t=0 α j,t,q ∇i(q,j,t) f (v q,j t )   Where we define, ∇i f (v q,j t ) as the vector of size n whose i'ith components are the calculated stochastic gradient vector defined at v q,j t , with the rest of the components padded with zeros. We indicate the component chosen i(q, j, t) depends on the GPU and minor and major iteration, allowing for flexibility in the choice of block update (including the entire vector). We are now ready to prove the convergence Theorem. The structure of the proof will follow Zhou & Cong (2017) , who derives the standard sublinear convergence rate for local SGD in a synchronous environment for nonconvex objectives. We begin with the standard application of the Descent Lemma, f (x j+1 ) -f (x j ) ≤ -∇f (x j ), 1 Q Q q=1 Kj -1 t=0 α j,t,q ∇i(j,t,q) f (v q,j t ) + L 2 1 Q Q q=1 Kj -1 t=0 α j ∇i(j,t,q) f (v q,j t ) 2 Now, since E ∇i(j,t,q) f (v q,j t ) = ∇ i(j,t,q) f (v q,j t ), -E ∇f (x j ), ∇i(j,t,q) f (v q,j t ) = -1 2 ∇ i(j,t,q) f (x j ) 2 + E ∇ i(j,t,q) f (v q,j t ) 2 -E ∇ i(j,t,q) f (v q,j t ) -∇ i(j,t,q) f (x j ) 2 9) We now split the last term by the two cases and use equation 6, E ∇ i(j,t,q) f (v q,j t ) -∇ i(j,t,q) f (x j ) 2 = E ∇ i(j,t,q) f (v q,j t ) -∇ i(j,t,q) f (x j ) 2 | ∩ i ∪ l≥0 I t,i,q,j l,j P ∩ i ∪ l≥0 I t,i,q,j l,j +E ∇ i(j,t,q) f (v q,j t ) -∇ i(j,t,q) f (x j ) 2 | ∩ i ∪ l≥0 I t,i,q,j l,j c P ∩ i ∪ l≥0 I t,i,q,j l,j c ≤ α 2 j,t,q BP ∩ i ∪ l≥0 I t,i,q,j l,j +2 ∇ i(j,t,q) f (x j ) 2 + E ∇ i(j,t,q) f (v q,j t ) 2 P ∩ i ∪ l≥0 I t,i,q,j l,j c and thus combining with equation 22 we get the overall bound, -E ∇f (x j ), ∇i(j,t,q) f (v q,j t ) ≤ -1 2 ∇ i(j,t,q) f (x j ) 2 + E ∇ i(j,t,q) f (v q,j t ) 2 -α 2 j,t,q B P ∩ i ∪ l≥0 I t,i, + 1 2 ∇ i(j,t,q) f (x j ) 2 + E ∇ i(j,t,q) f (v q,j t ) 2 P ∩ i ∪ l≥0 I t,i,q,j l,j c ≤ -∇ i(j,t,q) f (x j ) 2 + E ∇ i(j,t,q) f (v q,j t ) 2 P ∩ i ∪ l≥0 I t,i,q,j l,j + 1 2 ∇ i(j,t,q) f (x j ) 2 + E ∇ i(j,t,q) f (v k,j t ) 2 + α 2 j,t,q B 2 It can be seen from this expression that we must have, P ∩ i ∪ l≥0 I t,i,q,j l,j ≥ 1 2 + δ 11) for some δ > 0 to achieve descent in a expectation for iteration j for a sufficient small stepsizes. Since we are taking the sum of such iterations, we must have, ultimately, 1 Q Q q=1 Kj -1 t=0 α j,t,q P ∩ i ∪ l≥0 I t,i,q,j l,j -1 2 -α j,t,q QK j + αj,t,qB 2 E ∇ i(j,t,q) f (v q,j t ) 2 + 1 Q Q q=1 Kj -1 t=0 α j,t,q P ∩ i ∪ l≥0 I t,i,q,j l,j -1 2 - α 2 j,t,q B 2 E ∇ i(j,t,q) f (x j ) 2 ≥ δj (12) with ∞ j=0 δj ≥ f (x 0 ) -f m , where recall that f m is a lower bound on f , in order to achieve asymptotic convergence. The standard sublinear SGD convergence rate is recovered with any choice with α j,t,q = Θ 1 √ J and thus ultimately δj = Ω 1 √ J . Let us now consider the quantity P ∩ i ∪ l≥0 I t,i,q,j l,j in more detail and study how the nature of the concurrency affects the possibility and rate of convergence. In particular notice that, P ∩ i I t,i,q,j l,j ≤ 1 -p t-τ for l ≥ t -τ . In general of course we expect this quantity to increase as l is closer to t. Consider two extreme cases: if there is always only one SG iteration for all processes for all major iterations j, i.e., K j ≡ 1, any delay means reading a vector in memory before the last major iteration, and thus the probability of delay greater than zero must be very small in order to offset the worst possible ascent. On the other hand, if in general K j >> τ , then while the first τ minor could be as problematic at a level depending on the probability of the small delay times, for t > τ clearly the vector v q,j t satisfies equation 6. Thus we can sum up our conclusions in the following statements: 1. Overall, the higher the mean, variance, and thickness of the tails of the delay, the more problematic convergence would be, 2. The larger the quantity of local iterations each GPU performs in between averaging, the more likely a favorable convergence would occur. The first is of course standard and obvious. The second presents the interesting finding that if you are running HogWild-type SG iterations on local shared memory, performing local SGD with a larger gap in time between averaging results in more robust performance for local SGD. This suggests a certain fundamental harmony between asynchronous concurrency and local SGD, more "aggressive" locality, in the sense of doing more local updates between averaging, coincides with expected performance gains and robustness of more "aggressive" asynchrony and concurrency, in the sense of delays in the computations associated with local processes. In addition, to contrast the convergence in regards to the block size, clearly the larger the block the faster the overall convergence, since the norms of the gradient vectors appear. An interesting possibility to consider is if a process can roughly estimate or predict when averaging could be triggered, robustness could be gained by attempting to do block updates right after an expected averaging step, and full parameter vector updates later on in the major iteration.

A.3 CONVERGENCE -SIMPLER CASE

In reference to the classic local SGD theory in particular Stich (2018) for the strongly convex case and Zhou & Cong (2017) for the nonconvex case, we consider the simpler scenario wherein i(q, j, t) = [n] and v q,j t,i = x q,j s,i with s ≥ 0 for all v q,j t,i , i.e., at no point are local updates computed at gradients evaluated at model components existing in memory prior to the last averaging step. We shall see the local asynchrony introduces a mild adjustment in the constants in the strongly convex case, relative to the classic result, and results in no change whatsoever in the nonconvex case.

A.3.1 STRONGLY CONVEX CASE

The proof of convergence will be based on Stich (2018), the synchronous case. The formalism of the argument changes to Algorithm 4. Note that this is functionally the same, and simply the concept of major iterations is dispensed with, except to define K j . Initialize x q 0 for all q for t = 1, ..., T do for all q do Let x q t = x q t-1 -α t,q ∇f (v q t ) if (t MOD J j=1 K j = 0) for some J then 5 Let xt+1 = 1 Q Q q=1 x q t The only change in the procedure is that the stochastic gradients are computed as evaluated at a vector v q t , so we shall see how the convergence results contrasts with Stich (2018) for synchronous averaging when computations are performed in this manner. Let, xt = 1 Q Q q=1 x q t , g t = 1 Q Q q=1 ∇f (x q t ), ḡt = 1 Q Q q=1 ∇f (x q t ), g t = 1 Q Q q=1 ∇f (v q t ), gt = 1 Q Q q=1 ∇f (v q t ) We have, as in the proof of Lemma 3.1 Stich (2018) xt+1 -x * 2 = xt -α t g t -x * 2 = xt -x * -α t gt 2 + α 2 t g t -gt 2 +2α t xt -x * -α t gt ,g t -g t (13) Continuing, xt -x * -α t gt 2 = xt -x * 2 + α 2 t gt 2 -2α t xt -x * ,g t = xt -x * 2 + α 2 t Q Q q=1 ∇f (x q t ) 2 -2αt Q Q q=1 xt -v q t + v q t -x * , ∇f (v t ) = xt -x * 2 + α 2 t Q Q q=1 ∇f (x q t ) -∇f (x * ) 2 -2αt Q Q q=1 v q t -x * , ∇f (v t ) -2αt Q Q q=1 xt -v q t , ∇f (v t ) Using Young's inequality and L-smoothness, -2 xt -v q t , ∇f (v t ) ≤ 2L xt -v q t 2 + 1 2L ∇f (v q t ) 2 = 2L xt -v q t 2 + 1 2L ∇f (v q t ) -∇f (x * ) 2 ≤ 2L xt -v q t 2 + (f (v q t ) -f * ) Applying this to the above estimate of xt -x * -α t gt 2 , we get, xt -x * -α t gt 2 ≤ xt -x * 2 + 2αtL Q Q q=1 xt -v q t 2 + 2αt Q Q q=1 α t L -1 2 (f (v q t ) -f * ) -µ 2 v q t -x * 2 Let α t ≤ 1 4L so α t L -1 2 ≤ -1 4 . By the convexity of 1 4 (f (x) -f (x * )) + µ 2 x -x * 2 , -2αt Q Q q=1 1 4 (f (v q t ) -f * ) + µ 2 v q t -x * 2 ≤ -2αt Q Q q=1 1 4 (f (x q t ) -f * ) + µ 2 x q t -x * 2 + αt 2Q Q q=1 v q t -x q t + 2µ v q t -x q t 2 Putting this in equation 13 and taking expectations, we get, E xt+1 -x * 2 ≤ (1 -µα t )E xt -x * 2 + α 2 t E gt -g t 2 -αt 2 E(f (x t ) -f * ) + 2αtL Q Q q=1 xt -v q t 2 + αt 2Q Q q=1 v q t -x q t + 2µ v q t -x q t 2 By Assumption 2.1, we have, E gt -g t 2 = E 1 Q Q q=1 ∇f (v q t ) -∇f (v q t ) 2 ≤ σ 2 Q (15) We have that, Q q=1 v q t -x q t ≤ Q q=1 v q t -x q t 1 ≤ Q q=1 t-1 s=max t-τ,t0 α s ∇f (v q s ) 1 ≤ α t Qτ √ nG and similarly, Q q=1 v q t -x q t 2 ≤ Q q=1 t-1 s=max t-τ,t0 α 2 s ∇f (v q s ) 2 ≤ Qα 2 t τ G 2 Letting index t 0 be such that t -t 0 ≤ H := max{K j } when averaging takes place, i.e. xt0 = x q t0 for all q, we have, 1 Q Q q=1 E xt -v q t 2 = 1 Q Q q=1 E v q t -x q t + x q t -x t0 -(x t -x t0 ) 2 ≤ 2 Q Q q=1 E v q t -x q t 2 + x q t -x t0 -(x t -x t0 ) 2 ≤ 2 Q Q q=1 E x q t -x t0 2 + 2α 2 t τ G 2 ≤ 2 Q Q q=1 Hα 2 t0 t-1 s=t0 E ∇f (x q s ) 2 + 2α 2 t τ G 2 ≤ 2 Q Q q=1 H 2 α 2 t0 G 2 ≤ H 2 α 2 t0 G 2 + 2α 2 t τ G 2 ≤ 8H 2 α 2 t G 2 + 2α 2 t τ G 2 where we use E X -EX 2 = E X 2 -EX 2 and equation 17 to go from the third to the fourth line. Finally, putting equation 15, equation 18, equation 16 and equation 17 into equation 14 we get that, E xt+1 -x * 2 ≤ (1 -µα t )E xt -x * 2 + α 2 t σ 2 Q -αt 2 E(f (x t ) -f * ) + 16α 3 t LH 2 G 2 + α 2 t τ √ nG 2 + 2(µ + 2L)τ α 3 t G 2 Finally, using Lemma 3.4 Stich (2018), we obtain, with a > max{16κ, H} for κ = L/µ, and w t = (a + t) 2 , Ef 1 QS Q Q q=1 T -1 t=0 w t x q t -f * ≤ µa 3 2S T x 0 -x * 2 + 4T (T +2a) µS T σ 2 Q + τ √ nG 2 + 256T µ 2 S T 16LH 2 G 2 + 2(µ + 2L)τ G 2 which simplifies to, using Eµ x 0 -x * ≤ 2G, Ef 1 QS Q Q q=1 T -1 t=0 w t x q t -f * = O 1 µQT + κ+H µQT 2 σ 2 + O τ √ n µT G +O τ √ n(κ+H) µT 2 G + O κH 2 +τ (µ+2L) µT 2 + κ 3 +H 3 µT 3 G 2 A.3.2 NONCONVEX CASE This proof will again follow Zhou & Cong (2017) . In this case the step-sizes {α ( j, t, q)} are independent of t and q, i.e., they are simple {α j }. Thus, xj+1 = 1 Q Q q=1   xj - Kj -1 t=0 α j ∇f (v q,j t )   And thus, f (x j+1 ) -f (x j ) ≤ -∇f (x j ), 1 Q Q q=1 Kj -1 t=0 α j ∇f (v q,j t ) + L 2 1 Q Q q=1 Kj -1 t=0 α j ∇f (v q,j t ) 2 (21) Now, since E ∇f (v q,α t ) = ∇f (v q,α t ), -E ∇f (x j ), ∇f (v q,j t ) = -1 2 ∇f (x j ) 2 + E ∇f (v q,j t ) 2 -E ∇f (v q,j t ) -∇f (x j ) 2 ≤ -1 2 ∇f (x j ) 2 + E ∇f (v q,j t ) 2 -L 2 E v q,j t -xj 2 We now continue the proof along the same lines as in Zhou & Cong (2017) . In particular, we get E v q,j t -xj 2 ≤ t 2 α 2 j σ 2 + tα 2 j E t-1 s=0 ∇f (v q,j s ) 2 Let us define K = max j {K j } and K = min j {K j }. We now have, -α j Kj -1 t=0 E ∇f (x j ), ∇f (v q,j t ) ≤ - (K+1)αj 2 1 - L 2 α 2 j K( K-1) 2(K+1) ∇f (x j ) 2 - αj 2 1 - L 2 α 2 j ( K+1)( K-2) 2 K-1 t=0 E ∇f (v q,j t ) 2 + L 2 α 3 j σ 2 (2 K-1)K( K-1) Similarly, it also holds that, L 2 1 Q Kj -1 t=0 α j ∇f (v q,j t ) 2 ≤ LK 2 j α 2 j σ 2 2Q + LK j α 2 j 2 Kj -1 t=0 E ∇f (v q,j t ) 2 And so, finally, Ef (x j+1 ) -f (x j ) ≤ - (K+1)αj 2 1 - L 2 α 2 j K( K-1) 2(K+1) - Lαj K K+1) ∇f (x j ) 2 - αj 2 1 - L 2 α 2 j ( K+1)( K-2) 2 -Lα j K Q q=1 Kj -1 t=1 E ∇f (v q,j t ) 2 + L 2 α 3 j σ 2 (2 K-1) K( K-1) 12 + LK 2 α 2 j σ 2 2Q Now, if/once α j is small enough such that, 1 ≥ L 2 α 2 j ( K + 1)( K -2) 2 + Lα j K then the second term above disappears, and the result is exactly the same as in Zhou & Cong (2017)  . Specifically, if 1 -δ ≥ L 2 α 2 j E J j=1 αj ∇f (xj ) 2 J l=1 αj ≤ 2(f (x1)-F * ) (K-1+δ) J j=1 αj + J j=1 L Kα 2 j M J l=1 α l (K-1+δ) K Q + L(2 K-1)( K-1)αj 6 B APPENDIX B: AN ARGUMENT FOR INCREASED GENERALIZATION ACCURACY FOR LOCAL SGD B.

1. WIDE AND NARROW WELLS

In general it has been observed that whether a local minimizer is shallow or deep, or how "flat" it is, seems to affect its generalization properties Keskar et al. (2019) . Motivated by investigating the impact of batch size on generalization Dai & Zhu (2018) analyzed the generalization properties of SGD by considering the escape time from a "well", i.e., a local minimizer in the objective landscape, for a constant stepsize variant of SGD by modeling it as an overdamped Langevin-type diffusion process, dX t = -∇f (X t )dt + √ 2 dW t In general "flatter" minima have longer escape times than shallow ones, where the escape time is the expectation in the number of iterations (defined as a continuous parameter in this sense) until the iterates leave the well to explore the rest of the objective landscape. Any procedure that increases the escape time for flatter minima as compared to shallower ones should, in theory, result in better generalization properties, as it is more likely then that the procedure will return an iterate that is in a shallow minimizer upon termination. Denote with indexes w for a "wide" valley local minimizer and n for a "narrow" value, which also corresponds to smaller and larger minimal Hessian eigenvalues, respectively. The work Berglund (2011) discusses the ultimately classical result that as → 0, the escape time from a local minimizer valley satisfies, E[τ e ] = He C/ and letting the constant H depend on the type of minimizer, it holds that that H w > H n , i.e., this time is longer for a wider valley. We also have from the same reference, 

B.2 AVERAGING

We now contrast two procedures and compare the difference in their escape times for shallow and wider local minimizers. One is the standard SGD, and in one we perform averaging every τ a time. In each case there are Q processors, in the first case running independent instances of SGD, and in the other averaging their iterates. We model averaging probabilistically as it resulting in a randomized initialization within the well, and thus the escape time is a sequence of independent trials of length τ a with an initial point in the well, i.e., escaping at time τ e means that there are Q τe τa trials wherein none of the Q sequences escaped within τ a , and then one of them escaped in the next set of Q trials. For ease of calculation, let us assume that τ a = 1 2 E[τ w e ] = 2E[τ n e ], where τ w e and τ n e are the calculated single process escape time from a wide and shallow well, respectively. If any one of the local runs escapes, then there is nothing that can be said about the averaged point, so a lack of escape is indicated by the case for which all trajectories, while periodically averaged, stay within the local minimizer value. Now consider that if no averaging takes place, we sum up the probabilities for the wide valley that they all escape after time (i -1)τ time and, given that they do so, not all of them escape after iτ a . e -Q(i-1) 2 1 -e -Q 2 -e -2Q(i-1) 1 -e -2Q (i -1) + e -Q(i-1) 2 1 -e -Q 2 τ a Recall that in the case of averaging, if escaping takes place between (i -1)τ a and iτ a there were no escapes with less that τ a for M processors multiplied by i -1 times trials, and at least one escape between (i -1)τ a and iτ a , i.e., not all did not escape between these two times. The expected first escape time for any trajectory among Q from a wide valley, thus, is, e -(i-1)Q

2

(1 -e -iQ 2 ) -e -2(i-1)Q (1 -e -2iQ ) (i -1) + e -(i-1)Q (1 -e -iQ 2 ) τ a It is clear from the expressions then that the upper bound for the difference is larger in the case of averaging. This implies that averaging results in a greater difference between the escape times between wider and shallow local minimizers, suggesting that, on average if one were to stop a process of training and use the resulting iterate as the estimate for the parameters, this iterate would more likely come from a flatter local minimizer if it was generated with a periodic averaging procedure, relative to standard SGD. Thus it should be expected, at least by this argument that better generalization is more likely with periodic averaging. Note that since they are both upper bounds, this isn't a formal proof that in all cases the escape times are more favorable for generalization in the case of averaging, but a guide as to the mathematical intuition as to how this could be the case.



Tr.L.: Training Loss, Tr.A.: Training Accuracy, Te.L.: Test Loss, Te.A.: Test Accuracy, and T : Time in Seconds. The asynchronous methods have inherent randomization due to process scheduling by the operation system. Therefore, each micro-benchmark presents the mean of 3 runs unless otherwise mentioned.

Figure 2: Top-1 Test Accurcy.Notice that for this training task, with 8 commodity GPUs at our disposal we are very much in the small minibatch setting. Plethora of existing work in the literature efficiently train a RESNET on IM-AGENET with BS up to multiple thousands. Other system-dependent constraint that our considered setting faces is that there is hardly any leftover compute resources for us to exploit in the local setting of a worker. Yet, we see for RESNET-18, see Table10, that LAP-SGD improves on generalization accuracy of the baseline with speed-up.

[τ e > sE[τ e ]] = e -s

τ w e > (i -1)τ a ) Q (1 -P (τ w e > τ a i|τ w e > (i -1)τ a ) Q )iτ a ≤ ∞ i=1 e -Q(i-1) 2 1 -e -Q 2 τ a i For the narrow well this is, E[τ n e ] ≥ ∞ i=1 P (τ n e > (i -1)τ a ) Q (1 -P (τ n e > τ a i|τ n e > (i -1)τ a ) Q )(i -1)τ a ≥ ∞ i=1 e -2Q(i-1) 1 -e -2Q τ a (i -1)The difference in the expected escape times satisfies,

a ] (i-1)Q (1 -P[τ w e > τ a ] M i )τ a i ≤ ∞ i=1 e -(i-1)Q 2 (1 -e -iQ2 )τ a i And now with averaging, the escape time from a narrow valley satisfies,E[τ a,n e ] ≥ ∞ i=1 P[τ n e > τ a ] (i-1)Q (1 -P[τ n e > τ a ] Qi )τ a (i -1) ≥ ∞ i=1 e -2(i-1)Q (1 -e -2iQ)τ a (i -1) With now the difference in the expected escape times satisfies,

RESNET-20/CIFAR-10 training for 300 epochs on 2 GPUs. Throughout the training the local BS is kept constant across workers i.e.

RESNET



LAP-SGD performance.

WIDERESNET

Performance of RESNET-20 on CIFAR-100 over the setting S1.

Performance of WIDERESNET-16x8 on CIFAR-100 over the setting S3.

Performance of SQUEEZENET on CIFAR-10 over the setting S1. Other CIFAR-10/CIFAR-100 Results. Performance of the proposed methods in comparison to the baselines for other training tasks on CIFAR-10/CIFAR-100 datasets are available in Tables



Table 10, that LAP-SGD improves on generalization accuracy of the baseline with speed-up. LARS performance.

