DELAY-TOLERANT LOCAL SGD FOR EFFICIENT DIS-TRIBUTED TRAINING

Abstract

The heavy communication for model synchronization is a major bottleneck for scaling up the distributed deep neural network training to many workers. Moreover, model synchronization can suffer from long delays in scenarios such as federated learning and geo-distributed training. Thus, it is crucial that the distributed training methods are both delay-tolerant AND communication-efficient. However, existing works cannot simultaneously address the communication delay and bandwidth constraint. To address this important and challenging problem, we propose a novel training framework OLCO 3 to achieve delay tolerance with a low communication budget by using stale information. OLCO 3 introduces novel staleness compensation and compression compensation to combat the influence of staleness and compression error. Theoretical analysis shows that OLCO 3 achieves the same sub-linear convergence rate as the vanilla synchronous stochastic gradient descent (SGD) method. Extensive experiments on deep learning tasks verify the effectiveness of OLCO 3 and its advantages over existing works.

1. INTRODUCTION

Data-parallel synchronous SGD is currently the workhorse algorithm for large-scale distributed deep learning tasks with many workers (e.g. GPUs), where each worker calculates the stochastic gradient on local data and synchronizes with the other workers in one training iteration (Goyal et al., 2017; You et al., 2017; Huo et al., 2020) . However, high communication overheads make it inefficient to train large deep neural networks (DNNs) with a large number of workers. Generally speaking, the communication overheads come in two forms: 1) high communication delay due to the unstable network or a large number of communication hops, and 2) large communication budget caused by the large size of the DNN models with limited network bandwidth. Although communication delay is not a prominent problem for the data center environment, it can severely degrade training efficiency in practical scenarios, e.g. when the workers are geo-distributed or placed under different networks (Ethernet, cellular networks, Wi-Fi, etc.) in federated learning (Konečnỳ et al., 2016) . Existing works to address the communication inefficiency of synchronous SGD can be roughly classified into three categories: 1) pipelining (Pipe-SGD (Li et al., 2018) ); 2) gradient compression (Aji & Heafield, 2017; Stich et al., 2018; Alistarh et al., 2018; Yu et al., 2018; Vogels et al., 2019) ; and 3) periodic averaging (also known as Local SGD) (Stich, 2019; Lin et al., 2018a) . In pipelining, the model update uses stale information such that the next iteration does not wait for the synchronization of information in the current iteration to update the model. As the synchronization barrier is removed, pipelining can overlap computation with communication to achieve delay tolerance. Gradient compression reduces the amount of data transferred in each iteration by condensing the gradient with a compressor C(•). Representative methods include scalar quantization (Alistarh et al., 2017; Wen et al., 2017; Bernstein et al., 2018) , gradient sparsification (Aji & Heafield, 2017; Stich et al., 2018; Alistarh et al., 2018) , and vector quantization (Yu et al., 2018; Vogels et al., 2019) . Periodic averaging reduces the frequency of communication by synchronizing the workers every p (larger than 1) iterations. Periodic averaging is also shown to be effective for federated learning (McMahan et al., 2017) . In summary, exiting works handle the high communication delay with pipelining and use gradient compression and periodic averaging to reduce the communication budget. However, all existing methods fail to address both. It is also unclear how the three communication-efficient techniques introduced above can be used jointly without hurting the convergence of SGD.  × √ = 1 = 0 Periodic Averaging (Local SGD) × × ≥ 1 = 0 Pipelining (Pipe-SGD) √ × = 1 ≥ 1 CoCoD-SGD √ × ≥ 1 = 1 OverlapLocalSGD √ × ≥ 1 = 1 OLCO3 (Ours) √ √ ≥ 1 ≥ 1 In this paper, we propose a novel framework Overlap Local Computation with Compressed Communication (i.e., OLCO 3 ) to make distributed training both delay-tolerant AND communication efficient by enabling and improving the combination of the above three communicationefficient techniques. In Table 1 , we compare OLCO 3 with the aforementioned works and two succeeding state-of-the-art delay-tolerant methods CoCoD-SGD (Shen et al., 2019) and OverlapLo-calSGD (Wang et al., 2020) . Under the periodic averaging framework, we use p to denote the number of local SGD iterations per communication round, and s to denote the number of communication rounds that the information used in the model update has been outdated for. Let the computation time of one SGD iteration be T comput , then we can pipeline the communication and the computation when the communication delay time is less than sp • T comput . For simplicity, we define the delay tolerance of a method as T = sp. Local SGD has to use up-to-date information for the model update (s = 0, p ≥ 1, T = sp = 0). CoCoD-SGD and OverlapLocalSGD combine pipelining and periodic averaging by using stale results from last communication round (s = 1, p ≥ 1, T = sp = p), while our OLCO 3 supports various staleness (s ≥ 1, p ≥ 1, T = sp) and all other features in Table 1 . The main contributions of this paper are summarized as follows: • We propose the novel OLCO 3 method, which achieves extreme communication efficiency by addressing both the high communication delay and large communication budget issues. • OLCO 3 introduces novel staleness compensation and compression compensation techniques. Convergence analysis shows that OLCO 3 achieves the same convergence rate as SGD. • Extensive experiments on deep learning tasks show that OLCO 3 significantly outperforms existing delay-tolerant methods in both the communication efficiency and model accuracy. is the stochastic sampling variable and ∇F k (x t ; ξ (k) t ) is the corresponding stochastic gradient at worker k. Throughout this paper, we assume that the stochastic gradient is an unbiased estimator by default, i.e., E ξ (k)

2. BACKGROUNDS & RELATED WORKS

t ∇F k (x t ; ξ (k) t ) = ∇f k (x t ). Pipe-SGD (Li et al., 2018) parallelizes the communication and computation of SGD via pipelining. At iteration t, worker k computes stochastic gradient ∇F k (x t ; ξ (k) t ) at current model x t and communicates to get the averaged stochastic gradient 1 K K k=1 ∇F k (x t ; ξ (k) t ). Instead of waiting the communication to finish, Pipe-SGD concurrently updates the current model with stale averaged stochastic gradient via x t+1 = x t -ηt K K k=1 ∇F k (x t-s ; ξ (k) t-s ). Note that Pipe-SGD is different from asynchronous SGD (Ho et al., 2013; Lian et al., 2015) which computes stochastic gradient using stale model and does not parallelize the computation and communication of a worker. A problem of Pipe-SGD is that its performance deteriorates severely under high communication delay (large s). Pipelining with Periodic Averaging. CoCoD-SGD (Shen et al., 2019) utilizes periodic averaging to reduce the number of communication rounds and parallelizes the local model update and global model averaging by concurrently conducting x t = 1 K K k=1 x (k) t and x (k) t+p = x (k) t - t+p-1 τ =t η τ ∇F k (x (k) τ ; ξ (k) τ ) . (1) in which x (k) t denotes the local model at worker k as the local models on different workers are no longer consistent in non-communicating iterations. When the operations in Eq. ( 1) finishes, the local model is updated via x (k) t+p ← x t + x (k) t+p -x (k) t and t ← t + p. CoCoD-SGD can tolerate delay up to p SGD iterations (i.e., one communication round in periodic averaging). OverlapLocalSGD (Wang et al., 2020) improves CoCoD-SGD by heuristically pulling x (k) t+p back to the x t after the operations in Eq. (1) via x (k) t+p ← (1 -α)x (k) t+p + αx t where 0 ≤ α < 1. The motivation is to reduce the inconsistency in the local models across workers. OverlapLocalSGD also develops a momentum variant, which maintains a slow momentum buffer for x t following SlowMo (Wang et al., 2019) . As both CoCoD-SGD and OverlapLocalSGD communicates the non-compressed local model update, they suffer from a large communication budget in each communication round. Gradient Compression. The gradient vector v ∈ R d can be sent with a much smaller communication budget by applying a compressor C(•). Specifically, Scalar quantization rounds 32-bit floatingpoint gradient components to low-precision values of only several bits. One important such algorithm is scaled SignSGD (called SignSGD in this paper) (Bernstein et al., 2018; Karimireddy et al., 2019) which uses C(v) = v 1 d sign(v) to compress v to 1 bit. Gradient sparsification only communicates large gradient components. Vector quantization uses a codebook where each code is a vector and quantizes the gradient vector as a linear combination of the vector codes. With the local error feedback technique (Seide et al., 2014; Lin et al., 2018b; Wu et al., 2018; Karimireddy et al., 2019; Zheng et al., 2019) , which adds the previous compression error (i.e., v -C(v)) to current gradient before compression, gradient compression can achieve comparable performance as full-precision training. Local error feedback also works for both one-way compression (compress the communication from worker to sever) (Karimireddy et al., 2019) and two-way compression (compress the communication between worker and server) (Zheng et al., 2019) . Challenges. Simultaneously achieving communication compression with pipelining and periodic averaging requires careful algorithm design because 1) pipelining introduces staleness, and 2) stateof-the-art vector quantization methods usually require an additional round of communication to solve the compressor C(•), which is unfavorable in high communication delay scenarios.

3. THE PROPOSED FRAMEWORK: OLCO 3

In this section, we will introduce our new delay-tolerant and communication-efficient training framework OLCO 3 . We discuss two variants of OLCO 3 : OLCO 3 -TC for two-way compression in masterslave communication mode, and OLCO 3 -VQ adopting commutative vector quantization for both the master-slave and ring all-reduce communication modes. Note that one-way compression in just a special case of OLCO 3 -TC and we omit it for conciseness. We use "line x" to refer to the x-th line of Algorithm 1. The key differences between OLCO 3 -TC and OLCO 3 -VQ are marked in red color.

3.1. OLCO 3 -TC FOR TWO-WAY COMPRESSION

Motivation. OLCO 3 -TC is presented in the green part of Algorithm 1 for efficient master-slave distributed training. Naively pipelining local computation with compressed communication will break the update rule of momentum SGD for the averaged model x t = 1 K K k=1 x (k) t , leading to nonconvergence. Therefore, we consider an auxiliary variable xt := 1 = 0, server error e 0 = 0, local momentum buffer m (k) 0 = 0, and momentum constant 0 < µ < 1. Variables with negative subscripts are 0. K K k=1 x (k) t -1 K K k=1 e (k) t - 3: for t = 0, 1, • • • , T -1 do 4: m (k) t+1 = µm (k) t + ∇F k (x (k) t ; ξ (k) t ), x (k) t+1 = x (k) t -η t m (k) t+1 // Momentum Local SGD.

5:

if (t + 1) mod p = 0 then 6: Maintain or reset the momentum buffer. 7: ∆ (k) t+1 = x (k) t+1-p -x (k) t+1 + e (k) t // Compression compensation. 8: e (k) t+1 = e (k) t+2 = • • • = e (k) t+p = ∆ (k) t+1 -C(∆ (k) t+1 ) // Compression.

9:

Invoke the communication thread in parallel which does: 10: (1) Send C(∆ (k) t+1 ) to and receive C(∆ t+1 ) from the server node. 11: (2) Server: ∆ t+1 = 1 K K k=1 C(∆ (k) t+1 ) + e t ; e t+1 = e t+2 = • • • = e t+p = ∆ t+1 - C(∆ t+1 ). 12: Block until C(∆ t+1-sp ) is ready. 13: x t+1 = x t+1-p -C(∆ t+1-sp ) 14: x (k) t+1 ← x t+1 - s-1 i=0 C(∆ (k) t+1-ip ) // Staleness compensation. 15: ∆ (k) t+1 = x (k) t+1-p -x (k) t+1 + e (k) t-sp // Compression compensation.

16:

Invoke the communication thread in parallel which does: 17: (1) e (k) t+1 = e (k) t+2 = • • • = e (k) t+p = ∆ (k) t+1 -C(∆ (k) t+1 ) // Compression. 18: (2) Average 1 K K k=1 C(∆ (k) t+1 ) by ring all-reduce or master-slave communication. 19: Block until 1 K K k=1 C(∆ (k) t+1-sp ) and e (k) t+1-sp is ready. 20: x t+1 = x t+1-p -1 K K k=1 C(∆ (k) t+1-sp ) 21: x (k) t+1 ← x t+1 - s-1 i=0 ∆ (k) t+1-ip // Staleness compensation.

22:

end if 23: end for 24: Output: averaged model x T = 1 K K k=1 x (k) T and computation, we compress the local update ∆ (k) t+1 (line 7) for efficient communication, and at the same time, try to update the model with a stale compressed global update C(∆ t+1-sp ) (line 13) that has been outdated for s communication rounds (i.e., the staleness is s). The momentum buffer can be maintained or reset to zero every p iteration (line 6). If the delay tolerance T = sp is larger than the actual communication delay, the blocking in line 12 becomes a no-op and there will be no synchronization barrier. The server compresses the sum of the compressed local updates from all workers (line 11) and sends it back, making OLCO 3 -TC an efficient two-way compression method. Compensation. To make the update of the auxiliary variable xt follow momentum SGD, we propose to 1) compensate staleness with all compressed local updates with staleness ∈ [0, s -1] (line 14), which requires no communication and allows less stale local update to affect the local model, and 2) maintain a local error (line 8) and add it to the next local update before compression (line 7) to compensate the compression error. With the two compensation techniques in OLCO 3 -TC, Lemma 1 shows that the update rule of xt follows momentum SGD with averaged momentum 1 K K k=1 m (k) t . Lemma 1. For OLCO 3 -TC, let xt := 1 K K k=1 x (k) t -1 K K k=1 e (k) t -e t-sp , then we have xt = xt-1 -ηt-1 K K k=1 m (k) t . Note that there is a "gradient mismatch" problem as the local momentum m (k) t is computed at the local model x (k) t but used in the update rule of the auxiliary variable xt (Karimireddy et al., 2019; Xu et al., 2020) . However, our analysis shows that it does not affect the convergence rate. We have also considered OLCO 3 for one-way compression (i.e., OLCO 3 -OC) as a special case of OLCO 3 -TC. In OLCO 3 -OC, the compressor at the server side is identity function and the server error e t is 0. For OLCO 3 -OC, the auxiliary variable xt also follows momentum SGD as stated in Lemma 2. Lemma 2. For OLCO 3 -OC, let xt := 1 K K k=1 x (k) t -1 K K k=1 e (k) t , then we have xt = xt-1 - ηt-1 K K k=1 m (k) t . We can see that the delay tolerance of both OLCO 3 -TC and OLCO 3 -OC are T = sp(s ≥ 1, p ≥ 1). They have a memory overhead of O(sd) for storing information with staleness ∈ [0, s -1]. For most compression schemes such as SignSGD, the computation complexity of C(•) is O(d). 3.2 OLCO 3 -VQ FOR COMMUTATIVE VECTOR QUANTIZATION OLCO 3 -TC and OLCO 3 -OC work for compressed communication in the master-slave communication paradigm. In contrast, OLCO 3 -VQ (the yellow part of Algorithm 1) works for both the master-slave and ring all-reduce communication paradigms. Ring all-reduce minimizes communication congestion by shifting from centralized aggregation in master-slave communication (Yu et al., 2018) . OLCO 3 -VQ relies on a state-of-the-art vector quantization scheme, PowerSGD (Vogels et al., 2019) , which satisfies commutability for compression, i.e., C(v 1 ) + C(v 2 ) = C(v 1 + v 2 ). However, directly using PowerSGD breaks the delay tolerance of OLCO 3 as its compressor C(•) needs communication and introduces synchronization barriers. Specifically, PowerSGD invokes communication across all workers to compute a transformation matrix, which is used to project the local updates to the compressed form. Pipelining with Communication-Dependent Compressor. To make OLCO 3 -VQ delay-tolerant, we further propose a novel compression compensation technique with the stale local error (line 15). This is in contrast to OLCO 3 -TC and OLCO 3 -OC, which use immediate compressed results to calculate the up-to-date local error. As this technique removes the dependency on immediate compressed results, we can move the whole compression and averaging process to the communication thread (lines 17 and 18). For staleness compensation, OLCO 3 -VQ uses all uncompressed local updates with staleness ∈ [0, s -1] instead of compressed local updates in OLCO 3 -TC. With the two compensation techniques, Lemma 3 shows that for OLCO 3 -VQ, the auxiliary variable xt associated with the stale local error also follows the momentum SGD update rule. Lemma 3. For OLCO 3 -VQ, let xt := 1 K K k=1 x (k) t -1 K K k=1 e (k) t-sp , then we have xt = xt-1 - ηt-1 K K k=1 m (k) t .

4. THEORETICAL RESULTS

In this section, we provide the convergence results of the OLCO 3 variants for both SGD and momentum SGD maintaining momentum (line 6 of Algorithm 1) with common assumptions. As OLCO 3 -OC is a special case of OLCO 3 -TC, we only analyze OLCO 3 -TC and OLCO 3 -VQ. The detailed proofs of Theorems 1, 2, 3, and 4 can be found in Appendix D, E, F, and G respectively. The detailed proofs of Lemma 1, 2, and 3 can be found in Appendix C. We use f * to denote the optimal loss. Assumption 1. (L-Lipschitz Smoothness) Both the local (f k (•)) and global (f (•) = 1 K K k=1 f k (•)) loss functions are L-smooth, i.e., ∇f (x) -∇f (y) 2 ≤ L x -y 2 , ∀x, y ∈ R d , (2) ∇f k (x) -∇f k (y) 2 ≤ L x -y 2 , ∀k ∈ [K], ∀x, y ∈ R d . (3) Assumption 2. (Local Bounded Variance) The local stochastic gradient ∇F k (x; ξ) has a bounded variance, i.e., E ξ∼D k ∇F k (x; ξ) -∇f k (x) 2 2 ≤ σ 2 , ∀k ∈ [K], ∀x ∈ R d . Note that E ξ∼D k ∇F k (x; ξ) = ∇f k (x). Assumption 3. (Bounded Variance across Workers) The L 2 norm of the difference of the local and global full gradient is bounded, i.e., ∇f k (x) -∇f (x) 2 2 ≤ κ 2 , ∀k ∈ [K], ∀x ∈ R d . κ = 0 leads to i.i.d. data distributions across workers. Assumption 4. (Bounded Full Gradient) The second moment of the global full gradient is bounded, i.e., ∇f (x) 2 2 ≤ G 2 , ∀x ∈ R d . Assumption 5. (Karimireddy et al., 2019) The compression function C(•) : R d → R is a δ- approximate compressor for 0 < δ ≤ 1 if for all v ∈ R d , C(v) -v 2 2 ≤ (1 -δ) v 2 2 .

4.1. SGD

Theorem 1. For OLCO 3 -VQ with vanilla SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ min{ 1 6L(s+1)p , 1 9L }, then 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(f (x 0 ) -f * ) ηT + 9ηLσ 2 K + 12η 2 L 2 (s + 1)pσ 2 [1+ (4) 14(1 -δ) δ 2 (s + 1)p] + 36η 2 L 2 (s + 1) 2 p 2 κ 2 (1 + 5(1 -δ) δ 2 ) + 168(1 -δ) δ 2 η 2 L 2 (s + 1) 2 p 2 G 2 . If we set the learning rate η = O(K 1 2 T -1 2 ) and the communication interval p = O(K -3 4 T 1 4 (s + 1) -1 ), the convergence rate will be O(K -1 2 T -1 2 ). The O(K -1 2 T -1 2 ) rate is the same as synchronous SGD and Local SGD, and achieves linear speedup regarding the number of workers K. Theorem 2. For OLCO 3 -TC with vanilla SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ min{ 1 6L(s+1)p , 1 9L } and let h(δ) = 1-δ δ 2 (1 + 4(2-δ) δ 2 ), then 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(f (x 0 ) -f * ) ηT + 9ηLσ 2 K + 12η 2 pσ 2 (s + 1 + 80h(δ)p) + 12η 2 p 2 κ 2 (3(s + 1) 2 + 80h(δ)) + 960η 2 p 2 G 2 h(δ) . (5) If we set the learning rate η = O(K 1 2 T -1 2 ) and the communication interval p = O(K -3 4 T 1 4 (s + 1) -1 ), the convergence rate will be O(K -1 2 T -1 2 ). When the data distributions across workers are i.i.d. (i.e., κ = 0), if we choose the learning rate η = O(K 1 2 T -1 2 ) and the communication interval p = min{O(K -3 2 T 1 2 (s + 1) -1 ), O(K -3 4 T 1 4 )} (p = O(K -3 4 T 1 4 ) for a enough large T ) instead, the convergence rate will still be O (K -1 2 T -1 2 ). Therefore, OLCO 3 -TC can tackle a larger communication interval p (O(K -3 4 T 1 4 )) than OLCO 3 - VQ (O(K -3 4 T 1 4 (s + 1) -1 )) in the i.i.d. setting. But they are the same in the non-i.i.d. setting.

4.2. MOMENTUM SGD

Theorem 3. For OLCO 3 -VQ with Momentum SGD and under Assumptions 1, 2, 3, 4, 5, if the learning rate η ≤ min{ 1-µ √ 72L(s+1)p , 1-µ 9L } and let g(µ, δ, s, p) = 15 (1-µ) 2 + 60(1-δ)(s+1) 2 p 2 δ 2 , then 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(1 -µ)(f (x 0 ) -f * ) ηT + 9Lησ 2 (1 -µ)K (6) + 4η 2 L 2 (1 -µ) 2 [(4(s + 1)p + g(µ, δ, s, p))σ 2 + (12(s + 1) 2 p 2 + g(µ, δ, s, p))κ 2 + g(µ, δ, s, p)G 2 ] . Theorem 4. For OLCO 3 -TC with Momentum SGD and under Assumptions 1, 2, 3, 4, 5, if the learning rate η ≤ min{ 1-µ √ 72L(s+1)p , 1-µ 9L } and h(δ) = 1-δ δ 2 (1 + 4(2-δ) δ 2 ), then 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(1 -µ)(f (x 0 ) -f * ) ηT + 9Lησ 2 (1 -µ)K + 6η 2 L 2 (1 -µ) 2 [σ 2 ( 9 (1 -µ) 2 + 2(s + 1)p + 168h(δ)p 2 ) + κ 2 ( 9 (1 -µ) 2 + 6(s + 1) 2 p 2 + 168h(δ)p 2 ) + G 2 ( 9 (1 -µ) 2 + 168h(δ)p 2 )] . The same convergence rate and communication interval p are achieved as in Section 4.1. 

5. EXPERIMENTS

We compare the following methods: 1) Local SGD (baseline, NO delay tolerance with T = 0); 2) Pipe-SGD; 3) CoCoD-SGD; 4) OverlapLocalSGD with hyperparameters following Wang et al. (2020) ; 5) OLCO 3 -OC with SignSGD compression; 6) OLCO 3 -VQ with PowerSGD compressoin; 7) OLCO 3 -TC with SignSGD compression. The momentum buffer is maintained (line 6 of Algorithm 1) by default. We do not report the results of Pipe-SGD as it does not converge for the large delay tolerance T we experimented. We train ResNet-110 (He et al., 2016) Varying Delay Tolerance. We vary the delay tolerance T with staleness fixed at s = 1 in the left plot of Figure 2 . The goal is to check the robustness of OLCO 3 to the different period p. The results show that OLCO 3 -OC and OLCO 3 -TC always outperform other delay-tolerant methods, and have more comparable performance to Local SGD. Note that both the OLCO 3 -OC and OLCO 3 -TC provide a significantly smaller communication budget according to Figure 1 . OLCO 3 -VQ also outperforms CoCoD-SGD with a much smaller communication budget. Varying Staleness. We vary the staleness s of OLCO 3 in the right plot of Figure 2 under fixed delay tolerance T . Local SGD only supports s = 0 with no delay tolerance, and CoCoD-SGD and Over-lapLocalSGD only support s = 1, so there is only one result for them in the figure. When increasing the staleness beyond 2 for OLCO 3 , the deterioration of the model performance is very small, especially for OLCO 3 -VQ. This suggests that the staleness compensation techniques in OLCO 3 are effective. The performance peaks at s = 2 because an appropriate staleness may introduce some noise that helps generalization. In comparison, we cannot tune staleness s for better performance in CoCoD-SGD and OverlapLocalSGD.

6. CONCLUSION

In ImageNet. We train the ResNet-50 model with 16 workers on ImageNet (Russakovsky et al., 2015) image classification tasks. The model is trained for 120 epochs with a cosine learning rate scheduling (Loshchilov & Hutter, 2016) . The base learning rate is 0.4 and the total batch size is 2048. The momentum constant is 0.9 and the weight decay is 1 × 10 -4 . We linearly warm up the learning rate from 0.025 to 0.4 in the beginning 5 epochs. The rank of PowerSGD is 50. Random cropping, random flipping, and standardization are applied as data augmentation techniques. The Non-i.i.d. Setting. Similar to (Wang et al., 2020) , we randomly choose fraction α of the whole data, sort the data by the class, and evenly assign them to all workers in order. For the rest fraction (1 -α) of the whole data, we randomly and evenly distribute them to all workers (Figure 3 ). When 0 < α ≤ 1 is large, the data distribution across workers is non-i.i.d and highly skewed. When α = 0, it becomes i.i.d. data distribution across workers. In our non-i.i.d. experiments, we choose α = 0.8. A.4 HYPERPARAMETERS s & p Again, Figure 6 empirically confirms the theoretical results in Theorems 1, 2, 3, and 4 that OLCO 3 -TC can handle a larger period p than OLCO 3 and that this gap increases with the staleness s in the i.i.d. setting. Note that in the right plot of Figure 2 , the gap between OLCO 3 -TC and OLCO 3 -VQ does not increase with s because the period p is decreasing (the delay tolerance T = sp is fixed).  (•) = 1 K K k=1 f k (•)) loss functions are L-smooth, i.e., ∇f (x) -∇f (y) 2 ≤ L x -y 2 , ∀x, y ∈ R d , ( ) ∇f k (x) -∇f k (y) 2 ≤ L x -y 2 , ∀k ∈ [K], ∀x, y ∈ R d . ( ) Assumption 2. (Local Bounded Variance) The local stochastic gradient ∇F k (x; ξ) has a bounded variance, i.e., E ξ∼D k ∇F k (x; ξ) -∇f k (x) 2 2 ≤ σ 2 , ∀k ∈ [K], ∀x ∈ R d . ( ) Note that E ξ∼D k ∇F k (x; ξ) = ∇f k (x). Assumption 3. (Bounded Variance across Workers) The L 2 norm of the difference of the local and global full gradient is bounded, i.e., ∇f k (x) -∇f (x) 2 2 ≤ κ 2 , ∀k ∈ [K], ∀x ∈ R d , where κ = 0 leads to i.i.d. data distributions across workers. Assumption 4. (Bounded Full Gradient) The second moment of the global full gradient is bounded, i.e., ∇f (x) 2 2 ≤ G 2 , ∀x ∈ R d . ( ) Assumption 5. (δ-approximate compressor) The compression function C(•) : R d → R is a δ- approximate compressor for 0 < δ ≤ 1 if for all v ∈ R d , C(v) -v 2 2 ≤ (1 -δ) v 2 2 .

C BASIC LEMMAS

Lemma 1. For OLCO 3 -TC, let xt := 1 K K k=1 x (k) t -1 K K k=1 e (k) t -e t-sp , then we have xt = xt-1 - η t-1 K K k=1 m (k) t . Proof. For t = np where n is some integer, xnp = 1 K K k=1 x (k) np - 1 K K k=1 e (k) np -e (n-s)p = x np - 1 K K k=1 s-1 i=0 C(∆ (k) (n-i)p ) - 1 K K k=1 e (k) np -e (n-s)p = x (n-1)p -C(∆ (n-s)p ) - 1 K K k=1 s-1 i=0 C(∆ (k) (n-i)p ) - 1 K K k=1 e (k) np -e (n-s)p = 1 K K k=1 (x (k) (n-1)p + s-1 i=0 C(∆ (k) (n-1-i)p )) -C(∆ (n-s)p ) - 1 K K k=1 s-1 i=0 C(∆ (k) (n-i)p ) - 1 K K k=1 e (k) np -e (n-s)p = 1 K K k=1 x (k) (n-1)p - 1 K K k=1 ∆ (k) np + 1 K K k=1 C(∆ (k) (n-s)p ) -C(∆ (n-s)p ) -e (n-s)p = 1 K K k=1 x (k) (n-1)p - 1 K K k=1 ∆ (k) np -e (n-s)p-1 = 1 K K k=1 x (k) (n-1)p - 1 K K k=1 np-1 τ =(n-1)p η τ m (k) τ +1 - 1 K K k=1 e (k) np-1 -e (n-s)p-1 = xnp-1 - 1 K K k=1 η np-1 m (k) np . ( ) For t = np, xt = 1 K K k=1 x (k) t - 1 K K k=1 e (k) t -e t-sp = 1 K K k=1 x (k) t - 1 K K k=1 e (k) t-1 -e t-sp-1 = xt-1 - η t-1 K K k=1 m (k) t . Lemma 2. For OLCO 3 -OC, let xt : = 1 K K k=1 x (k) t -1 K K k=1 e (k) t , then we have xt = xt-1 - η t-1 K K k=1 m (k) t . Proof. For t = np where n is some integer, xnp = 1 K K k=1 x (k) np - 1 K K k=1 e (k) np = x np - 1 K K k=1 s-1 i=0 C(∆ (k) (n-i)p ) - 1 K K k=1 e (k) np = x (n-1)p - 1 K K k=1 C(∆ (k) (n-s)p ) - 1 K K k=1 s-1 i=0 C(∆ (k) (n-i)p ) - 1 K K k=1 e (k) np = 1 K K k=1 (x (k) (n-1)p + s-1 i=0 C(∆ (k) (n-1-i)p )) - 1 K K k=1 C(∆ (k) (n-s)p ) - 1 K K k=1 s-1 i=0 C(∆ (k) (n-i)p ) - 1 K K k=1 e (k) np = 1 K K k=1 x (k) (n-1)p - 1 K K k=1 C(∆ (k) np ) - 1 K K k=1 e (k) np = 1 K K k=1 x (k) (n-1)p - 1 K K k=1 ∆ (k) np = 1 K K k=1 x (k) (n-1)p - 1 K K k=1 np-1 τ =(n-1)p η τ m (k) τ +1 - 1 K K k=1 e (k) np-1 = xnp-1 - 1 K K k=1 η np-1 m (k) np . ( ) For t = np, xt = 1 K K k=1 x (k) t - 1 K K k=1 e (k) t = 1 K K k=1 x (k) t - 1 K K k=1 e (k) t-1 = xt-1 - η t-1 K K k=1 m (k) t . (19) Lemma 3. For OLCO 3 -VQ, let xt := 1 K K k=1 x (k) t -1 K K k=1 e (k) t-sp , then we have xt = xt-1 - η t-1 K K k=1 m (k) t . Proof. For t = np where n is some integer, xnp = 1 K K k=1 x (k) np - 1 K K k=1 e (k) (n-s)p = x np - 1 K K k=1 s-1 i=0 ∆ (k) (n-i)p - 1 K K k=1 e (k) (n-s)p = x (n-1)p - 1 K K k=1 C(∆ (k) (n-s)p ) - 1 K K k=1 s-1 i=0 ∆ (k) (n-i)p - 1 K K k=1 e (k) (n-s)p = 1 K K k=1 (x (k) (n-1)p + s-1 i=0 ∆ (k) (n-1-i)p ) - 1 K K k=1 C(∆ (k) (n-s)p ) - 1 K K k=1 s-1 i=0 ∆ (k) (n-i)p - 1 K K k=1 e (k) (n-s)p = 1 K K k=1 (x (k) (n-1)p + s-1 i=0 ∆ (k) (n-1-i)p ) - 1 K K k=1 ∆ (k) (n-s)p - 1 K K k=1 s-1 i=0 ∆ (k) (n-i)p = 1 K K k=1 x (k) (n-1)p - 1 K K k=1 ∆ (k) np = 1 K K k=1 x (k) (n-1)p - 1 K K k=1 np-1 τ =(n-1)p η τ m (k) τ +1 - 1 K K k=1 e (k) np-1-sp = xnp-1 - 1 K K k=1 η np-1 m (k) np . For t = np, xt = 1 K K k=1 x (k) t - 1 K K k=1 e (k) t-sp = 1 K K k=1 x (k) t - 1 K K k=1 e (k) t-1-sp = xt-1 - η t-1 K K k=1 m (k) t . D PROOF OF THEOREM 1 Lemma 4. For OLCO 3 -VQ with vanilla SGD and under Assumptions 2, 3, 4, and 5, the local error satisfies E e (k) t 2 2 ≤ 12(1 -δ) δ 2 p 2 η 2 (σ 2 + κ 2 + G 2 ) . Proof. First we have E ∇F (x (k) t ; ξ (k) t ) 2 2 ≤ 3E ∇F (x (k) t ; ξ (k) t ) -∇f k (x (k) t ) 2 + 3E ∇f k (x (k) t ) -∇f (x (k) t ) 2 2 + 3E ∇f (x (k) t ) 2 2 ≤ 3σ 2 + 3κ 2 + 3G 2 . ( ) Let S t = t p , E e (k) t 2 2 = E e (k) Stp 2 2 = E C(∆ (k) Stp ) -∆ (k) Stp 2 2 ≤ (1 -δ)E ∆ (k) Stp 2 2 = (1 -δ)E Stp-1 t =(St-1)p η∇F (x (k) t ; ξ (k) t ) + e (k) (St-s-1)p 2 2 ≤ (1 -δ)(1 + ρ)E e (k) (St-s-1)p 2 2 + (1 + δ)(1 + 1 ρ )E Stp-1 t =(St-1)p η∇F (x (k) t ; ξ (k) t ) 2 2 ≤ (1 -δ)(1 + ρ)E e (k) (St-s-1)p 2 2 + 3(1 + δ)(1 + 1 ρ )p 2 η 2 (σ 2 + κ 2 + G 2 ) . Therefore, E e (k) t 2 2 ≤ 3(1 -δ)(1 + 1 ρ )p 2 η 2 (σ 2 + κ 2 + G 2 ) S t s -1 i=0 [(1 -δ)(1 + ρ)] i ≤ 3(1 -δ)(1 + 1 ρ ) 1 -(1 -δ)(1 + ρ) p 2 η 2 (σ 2 + κ 2 + G 2 ) . ( ) Let ρ = δ 2(1-δ) such that 1 + 1 ρ = 2-δ δ ≤ 2 δ , then E e (k) t 2 2 ≤ 12(1-δ) δ 2 p 2 η 2 (σ 2 + κ 2 + G 2 ). Lemma 5. For OLCO 3 -VQ with vanilla SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ 1 6L(s+1)p , we have 1 KT T -1 t=0 K-1 k=1 E xt -x (k) t 2 2 ≤ 3η 2 (s + 1)pσ 2 (1 + 12(1 -δ) δ 2 (s + 1)p) + 9η 2 (s + 1) 2 p 2 κ 2 (1 + 4(1 -δ) δ 2 ) + 36(1 -δ) δ 2 η 2 (s + 1) 2 p 2 G 2 . ( ) Proof. Let S t = t p , 1 K K k=1 E xt -x (k) t 2 2 = 1 K K k=1 E 1 K K k =1 (- s-1 i=0 ∆ (k ) (St-i)p - t-1 t =Stp η∇F k (x (k ) t ; ξ (k ) t )) -(- s-1 i=0 ∆ (k) (St-i)p - t-1 t =Stp η∇F k (x (k) t ; ξ (k) t )) - 1 K K k =1 e (k ) t-sp 2 2 = 1 K K k=1 E - 1 K K k =1 t-1 t =(St-s)p η∇F k (x (k ) t ; ξ (k ) t ) + t-1 t =(St-s)p η∇F k (x (k) t ; ξ (k) t ) - 1 K K k =1 e (k ) t-sp - 1 K K k =1 s-1 i=0 e (k ) (St-i-s-1)p + s-1 i=0 e (k) (St-i-s-1)p 2 2 ≤ 2η 2 K K k=1 E - 1 K K k =1 t-1 t =(St-s)p ∇F k (x (k ) t ; ξ (k ) t ) + t-1 t =(St-s)p ∇F k (x (k) t ; ξ (k) t ) 2 2 + 2 K K k=1 E - 1 K K k =1 e (k ) t-sp - 1 K K k =1 s-1 i=0 e (k ) (St-i-s-1)p + s-1 i=0 e (k) (St-i-s-1)p 2 2 . ( ) The first term is bounded by 2η 2 K K k=1 E - 1 K K k =1 t-1 t =(St-s)p ∇F k (x (k ) t ; ξ (k ) t ) + t-1 t =(St-s)p ∇F k (x (k) t ; ξ (k) t ) 2 2 ≤ 2η 2 K K k=1 E t-1 t =(St-s)p - 1 K K k =1 (∇F k (x (k ) t ; ξ (k ) t ) -∇f k (x (k ) t )) + (∇F k (x (k) t ; ξ (k) t ) -∇f k (x (k) t )) 2 2 + 2η 2 K K k=1 E t-1 t =(St-s)p (- 1 K K k =1 ∇f k (x (k ) t ) + ∇f k (x (k) t )) 2 2 = 2η 2 K K k=1 t-1 t =(St-s)p E - 1 K K k =1 (∇F k (x (k ) t ; ξ (k ) t ) -∇f k (x (k ) t )) + (∇F k (x (k) t ; ξ (k) t ) -∇f k (x (k) t )) 2 2 + 2η 2 K K k=1 E t-1 t =(St-s)p (- 1 K K k =1 ∇f k (x (k ) t ) + ∇f k (x (k) t )) 2 2 ≤ 2η 2 K K k=1 t-1 t =(St-s)p E ∇F k (x (k) t ; ξ (k) t ) -∇f k (x (k) t ) 2 2 + 2η 2 K K k=1 t-1 t =(St-s)p (t -(S t -s)p)E - 1 K K k =1 ∇f k (x (k ) t ) + ∇f k (x (k) t ) 2 2 ≤ 2η 2 (s + 1)pσ 2 + 2η 2 (s + 1)p K t-1 t =t-(s+1)p K k=1 E - 1 K K k =1 ∇f k (x (k ) t ) + ∇f k (x (k) t ) 2 2 , ( ) where the third inequality follows 1 K K k=1 1 K K k =1 a k -a k 2 2 = 1 K K k=1 a k 2 2 - 1 K K k=1 a k 2 2 ≤ 1 K K k=1 a k 2 2 , and 1 K K k=1 E - 1 K K k =1 ∇f k (x k t ) + ∇f k (x (k) t ) 2 2 = 3 K K k=1 E[ ∇f k (x (k) t ) -∇f k (x t ) 2 2 + ∇f k (x t ) -∇f (x t ) 2 2 + ∇f (x t ) - 1 K K k =1 ∇f k (x k t ) 2 2 ] ≤ 3 K K k=1 E[L 2 xt -x (k) t 2 2 + κ 2 + 1 K K k =1 ∇f k (x t ) -∇f k (x k t ) 2 2 ] ≤ 6L 2 K K k=1 E xt -x (k) t 2 2 + 3κ 2 . ( ) The second term is bounded by 2 K K k=1 E - 1 K K k =1 e (k ) t-sp - 1 K K k =1 s-1 i=0 e (k ) (St-i-s-1)p + s-1 i=0 e (k) (St-i-s-1)p 2 2 ≤ 2(1 + s) K K k=1 E 1 K K k =1 e (k ) (St-s)p 2 2 + 2(1 + 1 s ) K K k=1 E - 1 K K k =1 s-1 i=0 e (k ) (St-i-s-1)p + s-1 i=0 e (k) (St-i-s-1)p 2 2 ≤ 2(1 + s) K K k=1 E e (k) (St-s)p 2 2 + 2(1 + 1 s ) K K k=1 E s-1 i=0 e (k) (St-i-s-1)p 2 2 ≤ 2(1 + s) K K k=1 E e (k) (St-s)p 2 2 + 2(1 + s) K K k=1 s-1 i=0 E e (k) (St-i-s-1)p 2 2 = 2(s + 1) K K k=1 s i=0 E e (k) (St-i-s)p 2 2 ≤ 24(1 -δ) δ 2 (s + 1) 2 p 2 η 2 (σ 2 + κ 2 + G 2 ) , where the last inequality follows Lemma 4. Combine the bounds of the first term and the second term, we have 1 K K k=1 E xt -x (k) t 2 2 ≤ 2η 2 (s + 1)pσ 2 + 2η 2 (s + 1)p t-1 t =t-(s+1)p ( 6L 2 K K k=1 E xt -x (k) t 2 2 + 3κ 2 ) + 24(1 -δ) δ 2 (s + 1) 2 p 2 η 2 (σ 2 + κ 2 + G 2 ) ≤ 2η 2 (s + 1)pσ 2 (1 + 12(1 -δ) δ 2 (s + 1)p) + 6η 2 (s + 1) 2 p 2 κ 2 (1 + 4(1 -δ) δ 2 ) + 24(1 -δ) δ 2 η 2 (s + 1) 2 p 2 G 2 + 12η 2 L 2 (s + 1)p t-1 t =t-(s+1)p 1 K K k=1 E xt -x (k) t 2 2 . ( ) Sum the above inequality from t = 0 to t = T -1 and divide it by T , 1 KT T -1 t=0 K-1 k=1 E xt -x (k) t 2 2 ≤ 2η 2 (s + 1)pσ 2 (1 + 12(1 -δ) δ 2 (s + 1)p) + 6η 2 (s + 1) 2 p 2 κ 2 (1 + 4(1 -δ) δ 2 ) + 24(1 -δ) δ 2 η 2 (s + 1) 2 p 2 G 2 + 12η 2 L 2 (s + 1) 2 p 2 • 1 KT T -1 t=0 K-1 k=1 E xt -x (k) t 2 2 . (33) Therefore, 1 KT T -1 t=0 K-1 k=1 E xt -x (k) t 2 2 ≤ 2η 2 (s + 1)pσ 2 (1 + 12(1-δ) δ 2 (s + 1)p) + 6η 2 (s + 1) 2 p 2 κ 2 (1 + 4(1-δ) δ 2 ) + 24(1-δ) δ 2 η 2 (s + 1) 2 p 2 G 2 1 -12η 2 L 2 (s + 1) 2 p 2 . ( ) 1 6L(s+1)p , 1 KT T -1 t=0 K-1 k=1 E xt -x (k) t 2 2 ≤ 3η 2 (s + 1)pσ 2 (1 + 12(1 -δ) δ 2 (s + 1)p) + 9η 2 (s + 1) 2 p 2 κ 2 (1 + 4(1 -δ) δ 2 ) + 36(1 -δ) δ 2 η 2 (s + 1) 2 p 2 G 2 . ( ) Theorem 1. For OLCO 3 -VQ with vanilla SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ min{ 1 6L(s+1)p , 1 9L }, then 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(f (x 0 ) -f * ) ηT + 9ηLσ 2 K + 12η 2 L 2 (s + 1)pσ 2 (1 + 14(1 -δ) δ 2 (s + 1)p) + 36η 2 L 2 (s + 1) 2 p 2 κ 2 (1 + 5(1 -δ) δ 2 ) + 168(1 -δ) δ 2 η 2 L 2 (s + 1) 2 p 2 G 2 . ( ) Proof. According to Assumption 1, E t f (x t+1 ) -f (x t ) ≤ E t ∇f (x t ), xt+1 -xt + L 2 E t xt+1 -xt 2 2 = -η ∇f (x t ), 1 K K k=1 ∇f k (x (k) t ) + Lη 2 2 E t 1 K K k=1 ∇F k (x (37) For the first term, -∇f (x t ), 1 K K k=1 ∇f k (x (k) t ) = -∇f (x t ) 2 2 -∇f (x t ), 1 K K k=1 (∇f k (x (k) t ) -∇f k (x t )) ≤ - 1 2 ∇f (x t ) 2 2 + 1 2 1 K K k=1 (∇f k (x (k) t ) -∇f k (x t )) 2 2 ≤ - 1 2 ∇f (x t ) 2 2 + L 2 2K K k=1 xt -x (k) t 2 2 , where the first equality follows that ∇f (x t ) = 1 K K k=1 ∇f k (x t ). For the second term, E t 1 K K k=1 ∇F k (x (k) t ; ξ (k) t ) 2 2 = E t 1 K K k=1 (∇F k (x (k) t ; ξ (k) t ) -∇f k (x (k) t )) + 1 K K k=1 (∇f k (x (k) t ) -∇f k (x t )) + ∇f (x t ) 2 2 ≤ 3E t 1 K K k=1 (∇F k (x (k) t ; ξ (k) t ) -∇f k (x (k) t )) 2 2 + 3 1 K K k=1 (∇f k (x (k) t ) -∇f k (x t )) 2 2 + 3 ∇f (x t ) 2 2 ≤ 3σ 2 K + 3L 2 K K k=1 xt -x (k) t 2 2 + 3 ∇f (x t ) 2 2 . (39) Combine them and we have E t (x t+1 ) -f (x t ) ≤ - η 2 (1 -3ηL) ∇f (x t ) 2 2 + ηL 2 2 (1 + 3ηL) 1 K K k=1 xt -x (k) t 2 2 + 3η 2 Lσ 2 2K . If we choose η ≤ 1 9L , E t (x t+1 ) -f (x t ) ≤ - η 3 ∇f (x t ) 2 2 + 2ηL 2 3 1 K K k=1 xt -x (k) t 2 2 + 3η 2 Lσ 2 2K . Then for the averaged parameters 1 K K k=1 x (k) t , ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 2 ∇f ( 1 K K k=1 x (k) t ) -∇f (x t ) 2 2 + 2 ∇f (x t ) 2 2 ≤ 2L 2 1 K K k=1 e (k) t-sp 2 2 + 2 ∇f (x t ) 2 2 ≤ 2L 2 K K k=1 e (k) t-sp 2 2 + 2 ∇f (x t ) 2 2 ≤ 6[f (x t ) -E t f (x t+1 )] η + 9ηLσ 2 K + 4L 2 K K k=1 xt -x (k) t 2 2 + 2L 2 K K k=1 e (k) t-sp 2 2 . Take total expectation, sum from t = 0 to t = T -1, and rearrange, 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6[f (x 0 ) -Ef ( xT )] ηT + 9ηLσ 2 K + 2L 2 KT T -1 t=0 K k=1 E e (k) t-sp 2 2 + 4L 2 KT T -1 t=0 K k=1 E xt -x (k) t 2 2 ≤ 6[f (x 0 ) -Ef ( xT )] ηT + 9ηLσ 2 K + 24(1 -δ) δ 2 p 2 η 2 L 2 (σ 2 + κ 2 + G 2 ) + 12η 2 L 2 (s + 1)pσ 2 (1 + 12(1 -δ) δ 2 (s + 1)p) + 36η 2 L 2 (s + 1) 2 p 2 κ 2 (1 + 4(1 -δ) δ 2 ) + 144(1 -δ) δ 2 η 2 L 2 (s + 1) 2 p 2 G 2 ≤ 6(f (x 0 ) -f * ) ηT + 9ηLσ 2 K + 12η 2 L 2 (s + 1)pσ 2 (1 + 14(1 -δ) δ 2 (s + 1)p) + 36η 2 L 2 (s + 1) 2 p 2 κ 2 (1 + 5(1 -δ) δ 2 ) + 168(1 -δ) δ 2 η 2 L 2 (s + 1) 2 p 2 G 2 , where the second inequality follows Lemma 4 and 5.

E PROOF OF THEOREM 2

Lemma 6. For OLCO 3 -TC with vanilla SGD and under Assumptions 2, 3, 4, and 5, the local error satisfies E e (k) t 2 2 ≤ 12(1 -δ) δ 2 p 2 η 2 (σ 2 + κ 2 + G 2 ) . Proof. Same as the proof of Lemma 4, except that e Lemma 7. For OLCO 3 -TC with vanilla SGD and under Assumptions 2, 3, 4, and 5, the server error satisfies E e t 2 2 ≤ 96(2 -δ)(1 -δ) δ 4 p 2 η 2 (σ 2 + κ 2 + G 2 ) . ( ) Proof. Let S t = t p , E 1 K K k=1 C(∆ (k) Stp ) 2 2 ≤ 2E 1 K K k=1 C(∆ (k) Stp ) - 1 K K k=1 ∆ (k) Stp 2 2 + 2E 1 K K k=1 ∆ (k) Stp 2 2 ≤ 2 K K k=1 E C(∆ (k) Stp ) -∆ (k) Stp 2 2 + 2 K K k=1 E ∆ (k) Stp 2 2 ≤ 2(2 -δ) K K k=1 E ∆ (k) Stp 2 2 . Following the proof of Lemma 4 we have E ∆ (k) Stp 2 2 ≤ 3(1+ 1 ρ ) 1-(1-δ)(1+ρ) p 2 η 2 (σ 2 +κ 2 +G 2 ). Therefore, E e t 2 2 = E e Stp 2 2 ≤ (1 -δ)E 1 K K k=1 C(∆ (k) Stp ) + e (St-1)p 2 ≤ (1 -δ)(1 + 1 ρ )E 1 K K k=1 C(∆ (k) Stp ) 2 2 + (1 -δ)(1 + ρ)E e (St-1)p 2 2 ≤ 2(2 -δ)(1 -δ)(1 + 1 ρ ) 1 K K k=1 E ∆ (k) Stp 2 2 + (1 -δ)(1 + ρ)E e (St-1)p 2 2 ≤ 2(2 -δ)(1 -δ)(1 + 1 ρ ) 3(1 + 1 ρ ) 1 -(1 -δ)(1 + ρ) p 2 η 2 (σ 2 + κ 2 + G 2 ) + (1 -δ)(1 + ρ)E e (St-1)p 2 2 ≤ 6(2 -δ)(1 -δ)(1 + 1 ρ ) 2 [1 -(1 -δ)(1 + ρ)] 2 p 2 η 2 (σ 2 + κ 2 + G 2 ) . ( ) Let ρ = δ 2(1-δ) such that 1 + 1 ρ = 2-δ δ ≤ 2 δ , then E e (k) t 2 2 ≤ 96(2-δ)(1-δ) δ 4 p 2 η 2 (σ 2 + κ 2 + G 2 ). Lemma 8. For OLCO 3 -TC with vanilla SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ 1 6L(s+1)p and let h(δ) = 1-δ δ 2 (1 + 4(2-δ) δ 2 ), we have 1 KT T -1 t=0 K-1 k=1 E xt -x (k) t 2 2 ≤ 3η 2 pσ 2 (s + 1 + 72h(δ)p) + 9η 2 p 2 κ 2 ((s + 1) 2 + 24h(δ)) + 216h(δ)η 2 p 2 G 2 . (48) Proof. Let S t = t p , 1 K K k=1 E xt -x (k) t 2 2 = 1 K K k=1 E 1 K K k =1 (- s-1 i=0 C(∆ (k ) (St-i)p ) - t-1 t =Stp η∇F k (x (k ) t ; ξ (k ) t )) -(- s-1 i=0 C(∆ (k) (St-i)p ) - t-1 t =Stp η∇F k (x (k) t ; ξ (k) t )) - 1 K K k =1 e (k ) t -e t-sp 2 2 , where s-1 i=0 C(∆ (k) (St-i)p ) = s-1 i=0 [∆ (k) (St-i)p -e (k) (St-i)p ] = s-1 i=0 [ (St-i)p-1 t =(St-i-1)p η∇F k (x (k) t ; ξ (k) t ) + e (k) (St-i-1)p -e (k) (St-i)p ] = s-1 i=0 (St-i)p-1 t =(St-i-1)p ∇F k (x (k) t ; ξ (k) t ) + e (k) (St-s)p -e (k) Stp . (50) 1 K K k=1 E xt -x (k) t 2 2 = 1 K K k=1 E - 1 K K k =1 t-1 t =(St-s)p η∇F k (x (k ) t ; ξ (k ) t ) + t-1 t =(St-s)p η∇F k (x (k) t ; ξ (k) t ) - 1 K K k =1 (e (k ) (St-s)p -e (k ) Stp ) + (e (k) (St-s)p -e (k) Stp ) - 1 K K k =1 e (k ) t -e t-sp 2 2 ≤ 2η 2 K K k=1 E - 1 K K k =1 t-1 t =(St-s)p η∇F k (x (k ) t ; ξ (k ) t ) + t-1 t =(St-s)p η∇F k (x (k) t ; ξ (k) t ) 2 2 + 2 K K k=1 E - 1 K K k =1 e (k ) (St-s)p + e (k) (St-s)p -e (k) Stp -e (St-s)p 2 2 , where the first term can be bounded following Eqs. (29, 30) . The second term satisfies 2 K K k=1 E - 1 K K k =1 e (k ) (St-s)p + e (k) (St-s)p -e (k) Stp -e (St-s)p 2 2 ≤ 6 K K k=1 E - 1 K K k =1 e (k ) (St-s)p + e (k) (St-s)p 2 2 + 6 K K k=1 E e (k) Stp 2 2 + 6 K K k=1 E e (St-s)p 2 2 ≤ 6 K K k=1 E e (k) (St-s)p 2 2 + 6 K K k=1 E e (k) Stp 2 2 + 6 K K k=1 E e (St-s)p 2 2 ≤ 1 -δ δ 2 (1 + 4(2 -δ) δ 2 ) • 144p 2 η 2 (σ 2 + κ 2 + G 2 ) , where the last inequality follows Lemmas 6 and 7. Let h(δ) = 1-δ δ 2 (1 + 4(2-δ) δ 2 ). Combine the above two inequalities and we have 1 K K k=1 E xt -x (k) t 2 2 ≤ 2η 2 (s + 1)pσ 2 + 2η 2 (s + 1)p t-1 t =t-(s+1)p ( 6L 2 K K k=1 E xt -x (k) t 2 2 + 3κ 2 ) + 144h(δ)p 2 η 2 (σ 2 + κ 2 + G 2 ) ≤ 2η 2 pσ 2 (s + 1 + 72h(δ)p) + 6η 2 p 2 κ 2 ((s + 1) 2 + 24h(δ)) + 144h(δ)p 2 η 2 G 2 + 12η 2 L 2 (s + 1)p t-1 t =t-(s+1)p 1 K K k=1 E xt -x (k) t 2 2 . (53) Following Eqs. (33, 34, 35) , 1 KT T -1 t=0 K-1 k=1 E xt -x (k) t 2 2 ≤ 3η 2 pσ 2 (s + 1 + 72h(δ)p) + 9η 2 p 2 κ 2 ((s + 1) 2 + 24h(δ)) + 216h(δ)η 2 p 2 G 2 . ( ) Theorem 2. For OLCO 3 -TC with vanilla SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ min{ 1 6L(s+1)p , 1 9L } and let h(δ) = 1-δ δ 2 (1 + 4(2-δ) δ 2 ), then 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(f (x 0 ) -f * ) ηT + 9ηLσ 2 K + 12η 2 pσ 2 (s + 1 + 80h(δ)p) + 12η 2 p 2 κ 2 (3(s + 1) 2 + 80h(δ)) + 960η 2 p 2 G 2 h(δ) . Proof. Following the proof of Theorem 1, we have the same inequality as Eq. ( 41): E t f (x t+1 ) -f (x t ) ≤ - η 3 ∇f (x t ) 2 2 + 2ηL 2 3 1 K K k=1 xt -x (k) t 2 2 + 3η 2 Lσ 2 2K . ( ) Then for the averaged parameters 1 K K k=1 x (k) t , ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 2 ∇f ( 1 K K k=1 x (k) t ) -∇f (x t ) 2 2 + 2 ∇f (x t ) 2 2 ≤ 2L 2 1 K K k=1 e (k) t + e t-sp 2 2 + 2 ∇f (x t ) 2 2 ≤ 4L 2 K K k=1 e (k) t 2 2 + 4L 2 K e t-sp 2 2 + 2 ∇f (x t ) 2 2 ≤ 6[f (x t ) -E t f (x t+1 )] η + 9ηLσ 2 K + 4L 2 K K k=1 xt -x (k) t 2 2 + 4L 2 K K k=1 e (k) t 2 2 + 4L 2 K e t-sp (57) Take total expectation, sum from t = 0 to t = T -1, and rearrange, 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6[f (x 0 ) -Ef ( xT )] ηT + 9ηLσ 2 K + 4L 2 KT T -1 t=0 K k=1 E e (k) t-sp 2 2 + 4L 2 KT T -1 t=0 E e t-sp 2 2 + 4L 2 KT T -1 t=0 K k=1 E xt -x (k) t 2 2 ≤ 6[f (x 0 ) -Ef ( xT )] ηT + 9ηLσ 2 K + 12η 2 pσ 2 (s + 1 + 72h(δ)p + 4(1 -δ) δ 2 p + 32(2 -δ)(1 -δ) δ 4 p) + 12η 2 p 2 κ 2 (3(s + 1) 2 + 72h(δ) + 4(1 -δ) δ 2 + 32(2 -δ)(1 -δ) δ 4 ) + 12η 2 p 2 G 2 (72h(δ) + 4(1 -δ) δ 2 + 32(2 -δ)(1 -δ) δ 4 ) ≤ 6(f (x 0 ) -f * ) ηT + 9ηLσ 2 K + 12η 2 pσ 2 (s + 1 + 80h(δ)p) + 12η 2 p 2 κ 2 (3(s + 1) 2 + 80h(δ)) + 960η 2 p 2 G 2 h(δ) , (58) where the second inequality follows Lemmas 6, 7 and 8.

F PROOF OF THEOREM 3

We first define two virtual variables z t and p t satisfying p t = µ 1-µ (x t -xt-1 ), t ≥ 1 0, t = 0 and z t = xt + p t . (60) Then the update rule of z t satisfies z t+1 -z t = (x t+1 -xt ) + µ 1 -µ (x t+1 -xt ) - µ 1 -µ (x t -xt-1 ) = - η K K k=1 m (k) t+1 - µ 1 -µ η K K k=1 m (k) t+1 + µ 1 -µ η K K k=1 m (k) t = - η (1 -µ)K K k=1 (m (k) t+1 -µm (k) t ) = - η (1 -µ)K K k=1 ∇f k (x (k) t ; ξ (k) t ) , which exists for OLCO 3 -OC, OLCO 3 -VQ and OLCO 3 -TC. Lemma 9. For OLCO 3 with Momentum SGD, we have E m (k) t 2 2 ≤ 3(σ 2 + κ 2 + G 2 ) (1 -µ) 2 . ( ) Proof. E m (k) t 2 2 = E t-1 t =0 µ t-1-t ∇F k (x (k) t ; ξ (k) t ) 2 2 = ( t-1 t =0 µ t-1-t ) 2 E t-1 t =0 µ t-1-t t-1 t =0 µ t-1-t ∇F k (x (k) t ; ξ (k) t ) 2 2 ≤ ( t-1 t =0 µ t-1-t ) 2 E ∇F k (x (k) t ; ξ (k) t ) 2 2 ≤ 3(σ 2 + κ 2 + G 2 ) (1 -µ) 2 . ( ) Lemma 10. For OLCO 3 with Momentum SGD, we have E p t 2 ≤ 3µ 2 η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 4 . ( ) Proof. E p t 2 = µ 2 (1 -µ) 2 E xt -xt-1 2 = µ 2 η 2 (1 -µ) 2 E 1 K K k=1 m (k) t 2 ≤ µ 2 η 2 (1 -µ) 2 K K k=1 E m (k) t 2 2 ≤ 3µ 2 η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 4 . ( ) Lemma 11. For OLCO 3 -VQ with Momentum SGD and under Assumptions 2, 3, 4, and 5, the local error satisfies E e (k) t 2 2 ≤ 12(1 -δ) (1 -µ) 2 δ 2 p 2 η 2 (σ 2 + κ 2 + G 2 ) . ( ) Proof. Let S t = t p , E e (k) t 2 2 = E e (k) Stp 2 2 = E C(∆ (k) Stp ) -∆ (k) Stp 2 2 ≤ (1 -δ)E ∆ (k) Stp 2 2 = (1 -δ)E S T p-1 t =(St-1)p ηm (k) t + e (k) (St-s-1)p 2 2 ≤ (1 -δ)(1 + ρ)E e (k) (St-s-1)p 2 2 + (1 -δ)(1 + 1 ρ ) 3η 2 p 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 2 . ( ) Therefore, E e (k) t 2 2 ≤ 3(1 -δ)(1 + 1 ρ ) p 2 η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 2 S t s -1 i=0 [(1 -δ)(1 + ρ)] i ≤ 3(1 -δ)(1 + 1 ρ ) 1 -(1 -δ)(1 + ρ) p 2 η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 2 . (68) Let ρ = δ 2(1-δ) such that 1 + 1 ρ = 2-δ δ ≤ 2 δ , then E e (k) t 2 2 ≤ 12(1-δ) (1-µ) 2 δ 2 p 2 η 2 (σ 2 + κ 2 + G 2 ). Lemma 12. For OLCO 3 -VQ with Momentum SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ 1-µ √ 72L(s+1)p , we have 1 KT T -1 t=0 K k=1 E z t -x (k) t 2 2 ≤ 4η 2 σ 2 (1 -µ) 2 ( 3 (1 -µ) 2 + (s + 1)p + 12(1 -δ)(s + 1) 2 p 2 δ 2 ) + 12η 2 κ 2 (1 -µ) 2 ( 1 (1 -µ) 2 + (s + 1) 2 p 2 + 4(1 -δ)(s + 1) 2 p 2 δ 2 ) + 12η 2 G 2 (1 -µ) 2 ( 1 (1 -µ) 2 + 4(1 -δ)(s + 1) 2 p 2 δ 2 ) . (69) Proof. Let S t = t p , 1 K K k=1 E z t -x (k) t 2 2 = 1 K K k=1 E 1 K K k =1 (- s-1 i=0 ∆ (k ) (St-i)p -(x (k ) Stp -x (k ) t )) -(- s-1 i=0 ∆ (k) (St-i)p -(x (k) Stp -x (k) t ) - 1 K K k =1 e (k ) t-sp 2 2 = 1 K K k=1 E 1 K K k =1 (ηm (k ) (St-s)p t-1-(St-s)p τ =0 µ τ +1 + t-1 t =(St-s)p η∇F k (x (k ) t ; ξ (k ) t ) t-1-t τ =0 µ τ ) -(ηm (k) (St-s)p t-1-(St-s)p τ =0 µ τ +1 + t-1 t =(St-s)p η∇F k (x (k) t ; ξ (k) t ) t-1-t τ =0 µ τ ) + 1 K K k =1 e (k ) t-sp + 1 K K k =1 s-1 i=0 e (k ) (St-i-s-1)p - s-1 i=0 e (k) (St-i-s-1)p 2 2 ≤ 3η 2 K K k=1 E 1 K K k =1 m (k ) (St-s)p t-1-(St-s)p τ =0 µ τ +1 -m (k) (St-s)p t-1-(St-s)p τ =0 µ τ +1 2 2 + 3η 2 K K k=1 E 1 K K k =1 t-1 t =(St-s)p ∇F k (x (k ) t ; ξ (k ) t ) t-1-t τ =0 µ τ - t-1 t =(St-s)p ∇F k (x (k) t ; ξ (k) t ) t-1-t τ =0 µ τ 2 2 + 3 K K k=1 E 1 K K k =1 e (k ) t-sp + 1 K K k =1 s-1 i=0 e (k ) (St-i-s-1)p + s-1 i=0 e (k) (St-i-s-1)p 2 2 . The first term 3η 2 K K k=1 E 1 K K k =1 m (k ) (St-s)p t-1-(St-s)p τ =0 µ τ +1 -m (k) (St-s)p t-1-(St-s)p τ =0 µ τ +1 2 2 ≤ 3η 2 K K k=1 E m (k) (St-s)p t-1-(St-s)p τ =0 µ τ +1 2 2 ≤ 3η 2 (1 -µ) 2 K K k=1 E m (k) (St-s)p 2 2 ≤ 9η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 4 , where the last inequality follows Lemma 9. Following Eq. ( 29), the second term can be bounded by 3η 2 K K k=1 E 1 K K k =1 t-1 t =(St-s)p ∇F k (x (k ) t ; ξ (k ) t ) t-1-t τ =0 µ τ - t-1 t =(St-s)p ∇F k (x (k) t ; ξ (k) t ) t-1-t τ =0 µ τ 2 2 = 3η 2 K K k=1 E t-1 t =(St-s)p [ 1 K K k =1 (∇F k (x (k ) t ; ξ (k ) t ) -∇f k (x (k ) t )) -(∇F k (x (k) t ; ξ (k) t ) -∇f k (x (k) t ))] t-1-t τ =0 µ τ 2 2 + 3η 2 K K k=1 E t-1 t =(St-s)p ( 1 K K k =1 ∇f k (x (k ) t ) -∇f k (x (k) t )) t-1-t τ =0 µ τ 2 2 ≤ 3η 2 (1 -µ) 2 K K k=1 t-1 t =(St-s)p E ∇F k (x (k) t ; ξ (k) t ) -∇f k (x (k) t ) 2 2 + 3η 2 (1 -µ) 2 K (t -(S t -s)p) K k=1 t-1 t =(St-s)p E 1 K K k =1 ∇f k (x (k ) t ) -∇f k (x (k) t ) 2 2 ≤ 3η 2 (s + 1)pσ 2 (1 -µ) 2 + 3η 2 (s + 1)p (1 -µ) 2 t-1 t =t-(s+1)p ( 6L 2 K K k=1 E z t -x (k) t 2 2 + 3κ 2 ) , where the last inequality follows Eq. ( 30). Combine the bounds of the first and second term with Lemma 11 and Eq. ( 31), 1 K K k=1 E z t -x (k) t 2 2 ≤ 9η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 4 + 3η 2 (s + 1)pσ 2 (1 -µ) 2 + 3η 2 (s + 1)p (1 -µ) 2 t-1 t =t-(s+1)p ( 6L 2 K K k=1 E z t -x (k) t 2 2 + 3κ 2 ) + 3(s + 1) K K k=1 s i=0 E e (k) (St-i-s)p 2 2 ≤ 9η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 4 + 3η 2 (s + 1)pσ 2 (1 -µ) 2 + 9η 2 (s + 1) 2 p 2 κ 2 (1 -µ) 2 + 36(1 -δ)η 2 (s + 1) 2 p 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 2 δ 2 + 18η 2 L 2 (s + 1)p (1 -µ) 2 t-1 t =t-(s+1)p 1 K K k=1 E z t -x (k) t 2 2 . ( ) Sum the above inequality from t = 0 to t = T -1 and divide it by T , 1 KT T -1 t=0 K k=1 E z t -x (k) t 2 2 ≤ 3η 2 σ 2 (1 -µ) 2 ( 3 (1 -µ) 2 + (s + 1)p + 12(1 -δ)(s + 1) 2 p 2 δ 2 ) + 9η 2 κ 2 (1 -µ) 2 ( 1 (1 -µ) 2 + (s + 1) 2 p 2 + 4(1 -δ)(s + 1) 2 p 2 δ 2 ) + 9η 2 G 2 (1 -µ) 2 ( 1 (1 -µ) 2 + 4(1 -δ)(s + 1) 2 p 2 δ 2 ) + 18η 2 L 2 (s + 1) 2 p 2 (1 -µ) 2 1 KT T -1 t=0 K k=1 E z t -x (k) t 2 2 . ( ) 1-µ √ 72L(s+1)p , 1 KT T -1 t=0 K k=1 E z t -x (k) t 2 2 ≤ 4η 2 σ 2 (1 -µ) 2 ( 3 (1 -µ) 2 + (s + 1)p + 12(1 -δ)(s + 1) 2 p 2 δ 2 ) + 12η 2 κ 2 (1 -µ) 2 ( 1 (1 -µ) 2 + (s + 1) 2 p 2 + 4(1 -δ)(s + 1) 2 p 2 δ 2 ) + 12η 2 G 2 (1 -µ) 2 ( 1 (1 -µ) 2 + 4(1 -δ)(s + 1) 2 p 2 δ 2 ) . Theorem 3. For OLCO 3 -VQ with Momentum SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ min{ 1-µ √ 72L(s+1)p , 1-µ 9L } and let g(µ, δ, s, p) = 15 (1-µ) 2 + 60(1-δ)(s+1) 2 p 2 δ 2 , then 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(1 -µ)(f (x 0 ) -f * ) ηT + 9Lησ 2 (1 -µ)K + 4η 2 L 2 (1 -µ) 2 [(4(s + 1)p + g(µ, δ, s, p))σ 2 + (12(s + 1) 2 p 2 + g(µ, δ, s, p))κ 2 + g(µ, δ, s, p)G 2 ] . Proof. Following the proof of Theorem 1 and the update rule Eq. ( 61), we have a similar inequality as Eq. ( 41) by choosing η ≤ 1-µ 9L : E t f (z t+1 ) -f (z t ) ≤ η 1 -µ (- 1 2 ∇f (z t ) 2 2 + L 2 2K K k=1 z t -x (k) t 2 2 ) + Lη 2 2(1 -µ) 2 ( 3σ 2 K + 3L 2 K K k=1 z t -x (k) t 2 2 + 3 ∇f (z t ) 2 2 ) = - η 2(1 -µ) (1 - 3Lη 1 -µ ) ∇f (z t ) 2 2 + L 2 η 2(1 -µ)K (1 + 3Lη 1 -µ ) K k=1 z t -x (k) t 2 2 + 3Lη 2 σ 2 2(1 -µ) 2 K ≤ - η 3(1 -µ) ∇f (z t ) 2 2 + 2ηL 2 3(1 -µ)K K k=1 z t -x (k) t 2 2 + 3Lη 2 σ 2 2(1 -µ) 2 K . Then for the averaged parameters 1 K K k=1 x (k) t , ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 2 ∇f ( 1 K K k=1 x (k) t ) -∇f (z t ) 2 2 + 2 ∇f (z t ) 2 2 ≤ 2L 2 1 K K k=1 e (k) t-sp -p t 2 2 + 2 ∇f (z t ) 2 2 ≤ 4L 2 1 K K k=1 e (k) t-sp 2 2 + 4L 2 p t 2 2 + 2 ∇f (z t ) 2 2 . ( ) Therefore 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(1 -µ)[f (z 0 ) -f (z T )] ηT + 9Lησ 2 (1 -µ)K + 4L 2 KT T -1 t=0 K k=1 E z t -x (k) t 2 2 + 4L 2 T T -1 t=0 E 1 K K k=1 e (k) t-sp 2 2 + 4L 2 T T -1 t=0 E p t 2 2 ≤ 6(1 -µ)(f (x 0 ) -f * ) ηT + 9Lησ 2 (1 -µ)K + 4η 2 L 2 (1 -µ) 2 [(4(s + 1)p + g(µ, δ, s, p))σ 2 + (12(s + 1) 2 p 2 + g(µ, δ, s, p))κ 2 + g(µ, δ, s, p)G 2 ] . ( ) where the last inequality follows Lemmas 10, 11 and 12 and g(µ, δ, s, p) =  ≤ 3(1+ 1 ρ ) 1-(1-δ)(1+ρ) p 2 η 2 (σ 2 +κ 2 +G 2 ) (1-µ) 2 . Therefore, E e t 2 2 ≤ 2(2 -δ)(1 -δ)(1 + 1 ρ ) 1 K K k=1 E ∆ (k) Stp 2 2 + (1 -δ)(1 + ρ)E e (St-1)p 2 2 ≤ 2(2 -δ)(1 -δ)(1 + 1 ρ ) 3(1 + 1 ρ ) 1 -(1 -δ)(1 + ρ) p 2 η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 2 + (1 -δ)(1 + ρ)E e (St-1)p 2 2 ≤ 6(2 -δ)(1 -δ)(1 + 1 ρ ) 2 [1 -(1 -δ)(1 + ρ)] 2 (1 -µ) 2 p 2 η 2 (σ 2 + κ 2 + G 2 ) , where the first inequality follows the proof of Lemma 7. Let ρ = δ 2(1-δ) such that 1+ 1 ρ = 2-δ δ ≤ 2 δ , then E e (k) t 2 2 ≤ E e (k) t 2 2 ≤ 96(2-δ)(1-δ) (1-µ) 2 δ 4 p 2 η 2 (σ 2 + κ 2 + G 2 ). Lemma 15. For OLCO 3 -TC with Momentum SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ 1-µ √ 72L(s+1)p , we have 1 KT T -1 t=0 K k=1 E z t -x (k) t 2 2 ≤ 3η 2 σ 2 (1 -µ) 2 ( 3 (1 -µ) 2 + (s + 1)p + 72h(δ)p 2 ) + 3η 2 κ 2 (1 -µ) 2 ( 3 (1 -µ) 2 + 3(s + 1) 2 p 2 + 72h(δ)p 2 ) + 3η 2 G 2 (1 -µ) 2 ( 3 (1 -µ) 2 + 72h(δ)p 2 ) , where h(δ) = 1-δ δ 2 (1 + 4(2-δ) δ 2 ). Proof. Let S t = t p , 1 K K k=1 E z t -x (k) t 2 2 = 1 K K k=1 E 1 K K k =1 (- s-1 i=0 C(∆ (k ) (St-i)p ) -(x (k ) Stp -x (k ) t )) -(- s-1 i=0 C(∆ (k) (St-i)p ) -(x (k ) Stp -x (k) t ) - 1 K K k =1 e (k) t -e t-sp  2 2 , where the first term is bounded following Lemma 9 and the third term is bounded following Eq. ( 72). The second term  3 K K k=1 E 1 K K k =1 ≤ 1 -δ (1 -µ) 2 δ 2 (1 + 4(2 -δ) δ 2 ) • 216p 2 η 2 (σ 2 + κ 2 + G 2 ) . Combine these bounds, 1 K K k=1 E z t -x (k) t 2 2 ≤ 9η 2 (σ 2 + κ 2 + G 2 ) (1 -µ) 4 + 1 -δ (1 -µ) 2 δ 2 (1 + 4(2 -δ) δ 2 ) • 216p 2 η 2 (σ 2 + κ 2 + G 2 ) + 3η 2 (s + 1)pσ 2 (1 -µ) 2 + 3η 2 (s + 1)p (1 -µ) 2 t-1 t =t-(s+1)p ( 6L 2 K K k=1 E z t -x (k) t 2 2 + 3κ 2 ) = 18η 2 L 2 (s + 1)p (1 -µ) 2 t-1 t =t-(s+1)p 1 K K k=1 E z t -x (k) t 2 2 (88) Under review as a conference paper at ICLR 2021 Sum the above inequality from t = 0 to t = T -1, divide it by T , and choose η ≤ 1-µ √ 72L(s+1)p , 1 KT T -1 t=0 K k=1 E z t -x (k) t 2 2 ≤ 3η 2 σ 2 (1 -µ) 2 ( 3 (1 -µ) 2 + (s + 1)p + 72h(δ)p 2 ) + 3η 2 κ 2 (1 -µ) 2 ( 3 (1 -µ) 2 + 3(s + 1) 2 p 2 + 72h(δ)p 2 ) + 3η 2 G 2 (1 -µ) 2 ( 3 (1 -µ) 2 + 72h(δ)p 2 ) . (89) Theorem 4. For OLCO 3 -TC with Momentum SGD and under Assumptions 1, 2, 3, 4, and 5, if the learning rate η ≤ min{ 1-µ √ 72L(s+1)p , 1-µ 9L } and let h(δ) = 1-δ δ 2 (1 + 4(2-δ) δ 2 ), then 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ ≤ 6(1 -µ)(f (x 0 ) -f * ) ηT + 9Lησ 2 (1 -µ)K + 6η 2 L 2 (1 -µ) 2 [σ 2 ( 9 (1 -µ) 2 + 2(s + 1)p + 168h(δ)p 2 ) + κ 2 ( 9 (1 -µ) 2 + 6(s + 1) 2 p 2 + 168h(δ)p 2 ) + G 2 ( 9 (1 -µ) 2 + 168h(δ)p 2 )] . (90) Proof. Following the proof of Theorem 3, (92) E t f (z t+1 ) -f (z t ) ≤ - η 3(1 -µ) ∇f (z t ) 2 2 + 2ηL 2 3(1 -µ)K K k=1 z t -x (k) t 2 2 + 3Lη 2 σ 2 2(1 -µ) 2 K , Therefore 1 T T -1 t=0 E ∇f ( 1 K K k=1 x (k) t ) 2 2 ≤ 6(1 -µ)[f (z 0 ) -f (z T )] ηT + 9Lησ 2 (1 -µ)K + 4L 2 KT T -1 t=0 K k=1 E z t -x (k) t 2 2 + 6L 2 T T -1 t=0 E 1 K K k=1 e (k) t-sp 2 2 + 6L 2 T T -1 t=0 E e t-sp 2 2 + 6L 2 T T -1 t=0 E p t 2 2 ≤ 6(1 -µ)(f (x 0 ) -f * ) ηT + 9Lησ 2 (1 -µ)K + 6η 2 L 2 (1 -µ) 2 [σ 2 ( 9 (1 -µ) 2 + 2(s + 1)p + 168h(δ)p 2 ) + κ 2 ( 9 (1 -µ) 2 + 6(s + 1) 2 p 2 + 168h(δ)p 2 ) + G 2 ( 9 (1 -µ) 2 + 168h(δ)p 2 )] , where the last inequality follows Lemmas 10, 13, 14 and 15.



SGD and Pipelining. In distributed training, we minimize the global loss function f(•) = 1 K K k=1 f k (•), where f k (•) is the local loss function at worker k ∈ [K]. At iteration t,vanilla synchronous SGD updates the model x t ∈ R d with learning rate η t via x t+1 = x t -

Figure 1: Training curves using T = 56 and s = 1 for the delay-tolerant methods, and T = 0 and p = 56 for Local SGD. Test accuracy can be found in Appendix A.3. Best viewed in color.

Figure 3: Non-i.i.d. data partition across workers. Best viewed in color.

Figure 6: Vary delay tolerance T for ResNet-56 on CIFAR-10. We set p of Local SGD equivalent to T of other delay-tolerant methods. Left: s = 1 for the OLCO 3 variants. Middle: s = 2 for the OLCO 3 variants. Right: s = 4 for the OLCO 3 variants.



Comparison of communication-efficient methods for distributed DNN training. The period p ∈ N + is the communication interval for periodic averaging. The staleness s ∈ N is the number of communication rounds that the information used in the model update has been outdated for. For all methods in this table, delay tolerance T = sp.

e t , where e is the local compression error at worker k and e t is the compression error at the server. If xt can follow the update rule of momentum SGD, then the real trained model x t will gradually approach xt as the training converges because the gradient and errors e Overlap Local Computation with Compressed Communication (OLCO 3 ) on worker k ∈ [K]. Green part: OLCO 3 -TC; Yellow part: OLCO 3 -VQ. Best view in color.

Non-i.i.d. test  accuracy (%) of ResNet-110 on CIFAR-10. T = 56 for the delay-tolerant methods, and T = 0 and p = 56 for Local SGD. Training curves can be found in Appendix A.2. The training curves of ResNet-110 on CIFAR-10 and ResNet-50 on ImageNet are shown in Figure 1. We use s = 1 because CoCoD-SGD and OverlapLocalSGD do not support s ≥ 2.Compared with other delay-tolerant methods, the communication budget of the OLCO 3 variants is significantly smaller due to compressed communication. OLCO 3 is also robust to communication delay with a large T = sp. Therefore, OLCO 3 features extreme communication efficiency with compressed communication, delay tolerance, and low communication frequency due to periodic averaging.

We train the ResNet-110(He et al., 2016) model with 8 workers onCIFAR-10 (Krizhevsky  et al., 2009)  image classification task. We report the mean and standard deviation metrics over 3 runs. The base learning rate is 0.4 and the total batch size is 512. The momentum constant is 0.9 and the weight decay is 1 × 10 -4 . The model is trained for 200 epochs with a learning rate decay of 0.1 at epoch 100 and 150. We linearly warm up the learning rate from 0.05 to 0.4 in the beginning 5 epochs. For OLCO 3 with staleness s ∈ {2, 4, 8}, we set the base learning rate to 0.2 due to increased staleness. The rank of PowerSGD is 4. Random cropping, random flipping, and standardization are applied as data augmentation techniques. We also train ResNet-56 to explore more combinations of s and p in Appendix A.4 with the same other settings.

Test accuracy (%) of ResNet-110 on CIFAR-10 and ResNet-50 on ImageNet using T = 56 for the delay-tolerant methods, and T = 0 and p = 56 for Local SGD. CR stands for compression ratio. For each method, the first row denotes maintaining momentum and the second row denotes resetting momentum (line 6).

Test accuracy (%) in Figure6by selecting the best configurations of s and p for the OLCO 3 variants. We set p of Local SGD equivalent to T of other delay-tolerant methods.

15  (1-µ) 2 + 60(1-δ)(s+1) 2 p 2 Lemma 13. For OLCO 3 -TC with Momentum SGD and under Assumptions 2, 3, 4, and 5, the local error satisfies -µ) 2 δ 2 p 2 η 2 (σ 2 + κ 2 + G 2 ) .Proof. Same as the proof of Lemma 11, except that e Lemma 14. For OLCO 3 -TC with vanilla SGD and under Assumptions 2, 3, 4, and 5, the server error satisfies -µ) 2 δ 4 p 2 η 2 (σ 2 + κ 2 + G 2 ) .

annex

whereStp .(85) Therefore,

