DELAY-TOLERANT LOCAL SGD FOR EFFICIENT DIS-TRIBUTED TRAINING

Abstract

The heavy communication for model synchronization is a major bottleneck for scaling up the distributed deep neural network training to many workers. Moreover, model synchronization can suffer from long delays in scenarios such as federated learning and geo-distributed training. Thus, it is crucial that the distributed training methods are both delay-tolerant AND communication-efficient. However, existing works cannot simultaneously address the communication delay and bandwidth constraint. To address this important and challenging problem, we propose a novel training framework OLCO 3 to achieve delay tolerance with a low communication budget by using stale information. OLCO 3 introduces novel staleness compensation and compression compensation to combat the influence of staleness and compression error. Theoretical analysis shows that OLCO 3 achieves the same sub-linear convergence rate as the vanilla synchronous stochastic gradient descent (SGD) method. Extensive experiments on deep learning tasks verify the effectiveness of OLCO 3 and its advantages over existing works.

1. INTRODUCTION

Data-parallel synchronous SGD is currently the workhorse algorithm for large-scale distributed deep learning tasks with many workers (e.g. GPUs), where each worker calculates the stochastic gradient on local data and synchronizes with the other workers in one training iteration (Goyal et al., 2017; You et al., 2017; Huo et al., 2020) . However, high communication overheads make it inefficient to train large deep neural networks (DNNs) with a large number of workers. Generally speaking, the communication overheads come in two forms: 1) high communication delay due to the unstable network or a large number of communication hops, and 2) large communication budget caused by the large size of the DNN models with limited network bandwidth. Although communication delay is not a prominent problem for the data center environment, it can severely degrade training efficiency in practical scenarios, e.g. when the workers are geo-distributed or placed under different networks (Ethernet, cellular networks, Wi-Fi, etc.) in federated learning (Konečnỳ et al., 2016) . Existing works to address the communication inefficiency of synchronous SGD can be roughly classified into three categories: 1) pipelining (Pipe-SGD (Li et al., 2018) ); 2) gradient compression (Aji & Heafield, 2017; Stich et al., 2018; Alistarh et al., 2018; Yu et al., 2018; Vogels et al., 2019) ; and 3) periodic averaging (also known as Local SGD) (Stich, 2019; Lin et al., 2018a) . In pipelining, the model update uses stale information such that the next iteration does not wait for the synchronization of information in the current iteration to update the model. As the synchronization barrier is removed, pipelining can overlap computation with communication to achieve delay tolerance. Gradient compression reduces the amount of data transferred in each iteration by condensing the gradient with a compressor C(•). Representative methods include scalar quantization (Alistarh et al., 2017; Wen et al., 2017; Bernstein et al., 2018) , gradient sparsification (Aji & Heafield, 2017; Stich et al., 2018; Alistarh et al., 2018), and vector quantization (Yu et al., 2018; Vogels et al., 2019) . Periodic averaging reduces the frequency of communication by synchronizing the workers every p (larger than 1) iterations. Periodic averaging is also shown to be effective for federated learning (McMahan et al., 2017) . In summary, exiting works handle the high communication delay with pipelining and use gradient compression and periodic averaging to reduce the communication budget. However, all existing methods fail to address both. It is also unclear how the three communication-efficient techniques introduced above can be used jointly without hurting the convergence of SGD.  × √ = 1 = 0 Periodic Averaging (Local SGD) × × ≥ 1 = 0 Pipelining (Pipe-SGD) √ × = 1 ≥ 1 CoCoD-SGD √ × ≥ 1 = 1 OverlapLocalSGD √ × ≥ 1 = 1 OLCO3 (Ours) √ √ ≥ 1 ≥ 1 In et al., 2020) . Under the periodic averaging framework, we use p to denote the number of local SGD iterations per communication round, and s to denote the number of communication rounds that the information used in the model update has been outdated for. Let the computation time of one SGD iteration be T comput , then we can pipeline the communication and the computation when the communication delay time is less than sp • T comput . For simplicity, we define the delay tolerance of a method as T = sp. Local SGD has to use up-to-date information for the model update (s = 0, p ≥ 1, T = sp = 0). CoCoD-SGD and OverlapLocalSGD combine pipelining and periodic averaging by using stale results from last communication round (s = 1, p ≥ 1, T = sp = p), while our OLCO 3 supports various staleness (s ≥ 1, p ≥ 1, T = sp) and all other features in Table 1 . The main contributions of this paper are summarized as follows: • We propose the novel OLCO 3 method, which achieves extreme communication efficiency by addressing both the high communication delay and large communication budget issues. • OLCO 3 introduces novel staleness compensation and compression compensation techniques. Convergence analysis shows that OLCO 3 achieves the same convergence rate as SGD. • Extensive experiments on deep learning tasks show that OLCO 3 significantly outperforms existing delay-tolerant methods in both the communication efficiency and model accuracy.

2. BACKGROUNDS & RELATED WORKS

SGD and Pipelining. In distributed training, we minimize the global loss function f (•) = 1 K K k=1 f k (•), where f k (•) is the local loss function at worker k ∈ [K]. At iteration t, vanilla synchronous SGD updates the model x t ∈ R d with learning rate η t via x t+1 = x t - ηt K K k=1 ∇F k (x t ; ξ (k) t ), where ξ (k) t is the stochastic sampling variable and ∇F k (x t ; ξ (k) t ) is the corresponding stochastic gradient at worker k. Throughout this paper, we assume that the stochastic gradient is an unbiased estimator by default, i.e., E ξ (k) t ∇F k (x t ; ξ (k) t ) = ∇f k (x t ). Pipe-SGD (Li et al., 2018) parallelizes the communication and computation of SGD via pipelining. At iteration t, worker k computes stochastic gradient ∇F k (x t ; ξ t-s ). Note that Pipe-SGD is different from asynchronous SGD (Ho et al., 2013; Lian et al., 2015) which computes stochastic gradient using stale model and does not parallelize the computation and communication of a worker. A problem of Pipe-SGD is that its performance deteriorates severely under high communication delay (large s).



current model x t and communicates to get the averaged stochastic gradient 1K K k=1 ∇F k (x t ; ξ (k) t ).Instead of waiting the communication to finish, Pipe-SGD concurrently updates the current model with stale averaged stochastic gradient via x t+1 = x t -ηt K K k=1 ∇F k (x t-s ; ξ (k)

Comparison of communication-efficient methods for distributed DNN training. The period p ∈ N + is the communication interval for periodic averaging. The staleness s ∈ N is the number of communication rounds that the information used in the model update has been outdated for. For all methods in this table, delay tolerance T = sp.

this paper, we propose a novel framework Overlap Local Computation with Compressed Communication (i.e., OLCO 3 ) to make distributed training both delay-tolerant AND communication efficient by enabling and improving the combination of the above three communicationefficient techniques. In Table 1, we compare OLCO 3 with the aforementioned works and two succeeding state-of-the-art delay-tolerant methods CoCoD-SGD (Shen et al., 2019) and OverlapLo-calSGD (Wang

