DELAY-TOLERANT LOCAL SGD FOR EFFICIENT DIS-TRIBUTED TRAINING

Abstract

The heavy communication for model synchronization is a major bottleneck for scaling up the distributed deep neural network training to many workers. Moreover, model synchronization can suffer from long delays in scenarios such as federated learning and geo-distributed training. Thus, it is crucial that the distributed training methods are both delay-tolerant AND communication-efficient. However, existing works cannot simultaneously address the communication delay and bandwidth constraint. To address this important and challenging problem, we propose a novel training framework OLCO 3 to achieve delay tolerance with a low communication budget by using stale information. OLCO 3 introduces novel staleness compensation and compression compensation to combat the influence of staleness and compression error. Theoretical analysis shows that OLCO 3 achieves the same sub-linear convergence rate as the vanilla synchronous stochastic gradient descent (SGD) method. Extensive experiments on deep learning tasks verify the effectiveness of OLCO 3 and its advantages over existing works.

1. INTRODUCTION

Data-parallel synchronous SGD is currently the workhorse algorithm for large-scale distributed deep learning tasks with many workers (e.g. GPUs), where each worker calculates the stochastic gradient on local data and synchronizes with the other workers in one training iteration (Goyal et al., 2017; You et al., 2017; Huo et al., 2020) . However, high communication overheads make it inefficient to train large deep neural networks (DNNs) with a large number of workers. Generally speaking, the communication overheads come in two forms: 1) high communication delay due to the unstable network or a large number of communication hops, and 2) large communication budget caused by the large size of the DNN models with limited network bandwidth. Although communication delay is not a prominent problem for the data center environment, it can severely degrade training efficiency in practical scenarios, e.g. when the workers are geo-distributed or placed under different networks (Ethernet, cellular networks, Wi-Fi, etc.) in federated learning (Konečnỳ et al., 2016) . Existing works to address the communication inefficiency of synchronous SGD can be roughly classified into three categories: 1) pipelining (Pipe-SGD (Li et al., 2018)); 2) gradient compression (Aji & Heafield, 2017; Stich et al., 2018; Alistarh et al., 2018; Yu et al., 2018; Vogels et al., 2019) ; and 3) periodic averaging (also known as Local SGD) (Stich, 2019; Lin et al., 2018a) . In pipelining, the model update uses stale information such that the next iteration does not wait for the synchronization of information in the current iteration to update the model. As the synchronization barrier is removed, pipelining can overlap computation with communication to achieve delay tolerance. Gradient compression reduces the amount of data transferred in each iteration by condensing the gradient with a compressor C(•). Representative methods include scalar quantization (Alistarh et al., 2017; Wen et al., 2017; Bernstein et al., 2018) , gradient sparsification (Aji & Heafield, 2017; Stich et al., 2018; Alistarh et al., 2018), and vector quantization (Yu et al., 2018; Vogels et al., 2019) . Periodic averaging reduces the frequency of communication by synchronizing the workers every p (larger than 1) iterations. Periodic averaging is also shown to be effective for federated learning (McMahan et al., 2017) . In summary, exiting works handle the high communication delay with pipelining and use gradient compression and periodic averaging to reduce the communication budget. However, all existing methods fail to address both. It is also unclear how the three communication-efficient techniques introduced above can be used jointly without hurting the convergence of SGD.

