SEQUENTIAL GRADIENT CODING FOR STRAGGLER MITIGATION

Abstract

In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients {g(1), g(2), . . . , g(J)}, where processing of each gradient g(t) starts in round-t and finishes by round-(t + T ). Here T ≥ 0 denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where T = 0. On the other hand, having T > 0 allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.

1. INTRODUCTION

We consider a distributed system consisting of a master and n computing nodes (will be referred to as workers). We are interested in computing a sequence of gradients g(1), g(2), . . . , g(J). Assume for simplicity that there are no dependencies between the gradients (we will relax this assumption later in Sec. 2). If we naively distribute the computation of each gradient among n workers, delayed arrival of results from any of the workers will become a bottleneck. Such a delay could be due to various reasons such as slower processing at workers, network issues leading to communication delays etc. Irrespective of the actual reason, we will refer to a worker providing delayed responses to the master, as a straggler. In a recent work Tandon et al. (2017) , the authors propose Gradient Coding (GC) to distribute computation of a single gradient across multiple workers in a straggler-resilient manner. Using (n, s)-GC, the master is able to compute the gradient as soon as (n -s) workers respond (s is an integer such that 0 ≤ s < n). Adapting GC to our setting, we will have a scheme where g(t) is computed in round-t and in every round, up to s stragglers are tolerated (the concept of round will be formally introduced in Sec. 2). Experimental results (Yang et al., 2019) have demonstrated that intervals of "bad" rounds (where each round has a large number of stragglers) are followed by "good" rounds (where there are relatively fewer number of stragglers), leading to a natural temporal diversity. * Equal contribution authors.

