SEQUENTIAL GRADIENT CODING FOR STRAGGLER MITIGATION

Abstract

In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients {g(1), g(2), . . . , g(J)}, where processing of each gradient g(t) starts in round-t and finishes by round-(t + T ). Here T ≥ 0 denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where T = 0. On the other hand, having T > 0 allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.

1. INTRODUCTION

We consider a distributed system consisting of a master and n computing nodes (will be referred to as workers). We are interested in computing a sequence of gradients g(1), g(2), . . . , g(J). Assume for simplicity that there are no dependencies between the gradients (we will relax this assumption later in Sec. 2). If we naively distribute the computation of each gradient among n workers, delayed arrival of results from any of the workers will become a bottleneck. Such a delay could be due to various reasons such as slower processing at workers, network issues leading to communication delays etc. Irrespective of the actual reason, we will refer to a worker providing delayed responses to the master, as a straggler. In a recent work Tandon et al. (2017) , the authors propose Gradient Coding (GC) to distribute computation of a single gradient across multiple workers in a straggler-resilient manner. Using (n, s)-GC, the master is able to compute the gradient as soon as (n -s) workers respond (s is an integer such that 0 ≤ s < n). Adapting GC to our setting, we will have a scheme where g(t) is computed in round-t and in every round, up to s stragglers are tolerated (the concept of round will be formally introduced in Sec. 2). Experimental results (Yang et al., 2019) have demonstrated that intervals of "bad" rounds (where each round has a large number of stragglers) are followed by "good" rounds (where there are relatively fewer number of stragglers), leading to a natural temporal diversity. Published as a conference paper at ICLR 2023 Note that GC will enforce a large s, as dictated by the number of stragglers expected in bad rounds, leading to a large computational load per worker. The natural question to ask here is that: do there exist better coding schemes that exploit the temporal diversity and achieve better performance?. In this paper, we present sequential gradient coding schemes which answer this in the affirmative.

1.1. SUMMARY OF CONTRIBUTIONS

In contrast to existing GC approaches where coding is performed only across workers, we propose two coding schemes which explore coding across rounds as well as workers. • Our first scheme, namely, Selective-Reattempt-Sequential Gradient Coding (SR-SGC) (Sec. 3.2), is a natural extension of the (n, s)-GC scheme, where unfinished tasks are selectively reattempted in future rounds. Despite the simplicity, SR-SGC scheme tolerates a strict superset of straggler patterns compared to (n, s)-GC, for the same computational load. • In our more involved second scheme, namely, Multiplexed-Sequential Gradient Coding (M-SGC) scheme (Sec. 3.3), we divide tasks into two sets; those which are protected against stragglers via reattempts and those which are protected via GC. We then carefully multiplex these tasks to obtain a scheme where computational load per worker is significantly reduced. In particular, the load decreases by a factor of s when compared to the (n, s)-GC scheme and is close to the information theoretic limit in certain conditions. • Our experiments (Sec. 4) on an AWS Lambda cluster involving 256 worker nodes show that the M-SGC scheme achieves a significant reduction of runtime over GC, in real-world conditions involving naturally occurring, non-simulated stragglers.

1.2. RELATED WORK

We provide a brief overview on the use of erasure codes for straggler mitigation in Appendix B. The terminology of "sequential gradient coding" initially appears in Krishnan et al. 



(2021). Both the current paper and the workKrishnan et al. (2021)  extend the classical GC setting(Tandon et al., 2017), by exploiting the temporal dimension. The authors ofKrishnan et al. (2021)  provide a non-explicit coding scheme that is resilient against communication delays and does not extend when stragglers are due to computational slowdowns. In contrast, we propose two explicit coding schemes that are oblivious of the reason for straggling behavior. Moreover, under equivalent straggler conditions, the computational load requirements of the newly proposed SR-SGC scheme and the scheme inKrishnan et al. (2021)  are identical, whereas the new M-SGC scheme offers a significantly smaller load requirement. The works Ye & Abbe (2018); Kadhe et al. (2020) explore a trade-off between communication and straggler resiliency in GC. Several variations to the classical GC setting appear in papers Raviv et al. (2020); Halbawi et al. (2018); Wang et al. (2019a;b); Maity et al. (2019). There is a rich literature on distributing matrix multiplication or more general, polynomial function computation (see Yang et al. (2019); Yu et al. (2017); Lee et al. (2018); Ramamoorthy et al. (2019); Subramaniam et al. (2019); Yu et al. (2019); Dutta et al. (2020); Yu et al. (2020); Ramamoorthy et al. (2020) and references therein). In Krishnan et al. (2020), the authors introduce a sequential matrix multiplication framework, where coding across temporal dimension is exploited to reduce the cumulative runtime for a sequence of matrix multiplication jobs. While we also explore the idea of introducing temporal dimension to obtain improved coding schemes, extending the approach inKrishnan et al. (2020)  to our setting yields an inferior solution requiring a large computational load.2 SEQUENTIAL GRADIENT CODING SETTINGFor integers a, b, let [a : b] ≜ {i | a ≤ i ≤ b} and [a : b] * ≜ {i mod n | a ≤ i ≤ b}. For integer c, we have c + [a : b] * ≜ {c + i | i ∈ [a : b] * }. Workers are indexed by [0 : n -1]. We are interested in computing a sequence of J gradients {g(i)} i∈[1:J] . Computation of gradient g(i) is referred to as job-i. All gradients are computed with respect to a single data set D (this assumption is just for simplicity in exposition and our schemes can easily be adapted to include multiple data sets). Data placement: Master partitions D into η data chunks {D 0 , D 1 , . . . , D η-1 }, possibly of different sizes. Each worker-i stores data chunks {D j } j∈Di , where D i ⊆ [0 : η -1]. Let g j (t) denote the t-th

