BYZANTINE-RESILIENT NON-CONVEX STOCHASTIC GRADIENT DESCENT *

Abstract

We study adversary-resilient stochastic distributed optimization, in which m machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an α-fraction of the machines are Byzantine, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging non-convex case. Our main result is a new algorithm SafeguardSGD which can provably escape saddle points and find approximate local minima of the non-convex objective. The algorithm is based on a new concentration filtering technique, and its sample and time complexity bounds match the best known theoretical bounds in the stochastic, distributed setting when no Byzantine machines are present. Our algorithm is very practical: it improves upon the performance of all prior methods when training deep neural networks, it is relatively lightweight, and it is the first method to withstand two recently-proposed Byzantine attacks.

1. INTRODUCTION

Motivated by the pervasiveness of large-scale distributed machine learning, there has recently been significant interest in providing distributed optimization algorithms with strong fault-tolerance guarantees. In this context, the strongest, most stringent fault model is that of Byzantine faults (Lamport et al., 1982) : given m machines, each having access to private data, at most an α fraction of the machines can behave in arbitrary, possibly adversarial ways, with the goal of breaking or slowing down the algorithm. Although extremely harsh, this fault model is the "gold standard" in distributed computing (Lynch, 1996; Lamport et al., 1982; Castro et al., 1999) , as algorithms proven to be correct in this setting are guaranteed to converge under arbitrary system behaviour. A setting of particular interest in this context has been that of distributed stochastic optimization. Here, the task is to minimize some stochastic function f (x) = E s∼D [f s (x)] over a distribution D, where f s (•) can be viewed as the loss function for sample s ∼ D. We assume there are m machines (workers) and an honest master, and α < 1/2 fraction of the workers may be Byzantine. In each iteration t, each worker has access to a version of the global iterate x t , which is maintained by the master. The worker can independently sample s ∼ D, compute ∇f s (x t ), and then synchronously send this stochastic gradient to the master. The master aggregates the workers' messages, and sends an updated iterate x t+1 to all the workers. Eventually, the master has to output an approximate minimizer of f . Clearly, the above description only applies to honest workers; Byzantine workers may deviate arbitrarily and return adversarial "gradient" vectors to the master in every iteration. This distributed framework is quite general and well studied. One of the first references in this setting studied distributed PCA and regression (Feng et al., 2014) . Other early approaches (Blanchard et al., 2017; Chen et al., 2017; Su & Vaidya, 2016a; b; Xie et al., 2018a) relied on defining generalizations of the geometric median. These approaches can withstand up to half of the nodes being malicious, but can have relatively high local computational cost Ω(m 2 d) (Blanchard et al., 2017; Chen et al., 2017) , where m is the number of nodes and d is the problem dimension, and usually have suboptimal sample and iteration complexities. Follow-up work resolved this last issue when the objective f (•) is convex, leading to tight sample complexity bounds. Specifically, Yin et al. (2018) provided bounds for gradient descent-type algorithms, and showed that the bounds are tight when the dimension is constant. Alistarh et al. ( 2018) provided a stochastic gradient descent (SGD) type algorithm and showed that its sample and time complexities are asymptotically optimal even when the dimension is large. Non-convex Byzantine-resilient stochastic optimization. In this paper, we focus on the more challenging non-convex setting, and shoot for the strong goal of finding approximate local minima (a.k.a. second-order critical points). In a nutshell, our main result is the following. Fix d to denote the dimension, and let the objective f : R d → R be Lipschitz smooth and second-order smooth. We have m worker machines, each having access to unbiased, bounded estimators of the gradient of f . Given an initial point x 0 , the SafeguardSGD algorithm ensures that, even if at most α < 1/2 fraction of the machines are Byzantine, after T = O α 2 + 1 m d(f (x0)-min f (x)) ε 4 parallel iterations, for at least a constant fraction of the indices t ∈ [T ], the following hold: ∇f (x t ) ≤ ε and ∇ 2 f (x t ) - √ εI. If the goal is simply ∇f (x t ) ≤ ε, then T = O α 2 + 1 m (f (x0)-min f (x)) ε 4 iterations suffice. Here, the O notation serves to hide logarithmic factors for readability. We spell out these factors in the detailed analysis. • When α < 1/ √ m, our sample complexity (= mT ) matches the best known result in the non-Byzantine case (Jin et al., 2019) without additional assumptions, and enjoys linear parallel speedup: with m workers of which < √ m are Byzantine, the parallel speedup is Ω(m).foot_0  • For α ∈ [1/ √ m, 1/2), our parallel time complexity is O(α 2 ) times that needed when no parallelism is used. This still gives parallel speedup. This α 2 factor appears in convex Byzantine distributed optimization, where it is tight (Yin et al., 2018; Alistarh et al., 2018) . • The Lipschitz and second-order smoothness assumptions are the minimal assumptions needed to derive convergence rates for finding second-order critical points (Jin et al., 2019) . Comparison with prior bounds. The closest known bounds are by Yin et al. (2019) , who derived three gradient descent-type of algorithms (based on median, mean, and iterative filtering) to find a weaker type of approximate local minima. Since it relies on full gradients, their algorithm is arguably less practical, and their time complexities are generally higher than ours (see Section 2.1). Other prior works consider a weaker goal: to find approximate stationary points ∇f Our algorithm and techniques. The structure of our algorithm is deceptively simple. The master node keeps track of the sum of gradients produced by each worker across time. It labels (allegedly) good workers as those whose sum of gradients "concentrate" well with respect to a surrogate of the median vector, and labels bad workers otherwise. Once a worker is labelled bad, it is removed from consideration forever. The master then performs the vanilla SGD, by moving in the negative direction of the average gradients produced by those workers currently labelled as good. We call our algorithm SafeguardSGD, since it behaves like having a safe guard to filter away bad workers. Its processing overhead at the master is O(md), negligible compared to standard SGD. As the astute reader may have guessed, the key non-trivial technical ingredient is to identify the right quantity to check for concentration, and make it compatible with the task of non-convex optimization. In particular, we manage to construct such quantities so that (1) good non-Byzantine workers never get mislabelled as bad ones; (2) Byzantine workers may be labelled as good ones (which is inevitable) but when they do, the convergence rates are not impacted significantly; and (3) the notion does not require additional assumptions or running time overhead. The idea of using concentration (for each worker across time) to filter out Byzantine machines



By parallel speedup we mean the reduction in wall-clock time due to sampling gradients in parallel among the m nodes. In each time step, the algorithm generates m new gradients, although some may be corrupted.



(x) ≤ ε only: Bulusu et al. (2020) additionally assumed there is a guaranteed good (i.e. non-Byzantine) worker known by the master, Xie et al. (2018b) gave a practical algorithm when the Byzantine attackers have no information about the loss function or its gradient, Yang et al. (2019); Xie et al. (2018a); Blanchard et al. (2017) derived eventual convergence without an explicit complexity bound, and the non-convex result obtained in Yin et al. (2018) is subsumed by Yin et al. (2019), discussed above.

