BYZANTINE-RESILIENT NON-CONVEX STOCHASTIC GRADIENT DESCENT *

Abstract

We study adversary-resilient stochastic distributed optimization, in which m machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an α-fraction of the machines are Byzantine, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging non-convex case. Our main result is a new algorithm SafeguardSGD which can provably escape saddle points and find approximate local minima of the non-convex objective. The algorithm is based on a new concentration filtering technique, and its sample and time complexity bounds match the best known theoretical bounds in the stochastic, distributed setting when no Byzantine machines are present. Our algorithm is very practical: it improves upon the performance of all prior methods when training deep neural networks, it is relatively lightweight, and it is the first method to withstand two recently-proposed Byzantine attacks.

1. INTRODUCTION

Motivated by the pervasiveness of large-scale distributed machine learning, there has recently been significant interest in providing distributed optimization algorithms with strong fault-tolerance guarantees. In this context, the strongest, most stringent fault model is that of Byzantine faults (Lamport et al., 1982) : given m machines, each having access to private data, at most an α fraction of the machines can behave in arbitrary, possibly adversarial ways, with the goal of breaking or slowing down the algorithm. Although extremely harsh, this fault model is the "gold standard" in distributed computing (Lynch, 1996; Lamport et al., 1982; Castro et al., 1999) , as algorithms proven to be correct in this setting are guaranteed to converge under arbitrary system behaviour. A setting of particular interest in this context has been that of distributed stochastic optimization. Here, the task is to minimize some stochastic function f (x) = E s∼D [f s (x)] over a distribution D, where f s (•) can be viewed as the loss function for sample s ∼ D. We assume there are m machines (workers) and an honest master, and α < 1/2 fraction of the workers may be Byzantine. In each iteration t, each worker has access to a version of the global iterate x t , which is maintained by the master. The worker can independently sample s ∼ D, compute ∇f s (x t ), and then synchronously send this stochastic gradient to the master. The master aggregates the workers' messages, and sends an updated iterate x t+1 to all the workers. Eventually, the master has to output an approximate minimizer of f . Clearly, the above description only applies to honest workers; Byzantine workers may deviate arbitrarily and return adversarial "gradient" vectors to the master in every iteration. This distributed framework is quite general and well studied. One of the first references in this setting studied distributed PCA and regression (Feng et al., 2014) . Other early approaches (Blanchard et al., 2017; Chen et al., 2017; Su & Vaidya, 2016a; b; Xie et al., 2018a) relied on defining generalizations of the geometric median. These approaches can withstand up to half of the nodes being malicious, but can have relatively high local computational cost Ω(m 2 d) (Blanchard et al., 2017; Chen et al., 2017) , where m is the number of nodes and d is the problem dimension, and usually have suboptimal sample and iteration complexities. Follow-up work resolved this last issue when the objective f (•) is convex, leading to tight sample

