BYZANTINE-ROBUST DECENTRALIZED LEARNING VIA CLIPPEDGOSSIP

Abstract

In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus and benefit from collaborative training. To address these issues, we propose a CLIPPEDGOSSIP algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a O(δ max ζ 2 /γ 2 ) neighborhood of the stationary point for non-convex objectives under standard assumptions. Finally, we demonstrate the encouraging empirical performance of CLIPPEDGOSSIP under a large number of attacks.

1. INTRODUCTION

"Divide et impera". Distributed training arises as an important topic due to privacy constraints of decentralized data storage (McMahan et al., 2017; Kairouz et al., 2019) . As the server-worker paradigm suffers from a single point of failure, there is a growing amount of works on training in the absence of server (Lian et al., 2017; Nedic, 2020; Koloskova et al., 2020b) . We are particularly interested in decentralized scenarios where direct communication may be unavailable due to physical constraints. For example, devices in a sensor network can only communicate devices within short physical distances. Failures-from malfunctioning or even malicious participants-are ubiquitous in all kinds of distributed computing. A Byzantine adversarial worker can deviate from the prescribed algorithm and send arbitrary messages and is assumed to have the knowledge of the whole system (Lamport et al., 2019) . It means Byzantine workers not only collude, but also know the data, algorithm, and models of all regular workers. However, they cannot directly modify the states on regular workers, nor compromise messages sent between two connected regular workers. Defending Byzantine attacks in a communication-constrained graph is challenging. As secure broadcast protocols are no longer available (Pease et al., 1980; Dolev & Strong, 1983; Hirt & Raykov, 2014) , regular workers can only utilize information from their own neighbors who have heterogeneous data distribution or are malicious, making it very difficult to reach global consensus. While there are some works attempt to solve this problem (Su & Vaidya, 2016a; Sundaram & Gharesifard, 2018) , their strategies suffer from serious drawbacks: 1) they require regular workers to be very densely connected; 2) they only show asymptotic convergence or no convergence proof; 3) there is no evidence if their algorithms are better than training alone. In this work, we study the Byzantine-robustness decentralized training in a constrained topology and address the aforementioned issues. The main contributions of our paper are summarized as follows: • We identify a novel network robustness criterion, characterized in terms of the spectral gap of the topology (γ) and the number of attackers (δ), for consensus and decentralized training, applying to a much broader spectrum of graphs than (Su & Vaidya, 2016a; Sundaram & Gharesifard, 2018 ). • We propose CLIPPEDGOSSIP as the defense strategy and provide, for the first time, precise rates of robust convergence to a O(δ max ζ 2 /γ 2 ) neighborhood of a stationary point for stochastic objectives under standard assumptions. 1 We also empirically demonstrate the advantages of CLIPPEDGOSSIP over previous works. • Along the way, we also obtain the fastest convergence rates for standard non-robust (Byzantine-free) decentralized stochastic non-convex optimization by using local worker momentum.

2. RELATED WORK

Recently there have been extensive works on Byzantine-resilient distributed learning with a trustworthy server. The statistics-based robust aggregation methods cover a wide spectrum of works including median (Chen et al., 2017; Blanchard et al., 2017; Yin et al., 2018; Mhamdi et al., 2018; Xie et al., 2018; Yin et al., 2019) nε 2 + σ 2/3 γ 2/3 ε 4/3 ) using local momentum. Decentralized machine learning with certified Byzantine-robustness is less studied. When the communication is unconstrained, there exist secure broadcast protocols that guarantee all regular workers have identical copies of each other's update (Gorbunov et al., 2021; El-Mhamdi et al., 2021) . We are interested in a more challenging scenario where not all workers have direct communication links. In this case, regular workers may behave very differently depending on their neighbors in the topology. One line of work constructs a Public-Key Infrastructure (PKI) so that the message from each worker can be authenticated using digital signatures. However, this is very inefficient requiring quadratic communication (Abraham et al., 2020) . Further, it also requires every worker to have a globally unique identifier which is known to every other worker. This assumption is rendered impossible on general communication graphs, motivating our work to explicitly address the graph topology in decentralized training. Sybil attacks are an important orthogonal issue where a single Byzantine node can create innumerable "fake nodes" overwhelming the network (cf. recent overview by Ford (2021)). Truly decentralized solutions to this are challenging and sometimes rely on heavy machinery, e.g. blockchains (Poupko et al., 2021) or Proof-of-Personhood (Borge et al., 2017) . More related to the approaches we study, Su & Vaidya (2016a); Sundaram & Gharesifard (2018) ; Yang & Bajwa (2019b; a) use trimmed mean at each worker to aggregate models of its neighbors. This approach only works when all regular workers have an honest majority among their neighbors and are densely connected. Guo et al. (2021) evaluate the incoming models of a good worker with its local samples and only keep those well-perform models for its local update step. However, this method only works for IID data. Peng & Ling (2020) reformulate the original problem by adding TV-regularization and propose a GossipSGD type algorithm which works for strongly convex and non-IID objectives. However, its convergence guarantees are inferior to non-parallel SGD. In this work, we address all of the above issues and are able to provably relate the communication graph (spectral gap) with the fraction of Byzantine workers. Besides, most works do not consider attacks that exploit communication topology, except (Peng & Ling, 2020) who propose zero-sum attack. We defer detailed comparisons and more related works to § F.

3.1. DECENTRALIZED THREAT MODEL

Consider an undirected graph G = (V, E) where V = {1, . . . , n} denotes the set of workers and E denotes the set of edges. Let N i ⊂ V be the neighbors of node i and N i := N i ∪ {i}. In addition, we assume there are no self-loops and the system is synchronous. Let V B ⊂ V be the set of Byzantine workers with b = |V B | and the set of regular (non-Byzantine) workers is V R := V\V B . Let G R be the subgraph of G induced by the regular nodes V R which means removing all Byzantine nodes and their associated edges. If the reduced graph G R is disconnected, then there exist two regular workers who cannot reliably exchange information. In this setting, training on the combined data of all the good workers is impossible. Hence, we make the following necessary assumption.



In a previous version, we referred to CLIPPEDGOSSIP as self-centered clipping.



, geometric median(Pillutla et al., 2019), signSGD(Bernstein  et al., 2019; Li et al., 2019; yong Sohn et al., 2020), clipping (Karimireddy et al., 2021a;b), and concentration filtering(Alistarh et al., 2018; Allen-Zhu et al., 2020; Data & Diggavi, 2021). Other works explore special settings where the server owns the entire training dataset(Xie et al., 2020a;  Regatti et al., 2020; Su & Vaidya, 2016b; Chen et al., 2018; Rajput et al., 2019; Gupta et al., 2021). The state-of-the-art attacks take advantage of the variance of good gradients and accumulate bias over time(Baruch et al., 2019; Xie et al., 2019). A few strategies have been proposed to provably defend against such attacks, including momentum(Karimireddy et al., 2021a; El Mhamdi et al., 2021)  and concentration filtering(Allen-Zhu et al., 2021). ) where the leading σ 2 nε 2 is optimal. In this paper we improve this rate to O( σ 2

