β-STOCHASTIC SIGN SGD: A BYZANTINE RESILIENT AND DIFFERENTIALLY PRIVATE GRADIENT COMPRES-SOR FOR FEDERATED LEARNING

Abstract

Federated Learning (FL) is a nascent privacy-preserving learning framework under which the local data of participating clients is kept locally throughout model training. Scarce communication resources and data heterogeneity are two defining characteristics of FL. Besides, a FL system is often implemented in a harsh environment, leaving the clients vulnerable to Byzantine attacks. To the best of our knowledge, no gradient compressors simultaneously achieve quantitative Byzantine resilience and privacy preservation. In this paper, we fill this gap via revisiting the stochastic sign SGD Jin et al. ( 2020). We propose β-stochastic sign SGD, which contains a gradient compressor that encodes a client's gradient information in sign bits subject to the privacy budget β > 0. We show that β-stochastic sign SGD converges in the presence of partial client participation, mobile static and adaptive Byzantine faults, and that it achieves quantifiable Byzantine-resilience and differential privacy simultaneously even with non-IID local data. We show that our compressor works for both bounded and unbounded stochastic gradients, i.e., both light-tailed and heavy-tailed distributions. As a byproduct, we show that when the clients report sign messages, the popular information aggregation rules simple mean, trimmed mean, median and majority vote are identical in terms of the output signs. Our theories are corroborated by experiments on MNIST and CIFAR-10 datasets.



). However, challenges remain. A FL system is often massive in scale and is implemented in harsh environment -leaving the clients vulnerable to unstructured faults such as Byzantine faults Lynch (1996) . Moreover, FL clients are privacy-sensitive. Despite clients' privacy is partially preserved via denying raw data access, quantitative privacy preservation is still desirable. Observing this, Bernstein et al. (2019) proposed signSGD with majority vote which is provably resilient to Byzantine faults. However, even in the absence of Byzantine faults, SignSGD fails to converge in the presence of non-IID data Safaryan & Richtárik (2021); Chen et al. (2020), and is not differentially private. To handle non-IID data, Jin et al. (2020) proposed stochastic sign SGD and its differentially-private (DP) variant, whose gradient compressors are simple yet elegant. Unfortunately, their DP variant does not convergefoot_0 , and their standard stochastic sign SGD is not differentially-private (shown in our Theorem 1). We will discuss the relations between Jin et al. ( 2020) and our work in the related work. Contributions. In this paper, we revisit the elegant compressor in Jin et al. (2020) . We propose β-stochastic sign SGD, which contains a gradient compressor that encodes a client's gradient information in sign bits subject to the privacy budget β > 0, and works for unbounded and mini-batch stochastic gradients. A parameter B > 0 is chosen carefully to clip the unbounded gradients. • We first show (in Theorem 1) that when β = 0, the compressor is not differentially private. In sharp contrast, when β > 0, the compressor is d • log ((2B + β)/β)-differentially private, where d is the gradient dimension. We provide a finer characterization of the differential privacy preservation (in Theorem 2 and Corollary 1). In addition, to help the readers interpret our DP, we show (in Proposition 2) that our compressor with β > 0 can be viewed as a composition of a randomized sign flipping and stochastic sign SGD compressor. To the best of our knowledge, this is the first result to establish DP with signed compressors in FL. • We show (in Theorem 4) β-stochastic sign SGD works for both bounded and unbounded stochastic gradient. Specifically, convergence bounds are derived for both light-tailed and heavy-tailed stochastic gradients. In addition, we show (in Theorem 4) the convergence of β-stochastic sign SGD in the presence of partial client participation and mobile Byzantine faults, showing that it achieves Byzantine-resilience and DP simultaneously. Both static and adaptive adversaries are considered. • As a byproduct, we show (in Proposition 1) that when the clients report sign messages, the popular information aggregation rules simple mean, trimmed mean, median, and majority vote are identical in terms of the output signs. This implies majority vote is a counterpart of "middle-seeking" Byzantine resilient algorithms in the realm of sign aggregations. • 2017) is a lossy compressor with provable trade-off between the number of bits communicated per iteration with the variance added to the process. However, its performance is shown to be inferior to simple compressor such as SignSGD Bernstein et al. ( 2019  Their Theorem 6 analysis contains major flaws.



FL) is a nascent learning framework that enables privacy sensitive clients to collectively train a model without disclosing their raw data McMahan et al. (2017); Kairouz et al. (2021). Expensive communication overhead and non-IID local data are two defining characteristics of FL. A variety of communication-saving techniques have been introduced, including periodic averaging McMahan et al. (2017), large mini-batch sizes Lin et al. (2020), and gradient compressors Xu et al. (2020); Alistarh et al. (2017); Bernstein et al. (2018; 2019); Jin et al. (2020); Safaryan et al. (2021); Wang et al. (

Our theoretical findings are validated with experiments on the MNIST and CIFAR-10 datasets 2 RELATED WORK Communication Efficiency. Communication is a scare resource in FL McMahan et al. (2017); Kairouz et al. (2021). Numerous efforts have been made to improve the provable communication efficiency of FL. FedAvg -the most widely-adopted FL algorithm -and its Algorithm 1s save communication via performing multiple local updates at the client side McMahan et al. (2017); Wang & Joshi (2019); Stich (2019); Li et al. (2020a). Large mini-batch size is another communicationsaving technique yet its performance turns out be often inferior to FedAvg Lin et al. (2020). Gradient compressors Xu et al. (2020) take the physical layer of communication into account and are used to reduce the number of bits used in encoding local gradient information. Quantized SGD (QSGD) Alistarh et al. (

), which, based on the sign, compresses a local gradient into a single bit. Nevertheless, SignSGD fails to converge in the presence of non-IID data Safaryan & Richtárik (2021); Chen et al. (2020), and is not differentially private. This is because SignSGD neglects the information contained in the gradient magnitude. Byzantine Resilience. Despite its popularity, FedAvg is vulnerable to Byzantine attacks on the participating clients Kairouz et al. (2021); Blanchard et al. (2017); Chen et al. (2017). This is because that under FedAvg the PS aggregates the local gradients via simple averaging. Alternative aggregation rules such as Krum Blanchard et al. (2017), geometric medianChen et al. (2017), coordinatewise median and trimmed mean Yin et al. (2018) are shown to be resilient to Byzantine attacks though different in levels of resilience protection with respect to the number of Byzantine faults, the model complexity, and underlying data statistics in the presence of IID local data. Assuming the PS can get access to sufficiently many freshly drawn data samples in each iteration, Xie et al. Xie et al. (2019) proposed an algorithm Zeno that can tolerate more than 1/2 fraction of clients to be Byzantine. Unfortunately, their analysis is restricted to homogeneous and balanced local data using techniques from robust statistics. However, it is not straightforward to extend the results to non-IID data, which stems from the difficulty of distinguishing the statistical heterogeneity from Byzantine attacksLi et al. (2019). Many efforts have been devoted to mitigate the negative impacts stemming from heterogeneous data Ghosh et al. (2019); Karimireddy et al. (2022). Ghosh et al.Ghosh et al. (2019) used robust clustering techniques whose correctness crucially relies on large local dataset and local cost functions to be strongly convex. Karimireddy et al. Karimireddy et al. (2022) derived

