β-STOCHASTIC SIGN SGD: A BYZANTINE RESILIENT AND DIFFERENTIALLY PRIVATE GRADIENT COMPRES-SOR FOR FEDERATED LEARNING

Abstract

Federated Learning (FL) is a nascent privacy-preserving learning framework under which the local data of participating clients is kept locally throughout model training. Scarce communication resources and data heterogeneity are two defining characteristics of FL. Besides, a FL system is often implemented in a harsh environment, leaving the clients vulnerable to Byzantine attacks. To the best of our knowledge, no gradient compressors simultaneously achieve quantitative Byzantine resilience and privacy preservation. In this paper, we fill this gap via revisiting the stochastic sign SGD Jin et al. ( 2020). We propose β-stochastic sign SGD, which contains a gradient compressor that encodes a client's gradient information in sign bits subject to the privacy budget β > 0. We show that β-stochastic sign SGD converges in the presence of partial client participation, mobile static and adaptive Byzantine faults, and that it achieves quantifiable Byzantine-resilience and differential privacy simultaneously even with non-IID local data. We show that our compressor works for both bounded and unbounded stochastic gradients, i.e., both light-tailed and heavy-tailed distributions. As a byproduct, we show that when the clients report sign messages, the popular information aggregation rules simple mean, trimmed mean, median and majority vote are identical in terms of the output signs. Our theories are corroborated by experiments on MNIST and CIFAR-10 datasets.

1. INTRODUCTION

Federated Learning (FL) is a nascent learning framework that enables privacy sensitive clients to collectively train a model without disclosing their raw data McMahan et al. (2017) ; Kairouz et al. (2021) . Expensive communication overhead and non-IID local data are two defining characteristics of FL. A variety of communication-saving techniques have been introduced, including periodic averaging McMahan et al. (2017) , large mini-batch sizes Lin et al. (2020) , and gradient compressors Xu et al. (2020) ; Alistarh et al. (2017) ; Bernstein et al. (2018; 2019) ; Jin et al. (2020) ; Safaryan et al. (2021) ; Wang et al. (2021) . However, challenges remain. A FL system is often massive in scale and is implemented in harsh environment -leaving the clients vulnerable to unstructured faults such as Byzantine faults Lynch (1996) . Moreover, FL clients are privacy-sensitive. Despite clients' privacy is partially preserved via denying raw data access, quantitative privacy preservation is still desirable. Observing this, Bernstein et al. (2019) proposed signSGD with majority vote which is provably resilient to Byzantine faults. However, even in the absence of Byzantine faults, SignSGD fails to converge in the presence of non-IID data Safaryan & Richtárik (2021) ; Chen et al. (2020) , and is not differentially private. To handle non-IID data, Jin et al. (2020) proposed stochastic sign SGD and its differentially-private (DP) variant, whose gradient compressors are simple yet elegant. Unfortunately, their DP variant does not convergefoot_0 , and their standard stochastic sign SGD is not differentially-private (shown in our Theorem 1). We will discuss the relations between Jin et al. (2020) and our work in the related work. Contributions. In this paper, we revisit the elegant compressor in Jin et al. (2020) . We propose β-stochastic sign SGD, which contains a gradient compressor that encodes a client's gradient information in sign bits subject to the privacy budget β > 0, and works for unbounded and mini-batch stochastic gradients. A parameter B > 0 is chosen carefully to clip the unbounded gradients. • We first show (in Theorem 1) that when β = 0, the compressor is not differentially private. In sharp contrast, when β > 0, the compressor is d • log ((2B + β)/β)-differentially private, where d is the gradient dimension. We provide a finer characterization of the differential privacy preservation (in Theorem 2 and Corollary 1). In addition, to help the readers interpret our DP, we show (in Proposition 2) that our compressor with β > 0 can be viewed as a composition of a randomized sign flipping and stochastic sign SGD compressor. To the best of our knowledge, this is the first result to establish DP with signed compressors in FL. • We show (in Theorem 4) β-stochastic sign SGD works for both bounded and unbounded stochastic gradient. Specifically, convergence bounds are derived for both light-tailed and heavy-tailed stochastic gradients. In addition, we show (in Theorem 4) the convergence of β-stochastic sign SGD in the presence of partial client participation and mobile Byzantine faults, showing that it achieves Byzantine-resilience and DP simultaneously. Both static and adaptive adversaries are considered. • As a byproduct, we show (in Proposition 1) that when the clients report sign messages, the popular information aggregation rules simple mean, trimmed mean, median, and majority vote are identical in terms of the output signs. This implies majority vote is a counterpart of "middle-seeking" Byzantine resilient algorithms in the realm of sign aggregations. • Our theoretical findings are validated with experiments on the MNIST and CIFAR-10 datasets

2. RELATED WORK

Communication Efficiency. Communication is a scare resource in FL McMahan et al. (2017) ; Kairouz et al. (2021) . Numerous efforts have been made to improve the provable communication efficiency of FL. FedAvg -the most widely-adopted FL algorithm -and its Algorithm 1s save communication via performing multiple local updates at the client side McMahan et al. (2017) ; Wang & Joshi (2019) ; Stich (2019) ; Li et al. (2020a) . Large mini-batch size is another communicationsaving technique yet its performance turns out be often inferior to FedAvg Lin et al. (2020) . Gradient compressors Xu et al. (2020) take the physical layer of communication into account and are used to reduce the number of bits used in encoding local gradient information. Quantized SGD (QSGD) Alistarh et al. (2017) is a lossy compressor with provable trade-off between the number of bits communicated per iteration with the variance added to the process. However, its performance is shown to be inferior to simple compressor such as SignSGD Bernstein et al. (2019) , which, based on the sign, compresses a local gradient into a single bit. Nevertheless, SignSGD fails to converge in the presence of non-IID data Safaryan & Richtárik (2021) ; Chen et al. (2020) , and is not differentially private. This is because SignSGD neglects the information contained in the gradient magnitude. Byzantine Resilience. Despite its popularity, FedAvg is vulnerable to Byzantine attacks on the participating clients Kairouz et al. (2021) ; Blanchard et al. (2017) ; Chen et al. (2017) . This is because that under FedAvg the PS aggregates the local gradients via simple averaging. Alternative aggregation rules such as Krum Blanchard et al. (2017 ), geometric medianChen et al. (2017) , coordinatewise median and trimmed mean Yin et al. (2018) are shown to be resilient to Byzantine attacks though different in levels of resilience protection with respect to the number of Byzantine faults, the model complexity, and underlying data statistics in the presence of IID local data. Assuming the PS can get access to sufficiently many freshly drawn data samples in each iteration, Xie et al. Xie et al. (2019) proposed an algorithm Zeno that can tolerate more than 1/2 fraction of clients to be Byzantine. Unfortunately, their analysis is restricted to homogeneous and balanced local data using techniques from robust statistics. However, it is not straightforward to extend the results to non-IID data, which stems from the difficulty of distinguishing the statistical heterogeneity from Byzantine attacks Li et al. (2019) et al. (2017) between the clients and the PS. However, recent results show that both weight and gradient sharing schemes may leak sensitive information Zhu et al. (2019); Phong et al. (2018) . Two notions of differential privacy exist in FL Truex et al. (2020) : (A) central privacy, where a trusted server masks the data and shares the perturbed updates with the distributed clients, and (B) local privacy, where each client protects their data from any external parties, including the server. Most of the current works focus on the perturbed updates by adding Laplace Huang et al. (2015) , GaussianWang et al. (2021) , or Binomial perturbation Agarwal et al. (2018) . The former two are applied to traditional gradients/updates, and the latter is applied to compressed gradients/updates whose values are represented in terms of bits. Focusing on distributed mean estimation problem, Agarwal et al. (2018) proposed a communication-efficient algorithm that achieves the same privacy and error trade-off as that of the Gaussian mechanism provided that the model dimension is at most comparable to the client population size. In this work, we focus on protecting local privacy. Comparison with Jin et al. (2020) . While we admit that β-stochastic sign SGD and sto-sign compressor in Jin et al. (2020) are structurally alike, our results depart in significant ways: their theoretical guarantees are flawed since no explicit forms are given for the residual terms and Byzantine resilience. They consider only full-batch gradients, which may not hold as edge clients in FL are often with limited computing power and storage McMahan et al. (2017) . On top of all, no partial client participation is considered. In a sharp contrast, our work contains a quantitative characterization of the interactions between FL system hyperparameters such as client number M , mini-batch size n, sampling rate p, etc. We show our work converges in the presence of both static and adaptive mobile Byzantine adversaries. We build upon our β compressor on mini-batch stochastic gradient and derive the convergence bounds under light-tailed and heavy-tailed noise using a variety of concentration bounds, which is technically non-trivial. We also show our compressor works for partial client participation. We reserve a point-by-point comparison in Appendix A.

3. PROBLEM SETUP

We consider a cross-device FL system, where a large number of clients are involved Kairouz et al. (2021) . The system consists of one parameter server (PS) and M clients that collaborate to minimize min w∈R d F (w) := 1 M M m=1 f m (w), where f m (w) := E Dm [f m (w, x, y)] is the local cost function at client m ∈ [M ] := {1, • • • , M } with the expectation taken over heterogeneous local data (x, y) ∼ D m . Client unavailability. Clients are also heterogeneous in their computation speeds and communication channel conditions, which result in intermittent clients unavailability. Upon receiving g(t): w(t + 1) ← w(t) -η g(t); end Our algorithm is formally described in Algorithm 1, which takes T, η, β, n, B, and ν ∈ R d as inputs, where T is the iteration horizon T , η > 0 is the stepsize, β ≥ 0 is the privacy budget, n is the minibatch size of local stochastic gradients, B > 0 is a parameter used to clip stochastic gradients, and ν serves as the initial values of w(0). In each iteration t of Algorithm 1, a client m is selected by the PS with probability p. Let S(t) be the set of selected clients at time t. Since Byzantine clients can deviate from Algorithm 1 arbitrarily, lines 2-8 are executed at clients in S(t) \ B(t) only. In each iteration t, client m ∈ S(t) \ B(t) first obtains n stochastic gradients g 1 m (t), . . . , g n m (t). Then it passes 1 n n j=1 g j m (t) to clip {•, B} coordinate-wise, and compresses the clipped gradient via the compressor M B,β in lines 5 and 6. For ease of exposition, let g mi (t) = [M B,β ] i ( 1 n n j=1 g j m (t)). Finally, client m reports g m (t) to the PS. On the PS side (lines 8 -12), it first waits to receive messages u m (t) from the sampled clients, for which u m (t) = g m (t) if m / ∈ B(t). Then the PS passes { u m : m ∈ S(t)} to an aggregation function agg, and takes the coordinate-wise sign of the function output to obtained g(t). Next, it broadcasts g(t) to S(t + 1). For convenience of exposition, if no message is received from a selected client m (which only occurs when m ∈ B(t)), then the PS treats u m as 0. Notably, if m ∈ B(t), the received u m (t) could take arbitrary value. Since g m ∈ {±1} d for m / ∈ B(t), if u mi (t) / ∈ {-1, 1}, then it must be true that client m ∈ B(t). Thus, u mi (t) will be removed from aggregation by the PS. In other words, it is always a better strategy for a Byzantine client to restrict u ∈ {±1} d . Henceforth, without loss of generality, we assume that u m (t) ∈ {±1} d for all received compressed gradients.

4.1. AGGREGATION FUNCTIONS

Simple mean (i.e. naive averaging) is one of the widely-adopted aggregation rule of FL algorithms Kairouz et al. (2021) ; Blanchard et al. (2017) , yet it is vulnerable to Byzantine attacks. Alternative aggregation rules such as Krum Blanchard et al. (2017) , geometric median Chen et al. (2017) , coordinate-wise median and trimmed mean Yin et al. (2018) are shown to be resilient to Byzantine attacks yet with different levels of resilience protection, we show that when the inputs of the aggregation functions are a collection of binary vectors in {±1} d , the signs of the outputs of the simple mean, trimmed mean, and median aggregation rules are identical. Moreover, they are all equivalent to the coordinate-wise majority vote aggregation rule.We denote by agg avg , agg trimmed,k , agg median , and agg maj , the coordinate-wise mean, k-trimmed-mean, median, and majority vote aggregation rules, respectively, whose definitions are deferred to Appendix B. Proposition 1. For any given S ⊆ [M ] and any given u m ∈ {±1} d for m ∈ S, the aggregation rules agg avg , agg trimmed,k with k < |S|/2, agg median , and agg maj are equivalent in terms of their signs.

5. PRIVACY PRESERVATION

In this section, we characterize the DP of our gradient compressor M B,β . Over the entire training time horizon, the quantification of the differential privacy preserved for any given client can be obtained by applying the composition theorem of ϵ-differentially private algorithms (Dwork et al., 2014, Corollary 3.15) . We first show that β is an enablor of DP for our compressor. Theorem 1. M B,0 is not differentially private. That is, there does not exist a finite ϵ > 0 for which Definition 1 holds. When β > 0, M B,β is d • log 2B+β β -DP for all gradients. Theorem 1 also implies that as long as β > 0, M B,β ensures ϵ-differential privacy for ϵ = O(d) at any iteration t. This might be pessimistic in the presence of a deep neural network. However, it is possible to reduce the order O(d), and we pave the way as outlined in Theorem 2 and Corollary 1. Definition 4. For any given B > 0, let C B := (-∞, -B) ∪ (B, ∞). For each g ∈ R, define dist (g, C B ) := inf g ′ ∈C B |g -g ′ |. Theorem 2. Let g, g ′ ∈ G ⊆ R d be an arbitrary pair of gradient inputs such that g ′ ̸ = g, and define l 1 sensitivity as ∆ 1 := max g,g ′ ∈G ∥g -g ′ ∥ 1 . M B,β is max g∈G d i=1 log 1 + ∆1 β+dist(gi,C B ) -DP on G for β > 0. Corollary 1. Given the same definitions as in Theorem 2, M B,β is ∆1 β -DP. Remark 1. Theorem 2 gives a finer characterization of the differential privacy preserved by M B,β when β > 0. Unfortunately, this maximum is often hard to find, so Corollary 1 tells us that we can get rid of the order O (d) by controlling the l 1 sensitivity ∆ 1 . One way is to let B = ∆ 1 /d. = denotes "equal in distribution". In Proposition 2, we decompose our M B,β as a composition of M B,0 and M B,flip to help the readers interpret the realization of our DP. A simple example of the equivalence in Proposition 2 is given in Fig. 1 . Under M B,0 , an occurrence of -1 in any round of one of experiments suggests its input gradient is g ′ rather than g. Even if we constrain the set of gradients, the ensured differential privacy is not controllable (implied by the proof of Theorem.2). In contrast, under M B,β , one can manipulate β to ensure a controllable privacy quantification as per Theorem.1. g = 0.1 g ′ = -0.05 1 -1 1 -1 g = 0.1 g ′ = -0.05 1 -1 with probability 5/8 with probability 3/8 with probability 3/4 with probability 1/4 with probability 1 Given B = β = 0.1 M 0 M f lip M β (a) (b) Figure 1: d = 1, B = β = 0. 1 and two gradients g = 0.1 and g ′ = -0.05.

6. CONVERGENCE ANALYSIS

Our analysis is derived under the following technical assumptions that are standard in non-convex optimization Shalev-Shwartz & Ben-David (2014) . Assumption 1 (Lower bound). There exists F * such that F (w) ≥ F * for all w. Assumption 2 (Smoothness). There exists some non-negative constant L such that F (w 1 ) ≤ F (w 2 ) + ⟨∇F (w 2 ), w 1 -w 2 ⟩ + L 2 ∥w 1 -w 2 ∥ 2 2 for all w 1 , w 2 . Assumption 3 (Bounded true gradient). For any coordinate i ∈ [d], there exists B i > 0 such that |∇f mi (w)| ≤ B i for all m ∈ [M ]. Let B 0 := max i∈[d] B i . Stochastic gradients can have significantly wider range than the true gradients. Example 1 (Norm discrepancy between true and stochastic gradients). Let x ∼ N (0, I d ) be standard Gaussian random vector. Let f (w, x) = ⟨w, x⟩ + ξ, where ξ is some unknown observation noise. It can be checked easily that for any w ∈ R d , it holds that ∇f (w) = ∇E [f (w, x)] = E [x] = 0, whereas the natural stochastic gradient ∇f (w, x) = x whose support is the entire R d . Assumption 4 (Sub-Gaussianity). For a given client m ∈ [M ], at any query w ∈ R d , the stochastic gradient g m (w) is an independent unbiased estimate of ∇f m (w) that is coordinate- wise related to the gradient ∇f m (w) as g mi (w) = ∇f mi (w) + ξ mi ∀ i ∈ [d], where ξ mi is zero-mean σ mi -sub-Gaussian, i.e, E [ξ mi ] = 0, and the two deviation inequalities P {ξ mi ≥ t} ≤ exp -t 2 2σ 2 mi and P {ξ mi ≤ -t} ≥ exp -t 2 2σ 2 mi hold. Let σ 2 := max m∈[M ],i∈[d] σ 2 mi . Assumption 5 (Heavy-tailed noise). Let ξ mi defined in Assumption 4 be a zero-mean random variable, E ξ 2 mi ≤ σ 2 , and E |ξ mi | p ′ ≤ M p ′ < ∞ for p ′ ≥ 4. Notably, the class of sub-Gaussian random variables contains bounded and unbounded random variable as special cases. Tighter convergence bounds can be obtained under boundedness or Gaussianity assumptions on the noise; see Appendix C.1 for details.  { g i (t)} ̸ = sign (∇ i F (w(t))), and then bound 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] to conclude convergence. Recall that g m (t) := ∇f m (w(t)) denotes the true local gradient at client m. Theorem 3. Choose c 0 (n, p) = max 8σ 2 n log 6 c , 8(B+β) 2 p 2 log 6 3-5c and B = (1 + ϵ 0 )B 0 . Fix t ≥ 1 and i ∈ [d]. Let c > 0 be any given constant such that c < 3 5 . Define δ 1 (t) = 2(B+β) p τ (t) M + c0(n,p) √ M and δ 2 (t) = 3(B + β) τ (t) M + c0(n,p) √ M . P { g i (t) ̸ = sign (∇ i F (w(t))) | w(t)} ≤ 1 -c 2 . • (Sub-Gaussian noise): Suppose Assumption 3 and 4 hold, and ϵ 0 > σ B0 . Eq. (4) holds if |∇ i F (w(t))| ≥ δ 1 (t) + 2(B + β) exp -n 2 , when the system adversary is adaptive, or when is static but with τ (t) ≤ 2 p 2 log 6 c , and if |∇ i F (w(t))| ≥ δ 2 (t) + 2(B + β) exp -n 2 when is static with τ (t) > 2 p 2 log 6 c . • (Heavy-tailed noise): Suppose Assumption 3 and 5 hold, and ϵ 0 > M 1 p ′ p ′ B0 . Eq. (4) holds for p ′ ≥ 4 if |∇ i F (w(t))| ≥ δ 1 (t) + 4(B+β) n p ′ 2 when the system adversary is adaptive, or when is static but with τ (t) ≤ 2 p 2 log 6 c , and if |∇ i F (w(t))| ≥ δ 2 (t) + 4(B+β) n p ′ 2 when is static with τ (t) > 2 p 2 log 6 c . Theorem 3 says that when |∇ i F (w(t))| is large enough, the sign estimation at the PS in each iteration is more likely to be correct. This is crucial in ensuring the convergence because Theorem 3 implies that when 2020), we neither assume the sign error distributions across clients be identical, nor require the average probability of sign error to be less than 1/2. Instead, we show that it is enough to let the probability of population sign errors be small when the magnitude of the gradients is large. |∇ i F (w(t))| is large enough, in Theorem 4. Suppose Assumptions 1, 2, 3 hold. Define δ 1 (t) = 2(B+β) p τ (t) M + c0(n,p) √ M , δ 2 (t) = 3(B + β) τ (t) M + c0(n,p) √ M , Ξ 1 (n) = 2(B + β) exp -n 2 , and Ξ 2 (n) = 4(B+β) n p ′ 2 . For any given t, B = (1 + ϵ 0 )B 0 , and c such that 0 < c < 3 5 , set the learning rate as η = 1 √ dT and c 0 (n, p) := max 8σ 2 n log 6 c , 8(B+β) 2 p 2 log 6 3-5c . With Assumption 4 for ϵ 0 > σ B0 or 5 for ϵ 0 > M 1 p ′ p ′ B0 (p ′ ≥ 4 ), when the system adversary is adaptive or when is static but with τ (t) ≤ 2 p 2 log 6 c , 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * ) √ d √ T + L √ d 2 √ T + 2dΞ(n) + 2d T T -1 t=0 δ 1 (t) ; (5) or when the system adversary is static with τ (t) > 2 p 2 log 6 c , 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * ) √ d √ T + L √ d 2 √ T + 2dΞ(n) + 2d T T -1 t=0 δ 2 (t) . (6) We present the convergence results under a unified framework, where Ξ(n) = Ξ 1 (n) in the case of sub-Gaussian noise, and Ξ(n) = Ξ 2 (n) in the case of heavy-tailed noise. Remark 2. (1) The convergence rates in Eq. (5) and Eq. (6) only differ in their Byzantine terms by a multiplicative factor of 3p 2 . As long as τ (t) is sufficiently large, the impacts of p, the degree of partial client participation, on the convergence rate upper bound is limited. The lower bound requirement on τ (t) might be an artifact of our analysis in simplifying the boundary case derivation. (2) If τ (t) = τ for each t, then the Byzantine terms in Eq. (5) and Eq. (6) become ), which is of the same order as the term c0(n,p) √ M in Eq. (5) and Eq. (6), the consequences of weak signal strength of the compressed gradients near a stationary point of the global objective F . We note that the impacts are limited as the contribution of one client is masked as a one-bit sign instead of an arbitrary value, which is also verified in the following experiments. On the other hand, if T -1 t=0 τ (t) = O √ T , the Byzantine terms in Eq. (5) and Eq. (6) scale as O( 1 √ T ), of the same order as the first two terms. In either case, due to the mobility of the Byzantine faults, it is possible that ∪ T t=0 B(t) = [M ], i.e., every client is corrupted at least once. (3) The residual term Ξ(n) is an immediate consequence of using mini-batch stochastic gradients rather than true gradient as in Jin et al. (2020) . It turns out that this term have minimal impact on the final convergence. In fact, as long as n = Ω (log M ) (sub-Gaussian noise) or n = Ω M 1 p ′ (heavy-tailed noise), this term becomes non-dominating. (4) When τ (t) = τ = O √ M and n of the same order as in (3), the convergence rates become The bounds in Theorem 4 can be tightened with more structured gradient noises. We defer our results on Gaussian-tailed and bounded stochastic gradients to Appendix C.2. All of the results have a similar form and differ only in the noisy residual terms. In detail, the residual term O (exp(-n/2)) in the convergence results of the Gaussian-tailed stochastic gradients is scaled by a constant 1/4 √ 2π, while no noisy tail term appears in the case of bounded stochastic gradients. O( 1 √ T + 1 √ M ),

7. EXPERIMENTAL EVALUATION

In this section, we evaluate the accuracy and convergence speed of our Algorithm 1 in terms of the impacts of client sampling p, the differential privacy protection β, and Byzantine attack resilience. More experiment setups and comparisons with Byzantine baselines are deferred to Appendix. E.We list key elements of our experimental setup for comparisons with benchmark algorithm below. 2 . We can observe that the testing results drop as β increases. This matches a trivial observation from Theorem 4 such that the increase of β pushes the bound farther away from stationary point, implying our variant suffers from the data-utility-privacy trade-off. For any given B, the testing accuracy is comparable when the differential privacy quantification βs are relatively small such as β ∈ {0.1B, B}, which is fortunate as we might not sacrifice too much data utility while ensuring a privacy quantification.

B

β/B MNIST 0.1 (ϵ = 3.04d) 1.0 (ϵ = 1.1d) 5.0 (ϵ = 0.34d) 10.0 (ϵ = 0.18d) 2 and Fig. 3 that the Algorithm 1 is not sensitive to client sampling. In particular, the Byzantine-free peak accuracy reaches around 96% on MNIST and around 44% on CIFAR-10, which is a direct consequence of highly heterogeneous data distributions. The accuracy drops are almost negligible when the Byzantine fraction does not exceed 0.1. We can see a sharper drop as the fraction increases. However, the testing accuracy remains stable during the final training stage when Byzantine clients account for 50% of the reporting clients except when the sampling rate p is small. Two baseline algorithms with no Byzantine clients are also evaluated for comparisons. It is observed that SignSGD slightly outperforms our variant in some cases with β > 0; however, our variant is provable d • log 3-differentially private. FedAvg is inferior to the other algorithms in all cases. Assumption 7 (Gaussianity). For a given client m ∈ [M ], at any query w ∈ R d , the stochastic gradient g m (w) is an independent unbiased estimate of ∇f m (w) that is coordinate-wise related to the gradient ∇f m (w) as g mi (w) = ∇f mi (w) + ξ mi ∀ i ∈ [d], where ξ mi ∼ N 0, σ 2 mi . Let σ 2 := max m∈[M ],i∈[d] σ 2 mi .

C.2 ALTERNATIVE CONVERGENCE RATES

Corollary 2. Suppose that Assumptions 3 and 7 hold. Choose B = (1 + ϵ 0 )B 0 for ϵ 0 > σ/B 0 and c 0 (n, p) = max 8σ 2 n log 6 c , 8(B+β) 2 p 2 log 6 3-5c . Fix t ≥ 1 and i ∈ [d]. Let c > 0 be any given constant such that c < 3 5 . When the system adversary is adaptive or when the system adversary is static but with τ (t) ≤ 2 p 2 log 6 c , if |∇ i F (w(t))| ≥ 2(B+β) pM τ (t) + B+β 2 √ 2π exp -n 2 + c0(n,p) √ M , then Eq. (4) holds. When the system adversary is static with τ (t) > 2 p 2 log 6 c , if |∇ i F (w(t))| ≥ 3(B+β)τ (t) M + B+β 2 √ 2π exp -n 2 + c0(n,p) √ M , then Eq. (4) holds. Corollary 3. Suppose Assumptions 1, 2, 3, and 7 hold. For any given T , B = (1 + ϵ 0 )B 0 for ϵ 0 > σ B0 , and c such that 0 < c < 3 5 , set the learning rate as η = 1 √ dT and c 0 (n, p) = max 8σ 2 n log 6 c , 8(B+β) 2 p 2 log 6 3-5c . When the system adversary is adaptive or when the system adversary is static but with τ (t) ≤ 2 p 2 log 6 c , we have 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * ) √ d √ T + L √ d 2 √ T + d √ 2π (B + β) exp - n 2 +2d c 0 (n, p) √ M + 4d (B + β) T -1 t=0 τ (t) pT M . On the other hand, when the system adversary is static with τ (t) > 2 p 2 log 6 c , we have 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * ) √ d √ T + L √ d 2 √ T + d √ 2π (B + β) exp - n 2 +2d c 0 (n, p) √ M + 6d (B + β) T -1 t=0 τ (t) T M . Corollary 4. Suppose that Assumption 6 holds. Choose B = B and c 0 (n, p) = max 8σ 2 n log 6 c , 8(B+β) 2 p 2 log 6 3-5c . Fix t ≥ 1 and i ∈ [d]. Let c > 0 be any given constant such that c < 3 5 . When the system adversary is adaptive or when the system adversary is static but with τ (t) B+β) pM τ (t) + c0(n,p) √ M , then Eq. (4) holds. ≤ 2 p 2 log 6 c , if |∇ i F (w(t))| ≥ 2( When the system adversary is static with τ (t) > 2 p 2 log 6 c , if |∇ i F (w(t))| ≥ 3(B+β) M τ (t) + B+β 2 √ 2π exp -n 2 + c0(n,p) √ M , then Eq. (4) holds. Corollary 5. Suppose Assumptions 1, 2, and 6 hold. For any given T and c such that 0 < c < 3 5 , set the learning rate as η = 1 √ dT and c 0 (n, p) := max 8σ 2 n log 6 c , 8(B+β) 2 p 2 log 6 3-5c . When the system adversary is adaptive or when the system adversary is static but with τ (t) ≤ 2 p 2 log 6 c , we have 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * ) √ d √ T + L √ d 2 √ T + 2d c 0 (n, p) √ M + 4d (B + β) T -1 t=0 τ (t) pT M . On the other hand, when the system adversary is static with τ (t) > 2 p 2 log 6 c , we have 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * ) √ d √ T + L √ d 2 √ T + 2d c 0 (n, p) √ M + 6d (B + β) T -1 t=0 τ (t) T M . If g 1 ∈ (-B, B), then there exists g ′ ∈ R d such that g ′ ̸ = g, g ′ 1 ≥ B, and ∥g -g ′ ∥ 1 ≤ 1. We have P { g 1 = -1} P { g ′ 1 = -1} = B-clip{g1,B} 2B B-clip{g ′ 1 ,B} 2B = B -clip {g 1 , B} B -clip {g ′ 1 , B} = B -clip {g 1 , B} B -B = ∞. Since a finite differential privacy quantification does not hold for any pair of gradients g and g ′ , no differential privacy implies as per Definition 1, proving the first part of the theorem. When β > 0, for any g, g ′ ∈ R d such that g ′ ̸ = g and ∥g -g ′ ∥ 1 ≤ 1, and for each coordinate i ∈ [d], it holds that P { g ′ i = -1} P { g i = -1} = B+β-clip{g ′ 1 ,B} 2B+2β B+β-clip{g1,B} 2B+2β = B + β -clip {g ′ 1 , B} B + β -clip {g 1 , B} ≤ 2B + β β . Similarly, we can show the same upper bound for P { g ′ i = 1} /P { g i = 1}. That is, for the i-th coordinate, the compressor M B,β is coordinate-wise log 2B+β β -differentially private. By Theorem 5, we conclude that the compressor M B,β is d • log 2B+β β -differentially private for the entire gradient.

Proof of Theorem 2 (Smaller Collection of Gradients). For each coordinate

i ∈ [d], it holds that P { g ′ i = -1} P { g i = -1} = B+β-clip{g ′ i ,B} 2B+2β B+β-clip{gi,B} 2B+2β = B + β -clip {g ′ i , B} B + β -clip {g i , B} = B + β -clip {g i , B} + clip {g i , B} -clip {g ′ i , B} B + β -clip {g i , B} ≤ 1 + |g i -g ′ i | B + β -clip {g i , B} ≤ 1 + ∆ 1 B + β -clip {g i , B} ≤ 1 + ∆ 1 β + dist (g i , C B ) . By Theorem 5, we conclude that the compressor M B,β is max g∈G d i=1 log 1 + ∆1 β+dist(gi,C B )differentially private for all gradients g ∈ G. Proof of Corollary 1 (Bounded DP with Bounded Sensitivity). By Theorem 2, we conclude that the compressor M B,β is max g∈G d i=1 log 1 + ∆1 β+dist(gi,C B ) -differentially private for all gradients g ∈ G. It turns out that this bound can be relaxed, and we start the derivation from Eq. ( 7): (7) ≤ 1 + |g i -g ′ i | β . Now consider the coordinate collection of the gradient pair, by Theorem 5, it remains to bound d i=1 log 1 + |g i -g ′ i | β ≤ d log 1 d d i=1 1 + |g i -g ′ i | β [Jensen's inequality] ≤ d log 1 + ∆ 1 dβ ≤ ∆ 1 β [follows from log(1 + x) < x when x > 0.] Proof of Proposition 2 (Equivalent as a Composition). Let g ∈ R d be an arbitrary gradient. To show this proposition, it is enough to show P {[M B,β ] i (g) = 1} = P {[M B,flip • M B,0 ] i (g) = 1} holds for any i ∈ [d]. To see this, P {[M B,flip • M B,0 ] i (g) = 1} = P {[M B,0 ] i (g) = 1 & M B,flip (1) = 1} + P {[M B,0 ] i (g) = -1 & M B,flip (-1) = -1} = B + clip {g i , B} 2B 2B + β 2(B + β) + B -clip {g i , B} 2B β 2(B + β) = B + β + clip {g i , B} 2(B + β) = P {[M B,β ] i (g) = 1} .

D.3 CONVERGENCE RESULTS

Proposition 3 (Bounded Random Variable Variance Bound). Given a random variable X and a clipping threshold B > 0, if µ = E [X] ∈ [-B, B], then var (clip (X, B)) ≤ var (X) = σ 2 . Proof of Proposition 3. var (clip (X, B)) :=E (clip(X, B) -E [clip(X, B)]) 2 =E (clip(X, B) -E [X]) 2 -(E [clip(X, B) -X]) 2 ≤E (clip(X, B) -E [X]) 2 . ( ) For ease of exposition, we assume X admits a probability density function f (x). General distributions of X can be shown analogously. It follows that E (clip(X, B) -E [X]) 2 = ∞ B (B -µ) 2 f (x)dx + B -B (x -µ) 2 f (x)dx + -B -∞ (-B -µ) 2 f (x)dx ≤ ∞ B (x -µ) 2 f (x)dx + B -B (x -µ) 2 f (x)dx + -B -∞ (x -µ) 2 f (x)dx = var (X) = σ 2 . ( ) Combining ( 8) and ( 9), we conclude var (clip (X, B)) ≤ var (X) = σ 2 .

D.3.1 SUB-GAUSSIAN AND HEAVY-TAILED DISTRIBUTIONS

Proof of Theorem 3 (Light and Heavy-tailed Sign Error). Recall that g mi (t) = [M B,β ] i 1 n n j=1 g j mi (t) if m ∈ N (t); * if m ∈ B(t), where * is an arbitrary value in {-1,1}. For any client m ∈ [M ] and any coordinate i ∈ [d], let X mi = 1 {m∈S(t)} 1 { gmi̸ =sign( 1 M M m=1 gmi)} , X mi = 1 {m∈S(t)} 1 {[Mβ] i ( 1 n n j=1 g j mi (t))̸ =sign( 1 M M m=1 gmi)} . Notably, if m ∈ B(t), then it is possible that X mi ̸ = X mi ; otherwise, X mi = X mi . Without loss of generality, we assume the true aggregation is negative, i.e., sign (∇ i F (w(t))) = -1. The case when sign (∇ i F (w(t))) = 1 can be shown analogously. For ease of exposition, we drop a condition of w(t) in the conditional probability expressions unless otherwise noted. It holds that P sign 1 M M m=1 g mi ̸ = -1 ≤ P M m=1 X mi ≥ |S(t)| 2 = P    m∈N (t) X mi + m∈B(t) X mi ≥ |S(t)| 2    = P    m∈N (t) X mi ≥ |S(t)| 2 - m∈B(t) X mi    ≤ P    M m=1 X mi ≥ |S(t)| 2 - m∈B(t) X mi    . Next, we bound M m=1 X mi and m∈B(t) X mi separately. When the system adversary is static, i.e., the system adversary does not know S(t), it corrupts clients independently of S(t). Hence, m∈B(t) X mi ≤ m∈B(t) 1 {m∈S(t)} . We know that if τ (t) ≤ 2 p 2 log 6 c , then m∈B(t) 1 {m∈S(t)} ≤ 2 p 2 log 6 c . Otherwise, with probability at least 1 -c 6 , it is true that m∈B(t) 1 {m∈S(t)} ≤ 3 2 pτ (t). On the other hand, when the system adversary is adaptive, it chooses B(t) based on S(t). In particular, if |S(t)| ≤ τ (t), then the adversary chooses B(t) = S(t). Otherwise, i.e., |S(t)| > τ (t), the adversary chooses an arbitrary subset of S(t). In both cases, it holds that m∈B(t) X mi ≤ m∈B(t) 1 {m∈S(t)} ≤ min{τ (t), |S(t)|} ≤ τ (t). For ease of exposition, we first focus on adaptive adversary and will visit the static adversary towards the end of this proof. Observe that |S(t)| = M m=1 1 {m∈S(t)} . Let Y mi = X mi - 1 {m∈S(t)} 2 . Conditioning on the mini-batch stochastic gradients g 1 mi , • • • , g n mi , we have E Y mi | g 1 mi , • • • , g n mi = E X mi | g 1 mi , • • • , g n mi - p 2 = p 2B + 2β clip   1 n n j=1 g j mi , B   . Taking expectation over g 1 mi , • • • , g n mi , we get E E Y mi | g 1 mi , • • • , g n mi = E E Y mi | g 1 mi , • • • , g n mi -p 1 n n j=1 g j mi 2B + 2β + pg mi 2B + 2β = E E Y mi | g 1 mi , • • • , g n mi -p 1 n n j=1 g j mi 2B + 2β + pg mi 2B + 2β . ( ) It turns out that E E Y mi | g 1 mi , • • • , g n mi -p 1 n n j=1 g j mi 2B+2β is small: 1 p E E Y mi | g 1 mi , • • • , g n mi -p 1 n n j=1 g j mi 2B + 2β = BP 1 n n j=1 g j mi ≥ B -BP 1 n n j=1 g j mi ≤ -B 2B + 2β (A) + E -1 n n j=1 g j mi 1 {| 1 n n j=1 g j mi |≥B} 2B + 2β (B) . We bound (A) and (B) for sub-Gaussian and heavy-tailed noise separately. First, for sub-Gaussian distributions with Assumption 4, we have (A) ≤ B 2B + 2β P    1 n n j=1 g j mi -E   1 n n j=1 g j mi   ≥ B -E   1 n n j=1 g j mi      ≤ B 2B + 2β exp - n (B -g mi ) 2 2σ 2 mi ≤ B 2B + 2β exp - nϵ 2 0 B 2 0 2σ 2 mi ≤ 1 2 exp - n 2 [since ϵ 0 > σ B 0 ], and (B) = E -1 n n j=1 g j mi 1 {| 1 n n j=1 g j mi |≥B} 2B + 2β = -B -∞ P 1 n n j=1 g j mi < t dt - +∞ B P 1 n n j=1 g j mi > t dt 2B + 2β ≤ -B -∞ P 1 n n j=1 g j mi -E 1 n n j=1 g j mi < t -E 1 n n j=1 g j mi dt 2B + 2β ≤ -B -∞ exp -(t-gmi) 2 2σ 2 mi /n dt 2B + 2β [Mill's ratio Gordon (1941)] = 1 2B + 2β -B -∞ - 2σ 2 mi /n 2 (t -g mi ) - 2 (t -g mi ) 2σ 2 mi /n exp - (t -g mi ) 2 2σ 2 mi /n dt ≤ σ 2 mi /n (2B + 2β) (B + g mi ) -B -∞ - 2 (t -g mi ) 2σ 2 mi /n exp - (t -g mi ) 2 2σ 2 mi /n dt ≤ σ 2 mi nϵ 0 B 0 (2B + 2β) exp - nϵ 2 0 B 2 0 2σ 2 mi ≤ σ 2 mi 2nϵ 2 0 B 2 0 exp - nϵ 2 0 B 2 0 2σ 2 mi [β > 0 and B := (1 + ϵ 0 )B 0 > ϵ 0 B 0 ] ≤ 1 2n exp - n 2 , where the last inequality follows from the choice of ϵ 0 > σ B0 . Combining the bounds of (A) and (B), we get E E Y mi | g 1 mi , • • • , g n mi -p 1 n n j=1 g j mi 2B+2β ≤ p exp -n 2 . Hence, E Y mi ≤ p exp - n 2 + pg mi 2B + 2β . Second, for heavy-tailed distributions with Assumption 5, we have (A) ≤ B 2B + 2β P    1 n n j=1 g j mi -E   1 n n j=1 g j mi   ≥ B -E   1 n n j=1 g j mi      ≤ B 2B + 2β P      n j=1 g j mi -E   n j=1 g j mi   p ′ ≥ n p ′ |B -g mi | p ′      ≤ B 2B + 2β E n j=1 g j mi -E n j=1 g j mi p ′ n p ′ |B -g mi | p ′ [Markov's inequality] ≤ B n j=1 E g j mi -E g j mi p ′ + B n j=1 E g j mi -E g j mi 2 p ′ 2 (2B + 2β)n p ′ |B -g mi | p ′ Rosenthal-type inequality Merlevède & Peligrad (2013) ≤ 1 2 nM p ′ + n p ′ 2 M p ′ n p ′ |B -g mi | p ′ [M 1 2 2 ≤ M 1 p ′ p ′ for p ′ ≥ 4] ≤ M p ′ n p ′ 2 ϵ p ′ 0 B p ′ 0 ≤ 1 n p ′ 2 and (B) = E -1 n n j=1 g j mi 1 {| 1 n n j=1 g j mi |≥B} 2B + 2β = -B -∞ P 1 n n j=1 g j mi < t dt - +∞ B P 1 n n j=1 g j mi > t dt 2B + 2β ≤ -B -∞ P 1 n n j=1 g j mi -E 1 n n j=1 g j mi < t -E 1 n n j=1 g j mi dt 2B + 2β ≤ 1 2B + 2β -B -∞ 2M p ′ n p ′ 2 |t -g mi | p ′ dt [similar argument as in (A)] ≤ 1 2B + 2β 1 ϵ p ′ -1 0 B p ′ -1 0 (p ′ -1)n p ′ 2 ≤ 1 (p ′ -1)n p ′ 2 ≤ 1 n p ′ 2 , where the last inequality follows from the choice of ϵ 0 > M 1 p ′ p ′ B0 . Combining the bounds of (A) and (B), we get E E Y mi | g 1 mi , • • • , g n mi -p 1 n n j=1 g j mi 2B+2β ≤ 2p n p ′ 2 . Hence, E Y mi ≤ 2p n p ′ 2 + pg mi 2B + 2β . ( ) Let us consider two mutually complement events E 1 and E 2 : E1 := 1 2(B + β) M m=1 clip 1 n n j=1 g j mi , B -E 1 2(B + β) M m=1 clip 1 n n j=1 g j mi , B ≤ c0(n, p) 4(B + β) √ M , E2 := 1 2(B + β) M m=1 clip 1 n n j=1 g j mi , B -E 1 2(B + β) M m=1 clip 1 n n j=1 g j mi , B > c0(n, p) 4(B + β) √ M . We have P M m=1 X mi ≥ |S(t)| 2 -τ (t) ≤ P M m=1 Y mi ≥ -τ (t) | E 1 + P {E 2 } . ( ) By Proposition 3, we know that var   clip   1 n n j=1 g j mi , B     ≤ var   1 n n j=1 g j mi   ≤ 1 n var g 1 mi = 1 n σ 2 mi ≤ 1 n σ 2 . In addition, clip 1 n n j=1 g j mi , B is bounded and thus sub-Gaussian. Hence, we have P {E 2 } ≤ exp - c 2 0 (n,p)M 4 2M σ 2 n . Since c 0 (n, p) ≥ 8σ 2 n log 6 c , we have P {E 2 } ≤ c 6 . For the first term in the right-hand side of Eq. ( 16), we have P M m=1 Y mi ≥ -τ (t) | E 1 =P            M m=1 Y mi -E M m=1 Y mi | g 1 mi , • • • , g n mi ≥ -τ (t) -E M m=1 Y mi | g 1 mi , • • • , g n mi (C) | E 1            Recall that E Y mi | g 1 mi , • • • , g n mi = p 2B+2β clip 1 n n j=1 g j mi , B . We have (C) | E 1 = -τ (t) - p 2B + 2β M m=1 clip   1 n n j=1 g j mi , B   | E 1 ≥ -τ (t) -E   p 2B + 2β M m=1 clip   1 n n j=1 g j mi , B     - pc 0 (n, p) 4(B + β) √ M = -τ (t) - M m=1 E Y mi - pc 0 (n, p) 4(B + β) √ M    ≥ -τ (t) -M p exp -n 2 -pM 2(B+β) ∇ i F (w(t)) -pc0(n,p) 4(B+β) √ M [Sub-Gaussian Noise] ≥ -τ (t) -2M p n p ′ 2 -pM 2(B+β) ∇ i F (w(t)) -pc0(n,p) 4(B+β) √ M [Heavy-tailed Noise] Recall that ∇ i F (w(t)) < 0. When pM 2(B+β) |∇ i F (w(t))| ≥ τ (t) + M p exp -n 2 + pc0(n,p) 2(B+β) √ M (sub-Gaussian noise) or when pM 2(B+β) |∇ i F (w(t))| ≥ τ (t) + 2M p n p ′ 2 + pc0(n,p) 2(B+β) √ M (heavy-tailed noise), we get P M m=1 Y mi ≥ -τ (t) | E 1 ≤P M m=1 Y mi -E M m=1 Y mi | g 1 mi , • • • , g n mi ≥ pc 0 (n, p) 4(B + β) √ M | E 1 ≤ exp - p 2 c 2 0 (n, p) 8(B + β) 2 ≤ 3 -5c 6 , where the last inequality holds because c 0 (n, p) ≥ √ M (heavy-tailed noise), then P sign 1 M M m=1 g mi ̸ = sign (∇ i F (w(t))) | w(t) ≤ 1 -c 2 . Otherwise, P sign 1 M M m=1 g mi ̸ = sign (∇ i F (w(t))) | w(t) ≤ 1. It remains to show the case for static adversary. When τ (t) ≤ 2 p 2 log 6 c , we bound Eq. ( 10) as P    M m=1 X mi ≥ |S(t)| 2 - m∈B(t) X mi    ≤P M m=1 X mi ≥ |S(t)| 2 -τ (t) . When τ (t) > 2 p 2 log 6 c , we bound Eq. ( 10) as P    M m=1 X mi ≥ |S(t)| 2 - m∈B(t) X mi    ≤P M m=1 X mi ≥ |S(t)| 2 - 3p 2 τ (t) + c 6 . The remaining proof follows the above argument for adaptive adversary. Proof of Theorem 4 (Sub-Gaussian and Heavy-tailed Convergence Rate). By Assumption 2, we have F (w(t + 1)) -F (w(t)) ≤ ⟨∇F (w(t)), w(t + 1) -w(t)⟩ + L 2 ∥w(t + 1) -w(t)∥ 2 = -η d i=1 |∇F (w(t)) i | 1 { gi=sign (∇iF (w(t)))} + η d i=1 |∇F (w(t)) i | 1 { gi̸ =sign (∇iF (w(t)))} + Ld 2 η 2 = -η∥∇F (w(t))∥ 1 + 2η d i=1 |∇F (w(t)) i | 1 { gi̸ =sign (∇iF (w(t)))} + Ld 2 η 2 , where ∇F (w(t)) i is the i-th coordinate of ∇F (w(t)). Then, by conditioning on parameter w(t), we get E F (w(t + 1)) -F (w(t)) w(t) ≤ E -η∥∇F (w(t))∥ 1 + 2η d i=1 |∇F (w(t)) i | 1 { gi̸ =sign(∇F (w(t))i)} + Ld 2 η 2 = -η∥∇F (w(t))∥ 1 + Ld 2 η 2 + 2η d i=1 |∇F (w(t)) i | P { g i ̸ = sign (∇F (w(t)) i )} . Recall that Ξ 1 (n) = 2(B + β) exp -n 2 , and Ξ 2 (n) = 4(B+β) n p ′ 2 . Define      A 1 = |∇ i F (w(t))| ≥ 2(B+β) pM τ (t) + c0(n,p) √ M + 2(B + β) exp -n 2 ; A 2 = |∇ i F (w(t))| ≥ 2(B+β) pM τ (t) + c0(n,p) √ M + 4(B+β) n p ′ 2 . In the following proof, we denote A = A 1 , Ξ(n) = Ξ 1 (n) for sub-Gaussian noise and A = A 2 , Ξ(n) = Ξ 2 (n) for heavy-tailed noise. We now have two cases: First, when the system adversary is adaptive or the system adversary is static but with τ (t) ≤ 2 p 2 log 6 c , then E F (w(t + 1)) -F (w(t)) w(t) = -η∥∇F (w(t))∥ 1 + Ld 2 η 2 + 2η d i=1 |∇F (w(t)) i | P { g i ̸ = sign (∇ i F (w(t)))} 1 {A} + 2η d i=1 |∇F (w(t)) i | P { g i ̸ = sign (∇ i F (w(t)))} 1 {A ∁ } ≤ -η∥∇F (w(t))∥ 1 + Ld 2 η 2 + 2η d i=1 |∇F (w(t)) i | 1 -c 2 1 {A} + 2η d i=1 2(B + β)τ (t) pM + c 0 (n, p) √ M + Ξ(n) 1 {A ∁ } ≤ -ηc∥∇F (w(t))∥ 1 + Ld 2 η 2 + 2ηd c 0 (n, p) √ M + 4ηd (B + β)τ (t) pM + 2ηdΞ(n). Therefore, by Assumption 1, we have F * -F (w(0)) ≤ E [F (w(T )) -F (w(0))] ≤ -ηc T -1 t=0 E [∥∇F (w(t))∥ 1 ] + η 2 LdT 2 + 2ηdT c 0 (n, p) √ M + 2ηdT Ξ(n) + 4ηd (B + β) T -1 t=0 τ (t) pM .

Rearrange the inequality and plug in

η = 1 √ dT , we get ηc T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ F (w(0)) -F * + η 2 LdT 2 + 2ηdT c 0 (n, p) √ M + 2ηdT Ξ(n) + 4ηd (B + β) T -1 t=0 τ (t) pM 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * ) √ d √ T + L √ d 2 √ T + 2d c 0 (n, p) √ M + 4d (B + β) T -1 t=0 τ (t) pT M + 2dΞ(n) . Second, when the system adversary is static with τ (t) > 2 p 2 log 6 c , follow a similar proof as above, we get 1 T T -1 t=0 E [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * ) √ d √ T + L √ d 2 √ T + 2d c 0 (n, p) √ M + 6d (B + β) T -1 t=0 τ (t) T M + 2dΞ(n) .

D.3.2 GAUSSIAN DISTRIBUTION

Proof of Corollary 2 (Gaussian Tail Sign Errors). Most of the proofs are the same with Theorem 3. We start from Eq. 13.

It turns out that

E E Y mi | g 1 mi , • • • , g n mi -p 1 n n j=1 g j mi 2B+2β is small: 1 p E E Ymi | g 1 mi , • • • , g n mi -p 1 n n j=1 g j mi 2B + 2β = (B -gmi) P 1 n n j=1 g j mi ≥ B 2B + 2β (A) - (B + gmi) P 1 n n j=1 g j mi ≤ -B 2B + 2β (B) + E -1 n n j=1 g j mi + gmi 1 {| 1 n n j=1 g j mi |≥B} 2B + 2β (C) . ( ) We have, (2B + 2β) (A) ≤ (B -g mi ) • σ mi / √ n B -g mi • 1 √ 2π • exp - (B -g mi ) 2 2 (σ mi / √ n) 2 = σ mi / √ n √ 2π exp - (B -g mi ) 2 2 (σ mi / √ n) 2 ; (2B + 2β) (B) ≥ (B + g mi ) • B+gmi σmi/ √ n B+gmi σmi/ √ n 2 + 1 • 1 √ 2π • exp - (B + g mi ) 2 2 (σ mi / √ n) 2 = 1 - (σ mi / √ n) 2 (B + g mi ) 2 + (σ mi / √ n) 2 σ mi / √ n √ 2π exp - (B + g mi ) 2 2 (σ mi / √ n) 2 ; (2B + 2β) (C) = - ∞ B x -g mi √ 2πσ mi / √ n exp - (x -g mi ) 2 2 (σ mi / √ n) 2 dx - -B -∞ x -g mi √ 2πσ mi / √ n exp - (x -g mi ) 2 2 (σ mi / √ n) 2 dx; = σ mi / √ n √ 2π exp - (B + g mi ) 2 2 (σ mi / √ n) 2 -exp - (B -g mi ) 2 2 (σ mi / √ n) 2 , where (A) and (B) follow because of Mill's ratio Gordon (1941) . Combining (A), (B), and (C), we get (17) ≤ p (σ mi / √ n) 3 √ 2π (2B + 2β) (B + g mi ) 2 + (σ mi / √ n) 2 exp - (B + g mi ) 2 2 (σ mi / √ n) 2 + pg mi 2B + 2β ≤ p (σ mi / √ n) 3 √ 2π (2B + 2β) ϵ 2 0 B 2 0 + (σ mi / √ n) 2 exp - ϵ 2 0 B 2 0 2 (σ mi / √ n) 2 + pg mi 2B + 2β ≤ p 4 √ 2π exp - n 2 + pg mi 2B + 2β , where the last inequality follows because ϵ 0 > σ B0 and B := B 0 + ϵ 0 B 0 > ϵ 0 B 0 . For the first term in the right hand side of Eq. ( 16), we have P M m=1 Y mi ≥ -τ (t) | E 1 =P            M m=1 Y mi -E M m=1 Y mi | g 1 mi , • • • , g n mi ≥ -τ (t) -E M m=1 Y mi | g 1 mi , • • • , g n mi (D) | E 1            Recall that E Y mi | g 1 mi , • • • , g n mi = p 2B+2β clip 1 n n j=1 g j mi , B . We have (D) | E 1 = -τ (t) - p 2B + 2β M m=1 clip   1 n n j=1 g j mi , B   | E 1 ≥ -τ (t) -E   p 2B + 2β M m=1 clip   1 n n j=1 g j mi , B     - pc 0 (n, p) 4(B + β) √ M = -τ (t) - M m=1 E Y mi - pc 0 (n, p) 4(B + β) √ M ≥ -τ (t) - M p 4 √ 2π exp - n 2 - p 2(B + β) M m=1 g mi - pc 0 (n, p) 4(B + β) √ M Recall that ∇ i F (w(t)) < 0. When M p 2(B+β) |∇ i F (w(t))| ≥ τ (t) + M p 4 √ 2π exp -n 2 + pc0(n,p) 2(B+β) √ M , we get P M m=1 Y mi ≥ -τ (t) | E 1 ≤P M m=1 Y mi -E M m=1 Y mi | g 1 mi , • • • , g n mi ≥ pc 0 (n, p) 4(B + β) √ M | E 1 ≤ exp - p 2 c 2 0 (n, p) 8(B + β) 2 ≤ 3 -5c 6 , where the last inequality holds because c 0 (n, p) ≥ 8(B+β) 2 p 2 log 6 3-5c . The remaining proof follows the arguments in Theorem 3. Proof of Corollary 3 (Gaussian Tail Convergence Rate). This proof follows from Theorem 4. We also consider two cases here. First, when the system adversary is adaptive or the system adversary is static but with τ (t) ≤ n n j=1 g j mi by Assumption 6. Thus, the bias introduced by the tail bound will be gone. For the first term in the right-hand side of Eq. ( 16), we have The remaining proof also follows the arguments in Theorem 3. P M m=1 Y mi ≥ -τ (t) | E 1 =P            M m=1 Y mi -E M m=1 Y mi | g 1 mi , • • • , g n mi ≥ -τ (t) -E M m=1 Y mi | g 1 mi , • • • , g n mi (A) | E 1            Recall that E Y mi | g 1 mi , • • • , g n mi = p Proof of Corollary 5 (Bounded Gradient Convergence Rate). This proof follows from Theorem 4. We also consider two cases here. First, when the system adversary is adaptive or the system adversary is static but with τ (t) ≤ Implementation. We build our codes upon PyTorch Paszke et al. (2019) . We run all the experiments with 4 GPUs of type Tesla P100 and 1 GPU of type RTX 3060.

E.2 PARAMETERS

Communication rounds: 500 for both datasets in the section of client sampling. For the other sections, 80 and 300 communication rounds for MNIST and CIFAR-10, respectively. Dataset partition: Clients' local datasets are evenly partitioned into balanced subsets. However, the distributions are non-IID since they follow Dirichlet distribution with a concentration α. We consider a constant learning rate in all cases, and the choices are tuned through grid search. Specifically, η ∈ {0.0001, 0.001, 0.01, 0.1}, B ∈ {0.001, 0.01, 0.1, 1}. Although our theory indicates the algorithm is not sensitive to mini-batch size, we set a large batch size n = 256 for both datasets. In this section, we reuse the network model and parameter settings in Table 4 , set local epoch to be 1 for non-signed aggregation rules. We evaluate the algorithms on MNIST dataset and partition in a same manner as Appendix.E.1. We consider a total of 100 clients under full-participation with 20 Byzantine clients. The aggregation-rule-specific parameters are illustrated in the following part. All the experiment results are collected with 5 repetitions. We compare our β stochastic sign compressor with Krum Blanchard et al. (2017) , geometric median Chen et al. (2017) , centered clipping Karimireddy et al. ( 2021) under three adversary models, including label flipping, inner product manipulation Xie et al. (2020) , the "A little is enough" Baruch et al. (2019) . Following Karimireddy et al. (2021) , τ is set to be 10 in centered clipping since momentum is switched off. We first illustrate the adversary models below:



Their Theorem 6 analysis contains major flaws.



DP-flip mechanism M B,flip : {±1} → {±1} is defined as: For any b ∈ {±1}, easily checked that M B,f lip is log 2B+β β -DP. Proposition 2. For any β > 0, M B,β d. = M B,flip • M B,0 , where d.

the road-map used in Bernstein et al. (2019); Jin et al. (2020); Safaryan & Richtárik (2021), we first establish an upper bound for the probability of gradient sign errors P

Now consider the asymptotics in terms of T and the client number M only. If τ = O √ M , then both 4(B+β)τ d pM and 6(B+β)τ d M scale in M with order O( 1 √ M

approaching the convergence rate of the standard (centralized, non-private, and adversary-free) SGD O 1 √ T as M → ∞. Luckily, M in FL is often large McMahan et al. (2017).

Datasets: MNIST LeCun et al. (2009), and CIFAR-10 Krizhevsky et al. (2009). • Models: Simple Multi-Layer Perceptron (MLP) McMahan et al. (2017). • Clients Data: 100 balanced workers with α = 1 Dirichlet distribution Hsu et al. (2019). • Baseline algorithms: Sign SGD Bernstein et al. (2018) and FedAvg McMahan et al. (2017). • Byzantine baselines: Krum Blanchard et al. (2017), Geometric Median Chen et al. (2017), Centered Clipping Karimireddy et al. (2021). Adversary models: Label flipping, Inner product manipulation Xie et al. (2020), and the "A little is enough" Baruch et al. (2019).

Figure 3: CIFAR-10 testing results under non-i.i.d distribution with different Byzantine fractions and (a) no differential privacy protection: β = 0, (b) differential privacy protection β = B. Parameter B and β. The peak performances are collected and listed in Table2. We can observe that the testing results drop as β increases. This matches a trivial observation from Theorem 4 such that the increase of β pushes the bound farther away from stationary point, implying our variant suffers from the data-utility-privacy trade-off. For any given B, the testing accuracy is comparable when the differential privacy quantification βs are relatively small such as β ∈ {0.1B, B}, which is fortunate as we might not sacrifice too much data utility while ensuring a privacy quantification.

log 6 c , plug in |∇ i F (w(t))| ≥ 2(B+β) M p τ (t) + [∥∇F (w(t))∥ 1 ] ≤ 1 c (F (w(0)) -F * )the system adversary is static with τ (t) > 2 p 2 log 6 c , plug in |∇ i F (w(t))| ≥ 4 (Bounded Gradient Sign Errors). This proof follows from Theorem 3. Notably, if we choose B = B, clip 1

Recall that ∇ i F (w(t)) < 0. When |∇ i F (w(t))| ≥ 2(B+β)τ (t)

log 6 c , plug in |F i (w(t))| ≥ 2(B+β)τ (t) the system adversary is static with τ (t) > 2 p 2 log 6 c , plug in |∇ i F (w(t))| ≥ MNIST LeCun et al. (2009). MNIST contains 60, 000 training images and 10, 000 testing images of 10 classes. • CIFAR-10 Krizhevsky et al. (2009). CIFAR-10 contains 50, 000 training images and 10, 000 testing images of 10 classes.

the strong assumption of bounded dissimilarity of local gradients, which often does not hold when the data efficiency is taken into accountSu et al. (2022). Preservation. FL is renowned for its capability to decouple model training from the raw data collections by communicating only model/gradient parameters McMahan

To capture this, following the literatureKairouz et al. (2021);Li et al. (2020b);Philippenko & Dieuleveut (2020), instead of full participation, we assume that, in each iteration, a client successfully uploads its local update with probability p independently across rounds, and independently from the PS and other clients.

expectation, Algorithm 1 pushes w(t) towards a stationary point of the global objective F . Additionally, small |∇ i F (w(t))| implies that w(t) is already near the neighborhood of a stationary point. Different from Safaryan & Richtárik (2021) and Jin et al. (

We train MLP under full client participation with 80 and 300 communication rounds in the first two comparisons for MNIST and CIFAR-10, respectively. For the client sampling and Byzantine resilience, we extend the time horizon to 500 communication rounds for both datasets. Mini-batch size. We compare the peak performances of our variant under different mini-batch sizes. It is observed in Table1that the Algorithm 1 is not sensitive to mini-batch size n. This meets Remark 2 (3) as n = Ω (log M ) or n = Ω M

Testing results on two datasets with different combinations of B and β/B (ϵ).

8(B+β) 2

annex

A COMPARISONS WITH JIN ET AL. (2020) . Jin et al. (2020) Our work

Gradient

Batch-size Full-batch (True Gradient) [Eq. ( 33 (B+β) T -1 t=0 τ (t) T M

Gradient noise

No, since only bounded gradients are considered.O exp -n 2 for Sub-Gaussian noise;(p ′ ≥ 4) for heavy-tailed noise.

Differential Privacy

No for sto-sign compressor; flawed arguments for DP-sign compressor.d log(1 + 2B β ) for arbitrary gradients;for gradient pairs with bounded l 1 sensitivity ∆ 1 .

Partial Client Participation

No.Theoretically and empirically verified, and build adaptive Byzantine adversaries on it. 

C.1 ALTERNATIVE ASSUMPTIONS

The following two alternative assumptions on the randomness of stochastic gradients are of decreasing levels of stringency. Assumption 6 (Boundedness). The ℓ ∞ norm of all possible stochastic gradients is upper bounded. Formally, let m ∈ [M ] be an arbitrary client and g be an arbitrary stochastic gradient that client m obtains. For any coordinate i ∈ [d], there existsThe following alternative assumption relaxes the boundedness requirement, and allows the stochastic gradients to be supported over the entire R d .

D PROOFS D.1 AGGREGATION FUNCTIONS

Proof. [Proof of Proposition 1 (Equivalent to Majority Vote)] The intuition behind this proof is to show that the signs of all the aggregation rules mentioned in the theorem statement, given u m ∈ {±1} d for m ∈ S, are equivalent to the sign of the k-trimmed-mean aggregation rule.We first show that for any k < |S|/2, the signs of the outputs of the signs of the aggregation rule agg trimmed,k are the same. When k < |S|/2, it holds that R i ̸ = ∅ for each i ∈ [d]. Thus, the aggregation rules agg trimmed,k with k < |S|/2 is deterministic.For any given coordinate i ∈ [d], if the sign of agg trimmed,k is 0, by definition, we know that there are equal numbers of 1 and -1 in { u mi : m ∈ R i }, and that the top (resp. bottom) k elements removed from { u mi : m ∈ S} are 1 (resp. -1). That is, there are equal numbers of 1 and -1 in { u mi : m ∈ S}. Hence, for any k ′ ̸ = k, as long as the remained set R ′ i after trimming is nonempty (which is ensured by the condition that k ′ < |S|/2), it holds that agg trimmed,k ′ ({ u mi : m ∈ S}) = 0.If the sign of agg trimmed,k is -1, we know that there are more -1 than 1 in { u mi : m ∈ R i }, and that the bottom k elements in { u mi : m ∈ S} are all -1 whereas the number of 1 in the top k elements is at most k. That is, there are more -1 than 1 in { u mi : m ∈ S}. Hence, we know that for any k ′ , as long as the remained set R ′ i after trimming is nonempty, the sign of agg trimmed,k ′ ({ u mi : m ∈ S}) is -1. Similarly, we can show the case when the sign of agg trimmed,k is 1.The above argument, combined with the definition of agg maj , immediately implies that when k < |S|/2, the signs of agg trimmed,k and agg maj are the same.Finally, since agg avg is agg trimmed,0 and agg median = agg trimmed,⌊ |S|-1 2 ⌋ , the signs of agg avg , agg median , and agg trimmed,k for k < |S|/2 are all the same, proving the theorem.

D.2 PRIVACY PRESERVATION

Theorem 5. (Dwork et al., 2014, Corollary 3.15) Proof of Theorem 1 (Necessity of β). We first consider the setting when β = 0. LetLet g ∈ G. Without loss of generality, let us assume that |g 1 -B| ≤ 1, where g 1 is the first entry of g. If g 1 ≥ B, then there exists g ′ ∈ R d such that g ′ ̸ = g, g ′ 1 ∈ (-B, B), and ∥g -g ′ ∥ 1 ≤ 1. Let g 1 and g ′ 1 be the compressed values of g 1 and g ′ 1 under our compressor in Eq. ( 2). It holds that• Label flipping: Suppose original label is x, the adversary will replace it with 9 -x;• Inner Product Manipulation: The adversaries send -γ

|N |

i∈N ∇f (w i ), instead of honest messages, to mislead the parameter server, where ϵ is the strength of the adversary. Let γ = 0.1.• A Little is Enough: The adversaries estimate the benign clients' mean µ N and standard deviation σ N . Then, they will construct new messages as µ N + zσ N and upload to the parameter server, where z is the strength of the adversary. We choose z according to Baruch et al. (2019) :where z = ⌊ M 2 + 1⌋ -|B(t)|, and Φ is the cumulative distribution function of standard normal distribution. For us, z ≈ 0.5.In Section 7, we present the performance of our compressor under flipping sign attacks. For sign-bit messages, this is the worst-case scenario as adversaries' messages cannot escape a binary value. Otherwise, it will be detected by PS and filtered out.We consider a milder condition than the sign flipping attacks for a fair of competition. We allow adversaries to manipulate the mini-batch stochastic gradient but assume an honest compressor that will send out the correctly compressed corrupted messages to PS. Throughout the experiments, it is observed that our β stochastic sign compressor outperforms all other baseline algorithms when β = 0 or β = B. Notably, our compressor saves up to 31x communications and is differentially private when β > 0.

