z-SIGNFEDAVG: A UNIFIED STOCHASTIC SIGN-BASED COMPRESSION FOR FEDERATED LEARNING

Abstract

Federated Learning (FL) is a promising privacy-preserving distributed learning paradigm but suffers from high communication cost when training large-scale machine learning models. Sign-based methods, such as SignSGD (Bernstein et al., 2018) , have been proposed as a biased gradient compression technique for reducing the communication cost. However, sign-based algorithms could diverge under heterogeneous data, which thus motivated the development of advanced techniques, such as the error-feedback method and stochastic sign-based compression, to fix this issue. Nevertheless, these methods still suffer from slower convergence rates. Besides, none of them allows multiple local SGD updates like FedAvg (McMahan et al., 2017) . In this paper, we propose a novel noisy perturbation scheme with a general symmetric noise distribution for sign-based compression, which not only allows one to flexibly control the tradeoff between gradient bias and convergence performance, but also provides a unified viewpoint to existing stochastic sign-based methods. More importantly, the unified noisy perturbation scheme enables the development of the very first sign-based FedAvg algorithm (z-SignFedAvg) to accelerate the convergence. Theoretically, we show that z-SignFedAvg achieves a faster convergence rate than existing sign-based methods and, under the uniformly distributed noise, can enjoy the same convergence rate as its uncompressed counterpart. Extensive experiments are conducted to demonstrate that the z-SignFedAvg can achieve competitive empirical performance on real datasets and outperforms existing schemes.

1. INTRODUCTION

We consider the Federated Learning (FL) network with one parameter server and n clients (McMahan et al., 2017; Li et al., 2020a) , with the focus on solving the following distributed learning problem min x∈R d f (x) = 1 n n i=1 f i (x), where f i (•) is the local objective function for the i-th client, for i = 1, . . . , n. Throughout this paper, we assume that each f i is smooth and possibly non-convex. The local objective functions are generated from the local dataset owned by each client. When designing distributed algorithms to solve (1), a crucial aspect is the communication efficiency since a massive number of clients need to transmit their local gradients to the server frequently (Li et al., 2020a) . As one of the most popular FL algorithms, the federated averaging (FedAvg) algorithm (McMahan et al., 2017; Konečnỳ et al., 2016) considers multiple local SGD updates with periodic communications to reduce the communication cost. Another way is to compress the local gradients before sending them to the server (Li et al., 2020a; Alistarh et al., 2017; Reisizadeh et al., 2020) . Among the existing compression methods, a simple yet elegant technique is to take the sign of each coordinate of the local gradients, which requires only one bit for transmitting each coordinate. For any x ∈ R, we define the sign operator as: Sign(x) = 1 if x ≥ 0 and -1 otherwise. It has been shown recently that optimization algorithms with the sign-based compression can enjoy a great communication efficiency while still achieving comparable empirical performance as uncompressed algorithms (Bernstein et al., 2018; Karimireddy et al., 2019; Safaryan & Richtárik, 2021) . However, for distributed learning, especially the scenarios with heterogeneous data, i.e., f i ̸ = f j for every i ̸ = j, a naive application of the sign-based algorithm may end up with divergence (Karimireddy et al., 2019; Chen et al., 2020a; Safaryan & Richtárik, 2021) . A counterexample for sign-based distributed gradient descent. Consider the one-dimensional problem with two clients: min x∈R (x -A) 2 + (x + A) 2 , where A > 0 is some constant. For any x ∈ [-A, A], the averaged sign gradient at x is Sign(x -A) + Sign(x + A) = 0, i.e., the algorithm never moves. Similar examples are also discussed by (Chen et al., 2020a; Safaryan & Richtárik, 2021) . The fundamental reason for this undesirable result is the uncontrollable bias brought by the sign-based compression. There are mainly two approaches to fixing this issue in the existing literature. The first one is the stochastic sign-based method, which introduces stochasticity into the sign operation (Jin et al., 2020; Safaryan & Richtárik, 2021; Chen et al., 2020a) , and the second one is the Error-Feedback (EF) method (Karimireddy et al., 2019; Vogels et al., 2019; Tang et al., 2019) . However, these works are still unsatisfactory. Specifically, on one hand, both the theoretical convergence rates and empirical performance of these algorithms are still worse than uncompressed algorithms like (Ghadimi & Lan, 2013; Yu et al., 2019) . On the other hand, none of them allows the clients to have multiple local SGD updates within one communication round like the FedAvg, which thereby are less communication efficient. This work aims at addressing these issues and closing the gaps for sign-based methods. Main contributions. Our contributions are summarized as follows. (1) A unified family of stochastic sign operators. We show an intriguing fact: The bias brought by the sign-based compression can be flexibly controlled by injecting a proper amount of random noise before the sign operation. In particular, our analysis is based on a novel noisy perturbation scheme with a general symmetric noise distribution, which also provides a unified framework to understand existing stochastic sign-based methods including (Jin et al., 2020; Safaryan & Richtárik, 2021; Chen et al., 2020a) . (2) The first sign-based FedAvg algorithm. In contrast to the existing sign-based methods which do not allow multiple local SGD updates within one communication round, based on the proposed stochastic sign-based compression, we design a novel family of sign-based federated averaging algorithms (z-SignFedAvg) that can achieve the best of both worlds: high communication efficiency and fast convergence rate. (3) New theoretical convergence rate analyses. By leveraging the asymptotic unbiasedness property of the stochastic sign-based compression, we derive a series of theoretical results for z-SignFedAvg and demonstrate its improved convergence rates over the existing signbased methods. In particular, we show that by injecting a sufficiently large uniform noise, z-SignFedAvg can have a matching convergence rate with the uncompressed algorithms. Organization. In Section 2, the proposed general noisy perturbation scheme for the sign-based compression and its key property, i.e., asymptotic unbiasedness, are presented. Inspired by this result, the main algorithms are devised in Section 3 together with their convergence analyses under different noise distribution parameters. We evaluate our proposed algorithms on real datasets and benchmarks with existing sign-based methods in Section 4. Finally, conclusions are drawn in Section 5. Notations. For any x ∈ R d , we denote x(j) as the j-th element of the vector x. We define the ℓ p -norm for p ≥ 1 as ∥x∥ p = ( d j=1 |x(j)| p ) 1 p . We denote that ∥ • ∥ = ∥ • ∥ 2 , and ∥x∥ ∞ = max j∈{1,...,d} |x(j)|. For any function f (x), we denote f (k) (x) as its k-th derivative, and for a vector x = [x(1), ..., x(d)] ⊤ ∈ R d , we define Sign(x) = [Sign(x(1)), ..., Sign(x(d))] ⊤ .

1.1. RELATED WORKS

Stochastic sign-based method. Our proposed algorithm belongs to this category. Among the existing works (Safaryan & Richtárik, 2021; Jin et al., 2020; Chen et al., 2020a) , the setting considered by (Safaryan & Richtárik, 2021) is closest to ours since the latter two consider gradient compression not only in the uplink but also in the downlink. Despite of this difference and the use of different convergence metrics, the algorithms therein achieve the same convergence rate O(τ -1 4 ), where τ is the total number of gradient queries to the local objective function. Compared to existing works, our proposed z-SignFedAvg requires a slightly stronger assumption on the mini-batch gradient noise, but achieves a faster convergence rate O(τ -1 3 ) or even O(τ -1 2 ), with the standard squared ℓ 2 -norm of gradients as the convergence metric. Error-Feedback method. The error-feedback (EF) method is first proposed by (Seide et al., 2014) and later theoretically justified by (Karimireddy et al., 2019) . Then, (Vogels et al., 2019; Tang et al., 2019; 2021a) further extended this EF method into distributed and adaptive gradient schemes. The key idea of the EF-based methods is to show that the sign operator scaled by the gradient norm is a contractive compressor, and the error induced by the contractive compressor can be compensated. However, such EF-based methods cannot deal with partial client participation otherwise the error residuals cannot be correctly tracked. Besides, the EF-based methods have a convergence rate O(τ -1 2 + d 2 τ -1 ), where d is the dimension of the gradients, and therefore is not competitive for high-dimension problems. Unbiased quantization method. Apart from the sign-based gradient compression, another popular way of compression is the unbiased stochastic quantization method adopted by (Alistarh et al., 2017; Reisizadeh et al., 2020; Haddadpour et al., 2021) . A key assumption made by this category of methods is that the quantization error is bounded by the norm of the input, which however does not hold for sign-based compression, and therefore the existing convergence results therein do not apply to sign-based methods. Besides, as shown in (Alistarh et al., 2017; Reisizadeh et al., 2020) , these methods usually have degraded convergence speed when fewer quantization bits are used. As mentioned, some of the existing sign-based methods like (Chen et al., 2020a; Safaryan & Richtárik, 2021) do not adopt the standard squared ℓ 2 -norm of gradients as the metric for the convergence rate analysis. Thus, it is tricky to make a fair comparison between them and the proposed z-SignFedAvg. In Appendix A, we provide a detailed discussion and summarize the convergence rates of some representative algorithms in Table 2 .

2. SIGN OPERATOR WITH SYMMETRIC AND ZERO-MEAN NOISE

In this section, we introduce a general noisy perturbation scheme for the sign-based compression and analyze the asymptotic unbiasedness of compressed gradients. The results serve as the foundation for the proposed algorithms in subsequent sections.

Key observation.

Let ξ be a random variable that is symmetric, zero-mean and has the p.d.f p(t). If p(0) ̸ = 0 and p(t) is continuous and uniformly bounded on (-∞, +∞), then it holds that lim σ→+∞ σ 2p(0) E[Sign(x + σξ)] = lim σ→+∞ σ p(0) x σ 0 p(t)dt = x. (2) In other words, the perturbed sign operator is an asymptotically unbiased estimator of the input x when σ → ∞. Furthermore, assume that p(t) is uniformly bounded on (-∞, +∞) and differentiable for an arbitrary order. Then, with the Taylor's expansion, we can have σ p(0) x σ 0 p(t)dt = x + 1 p(0) +∞ k=1 p (k) (0)x k+1 (k+1)!σ k = x + +∞ k=1 p (k) (0)O σ -k . Therefore, suppose that K is the largest integer such that p (1) (0) = 0, ..., p (K) (0) = 0. The LHS of (2) will converge to x with the order O(σ -(K+1) ). This observation motivates us to propose the following family of noise distribution parameterized by a positive integer z ∈ Z + . Definition 1 (z-distribution). A random variable ξ z is said to follow the z-distribution if its p.d.f is p z (t) = 1 2η z e -t 2z 2 , ( ) where η z = 2 1 2z Γ 1 + 1 2z and Γ(z) = +∞ 0 t z-1 e -t dt is the Gamma function. It can be verified p z (t) in (3) is a valid p.d.f. When z = 1, it corresponds to the standard Gaussian distribution. In addition, one can also show that p z (t) converges to the p.d.f of the uniform random variable on the interval [-1, 1] when z → +∞ (see Lemma 2 in Appendix B). This z-distribution has a nice property that can be leveraged to bound the bias caused by the sign-based compression, as stated in the following lemma. Lemma 1. For any x ∈ R d and σ > 0, ∥η z σE [Sign(x + σξ z )] -x∥ 2 ≤ ∥x∥ 4z+2 4z+2 4(2z + 1) 2 σ 4z , where ξ z (1), ..., ξ z (d) follow the i.i.d. z-distribution. Remark 1. One can see that the RHS of (4) involves the term (∥x∥ 4z+2 /σ) 4z . Thus, as long as σ > ∥x∥ ∞ , the LHS of (4) converges to zero when z → +∞. Since Lemma 2 implies that ξ ∞ follows the i.i.d uniform distribution on [-1, 1], we obtain σE [Sign(x + σξ ∞ )] = x as long as σ > ∥x∥ ∞ . It is interesting to remark that the stochastic sign operators proposed in (Jin et al., 2020; Safaryan & Richtárik, 2021) are exactly the sign operator injected by the uniform noise, and (Chen et al., 2020a) also considered the use of a symmetric noise for gradient perturbation. Thus, sign-based compression with the z-distribution offers a unified perspective to understand the relationship among the existing stochastic sign-based methods.

3. z-SIGNFEDAVG ALGORITHM

In this section, based on the analysis in Section 2, we propose the following sign-based FedAvg algorithm, termed as z-SignFedAvg. While FedAvg-type algorithms with gradient compression are also presented in (Haddadpour et al., 2021) , they require unbiased compression and are not applicable to sign-based methods. The details of z-SignFedAvg are presented in Algorithm 1. A prominent difference between the proposed z-SignFedAvg and the existing sign-based methods lies in that the clients are allowed to perform multiple SGD updates per communication round (E > 1) before applying the stochastic sign-based compression. Like the FedAvg algorithm, it is anticipated that z-SignFedAvg can greatly benefit from this and has a significantly reduced communication cost. Note that in practice we only consider z = 1 and z = +∞ for the z-SignFedAvg since they correspond to the Gaussian distribution and uniform distribution, respectively. Nevertheless, we are interested in the convergence properties of z-SignFedAvg for a general positive integer z as it provides better insights on the role of z for the convergence rate. Algorithm 1 z-SignFedAvg (or z-SignSGD when E = 1) Require: Total communication rounds T , number of local steps E, number of clients n, clients stepsize γ, server stepsize η, noise coefficient σ, parameter of noise distribution z. 1: Initialize x0. 2: for t = 1 to T do 3: On Clients: 4: for i = 1 to n do 5: x i t-1,0 = xt-1 6: for s = 1 to E do 7: g i t-1,s = gi(x i t-1,s-1 ), where gi(•) is the mini-batch gradient oracle of the i-th client. 8: x i t-1,s = x i t-1,s-1 -γg i t-1,s . 9: end for 10: We first state some standard assumptions for problem (1). Assumption 1. We assume that each f i (x) has the following properties: ∆ i t-1 = Sign x t-1 -x i t-1,E γ + σξz , A.1 We can access a mini-batch gradient oracle that is unbiased and has bounded variance, i.e., E[g i (x)] = ∇f i (x) and E[∥g i (x) -∇f i (x)∥ 2 2 ] ≤ ζ 2 . A. 2 Each f i is smooth, i.e., for any x, y ∈ R d , there exists some non-negative constants L 1 , . . . , L d , such that f (y) -f (x) ≤ ⟨∇f (x), y -x⟩ + d j=1 Lj (y(j)-x(j)) 2 2 . A.3 There exists some constant f * such that f (x) ≥ f * , ∀x ∈ R d . A.4 There exists a constant G ≥ 0 such that ∥∇f i (x)∥ ≤ G, ∀i = 1, ..., n, and x ∈ R d . Assumption A.2 is a more fine-grained assumption on the function smoothness than the commonly used one and is also used by (Bernstein et al., 2018; Safaryan & Richtárik, 2021) . For the convergence rate analysis, we consider two cases, namely, the case with z < +∞ and the case of z = ∞.

3.1. CASE 1: z < +∞

As we can see from Lemma 1, there always exists some gradient bias when z < +∞. In order to bound it, we further assume that a higher order moment of the mini-batch gradient noise is bounded. Assumption 2. There exists a constant Q z ≥ 0 such that for any x ∈ R d , we have E[∥g i (x) -∇f i (x)∥ 4z+2 4z+2 ] ≤ Q z . Theorem 1. Suppose that Assumption 1 and 2 hold. Denote xt,s = 1 n n i=1 x i t,s and L max = max j L j . Then, for η = η z σ, γ ≤ 1 Lmax and z < +∞ in Algorithm 1, we have E 1 T E T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x 0 ) -f * ] T Eγ + γζ 2 L max n + 4γ 2 (E -1)EL 2 max (ζ 2 + G 2 ) 3 (a) Standard terms in FedAvg (6a) + 2 2z+1 E 2z Q z + G 4z+2 G √ 2(2z + 1)σ 2z + γ2 4z E 4z+1 (Q z + G 4z+2 )L max 2(2z + 1) 2 σ 4z (b) Bias terms (6b) + 4η 2 z γσ 2 d j=1 L j En (c) Variance term . ( ) When is the bound non-trivial? Since we assume that the ℓ 2 -norm of gradient is bounded by G, all the terms in the RHS of (6) should be no larger than G 2 . For example, to have the first term in (6b) less than G 2 , one requires σ to be greater than 2 1+ 1 4z E Q z /G + G 4z 1 4z /(2z + 1) 1 2z . Bias-variance trade-off. An interesting observation from Theorem 1 is that there exists a trade-off between the bias and variance terms. One can see that the terms in (6b) is caused by the gradient bias of the sign operation (see (4)) and is an infinitesimal of σ with O σ -2z , while the term in (6c) is due to the injected noise and is in the order of O γσ 2 . Specifically, the first term in (6b) only depends on the noise scale σ and mostly affects the final objective. Meanwhile, the variance term in (6c) mainly affects the convergence speed because a smaller stepsize is required for it to diminish. Theoretically, we can choose an iteration-dependent noise scale σ so as to make the algorithm converge to a stationary solution. To see this, let us denote τ = T E as the total number of gradient queries per client, and present the following corollary. Corollary 1 (Informal). Let γ = min{n z 2z+1 τ -z+1 2z+1 , L -1 max } and σ = (nτ ) 1 4z+2 in Theorem 1, and let E ≤ n -3z 4z+2 τ z+2 4z+2 . We have E 1 τ T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 = O (nτ ) -z 2z+1 . Achieveing linear speedup. From Corollary 1, we can see that the z-SignFedAvg needs (nτ ) 3z 4z+2 communication rounds to achieve a linear-speedup convergence rate. Particularly, when z = 1, the corresponding convergence rate is O((nτ ) -1 3 ) and the required communication rounds is (nτ ) 1 2 . To the best of our knowledge, the previous works have never shown the sign-based method can achieve a linear-speedup convergence rate. Relationship to (Chen et al., 2020a) . The work (Chen et al., 2020a ) also considered the use of a symmetric and zero-mean noise for the sign-based compression and proved that the algorithm has a convergence rate O(τ -1 4 ). However, their results have three differences from our z-SignFedAvg and Theorem 1. First, (Chen et al., 2020a) considered gradient compression both in the uplink and downlink communications. In addition, the convergence metric they used is not the standard squared ℓ 2 -norm of gradients and is hard to interpret. Second, their analysis is rooted in the median-based algorithm, whereas we judiciously exploit the property of the sign operation and hence provide a general analysis framework for the stochastic sign-based methods. Last but not the least, unlike our z-SignFedAvg, (Chen et al., 2020a) cannot allow multiple local SGD updates.

3.2. CASE

2: z = +∞ When z = +∞, the injected noise ξ z in the z-SignFedAvg is uniformly distributed on [-1, 1]. From Remark 1, we have learned that the gradient bias can vanish as long as the noise scale σ is sufficiently large. To quantify this threshold, we need the following assumption which is a limit form of Assumption 2. Assumption 3. There exists a constant Q ∞ ≥ 0 such that for any x ∈ R d , with probability 1, ∥g i (x) -∇f i (x)∥ ∞ ≤ Q ∞ . ( ) Theorem 2. (Informal) Suppose that Assumption 1 and 3 hold. For γ = min{n 1 2 τ -1 2 , L -1 max }, η = σ, z = +∞, E ≤ n -3 4 τ 1 4 and σ > E(G + Q ∞ ) in Algorithm 1 we have E 1 τ T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 = O (nτ ) -1 2 . ( ) However, if σ ≤ E(G + Q ∞ ) , there exists a problem instance for which Algorithm 1 cannot converge. Remark 2. Note that Theorem 2 implies that ∞-SignFedAvg has a matching convergence rate as the uncompressed FedAvg. The reason why ∞-SignFedAvg cannot converge when σ ≤ E(G + Q ∞ ) is simply that the uniform noise has a finite support and cannot always change the sign of gradients. For example, if σ < A for some A > 0, then we have Sign(x + σξ ∞ ) = Sign(x) for any x ≥ A. Relationship to (Jin et al., 2020; Safaryan & Richtárik, 2021) . As mentioned in Remark 1, both the stochastic sign operators in (Jin et al., 2020; Safaryan & Richtárik, 2021) are equivalent to the sign operator injected by the uniform noise. Nevertheless, there are still two distinctions when compared with our ∞-SignFedAvg. First, while (Safaryan & Richtárik, 2021) shows their algorithm has a O(τ -1 4 ) convergence rate, it is based on the ℓ 2 -norm of gradients and cannot imply the same rate as that in (9) (see Appendix A). Second, although (Safaryan & Richtárik, 2021) does not need Assumption 3, it relies on an input-dependent noise scale which, unfortunately, often slows the algorithm convergence in practice especially when the problem dimension is large.  z < +∞ O(τ -z 2z+1 ) O Qz G + G 4z 1 4z Assumption 2 z = +∞ O(τ -1 2 ) O(Q∞ + G) Assumption 3 More theoretical results and proofs are relegated to Appendix B and C. Below, we have two more remarks. Remark 3. (Bounded minibatch gradient noise) While both Assumption 2 and 3 are slightly stronger than the commonly used second-order condition on the minibatch gradient noise, they are still justifiable since unbounded minibatch gradient noise is rarely to happen in practice. Remark 4. (Minibatch gradient noise works as noise perturbation) When the minibatch gradient is used as the input of the sign operator in (2), the minibatch gradient noise itself may function as the perturbation noise. In particular, as shown in (Chen et al., 2020b) the minibatch gradient noise approximately follows a symmetric distribution. Therefore, in practice, one may not need to inject as large noises as suggested by Theorem 2 since the minibatch gradient noise can also help mitigate the bias due to sign-based compression. This also explains why a small noise scale is sufficient for z-SignFedAvg to achieve good performance in the experiment section.

3.3. COMPARISON OF CASE 1 AND CASE 2

We summarize the results of Case 1 and Case 2 in Table 1 , where O(•) hides some constants that do not affect the comparison. Especially, we can see that when the mini-batch gradient noise has a long tail such that Q z /G ≪ Q 4z ∞ , Case 1 requires a less amount of noise than Case 2 for guaranteeing convergence. Despite of the difference in theory, we will see in Section 4 that z-SignFedAvg under Case 1 and Case 2 have almost the same behavior in practice.

3.4. IMPLICATION ON DIFFERENTIALLY PRIVATE FEDERATED LEARNING (DP-FL)

Beyond the convergence issue, we remark that adding Gaussian noise to the local gradients is also a common practice for privacy protection, especially in DP-FL (Geyer et al., 2017; Agarwal et al., 2021; 2018) . With this observation, it is straightforward to propose a differentially-private variant of z-SignFedAvg, which we term DP-SignFedAvg. More details and comparison results between DP-SignFedAvg and the uncompressed DP-FedAvg (Geyer et al., 2017; Kairouz et al., 2021) under different privacy budgets are given in Appendix F.

4. EXPERIMENTS

In this section, we present the experiment results on both synthetic and real problems, and all the figures in this section are obtained by 10 independent runs and are visualized in the form of mean±std. Noise scale as a hyperparameter. Although we explicitly characterize how the performance of z-SignFedAvg depends on the noise scale σ in the previous section, we treat σ as a tunable hyperparameter in the experiments. This is because, on one hand, the theoretical lower bound for σ are difficult to compute since it is impossible to access the moment condition of the minibatch gradient noise. On the other hand, as we have discussed in Remark 4, owing to the presence of the minibatch gradient noise, we can use a much smaller noise scale than the theoretical one in practice. Aside from the experiments presented in this section, we also compare our algorithm to another popular family of unbiased stochastic compressed FL algorithms, namely, the QSGD in (Alistarh et al., 2017) and FedPAQ in (Reisizadeh et al., 2020) . For detailed results, we refer readers to Appendix E.

4.1. A SIMPLE CONSENSUS PROBLEM

In this section, we verify our theoretical results in Section 3 by considering the simple consensus problem with 10 clients: min x∈R d 1 2 10 i=1 ∥x -y i ∥ 2 , where y 1 , ..., y 10 ∈ R d are generated using i.i.d standard Gaussian distribution, and d is the problem dimension. We implemented the following algorithms: GD (Gradient descent), Sto-SignSGD (Safaryan & Richtárik, 2021) , SignSGD (Algorithm 1 with z = 1, E = 1 and σ = 0), 1-SignSGD (Algorithm 1 with z = 1 and E = 1.), ∞-SignSGD (Algorithm 1 with z = +∞ and E = 1). For all the algorithms, we considered the full gradient (no mini-batch SGD), and used the same stepsize 0.01 and initialization by a zero vector. Results. As we can see from Figure 1 , the vanilla SignSGD fails to converge to the optimal solution whereas the others can. Besides, 1-SignSGD and ∞-SignSGD have roughly the same convergence speed which is slightly slower than the uncompressed GD. It is also observed that the input-dependent noise scale adopted by (Safaryan & Richtárik, 2021) could slow the convergence when the problem dimension is high, as discussed in Section 3.2. Settings. For both the experiments on EMNIST and CIFAR-10, we followed a setting similar to (Reddi et al., 2020) . We also considered the scenario with partial client participation. From previous experiments, we have learned that the noise scale σ has to be properly chosen for the algorithm to perform well. However, it could be time-consuming to select the optimal noise scale via grid search. Therefore, here we introduce a simple yet useful strategy that can tune the noise scale adaptively during the training process. Figure 2 indicates that the noise scale should plays a similar role as the stepsize when training a neural network: Small noise scale leads to fast convergence at the beginning, while large noise scale guarantees a better final performance. This suggests that we should use an increasing noise scale during the optimization process. We can also see this from Corollary 1 because that the noise scale σ is proportional to τ . Besides, it has been shown that the gradients of neural network tend to be sparser during the training process (Karimireddy et al., 2019) . Therefore, as studied in (Isik & Weissman, 2022) , from the rate-distortion theoretic aspect, the noise scale should be increasing as the compression becomes more aggressive. Motivated by all of these insights, we propose the following Plateau criterion for adapting the noise scale. Plateau criterion. We denote a few parameters σ bound ≥ σ init > 0, κ ∈ Z + , β > 0. We first start Algorithm 1 with a small noise scale σ init , i.e., σ = σ init , and then update the noise scale via σ = βσ, where β ∈ [1.5, 2], whenever the objective function stops improving for κ communication rounds. We stop updating σ if it has already been greater than a relatively large number σ bound . Results. We demonstrate the efficacy of the Plateau criterion by comparing the performance of 1-SignSGD/1-SignFedAvg with the optimal noise scale found in previous experiments and the ones with Plateau criterion. Figure 6 shows the results under the three different settings used in Section 4.2 and 4.3. We can see that, the Plateau criterion could results in a slower convergence speed than the optimal noise scale in the middle phase of optimization, because it requires some time for the algorithm to adapt to a suitable noise scale. But eventually it can lead to the same objective value obtained by using the optimal noise scale. For more details like the hyperparameters for Plateau criterion and the evolution of noise scale, we refer readers to Appendix D.3. For communication complexity, we focus on the uplink communication cost, i.e., the number of bits transmitted from the clients to the server in each communication round. We assume that all the uncompressed algorithms use 32 bits to represent a single float number as it is the most common setting in Tensorflow (Abadi et al., 2016a) and Pytorch (Paszke et al., 2017) . While most of the existing methods use the squared ℓ 2 -norm of gradients as the convergence metric, the work (Safaryan & Richtárik, 2021) adopts the ℓ 2 -norm of gradients. The work (Chen et al., 2020a) uses a convergence metric mixed with squared ℓ 2 -norm and ℓ 1 -norm of gradients due to the compression in both uplink and downlink . Among the works in Table 2 , the setting considered by (Safaryan & Richtárik, 2021 ) is closest to ours. (Safaryan & Richtárik, 2021) proposed an algorithm that can achieve the convergence rate O(τ -1 4 ) with the ℓ 2 -norm of gradients as the metric. We remark that this is inferior to the convergence rate O(τ -1 2 ) with the squared ℓ 2 -norm as the metric. To illustrate this point, we denote a series of vector as {α 1 , ..., α τ , ...} with α i ∈ R d . If now 1 τ τ i=1 ∥α i ∥ = O(τ -1 4 ), in the worst case, we can only guarantee that 1 τ τ i=1 ∥α i ∥ 2 ≤ τ 1 τ τ i=1 ∥α i ∥ 2 = O(τ 1 2 ). As a simple example, the equality in (11) holds if and only if there is exactly one non-zero term in {α 1 , ..., α τ }. On the contrary, if it holds that  1 τ τ i=1 ∥α i ∥ 2 = O(τ -1 2 ), O(τ -1 2 ) squared ℓ 2 32d • Bounded gradient ✓ ✓ (Karimireddy et al., 2019) O(τ -1 2 + d 2 τ -1 ) squared ℓ 2 d + 32 • Bounded gradient ✗ ✗ (Safaryan & Richtárik, 2021) O(τ -1 4 ) ℓ 2 d No ✗ ✗ (Jin et al., 2020) O(τ -1 4 ) squared ℓ 2 d • Bounded gradient • n is an odd number ✗ ✗ (Chen et al., 2020a) O(τ -1 4 ) mixed d • Bounded gradient • n is an odd number ✗ ✗ (Alistarh et al., 2017) O(τ -1 2 ) squared ℓ 2 ≈ sd + 32 No ✓ ✗ (Haddadpour et al., 2021) O(τ -1 2 ) squared ℓ 2 ≈ sd + 32 • Bounded gradient dissimilarity ✓ ✓ 1-SignFedAvg (ALG. 1) This work O(τ -1 3 ) squared ℓ 2 d • Bounded gradient • Bounded 6th moment of gradient noise ✓ ✓ ∞-SignFedAvg (ALG. 1) This work O(τ -1 2 ) squared ℓ 2 d • Bounded gradient • Bounded support of gradient noise ✓ ✓ Table 2: Summary of representative stochastic sign-based methods. then we have 1 τ τ i=1 ∥α i ∥ ≤ 1 τ τ i=1 ∥α i ∥ 2 = O(τ -1 4 ). Thus, the convergence results in (Safaryan & Richtárik, 2021) cannot imply the rate in Theorem 2. Besides, the algorithm in (Safaryan & Richtárik, 2021 ) is equivalent to our Algorithm 1 with z = ∞, E = 1 and σ = ∥g i t-1,s ∥. This input-dependent noise scale is linearly increasing w.r.t the problem dimension and is too conservative for practical applications. From Figure 1 and Figure 3 , we have already seen that this input-dependent noise scale could result in an extremely slow convergence for high-dimensional problems. Except for the previous sign-based compression methods, another type of compressed FL algorithms, such as (Alistarh et al., 2017) and (Haddadpour et al., 2021) , adopt a unified unbiased compressor Q(•) that satisfies E[∥Q(x) -x∥ 2 ] ≤ C∥x∥ 2 for some constant C > 0. We remark that such property is not fulfilled by any of the existing sign-based compressors. Thus, the theoretical results therein cannot be applied to sign-based methods. A specific example of such unbiased compressor is described below. Definition 2 (Unbiased quantizer). For any variable x ∈ R d , the unbiased quantizer Q(•) : R d → R d is defined as below Q(x) = ∥x∥ 2 •     Sign(x 1 )ξ(x 1 , s) Sign(x 2 )ξ(x 2 , s) . . . Sign(x d )ξ(x d , s)     where ξ(x i , s) is a random variable taking on value l+1 s with probability |xi| ∥x∥2 s -l and l s otherwise. Here, the tuning parameter s corresponds to the number of quantization levels and l ∈ [0, s) is an integer such that |xi| ∥x∥2 ∈ [l/s, l + 1/s). In Table 2 , we assume both (Alistarh et al., 2017) and (Haddadpour et al., 2021) adopt the quantizer in ( 14). Generally speaking, this type of unbiased quantization usually requires much more bits than sign-based compression to obtain a good performance, which is also verified empirically in Appendix E. It is also worthwhile to mention that the FedPAQ in (Reisizadeh et al., 2020) and the FedCOM in (Haddadpour et al., 2021) are equivalent in algorithm, but only the latter one considers the heterogeneous scenario theoretically.

B DETAILED THEORETICAL RESULTS

We first state the result on the limit of z-distribution. Lemma 2. The z-distribution weakly converges to uniform distribution on [-1, 1] when z → +∞. The following corollary is the formal version of Corollary 1. Corollary 2 (Formal version of Corollary 1). For γ = min{n z 2z+1 τ -z+1 2z+1 , 1 Lmax } and σ = (nτ ) 1 4z+2 in Theorem 1, we have E 1 τ T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x 0 ) -f * ] (nτ ) z 2z+1 + ζ 2 L max (nτ ) z+1 2z+1 + 4(E -1)En 2z 2z+1 L 2 max ζ 2 + G 2 3τ 2z+2 2z+1 + 2 2z+1 E 2z Q z + G 4z+2 G √ 2(2z + 1)(nτ ) z 2z+1 + 2 4z E 4z+1 (Q z + G 4z+2 )L max 2(2z + 1) 2 n z 2z+1 τ 3z+1 2z+1 + 4η 2 z d j=1 L j E(nτ ) z 2z+1 . (15) Furthermore, if E ≤ n -3z 4z+2 τ z+2 4z+2 , the upper bound above becomes E 1 τ T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x 0 ) -f * ] (nτ ) z 2z+1 + ζ 2 L max (nτ ) z+1 2z+1 + 4L 2 max ζ 2 + G 2 3(nτ ) z 2z+1 + 2 2z+1 E 2z Q z + G 4z+2 G √ 2(2z + 1)(nτ ) z 2z+1 + 2 4z E 4z+1 (Q z + G 4z+2 )L max 2(2z + 1) 2 n z 2z+1 τ 3z+1 2z+1 + 4η 2 z d j=1 L j E(nτ ) z 2z+1 . ( ) . The formal version of Theorem 2 is given below. Theorem 3 (Formal version of Theorem 2). Suppose that Assumption 1 and 3 hold. For γ ≤ 1 Lmax , η = σ, z = +∞ and σ > E(G + Q ∞ ) in Algorithm 1, we have E 1 T E T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x 0 ) -f * ] T Eγ + γζ 2 L max n + 4γ 2 (E -1)EL 2 max (ζ 2 + G 2 ) 3 Standard terms in FedAvg + 4γσ 2 d j=1 L j En Variance term . ( ) Otherwise, if σ ≤ E(G + Q ∞ ), there exists a problem instance for which the algorithm cannot converge. If we further choose γ = min{n 1 2 τ -1 2 , 1 Lmax }, we have E 1 τ T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x 0 ) -f * ] (nτ ) 1 2 + ζ 2 L max (nτ ) 1 2 + 4(E -1)EnL 2 max ζ 2 + G 2 3τ + 4σ 2 d j=1 L j E(nτ ) 1 2 . ( ) Furthermore, if E ≤ n -3 4 τ 1 4 , the upper bound above becomes E 1 τ T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x 0 ) -f * ] (nτ ) 1 2 + ζ 2 L max (nτ ) 1 2 + 4L 2 max ζ 2 + G 2 3(nτ ) 1 2 + 4σ 2 d j=1 L j E(nτ ) 1 2 , ( ) which recovers the convergence result of the uncompressed FedAvg algorithm (Yu et al., 2019) . In particular, since the third term in the RHS of ( 18) is O(E 2 nτ -1 ), hence when E ≤ n -3 4 τ 1 4 , this term becomes O((nτ ) -1 2 ). C PROOFS C.1 PROOF OF LEMMA 1 We first state a useful inequality on the c.d.f of the z-distribution: Lemma 3. For any x ∈ R, it holds that |x| - |x| 2z+1 2(2z + 1) ≤ |Ψ z (x)| ≤ |x|, ( ) where Ψ z (x) def.

=

x 0 e -t 2z 2 dt. Similar to the sign operator, for any vector x = [x(1), ..., x(d)] ⊤ ∈ R d , we define Ψ z (x) = [Ψ z (x(1)), ..., Ψ z (x(d))] ⊤ . With the presence of Lemma 3, we have ∥η z σE [Sign(x + σξ z )] -x∥ 2 = x -σΨ z x σ 2 = d j=1 x(j) -σΨ z x(j) σ 2 ≤ d j=1 (x(j)) 4z+2 4(2z + 1) 2 σ 4z = ∥x∥ 4z+2 4z+2 4(2z + 1) 2 σ 4z . ( ) Proof of Lemma 3. Without loss of generality, we consider x ≥ 0. First, x 0 e -t 2z 2 dt ≤ x 0 1dt ≤ x. ( ) Now we define F (x) def. x 0 e -t 2z 2 dt -x + x 2z+1 2(2z+1) . Note that F (0) = 0. Then, it suffices to show F (x) ≥ 0 by F ′ (x) = e -x 2z 2 -1 + x 2z 2 ≥ 0. ( ) It is true since the inequality e -t -1 + t ≥ 0 for any t ≥ 0.

C.2 PROOF OF LEMMA 2

Now we denote the p.d.f of the uniform distribution as p ∞ (x) = 1 2 |x| ≤ 1, 0 |x| > 1. (24) Without loss of generality, for any x > 1 and z ∈ Z + , we have x -∞ 1 2η z e -t 2z 2 dt - x -∞ p ∞ (t)dt = x 0 1 2η z e -t 2z 2 -p ∞ (t) dt ≤ 1 0 1 2η z e -t 2z 2 - 1 2 dt + x 1 1 2η z e -t 2z 2 dt. ( ) For any 0 < ϵ < min{1, x -1}, we have 1 0 1 2η z e -t 2z 2 - 1 2 dt = 1-ϵ 0 1 2η z e -t 2z 2 - 1 2 dt + 1 1-ϵ 1 2η z e -t 2z 2 - 1 2 dt ≤ 1 2η z e -(1-ϵ) 2z 2 - 1 2 + ϵ. ( ) Since lim z→∞ 1 2ηz = lim z→∞ z 2 1 2z Γ( 1 2z ) = 1 2 and lim z→∞ e -(1-ϵ) 2z 2 = 1, there exists an integer Z 1 > 0 such that if z > Z 1 , we have 1 2η z e -(1-ϵ) 2z 2 - 1 2 ≤ ϵ. Similarly, we have x 1 1 2η z e -t 2z 2 dt = 1+ϵ 1 1 2η z e -t 2z 2 dt + x 1+ϵ 1 2η z e -t 2z 2 dt ≤ ϵ + 1 2η z e -(1+ϵ) 2z 2 (x -1 -ϵ). ( ) Since lim z→∞ e -(1+ϵ) 2z 2 = 0, there exists an integer Z 2 > 0 such that if z > Z 2 , we have x 1 1 2η z e -t 2z 2 dt ≤ ϵ. In all, for any 0 < ϵ < min{1, x -1}, if z is sufficiently large, we have x -∞ 1 2η z e -t 2z 2 dt - x -∞ p ∞ (t)dt ≤ 4ϵ. ( ) Taking ϵ → 0 and z → ∞, we have lim z→∞ x -∞ 1 2η z e -t 2z 2 dt - x -∞ p ∞ (t)dt = 0.

C.3 PROOF OF THEOREM 1

We denote the aggregated update xt = xt-1,E . First, we state two technical lemmas: Lemma 4. Suppose that Assumption 1 and 2 hold. For the t-th (1 ≤ t ≤ T ) communication round in Algorithm 1, if η = η z σ and z < +∞, we have E[f (x t ) -f (x t )] ≤ γ2 2z E 2z+1 Q z + G 4z+2 G √ 2(2z + 1)σ 2z + γ 2 2 4z E 4z+2 (Q z + G 4z+2 )L max 4(2z + 1) 2 σ 4z + 2η 2 z γ 2 σ 2 d j=1 L j n . ( ) Lemma 5. Suppose that Assumption 1 hold. For the t-th (1 ≤ t ≤ T ) communication round in Algorithm 1, if γ ≤ 1 Lmax , we have E[f (x t ) -f (x t-1 )] ≤ - γ 2 E s=1 ∥∇f (x t-1,s-1 )∥ 2 + Eγ 2 ζ 2 L max 2n + 2γ 3 (E -1)E 2 L 2 max (ζ 2 + G 2 ) 3 . ( ) By combining Lemma 4 and Lemma 5, we have E[f (x t ) -f (x t-1 )] = E[f (x t ) -f (x t )] + E[f (x t ) -f (x t-1 )] ≤ - γ 2 E s=1 ∥∇f (x t-1,s-1 )∥ 2 + Eγ 2 ζ 2 L max 2n + 2γ 3 (E -1)E 2 L 2 max (ζ 2 + G 2 ) 3 + γ2 2z E 2z+1 Q z + G 4z+2 G √ 2(2z + 1)σ 2z + γ 2 2 4z E 4z+2 (Q z + G 4z+2 )L max 4(2z + 1) 2 σ 4z + 2η 2 z γ 2 σ 2 d j=1 L j n . Rearranging the inequality (33), we have 1 E E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x t-1 ) -f (x t )] Eγ + γζ 2 L max n + 4γ 2 (E -1)EL 2 max (ζ 2 + G 2 ) 3 + 2 2z+1 E 2z Q z + G 4z+2 G √ 2(2z + 1)σ 2z + γ2 4z E 4z+1 (Q z + G 4z+2 )L max 2(2z + 1) 2 σ 4z + 4η 2 z γσ 2 d j=1 L j En . Finally, by a telescopic sum, we obtain E 1 T E T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x 0 ) -f * ] T Eγ + γζ 2 L max n + 4γ 2 (E -1)EL 2 max (ζ 2 + G 2 ) 3 + 2 2z+1 E 2z Q z + G 4z+2 G √ 2(2z + 1)σ 2z + γ2 4z E 4z+1 (Q z + G 4z+2 )L max 2(2z + 1) 2 σ 4z + 4η 2 z γσ 2 d j=1 L j En . ( ) Proof of Lemma 4. First, we know from function smoothness that f (x t ) -f (x t ) ≤ ⟨∇f (x t ), x t -xt ⟩ + d j=1 L j (x t (j) -xt (j)) 2 2 . ( ) As can be seen from ( 36), we need to study the x t -xt in order to obtaining the upper bound for f (x t ) -f (x t ). Note that x t -xt = γ n n i=1 η z σSign E s=1 g i t,s + σξ z - E s=1 g i t,s . For ease of presentation, we define that A i t def. = η z σSign E s=1 g i t,s + σξ z . By taking the expectation over the random vector ξ z , for any j = 1, ..., d, we have E ξz [(x t (j) -xt (j)) 2 ] = γ 2 n 2 E ξz   n i=1 A i t - E s=1 g i t,s (j) 2   (39a) = γ 2 n 2 E ξz   n i=1 A i t (j) -E ξz A i t (j) + E ξz A i t (j) - E s=1 g i t,s (j) 2   (39b) ≤ γ 2 n 2 E ξz   n i=1 A i t (j) -E ξz A i t (j) 2   (39c) + γ 2 n 2 E ξz   n i=1 E ξz A i t (j) - E s=1 g i t,s (j) 2   , where the last inequality is obtained because n i=1 A i t (j) -E ξz A i t (j) is zero-mean and inde- pendent of n i=1 E ξz A i t (j) - E s=1 g i t,s (j) . From (38) it is easy to check that |A n t (j)| ≤ η 2 z σ 2 . Hence, for the RHS of (39c), we have E ξz   n i=1 A i t (j) -E ξz A i t (j) 2   (a) = n i=1 E ξz A i t (j) -E ξz A i t (j) 2 ≤ 2 n i=1 E ξz A i t (j) 2 + E ξz A i t (j) 2 ≤ 4nη 2 z σ 2 , where equality (a) is true because A 1 t (j), ..., A n t (j) are independent to each other. Therefore, from (39) and (40) we have E ξz   d j=1 L j (x t (j) -xt (j)) 2   = d j=1 L j E ξz (x t (j) -xt (j)) 2 (41a) ≤ 4η 2 z γ 2 σ 2 d j=1 L j n + γ 2 n 2 d j=1 L j E ξz   n i=1 E ξz A i t (j) - E s=1 g i t,s (j) 2   (41b) ≤ 4η 2 z γ 2 σ 2 d j=1 L j n + γ 2 L max n 2 E ξz   n i=1 E ξz A i t - E s=1 g i t,s 2   . To bound the RHS of (41c), we have E ξz   n i=1 E ξz A i t - E s=1 g i t,s 2   ≤ n n i=1 E ξz   E ξz A i t - E s=1 g i t,s 2   ≤ n 4(2z + 1) 2 σ 4z n i=1 E s=1 g i t,s 4z+2 4z+2 , where the last inequality is due to Lemma 1. Now we need to bound E   E s=1 g i t,s 4z+2 4z+2   , where the expectation is taken over both ξ z and the mini-batch gradient noise. To this end, we need the following lemma about the ℓ p -norm. Lemma 6. For any M ∈ Z + , p > 1 and M vectors x 1 , ..., x M ∈ R d , we have M i=1 x i p p ≤ M p-1 M i=1 ∥x i p p . As a direct application of Lemma 6, we obtain E   E s=1 g i t,s 4z+2 4z+2   ≤ E E 4z+1 E s=1 g i t,s 4z+2 4z+2 = E 4z+1 E s=1 E g i t,s Then we can bound the RHS of ( 44) as E g i t,s 4z+2 4z+2 = E g i t,s -∇f i (x i t,s-1 ) + ∇f i (x i t,s-1 ) 4z+2 4z+2 (a) ≤ E 2 4z+1 g i t,s -∇f t i(x i t,s-1 ) 4z+2 4z+2 + 2 4z+1 ∇f i (x i t,s-1 ) 4z+2 4z+2 (b) ≤ 2 4z+1 Q z + 2 4z+1 ∇f i (x i t,s-1 ) 4z+2 2 (c) ≤ 2 4z+1 (Q z + G 4z+2 ), where inequality (a) follows Lemma 6, inequality (b) is due to Assumption 2, and inequality (c) is due to A.4 of Assumption 1. Combing ( 41), ( 42), ( 44) and ( 45), we have E n i=1 E ξz A i t - E s=1 g i t,s ≤ E   n i=1 E ξz A i t - E s=1 g i t,s 2   ≤ n 2 2 4z E 4z+2 (Q z + G 4z+2 ) 2(2z + 1) 2 σ 4z ≤ n2 2z E 2z+1 (Q z + G 4z+2 ) √ 2(2z + 1)σ 2z and E   d j=1 L j (x t (j) -xt (j)) 2   ≤ 4η 2 z γ 2 σ 2 d j=1 L j n + γ 2 L max n 2 E   n i=1 E ξz A i t - E s=1 g i t,s 2   ≤ 4η 2 z γ 2 σ 2 d j=1 L j n + γ 2 2 4z+1 E 4z+2 (Q z + G 4z+2 )L max 4(2z + 1) 2 σ 4z . Hence, we have E [f (x t ) -f (x t )] ≤E ∇f (x t ), γ n n i=1 E ξz A i t - E s=1 g i t,s + E d j=1 L j (x t (j) -xt (j)) 2 2 ≤∥∇f (x t )∥E γ n n i=1 E ξz A i t - E s=1 g i t,s + E d j=1 L j (x t (j) -xt (j)) 2 2 ≤ γ2 2z E 2z+1 Q z + G 4z+2 G √ 2(2z + 1)σ 2z + γ 2 2 4z E 4z+2 (Q + G 4z+2 )L max 4(2z + 1) 2 σ 4z + 2η 2 z γ 2 σ 2 d j=1 L j n . ( ) Proof of Lemma 6. To prove this lemma, we need to use a classical result on the monotonicity of ℓ p norm: Lemma 7. (Kantorovich & Akilov, 2016) For any x ∈ R d and 1 < r < p, we have ∥x∥ p ≤ ∥x∥ r ≤ d 1 r -1 p ∥x∥ p . Now from the definition of ℓ p norm we have M i=1 x i p p = d j=1 M i=1 x i (j) p ≤ d j=1 M i=1 |x i (j)| p = d j=1 ∥[x 1 (j), ..., x M (j)] ⊤ ∥ p 1 (a) ≤ M p-1 d j=1 ∥[x 1 (j), ..., x M (j)] ⊤ ∥ p p = M p-1 d j=1 M i=1 (x i (j)) p = M p-1 M i=1 ∥x i ∥ p p , where inequality (a) is due to Lemma 7. Proof of Lemma 5. First we unroll the difference f (x t ) -f (x t-1 ) into a telescopic sum across E local steps. f (x t ) -f (x t-1 ) = f (x t-1,E ) -f (x t-1,0 ) = E s=1 f (x t-1,s ) -f (x t-1,s-1 ) (51a) ≤ E s=1 -⟨∇f (x t-1,s-1 ), xt-1,s-1 -xt-1,s ⟩ + L max 2 ∥x t-1,s -xt-1,s-1 ∥ 2 (51b) = E s=1   -γ⟨∇f (x t-1,s-1 ), 1 n n i=1 g i t-1,s ⟩ + γ 2 L max 2 1 n n i=1 g i t-1,s 2   , where the inequality is due to the smoothness assumption. Taking expectation over the mini-batch gradient noise g 1 t-1,s , ..., g n t-1,s , for the first terms in (51c), we obtain E -∇f (x t-1,s-1 ), 1 n n i=1 g i t-1,s = -∇f (x t-1,s-1 ), 1 n n i=1 ∇f i (x i t-1,s-1 ) (52a) = - 1 2 ∥∇f (x t-1,s-1 )∥ 2 - 1 2 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 (52b) + 1 2 ∇f (x t-1,s-1 ) - 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 . For the second terms in (51c), we have E   1 n n i=1 g i t-1,s 2   = E   1 n n i=1 g i t-1,s - 1 n n i=1 ∇f i (x i t-1,s-1 ) + 1 n n i=1 ∇f i (x i t-1,s-1 ) 2   (a) = E   1 n n i=1 g i t-1,s - 1 n n i=1 ∇f i (x i t-1,s-1 ) 2   + 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 (b) = 1 n 2 n i=1 E   g i t-1,s - 1 n n i=1 ∇f i (x i t-1,s-1 ) 2   + 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 (c) ≤ ζ 2 n + 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 , ( ) where equalities (a) and (b) are true because the mini-batch gradient noise is independent, and inequality (c) is due to A.1 of Assumption 1. Notice that owing to the function smoothness, we have for arbitrary x, y ∈ R d , f (y) ≤ ⟨∇f (x), y -x⟩ + L max 2 ∥y -x∥ 2 , which is equivalent to ∥∇f (x) -∇f (y)∥ ≤ L max ∥y -x∥. ( ) Now to bound the term in (52c), for every s, we have ∇f (x t-1,s-1 ) - 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 = 1 n n i=1 ∇f i (x t-1,s-1 ) - 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 ≤ L 2 n n i=1 ∥x t-1,s-1 -x i t-1,s-1 ∥ 2 = γ 2 L 2 max n n i=1 s-1 q=1   1 n n j=1 g j t-1,q -g i t-1,q   2 ≤ (s -1)γ 2 L 2 max n n i=1 s-1 q=1 1 n n j=1 g j t-1,q -g i t-1,q 2 ≤ 2(s -1)γ 2 L 2 max n n i=1 s-1 q=1    1 n n j=1 g j t-1,q 2 + g i t-1,q 2    ≤ 2(s -1)γ 2 L 2 max n n i=1 s-1 q=1   1 n n j=1 g j t-1,q 2 + g i t-1,q 2   . For any t = 1, ..., T , i = 1, ..., n and q = 1, ..., s -1, taking expectation over mini-batch gradient noise, we have E g j t-1,q 2 = E g i t-1,q -∇f i (x i t-1,q-1 ) + ∇f i (x i t-1,q-1 ) 2 ≤ E g i t-1,q -∇f i (x i t-1,q-1 ) 2 + ∇f i (x i t-1,q-1 ) 2 ≤ ζ 2 + G 2 . ( ) Substituting ( 57) into (56), we have ∇f (x t-1,s-1 ) - 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 ≤ 4(s -1) 2 γ 2 L 2 max (ζ 2 + G 2 ). Further substituting ( 53), ( 52) and ( 58) into (51) and by rearranging the terms, we obtain E[f (x t ) -f (x t-1 )] ≤ E s=1 - γ 2 ∥∇f (x t-1,s-1 )∥ 2 + γ 2 L max -γ 2 1 n n i=1 ∇f i (x i t-1,s-1 ) 2 + E s=1 γ 2 ζ 2 L max 2n + γ 2 ∥∇f (x t-1,s-1 ) - 1 n n i=1 ∇f i (x i t-1,s-1 )∥ 2 (a) ≤ - γ 2 E s=1 ∥∇f (x t-1,s-1 )∥ 2 + Eγ 2 ζ 2 L max 2n + E s=1 2(s -1) 2 γ 3 L 2 max (ζ 2 + G 2 ), where inequality (a) is by (γ 2 L max -γ) ≤ 0. Under review as a conference paper at ICLR 2023 Note that E s=1 (s -1) 2 = (E -1)E(2E -1) 6 ≤ (E -1)E 2 3 . ( ) By applying it to (59), we finally have E[f (x t ) -f (x t-1 )] ≤ - γ 2 E s=1 ∥∇f (x t-1,s-1 )∥ 2 + Eγ 2 ζ 2 L max 2n + 2γ 3 (E -1)E 2 L 2 max (ζ 2 + G 2 ) 3 . C.4 PROOF OF THEOREM 3 We need a lemma similar to Lemma 4. Lemma 8. Suppose that Assumption 1 and 3 hold. For the t-th (1 ≤ t ≤ T ) communication round in Algorithm 1, if η = σ and z = +∞, and σ > E(G + Q ∞ ), then E[f (x t ) -f (x t )] ≤ 2γ 2 σ 2 d j=1 L j n . ( ) Following the similar idea as in the proof of Theorem 1, we have E[f (x t ) -f (x t-1 )] = E[f (x t ) -f (x t )] + E[f (x t ) -f (x t-1 )] ≤ - γ 2 E s=1 ∥∇f (x t-1,s-1 )∥ 2 + Eγ 2 ζ 2 L max 2n + 2γ 3 (E -1)E 2 L 2 max (ζ 2 + G 2 ) 3 + 2γ 2 σ 2 d j=1 L j n . ( ) Rearranging the terms, we have 1 E E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x t-1 ) -f (x t )] Eγ + γζ 2 L max n + 4γ 2 (E -1)EL 2 max (ζ 2 + G 2 ) 3 + 4γσ 2 d j=1 L j En . Form the telescopic sum, we obtain E 1 T E T t=1 E s=1 ∥∇f (x t-1,s-1 )∥ 2 ≤ 2E[f (x 0 ) -f * ] T Eγ + γζ 2 L max n + 4γ 2 (E -1)EL 2 max (ζ 2 + G 2 ) 3 + 4γσ 2 d j=1 L j En . Here we provide a simple example to show that when σ < E(G + Q ∞ ), the algorithm cannot converge. Consider E = 1, Q ∞ = 0 and the problem min x∈R (x -A) 2 + (x + A) 2 , where A > 0 is some positive number. If we choose the initial to be x 0 = A 2 . As one can see, the gradient at x 0 for the two parts of the objective function are -A and 3A, respectively. We denote that ξ ∞ as the random noise following uniform distribution at [-1, 1] . If now σ < A, we have Sign(-A + σξ ∞ ) + Sign(3A + σξ ∞ ) = 0, i.e., this algorithm never update the variable. Proof of Lemma 8. We first note that, when z = +∞, we have Ψ ∞ (x) =    x x ∈ [-1, 1], -1 x < -1, 1 x > 1. Again, from the smoothness assumption (A.2 in Assumption 1) we have, f (x t ) -f (x t ) ≤ ⟨∇f (x t ), x t -xt ⟩ + d j=1 L j (x t (j) -xt (j)) 2 2 . ( ) Taking expectation over ξ ∞ , E ξ∞ [x t -xt ] = E ξ∞ γ n n i=1 σSign E s=1 g i t,s + σξ ∞ - E s=1 g i t,s (a) = γ n n i=1 σΨ ∞ E s=1 g i t,s σ - E s=1 g i t,s (b) = γ n n i=1 E s=1 g i t,s - E s=1 g i t,s = 0, where equality (a) is because for any x ∈ R d , E ξ∞ [Sign(x + σξ ∞ )] = Ψ ∞ (x/σ), equality (b) is due to σ > ∥ E s=1 g i t, s ∥ ∞ almost surely and the property of the function Ψ ∞ (•) in (67). For ease of presentation, we define that B i t def. = σSign E s=1 g i t,s + σξ ∞ . From ( 69) we have learned that E ξ∞ [B i t ] = E s=1 g i t,s . Thus, for any j = 1, ..., d, we have E ξ∞ [(x t (j) -xt (j)) 2 ] ≤ γ 2 n 2 E ξ∞   n i=1 B i t (j) -E ξ∞ B i t (j) 2   = γ 2 n 2 n i=1 E ξ∞ B i t (j) -E ξ∞ B i t (j) 2 ≤ 2γ 2 n 2 n i=1 E ξ∞ B i t (j) 2 + E ξ∞ B i t (j) 2 ≤ 4γ 2 σ 2 n . Finally, substituting (69) and ( 71) into (68), and taking the expectation over both ξ ∞ and the minibatch gradient noise, we have In Figure 7 , we visualize the performance of 1-SignSGD and ∞-SignSGD under different noise scales. As we can see, the results for 1-SignSGD and ∞-SignSGD are almost the same, except that the ∞-SignSGD is slightly better than 1-SignSGD when the noise scale is large. We denote the noiseless case, i.e., Algorithm 1 with σ = 0 as SignFedAvg. E[f (x t ) -f (x t )] ≤ E[⟨∇f (x t ), x t -xt ⟩] + E d j=1 L j (x t (j) -xt (j)) 2 2 ≤ 2γ 2 σ 2 d j=1 L j n . EMNIST: For the experiment on EMNIST, we fixed the client stepsize as 0.05. We tuned the server stepsize, noise scales via grid search: [1, 0.5, 0.1, 0.05, 0.01, 0.005] for stepsize, [0, 0.005, 0.02, 0.05, 0.01, 0.03, 0.05, 0.1, 0.2] for noise scale. The comparison between 1-SignFedAvg and ∞-SignFedAvg on EMNIST is shown in Figure 8 . The used hyperparameter in the Figure 4 and 8 are summarized in Table 4 . We also visualize the performance of 1-SignFedAvg and ∞-SignFedAvg under various noise scales and local steps in Figure 9 and Figure 10 . CIFAR-10: For the experiment on CIFAR-10, we fixed the client stepsize as 0.1. We tuned the server stepsize, noise scales via grid search: [10 0 , 10 -0.5 , 10 -1 , 10 -1.5 , 10 -2 , 10 -2.5 , 10 -3 ] for the stepsize, [0, 0.0001, 0.0005, 0.001, 0.005] for the noise scale. The comparison between 1-SignFedAvg and ∞-SignFedAvg on CIFAR-10 is displayed in Figure 11 . The used hyperparameter in the Figure 5 and 11 are summarized in Table 5 . We also visualize the performance of 1-SignFedAvg and ∞-SignFedAvg under various noise scales and different numbers of local steps in Figure 12 and Figure 13 . An interesting phenomeNon-in Figure 12 amd Figure 13 is that the more local steps are, the less impact the additive noise has on the convergence performance. For the experiment results shown in Figure 6 , except for the noise scale, both 1-SignSGD/1-SignFedAvg and 1-SignSGD-plateau/1-SignFedAvg-plateau used the same hyperparameters found in previous experiments. In Table 6 , we show the hyperparameters of the Plateau criterion for the adaptive noise scale, which are chosen by a few rounds of trial and error. Besides, we also show the corresponding test accuracy in Figure 14 , and how the noise scale evolves over communication rounds in Figure 15 . In this part, we compare our Algorithm 1 to the QSGD (Alistarh et al., 2017) along with its extension to FedAvg, i.e., FedPAQ (Reisizadeh et al., 2020) . As we have shown that z-SignSGD/z-SignFedAvg with the Gaussian noise and uniform noise behave very closely, here we only consider 1-SignSGD/1-SignFedAvg for comparison. We use the unbiased quantizer in ( 14) for both QSGD and FedPAQ. We can see that the quantization level s plays as a key role in the performance and communication efficiency of QSGD and FedPAQ. In a rough sense, s also represents the number of bits needed to transmit for a single coordinate. Thus, we will compare our algorithms to them with different choices of s. We remark that, even in the most extreme case, i.e., s = 1, it still needs three alphabets -1, 1, 0 for communication, while sign-based method only uses -1 and 1. Setting. Again, we consider the three different datasets used in Section 4.2 and 4.2. Specifically, we compare the 1-SignSGD with QSGD on the non-i.i.d. MNIST dataset, and compare 1-SignFedAvg with FedPAQ on EMNIST and CIFAR-10. For all the algorithms, the client's stepsize and batchsize The value ε is regarded as the privacy budget, and the smaller it is the stronger privacy the algorithm provides. The quantity δ is usually set to 1 n . The most popular mechanism to achieve DP is the Gaussian mechanism (Dwork et al., 2014) . Specifically, similar to (Agarwal et al., 2021; Kairouz et al., 2021) , here we consider client-level DP guarantee for Federated Learning, i.e, we regard each client as a single data point in Definition 3. Besides, we also adopt the local version of DP gurantee, i.e., each dataset in Definition 3 contains only one data point. Such DP guarantee do not assume that the server is trustworthy and hence is commonly used in practice (Agarwal et al., 2021; Kairouz et al., 2021) . For more details on DP and its application FL, we refer readers to (Dwork et al., 2014; Mironov, 2017; Abadi et al., 2016b; Geyer et al., 2017) . Here we describe the differential private version of Algorithm 1, which we term DP-SignFedAvg (Algorithm 2). The only difference between DP-SignFedAvg and z-SignFedAvg is that z = 1 is chosen (Gaussian noise), and the norm of local gradients is clipped before perturbing it by the noise and applying the sign compression. To obtain the client-level privacy guarantee, we adopt the privacy accounting method in (Mironov et al., 2019) . for i in S do 6: x i t-1,0 = x t-1 7: for s = 1 to E do 8: g i t-1,s = g i (x i t-1,s-1 ), where g i (•) is the mini-batch gradient oracle of the i-th client. 9: x i t-1,s = x i t-1,s-1 -γg i t-1,s . 10: end for 11: ∆ i t-1 = Sign xt-1-x i t-1,E max{1,∥xt-1-x i t-1,E ∥/C} + N (0, σ 2 C 2 I) . 12: Send ∆ i t-1 to the server. 13: end for 14: On Server: 15: x t = x t-1 -η 1 n n i=1 ∆ i t-1 . 16: Broadcast x t to clients. 17: end for 18: return x T . Now we investigate the empirical performance of the DP-SignFedAvg on EMNIST, and compared it with the uncompressed DP-FedAvg used in (Agarwal et al., 2021; Kairouz et al., 2021) . Settings. We followed a setting similar to (Kairouz et al., 2021) for the experiment on EMNIST. We adopted the client-level differential privacy, i.e., to treat each client as a single data point, and perturbed the local gradients before sending them to server. We also used the technique of privacy amplification by client sub-sampling in (Kairouz et al., 2021; Geyer et al., 2017) . For both DP-FedAvg and DP-SignFedAvg, the same CNN in Section 4.2 was used, and the maximum norm for clipping was set to 0.01. We sampled 100 clients at each communication round and ran both algorithms for 500 communication rounds. Similar to (Kairouz et al., 2021) , we run the experiments under the privacy budgets ε = [1, 2, 4, 6, 8, 10] . In Table 8 , we provide the hyperparameter for DP-FedAvg and DP-SignFedAvg for all levels of privacy budgets. Unlike previous experiments, the noise scales used in this experiment were determined by the privacy budget and the privacy accounting method in (Mironov et al., 2019) . Results. It can be seen from Figure 17 that DP-SignFedAvg is only slightly inferior to the uncompressed DP-FedAvg for various levels of privacy budget. It is worthy to note that the work (Kairouz et al., 2021) conducted a similar experiment and showed that the compressed DP-FedAvg



Figure 1: Performance of tested algorithms under different problem dimension.

Figure 2: z-SignSGD under various noise scales.

Figure2displays the results of 1-SignSGD and ∞-SignSGD with various noise scales. We can see that there is a clear biasvariance trade-off for different noise scales and it corroborates our analysis after Theorem 1. It is also worth mentioning that the best choice of σ for Algorithm 1 shown in Figure2is much smaller than the one predicted by the theorems.4.2 z-SIGNSGD ON NON-I.I.D MNISTIn this section, we consider an extremely non-i.i.d setting with the MNIST dataset(Deng, 2012). Specifically, we split the dataset into 10 parts based on the labels and each client has the data of one digit only. A simple two-layer convolutional neural network (CNN) from Pytorch tutorial(Paszke et al., 2017) was used for the learning task. The following algorithms were implemented: SGDwM (Distributed SGD(Ghadimi & Lan, 2013) with momentum), EF-SignSGDwM (Distributed

Figure 4: Performance of FedAvg and 1-SignFedAvg on the EMNIST dataset.

Figure 5: Performance of FedAvg and 1-SignFedAvg on the CIFAR-10 dataset. 4.4 PLATEAU CRITERION FOR TUNING THE NOISE SCALE

Figure 6: Evaluating the efficacy of Plateau criterion on three different datasets.

Figure 7: z-SignFedAvg under different noise scales on non-i.i.d MNIST

Figure 8: Performance of 1-SignFedAvg and ∞-SignFedAvg on EMNIST dataset.

E = 2 (f) E = 5 (g) E = 10 (h) E = 20

Figure 9: EMNIST: 1-SignFedAvg under different noise scales and different numbers of local steps

Figure 12: CIFAR-10: 1-SignFedAvg under different noise scales and different numbers of local steps

Figure 14: The corresponding test accuracy to Figure 6.

Figure 15: The corresponding trends of noise scale to Figure 6.

DP-SignFedAvg Require: Total communication rounds T , Number of local steps E, Number of clients n, Client sampling ratio q, Clients stepsize γ, Server stepsize η, Noise coefficient σ, Norm clipping coefficient C. 1: Initialize x 0 and for i = 1, ..., n. 2: for t = 1 to T do 3: Sample a set of clients S with size qn for current round.

Comparison of Case 1 and Case 2.

Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp. 3-19, 2018. Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5693-5700, 2019. Xinwei Zhang, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Jinfeng Yi. Understanding clipping for federated learning: Convergence and client-level differential privacy. arXiv preprint arXiv:2106.13673, 2021. Qinqing Zheng, Shuxiao Chen, Qi Long, and Weijie Su. Federated f-differential privacy. In International Conference on Artificial Intelligence and Statistics, pp. 2251-2259. PMLR, 2021. Shuai Zheng, Ziyue Huang, and James Kwok. Communication-efficient distributed blockwise momentum sgd with error-feedback. Advances in Neural Information Processing Systems, 32, 2019.



In Table3, we provide the tuned hyperparameters for all the tested algorithms on non-i.i.d MNIST. Hyperparameters used for FL on non-i.i.d MNIST.

annex

are set to the same values used in Section 4.2 and 4.2. For 1-SignSGD/1-SignFedAvg, we reuse the previously found optimal hyperparameters. For QSGD, we tune the server stepsize via grid search on [0.1, 0.05, 0.01, 0.005]. For FedPAQ, we tune the server stepsize via grid search on [1, 0.5, 0.1, 0.05, 0.01, 0.005]. The chosen server stepsizes for QSGD and FedPAQ under three datesets are presented in Table 7 .

Algorithm

Non-i.i.d. MNIST EMNIST CIFAR-10Table 7 : The chosen server stepsizes for tested QSGD and FedPAQ on three datasets.Results. From Figure 16 , we can see that, our proposed sign-based compressor is consistently superior to the unbiased stochastic quantization method in low precision region (1 bit to 8 bits), except the only case that QSGD with s = 4 is slightly better than our 1-SignSGD on the non-i.i.d MNIST dataset. These results again, as (Bernstein et al., 2018; Karimireddy et al., 2019) did, show that the biased compressor, or more specifically the sign-based compressor, can be a strong competitor to those unbiased quantizer due to reduced variance. Our contribution in this work is to provide a generic framework that bridges the unbiased compressor and the biased one, which allows one to conveniently seek an optimal trade-off between the compression bias and variance. Definition 3 (Approximate DP (Dwork et al., 2014) ). A randomized algorithm M that takes as input a dataset consisting of individuals is (ε, δ)-differentially private if for any pair of datasets S,S ′ that differ in the record of a single individual, and for any event E,Privacy budget η for DP-FedAvg η for DP-SignFedAvg Noise scale with 12 bits for each gradient coordinate can be far worse than the uncompressed DP-FedAvg. It is a strong contrast to our DP-SignFedAvg which uses only 1 bit for each coordinate. 

