ANALYSIS OF ERROR FEEDBACK IN COMPRESSED FEDERATED NON-CONVEX OPTIMIZATION

Abstract

Communication cost between the clients and the central server could be a bottleneck in real-world Federated Learning (FL) systems. In classical distributed learning, the method of Error Feedback (EF) has been a popular technique to remedy the downsides of biased gradient compression, but literature on applying EF to FL is still very limited. In this work, we propose a compressed FL scheme equipped with error feedback, named Fed-EF, with two variants depending on the global optimizer. We provide theoretical analysis showing that Fed-EF matches the convergence rate of the full-precision FL counterparts in non-convex optimization under data heterogeneity. Moreover, we initiate the first analysis of EF under partial client participation, which is an important scenario in FL, and demonstrate that the convergence rate of Fed-EF exhibits an extra slow down factor due to the "stale error compensation" effect. Experiments are conducted to validate the efficacy of Fed-EF in practical FL tasks and justify our theoretical findings. √ T Kn ) where T is the number of communication rounds, K is the number of local training steps and n is the number of clients. Our new analysis matches the convergence rate of full-precision FL counterparts, improving the previous convergence result [6] on error compensated FL (see detailed comparisons in Section 3). Moreover, Fed-EF-AMS is the first compressed adaptive FL algorithm in literature. • Partial participation (PP) has not been considered for standard error feedback in distributed learning. We initiate new analysis of Fed-EF in this setting, with local steps and non-iid data all considered at the same time. We prove that under PP, Fed-EF exhibits a slow down of factor n/m compared with the best full-precision rate, where m is the number of active clients per round, which is caused mainly by the "delayed error compensation" effect. • Experiments are conducted to illustrate the effectiveness of the proposed methods, where we show that Fed-EF matches the performance of full-precision FL with significant reduction in communication, and compares favorably against the algorithms using unbiased compression without error feedback. Numerical examples are also provided to justify our theory. Distributed SGD with compressed gradients. In distributed SGD training systems, extensive works have considered compression applied to the communicated gradients. Unbiased stochastic compressors include stochastic rounding and QSGD [3; 66; 58; 38], and magnitude based random sparsification [57]. The works [50; 7; 8; 28; 24] analyzed communication compression using only the sign (1-bit) information of the gradients. Unbiased compressors can be combined with variance reduction techniques for acceleration, e.g., [17]. On the other hand, examples of popular biased compressors include TopK [37; 54; 52], which only transmits gradient coordinates with largest magnitudes, and fixed (or learned) quantization [13; 66]. See [9] for a summary of more biased

1. INTRODUCTION

Federated Learning (FL) has seen numerous applications in, e.g., computer vision, language processing, public health, Internet of Things (IoT) [19; 44; 62; 39; 49; 29; 25] . A centralized FL system includes multiple clients each with local data, and one central server that coordinates the training process. The goal of FL is for n clients to collaboratively find a global model, parameterized by θ, such that θ * = arg min θ∈R d f (θ) := arg min θ∈R d 1 n n i=1 f i (θ), where f i (θ) := E D∼Di F i (θ; D) is a non-convex loss function for the i-th client w.r.t. the local data distribution D i . Taking standard local SGD [42; 53] as an example, in each training round, the server first broadcasts the model to the clients. Then, each client trains the model based on the local data, after which the updated local models are transmitted back to the server and aggregated. The number of clients, n, can be either tens/hundreds (cross-silo FL [41; 21] , e.g., clients are companies) or as large as millions (cross-device FL [26; 29] , e.g., clients are personal devices). There are two primary benefits of FL: (i) the clients train the model simultaneously, which is efficient in terms of computational resources; (ii) each client's data are kept local throughout training and never transmitted to other parties, which promotes data privacy. However, the efficiency and broad application scenarios also brings challenges for FL method design: • Communication cost: In most FL algorithms, clients are allowed to conduct multiple training steps (e.g., local SGD updates) in each round. Though this has reduced the communication frequency, the one-time communication cost is still a challenge in real-world FL systems with limited bandwidth, e.g., portable devices at the wireless network edges [5; 61; 29] . • Data Heterogeneity: Unlike in the classical distributed training, the local data distribution in FL (D i in (1)) can be different (non-iid), reflecting many real-world scenarios where the local data held by different clients (e.g., app/website users) are highly personalized. When multiple local training steps are taken, the local models could become "biased" towards minimizing the local losses, instead of the global loss. This data heterogeneity may hinder the global model to converge to a good solution [34; 67; 33] . • Partial participation (PP): Another practical issue, especially for cross-device FL, is the partial participation (PP) where the clients do not join training consistently, e.g., due to unstable connection or user change. That is, only a fraction of clients are involved in each FL training round, which may also slow down the convergence of the global model [10; 12] .

FL under compression.

In order to overcome the main challenge of communication bottleneck, several works have considered federated learning with compressed message passing. Examples include FedPaQ [47] , FedCOM [18] and FedZip [40] . All these algorithms are built upon directly compressing model updates communicated from clients to server. In particular, [47; 18] proposed to use unbiased stochastic compressors such as stochastic quantization [3] and sparsification [57] , which showed that with considerable communication saving, applying unbiased compression in FL could approach the learning performance of un-compressed FL algorithms. However, unbiased (stochastic) compressors typically require additional computation (sampling) which is less efficient in real-world large training systems. Biased gradients/compressors are also common in many applications [2] . Error feedback (EF) for distributed training. One simpler and popular type of compressor is the deterministic compressor, including fixed quantization [13] , TopK sparsification [54; 52; 35] , SignSGD [50; 7; 8] , etc., which belong to biased compression operators. In classical distributed learning literature, it has been shown that directly updating with the biased gradients may slow down the convergence or even lead to divergence [28; 2] . A popular remedy is the so-called error feedback (EF) strategy [54] : in each iteration, the local worker sends a compressed gradient to the server and records the local compression error, which is subsequently used to adjust the gradient computed in next iteration, conceptually "correcting the bias" due to compression. It is known that biased gradient compression with EF can achieve same convergence rate as the full-precision counterparts [28; 35] . Our contributions. Despite the rich literature on EF in classical distributed training, it has not been well explored in the context of federated learning. In this paper, we provide a thorough analysis of EF in FL. In particular, the three key features of FL, local steps, data heterogeneity and partial participation, pose interesting questions regarding the performance of EF in federated learning: (i) Can EF still achieve the same convergence rate as full-precision FL algorithms, possibly with highly non-iid local data distribution? (ii) How does partial participation change the situation and the results? We present new algorithm and results to address these questions: • We study a FL framework with biased compression and error feedback, called Fed-EF, with two variants (Fed-EF-SGD and Fed-EF-AMS) depending on the global optimizer (SGD and adaptive AMSGrad [46] , respectively). Under data heterogeneity, Fed-EF has asymptotic convergence rate O( 1 compressors. There exist other compression schemes, such as vector quantization [64; 40] , low-rank approximation [56] and sketching [22] , which will not be the focus of this paper. Error Feedback (EF) for biased compression. It has been shown that directly implementing biased compression in distributed SGD leads to worse convergence and generalisation [2; 9] . Error feedback (EF), as proposed in [54] , can fix this issue [28] . With EF, distributed SGD under biased compression can match the convergence rate of the full-precision distributed SGD, e.g., also achieving linear speedup (O(1/ √ T n)) w.r.t. the number of workers n in distributed SGD [23; 51; 4; 68; 55] . Recently, a variant of EF, called EF21, was proposed [48] , which is different from EF in algorithm design. [15] applied EF21 to FL under several settings. Our work is different, where we analyze the standard EF in FL, design new algorithms and derive faster convergence rates with different analysis. Finally, in federated learning, as mentioned earlier, several works have applied compression to the client-to-server communication, e.g., [47; 40; 18] . Among the limited related literature on adopting EF to FL, the most relevant method is QSparse-local-SGD [6] which is a special case of the Fed-EF framework studied in this paper. In Section 3 and Section 4, we will compare our proposed Fed-EF framework with these related methods in terms of both algorithm and theory. Our algorithm also exploits adaptive gradient method AMSGrad [46] , which has been applied to distributed and federated learning [45; 35] . See, for instance, [14; 65; 30] for the series of works on adaptive gradient methods.

3. FED-EF: COMPRESSED FEDERATED LEARNING WITH ERROR FEEDBACK

In this paper, we consider deterministic compressors which are simple and computational efficient. Throughout the paper, [n] will denote the integer set {1, ..., n}. Definition 1 (q C -deviate compressor). The q C -deviate compressor C : R d → R d is defined such that for ∀x ∈ R d , ∃ 0 ≤ q C < 1 s.t. ∥C(x) -x∥ 2 ≤ q 2 C ∥x∥ 2 . In particular, two examples are [54; 68] : • Let S = {i ∈ [d] : |x i | ≥ t} where t is the (1 -k)-quantile of |x i |, i ∈ [d]. The TopK compressor with compression rate k is defined as C(x) i = x i , if i ∈ S; C(x) i = 0 otherwise. • Larger q C indicates heavier compression, and q C = 0 implies no compression, i.e. C(x) = x. Additionally, these two compression operations can be combined to derive the so-called "heavy-Sign", where we first apply TopK and then Sign. This strategy is also q-deviate (see Appendix A for details) and will also be tested in our experiments in Section 5. Can we simply use biased compressors in communicationefficient FL? As an example, in Figure 1 , we report the test accuracy of a Multi-Layer Perceptron (MLP) trained on MNIST in non-iid FL environment (see Section 5 for more description), of Fed-SGD [53] with full communication v.s. Sign compression. We observe a catastrophic performance loss of using biased compression directly. Fed-EF algorithm. To resolve this problem, error feedback (EF), which is a popular tool in distributed training, can be adapted to federated learning. In Algorithm 1, we present a FL framework with biased compression, named Fed-EF, whose main steps are summarized below. In round t: 1) The server broadcast the global model θ t to all clients (line 5); 2) The i-th client performs K steps of local SGD updates to get local model θ (K) t,i , compute the compressed local model update ∆t,i , updates the local error accumulator e t,i , and sends the compressed ∆t,i back to the server (line 6-12); 3) The server receives ∆t,i , i ∈ [n] from all clients, takes the average, and perform a global model update using the averaged compressed local model updates (line [15] [16] [17] [18] [19] . Depending on the global model optimizer, we propose two variants: Fed-EF-SGD (green) which applies SGD global updates, and Fed-EF-AMS (blue), whose global optimizer is AMSGrad [46] . parallel for worker i ∈ [n] do: 5: Receive model parameter θ t from central server, set θ t,i = θ t 6: for k = 1, . . . , K do 7: Compute stochastic gradient g (k) t,i at θ (k) t,i 8: Local update θ (k+1) t,i = θ (k) t,i -η l g (k) t,i 9: end for Update the error e t+1,i = e t,i + ∆ t,i -∆ t,i

13:

end parallel

14:

Central server do: 15: Global aggregation ∆ t = 1 n n i=1 ∆ t,i Update the global model θ t+1 = θ t -η ∆ t ▷ Fed-EF-SGD 17: m t = β 1 m t-1 + (1 -β 1 ) ∆ t ▷ Fed-EF-AMS 18: v t = β 2 v t-1 + (1 -β 2 ) ∆ 2 t , vt = max(v t , vt Update the global model θ t+1 = θ t -η mt √ vt+ϵ

20: end for

In Fed-EF-AMS, by the nature of adaptive gradient methods, we incorporate momentum (m t ) with different implicit dimension-wise learning rates η/v t . Additionally, for conciseness, the presented algorithm employs one-way compression (clients-to-server). In Appendix D, we also provide a twoway compressed Fed-EF framework and demonstrate that adding the server-to-clients compression would not affect the convergence rates. Comparison with prior work. Compared with EF approaches in the classical distributed training, e.g., [54; 28; 68; 38; 16; 35] , our algorithm allows local steps (more communication-efficient) and uses two-side learning rates. When η ≡ 1, the Fed-EF-SGD method reduces to QSparse-local-SGD [6] . In Section 4, we will demonstrate how the two-side learning rate schedule improves the convergence analysis of the one-side learning rate approach [6] . On the other hand, several recent works considered compressed FL using unbiased stochastic compressors (all of which use SGD as the global optimizer). FedPaQ [47] applied stochastic quantization without error feedback to local SGD, which is improved by [18] using a gradient tracking trick that, however, requires communicating an extra vector from server to clients, which is less efficient than Fed-EF. [40] provided an empirical study on directly compressing the local updates using various compressors in Fed-SGD, while we use EF to compensate for the bias. [43] proposed FedLin, which only uses compression for synchronizing a local memory term but still requires transmitting full-precision updates. Finally, to our knowledge, Fed-EF-AMS is the first compressed adaptive FL method in literature.

4. THEORETICAL RESULTS

Assumption 1 (Smoothness). For ∀i ∈ [n], f i is L-smooth: ∥∇f i (x) -∇f i (y)∥ ≤ L ∥x -y∥. Assumption 2 (Bounded variance). For ∀t ∈ [T ], ∀i ∈ [n], ∀k ∈ [K]: (i) the stochastic gradient is unbiased: E g (k) t,i = ∇f i (θ (k) t,i ); (ii) the local variance is bounded: E ∥g (k) t,i -∇f i (θ (k) t,i )∥ 2 < σ 2 ; (iii) the global variance is bounded: 1 n n i=1 ∥∇f i (θ t ) -∇f (θ t )∥ 2 ≤ σ 2 g . Both assumptions are standard in the convergence analysis of stochastic gradient methods. The global variance bound σ 2 g in Assumption 2 characterizes the difference among local objective functions, which, is mainly caused by different local data distribution X i in (1), i.e., data heterogeneity.

Assumption 3 (Compression discrepancy). There exists some q

A < 1 such that E ∥ 1 n n i=1 C ∆ t,i + e t,i -1 n n i=1 (∆ t,i + e t,i )∥ 2 ≤ q 2 A E ∥ 1 n n i=1 (∆ t,i + e t,i )∥ 2 in every round t ∈ [T ]. In Assumption 3, if we replace "the average of compression", 1 n n i=1 C ∆ t,i + e t,i , by "the compression of average", C 1 n n i=1 (∆ t,i + e t,i ) , the statement immediately holds by Definition 1 with q A = q C . Thus, Assumption 3 basically says that the above two terms stay close during training. This is a common assumption in related work on compressed distributed learning, for example, a similar assumption is used in [4] analyzing sparsified SGD. In [18] , for unbiased compression without EF, a similar condition is also assumed with an absolute bound. In Appendix B, we provide more discussion and empirical justification to validate this analytical assumption in practice.

4.1. CONVERGENCE OF FED-EF: LINEAR SPEEDUP UNDER DATA HETEROGENEITY

Theorem 1 (Fed-EF-SGD). Let θ * = arg min f (θ), and denote q = max{q C , q A }, C 1 := 2 + 4q 2 (1-q 2 ) 2 . Under Assumptions 1 to 3, when η l ≤ 1 2KL•max{4,η(C1+1)} , the squared gradient norm of Fed-EF-SGD iterates in Algorithm 1 can be bounded by 1 T T t=1 E ∥∇f (θ t )∥ 2 ≲ f (θ 1 ) -f (θ * ) ηη l T K + 2ηη l C 1 L n σ 2 + 10ηη 3 l C 1 K 2 L 3 (σ 2 + 6Kσ 2 g ). In Theorem 1, the LHS is the expected squared gradient norm at a uniformly chosen global model from t = 1, ..., T , which is a standard measure of convergence in non-convex optimization (i.e., "norm convergence"). The first term is dependent on the initialization, the second term σ 2 comes from the local stochastic variance, and the last term represents the influence of data heterogeneity. In general, we see that larger q (i.e., higher compression) would slow down the convergence. In our analysis of Fed-EF-AMS, we will make the following additional assumption of bounded stochastic gradients, which is common in the convergence analysis of adaptive methods, e.g., [46; 69; 11; 35] . Note that this assumption is only used for Fed-EF-AMS, but not for Fed-EF-SGD. Assumption 4. (Bounded gradients) It holds that ∥g (k) t,i ∥ ≤ G, ∀t > 0, ∀i ∈ [n], ∀k ∈ [K]. We provide the first convergence analysis of compressed adaptive FL method as below. Theorem 2 (Fed-EF-AMS). With same notations as in Theorem 1, let C 1 := β1 1-β1 + 2q 1-q 2 . Under Assumptions 1 to 4, if the learning rates satisfy η l ≤ √ ϵ 8KL min 1 √ ϵ , 2(1-q 2 )L (1+q 2 ) 1.5 G , 1 max{16,32C 2 1 }η , 1 3η 1/3 , the Fed-EF-AMS iterates in Algorithm 1 satisfy 1 T T t=1 E ∥∇f (θ t )∥ 2 ≲ f (θ 1 ) -f (θ * ) ηη l T K + 5η 2 l KL 2 2 √ ϵ + ηη 3 l (30 + 20C 2 1 )K 2 L 3 ϵ (σ 2 + 6Kσ 2 g ) + ηη l L(6 + 4C 2 1 ) nϵ σ 2 + (C 1 + 1)G 2 d T √ ϵ + 3ηη l C 2 1 LKG 2 d T ϵ . With some properly chosen learning rates, we have the following simplified results. Corollary 1 (Fed-EF, specific learning rates). Suppose the conditions in Theorem 1 and Theorem 2 are satisfied respectively. Choosing η l = Θ( 1 K √ T ) and η = Θ( √ Kn), Fed-EF-SGD satisfy 1 T T t=1 E ∥∇f (θ t )∥ 2 = O f (θ 1 ) -f (θ * ) √ T Kn + 1 √ T Kn σ 2 + √ n T 3/2 √ K (σ 2 + Kσ 2 g ) , and for Fed-EF-AMS, it holds that 1 T T t=1 E ∥∇f (θ t )∥ 2 = O f (θ 1 ) -f (θ * ) √ T Kn + 1 √ T Kn σ 2 + ( 1 T K + √ n T 3/2 √ K )(σ 2 + Kσ 2 g ) . Discussion. From Corollary 1, we see that when T ≥ K, Fed-EF-AMS and Fed-EF-SGD have the same rate of convergence asymptotically. Therefore, our following discussion applies to the general Fed-EF scheme with both variants. In Corollary 1, when T ≥ Kn, the global variance term σ 2 g vanishes and the convergence rate becomes O(1/ √ T Kn). Thus, the proposed Fed-EF enjoys linear speedup w.r.t. the number of clients n, i.e., it reaches a δ-stationary point (i.e., 1 T T t=1 E ∥∇f (θ t )∥ 2 ≤ δ) as long as T K = Θ(1/nδ 2 ), which matches the recent results of the full-precision counterparts [60; 45] ( [45] only analyzed the special case β 1 = 0, while our analysis is more general). The condition T ≥ Kn to reach linear speedup considerably improves O(K 3 n 3 ) of the federated momentum SGD analysis in [63] . In terms of communication complexity, by setting K = Θ(1/nδ), Fed-EF only requires T = Θ(1/δ) rounds of communication to converge. This matches one of the state-of-the-art FL communication complexity results of SCAFFOLD [27] . Comparison with prior related results. As a special case of Fed-EF-SGD (η ≡ 1) and the most relevant previous work, the analysis of QSparse-local-SGD [6] did not consider data heterogeneity, and their convergence rate O(1/ √ T K) did not achieve linear speedup either. Our new analysis improves this result, showing that EF can also match the best rate of using full communication in federated learning. For FL with unbiased compression (without EF), the convergence rate of FedPaQ [47] is also O(1/ √ T K). [18] refined the analysis and algorithm of FedPaQ, which matches our O(1/δ) communication complexity. To sum up, both Fed-EF-SGD and Fed-EF-AMS are able to achieve the convergence rates of the corresponding full-precision FL counterparts, as well as the state-of-the-art rates of FL with unbiased compression.

4.2. ANALYSIS OF FED-EF UNDER PARTIAL CLIENT PARTICIPATION

Whilst being a popular strategy in classical distributed training, error feedback has rarely been analyzed under partial participation (PP), which is an important feature of FL. Next, we provide new analysis and results of EF under this setting, considering both local steps and data heterogeneity in federated learning. In each round t, assume only m randomly chosen clients (without replacement) indexed by M t ⊆ [n] are active and participate in training (i.e., changing i ∈ [n] to i ∈ M t at line 4 of Algorithm 1). For the remaining (n -m) inactive clients, we simply set e t,i = e t-1,i , ∀i ∈ [n] \ M t . The convergence rate is given as below. Theorem 3 (Fed-EF, partial participation). In each round, suppose m randomly chosen clients in M t participate in the training. Under Assumptions 1 to 3, suppose the learning rates satisfy η l ≤ min 1 6 , m 96C ′ η , m 2 53760(n-m)C1η , 1 4η , 1 32C1η 1 KL . Fed-EF-SGD admits 1 T T t=1 E ∥∇f (θ t )∥ 2 ≲ f (θ 1 ) -f (θ * ) ηη l T K + ηη l L m + 8ηη l C 1 Ln m 2 σ 2 + 3ηη l C ′ KL m σ 2 g + 5η 2 l KL 2 2 + 15ηη 3 l C ′ K 2 L 3 m + 560ηη l C 1 (n -m)L m 2 (σ 2 + 6Kσ 2 g ), where C 1 = q 2 (1-q 2 ) 3 and C ′ = n-m n-1 . Choosing η = Θ( √ Km), η l = Θ( √ m K √ T n ), we have 1 T T t=1 E ∥∇f (θ t )∥ 2 = O √ n √ m f (θ 1 ) -f (θ * ) √ T Km + 1 √ T Km σ 2 + √ K √ T m σ 2 g . Remark 1. We present Fed-EF-SGD for simplicity. With more complicated analysis, similar result applies to Fed-EF-AMS yielding the same asymptotic convergence rate as Fed-EF-SGD. Remark 2. When m = n (full participation), Theorem 3 recovers the O(1/ √ T Kn) rate in Corollary 1. When q = 0, we recover the O( K/T m) rate of full-precision Fed-SGD under PP [60] . Effect of delayed error compensation. The convergence rate in Theorem 3 involves m in the denominator, instead of n as in Corollary 1, which is a result of larger gradient estimation variance due to client sampling. Compared with the O( K/T m) rate of [60] for full-precision local SGD under PP, Theorem 3 is slower by a factor of n/m, which is a consequence of the mechanism of error feedback. Intuitively, with full participation where each client is active in every round, EF itself can, to a large extent, be regarded as subtly "delaying" the "untransmitted" gradient information (C(∆ t ) -∆ t ) to the next iteration. However, under partial participation, in each round t, the error accumulator of a chosen client actually contains the latest information from round t -s, where s can be viewed as the "lag" which follows a geometric distribution with E[s] = n/m. In some sense, this shares similar spirit to the problem of asynchronous distributed optimization with delayed gradients (e.g., [1; 36] ). The delayed error information in Fed-EF under PP is likely to pull the model away from heading towards a stationary point (i.e., slower down the norm convergence), especially for highly non-convex loss functions. In Section 5, we will propose a simple strategy to justify (and mitigate) the negative impact of this error staleness on the norm convergence empirically.

5. NUMERICAL STUDY

We provide numerical results to show the efficacy of Fed-EF in communication-efficient FL problems and justify our theoretical analysis. Due to space limitation, we include representative results here and place more results and experimental details in Appendix A. Datasets. We present experiments on two popular FL datasets. The MNIST dataset [32] contains 60000 training examples and 10000 test samples of 28 × 28 gray-scale hand-written digits from 0 to 9. The FMNIST dataset [59] has the same input size and train/test split as MNIST, but the samples are fashion products (e.g., clothes and bags). More results on CIFAR dataset are included in Appendix A. Federated setting. In our experiments, we test n = 200 clients. The clients' local data are set to be highly non-iid (heterogeneous), where we restrict the local data samples of each client to come from at most two classes. We run T = 100 rounds, where one FL training round is finished after all the clients have performed one epoch of local training. The local mini-batch size is 32, which means that the clients conduct 10 local iterations per round. Regarding partial participation, we uniformly randomly sample m clients in each round. We present the results at multiple sampling proportion p = m/n (e.g., p = 0.1 means choosing 20 active clients per round). To measure the communication cost, we report the accumulated number of bits transmitted from the client to server (averaged over all clients), assuming that full-precision gradients are 32-bit encoded. Methods and compressors. For both Fed-EF variants, we implement Sign compressor, and TopK compressor with compression rate k ∈ {0.001, 0.01, 0.05}. We also employ a more compressive strategy heavy-Sign where Sign is applied after TopK (i.e., a further 32x compression over TopK under same sparsity). We test hv-Sign with k ∈ {0.01, 0.05, 0.1}. We compare our method with the analogue FL approach using full-precision updates, and the analogue algorithms of Fed-EF using unbiased stochastic quantization "Stoc" without error feedback [3] . For this compressor, we test parameter b ∈ {1, 2, 4}. For SGD, this algorithm is equivalent to FedCOM/FedPaQ [47; 18] .

5.1. FED-EF MATCHES FULL-PRECISION FL WITH SUBSTANTIALLY LESS COMMUNICATION

Firstly, we demonstrate the superior performance of Fed-EF in practical FL tasks. For both datasets, we train a ReLU activated CNN with two convolutional layers followed by one max-pooling, one dropout and two fully-connected layers before the softmax output. In Figure 2 and Figure 3 , we compare our method with Stoc without EF with p = 0.5 and p = 0.1, respectively. We have tested each compressor with multiple compression ratios, see Appendix A for the complete results. Here, we present curves (respective compression ratios) chosen by the following rule: for each method, we present the curve with highest compression level that achieves the best full-precision test accuracy; if the method does not match the full-precision performance, we present the curve with the highest test accuracy. From Figure 2 and Figure 3 , we see that: • In general, higher compression ratio leads to worse performance, as expected from the theory. The proposed Fed-EF (including both variants) is able to achieve the same performance as the full-precision methods with substantial communication reduction, e.g., hv-Sign and TopK reduce the communication by more than 100x without losing accuracy. Sign also provides 30x compression with matching accuracy as full-precision training. • On MNIST, the loss and accuracy of Stoc (stochastic quantization without EF) tend to be slightly worse than Fed-EF-SGD with hv-Sign, yet requiring more communication. • With more aggressive p = 0.1, with proper compressor, Fed-EF still matches the performance of full-precision algorithms. While Sign performs well on MNIST for both Fed-EF FMNIST p = 0.5

AMS-full AMS-sign AMS-topk AMS-hv-sign AMS-Stoc

Figure 2 : Training loss and test accuracy v.s. communicated bits, participation rate p = 0.5. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. FMNIST p = 0.1 Figure 3 : Training loss and test accuracy v.s. communicated bits, participation rate p = 0.1. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. variants, we notice that fixed sign-based compressors (Sign and heavy-Sign) are considerably outperformed by TopK for Fed-EF-AMS on FMNIST. We conjecture that this is because with small participation rate, sign-based compressors tend to assign a same implicit learning rate across coordinates (controlled by the second moment v), making adaptive method less effective. In contrast,magnitude-preserving compressors (e.g., TopK and Stoc) may better exploit the adaptivity of AMSGrad.

5.2. ANALYSIS OF NORM CONVERGENCE AND DELAYED ERROR COMPENSATION

We empirically evaluate the norm convergence to verify the theoretical speedup properties of Fed-EF and the effect of delayed error compensation in partial participation (PP). Recall that from Theorem 1, in full participation case, reaching a δ-stationary point requires running Θ(1/nδ 2 ) rounds of Fed-EF (i.e., linear speedup). From Theorem 3, when n is fixed, our result implies that the speedup should be super-linear against m, the number of participating clients, due to the additional stale error effect. In other words, altering m is expected to have more impact on the convergence under PP. We train an MLP (which is also used for Figure 1 ) with one hidden layer of 200 neurons. In Figure 4 , we report the squared gradient norm and the training loss on MNIST under the same non-iid FL setting as above (the results on FMNIST are similar). In the full participation case, we implement Fed-EF-SGD with n = 20, 40, 60, 100 clients; for the partial participation case, we fix n = 200 and alter m = 20, 40, 60, 100. According to our theory, we set η = 0.1 √ n (or 0.1 √ m) and η l = 0.1. We see that: 1) In general, the convergence of Fed-EF is faster with increasing n or m, which confirms the speedup property; 2) The gaps among curves in the PP setting (the 2nd and 4th plot) is larger than those in the full-participation case (the 1st and 3rd plot), which suggests that the acceleration brought by increasing m under PP is more significant than that of increasing n in full-participation by a same proportion, which is consistent with our theoretical implications. To further embody the intuitive impact of delayed error compensation under PP, we test a simple strategy called "error restarting": for each client i in round t, if the error accumulator was last updated more than S (a threshold) rounds ago (i.e., before round t -S), we simply restart the error accumulator by setting e t = 0, which effectively eliminates the error information that is "too old". In Figure 5 , we first run Fed-EF for 50 rounds, and then apply error restarting with threshold S = 10. As we see, after prohibiting heavily delayed error information, the gradient norm is smaller than that of continuing running standard Fed-EF, i.e., the model finds a stationary point faster. These results illustrate the influence of stale error compensation in Fed-EF, and that properly handling this staleness might be a promising direction for improvement in the future.

6. DISCUSSION AND CONCLUSION

We propose Fed-EF, a Federated Learning (FL) framework with compressed communication and Error Feedback (EF). Two variants, Fed-EF-SGD and Fed-EF-AMS, are designed based on the choice of the global optimizer. Theoretically, we present convergence analysis in non-convex optimization showing that Fed-EF achieves the same convergence rate as the full-precision FL counterparts, which improves upon previous results. The Fed-EF-AMS variant is the first compressed adaptive FL method in literature. Moreover, we develop new analysis of error feedback in distributed training systems under the partial participation setting. We prove an additional slow down factor related to the participation rate due to the delayed error compensation of the EF mechanism. Experiments validate that compared with full-precision training, Fed-EF achieves significant communication reduction without performance drop. We also present numerical results to justify the theory and provide intuition regarding the impact of the delayed error feedback on the norm convergence of Fed-EF. Our work supports the effectiveness of error feedback in federated learning, and provide insight on its convergence under practical FL setting with partial participation. Our paper expands several interesting future directions, e.g., to improve Fed-EF by other tricks, especially under partial participation, and to study more closely the property of different compressors. More mechanisms in FL (e.g., variance reduction, fairness) can also be incorporated into our Fed-EF scheme.

A EXPERIMENT DETAILS, ALGORITHMS AND MORE RESULTS

In this section, we provide more theoretical justification of the compressors, and more implementation details of the empirical results.

A.1 BIASED COMPRESSION OPERATORS

In our Fed-EF, the biased compressors are implemented as follows. Sign is implemented exactly following Definition 1. For TopK, we also apply it in a "layer-wise" manner. Let k denote the proportion of coordinates selected. For each layer with d i parameters, we pick max(1, ⌊kd i ⌋) gradient dimensions. The maximum operator avoids the case where a layer is never updated. The heavy-Sign is implemented by first applying TopK (per layer) and then applying Sign. For completeness, we provide more theoretical details of the biased compression operators, TopK, Sign and heavy-Sign. Recall in Definition 1 that ∥C(x) -x∥ 2 ≤ q 2 C ∥x∥ 2 for some q C < 1. We first justify that TopK and Sign are both valid compressors. In the sequel, ∥ • ∥ always denotes the l 2 norm and ∥ • ∥ 1 is the l 1 norm. Proposition A.1 is well-known (e.g., [54; 68] ), and we provide the proof for clarity. Again, note that in TopK, k is the compression rate, which is the fraction, instead of the number, of selected coordinates. Proposition A.1. For the TopK compressor which selects top k-percent of coordinates, we have q 2 C = 1 -k. For the (Group) Sign compressor, q 2 C = 1 -min i∈[M ] 1 di . Proof. For TopK, the proof is trivial: since C(x) -x only contain (1 -k)d coordinates with lowest magnitudes, we know ∥C(x) - x∥ 2 /∥x 2 ∥ ≤ 1 -k. For Sign, recall that I i is the index set of block (group) i. By definition, for the i-th block (group) x Ii ∈ R di , we have ∥C(x Ii ) -x Ii ∥ 2 = ∥x Ii - ∥x Ii ∥ 1 d i sign(x Ii )∥ 2 = ∥x Ii ∥ 2 + ∥x Ii ∥ 2 1 d 2 i • d i - 2∥x Ii ∥ 2 1 d i = ∥x Ii ∥ 2 -∥x Ii ∥ 2 1 /d i . Since we have M blocks, concatenating the blocks leads to ∥C(x) -x∥ 2 = M i=1 ∥x Ii ∥ 2 -∥x Ii ∥ 2 1 /d i = ∥x∥ 2 - M i=1 ∥x Ii ∥ 2 1 /d i = 1 - M i=1 ∥x Ii ∥ 2 1 /d i ∥x∥ 2 ∥x∥ 2 ≤ 1 -min i∈[M ] ∥x Ii ∥ 2 1 d i ∥x Ii ∥ 2 ∥x∥ 2 ≤ (1 -min i∈[M ] 1 d i )∥x∥ 2 , where the last inequality is because l 1 norm is larger than l 2 norm. We now show that heavy-Sign is also a valid compressor. Proposition A.2. The heavy-Sign compressor satisfies Definition 1 with q 2 C = 1 -min i∈[M ] k di . Proof. Let C k denote the TopK compressor and C s be the Sign operator. Thus, the heavy-Sign operator can be expressed as C(x) = C s C k (x) . It holds that ∥C(x) -x∥ 2 = ∥C s C k (x) -C k (x) + C k (x) -x∥ 2 = ∥C s C k (x) -C k (x)∥ 2 + ∥C k (x) -x∥ 2 , because TopK zeros out the unpicked coordinates. By Proposition A.1, we continue to obtain ∥C(x) -x∥ 2 ≤ (1 -min i∈[M ] 1 d i )∥C k (x)∥ 2 + ∥C k (x) -x∥ 2 = ∥x∥ 2 -min i∈[M ] 1 d i ∥C k (x)∥ 2 ≤ (1 -min i∈[M ] k d i )∥x∥ 2 , where we use the fact that  ∥C k (x)∥ 2 +∥C k (x)-x∥ 2 = ∥x∥ 2 , m 0 = 0, v 0 = 0, v0 = 0 3: for t = 1, . . . , T do 4: parallel for worker i ∈ [n] do: 5: Receive model parameter θ t from central server, set θ t,i = θ t 6: for k = 1, . . . , K do 7: Compute stochastic gradient g (k) t,i at θ (k) t,i Local update θ (k+1) t,i = θ (k) t,i -η l g (k) t,i 9: end for 10: Compute the local model update ∆ t,i = θ (K+1) t,i -θ t 11: Send quantized local model update ∆ t,i = Q(∆ t,i ) to central server using (2) 12: end parallel

13:

Central server do:  m t = β 1 m t-1 + (1 -β 1 ) ∆ t ▷ Stoc with AMSGrad 17: v t = β 2 v t-1 + (1 -β 2 ) ∆ 2 t , vt = max(v t , vt Update the global model θ t+1 = θ t -η mt √ vt+ϵ

19: end for

In Algorithm 2, for completeness, we give the details of the competing method, called Stoc, in our experiments. Instead of using the error feedback scheme, this method directly compresses the transmitted vector from clients to server by unbiased stochastic quantization Q(•) proposed by [3] . For a vector x ∈ R d , the operator Q(•) is defined as Q b (x) = ∥x∥ • sign(x) • ξ(x, b), where b ≥ 1 is number of bits per non-zero entry of the compressed vector Q (x). Suppose 0 ≤ l < 2 b-1 is the integer such that |x i |/∥x∥ is contained in the interval [l/2 b-1 , (l + 1)/2 b-1 ]. The random variable ξ(x, b) is defined by ξ(x, b) = l/s, with probability 1 -g( |xi| ∥x∥ , b), (l + 1)/s, otherwise, with g(a, b) = a • 2 b-1 -l for a ∈ [0, 1] . Simply, 0 is always quantized to 0. The Stoc quantizer is unbiased, i.e., E[Q(x)|x] = x. In addition, it also introduces sparsity to the compressed vector in a probabilistic way, with E[∥Q(x)∥ 0 ] ≤ 2 b + 2 b-1 √ d. Stoc also has two corresponding variants, one using SGD and one using AMSGrad as the global optimizer. For the SGD variant, Stoc is equivalent to the FedCOM method in [18] , which is also the FedPaQ algorithm [47] with tunable global learning rate. For the full-precision algorithms, we simply set ∆ t,i = ∆ t,i in line 11 of Algorithm 2. For SGD, it becomes the one studied in [60] which is the standard local SGD [42] with global learning rate. For adaptive optimizer, it becomes FedAdam [45] . Note that [45] used Adam, while we use AMSGrad (with the max operation). Empirically, the performance of these two options are fairly similar. A.3 RESULTS ON CIFAR-10 TRAINED BY RESNET-18 We present more experiment results of Fed-EF on the task of CIFAR-10 [31] image classification. This dataset contains 50000 natural images of size 32 × 32 each with 3 RGB channels. There are 10 classes, e.g., airplanes, cars, cats, etc. We follow a standard strategy for CIFAR-10 dataset to pre-process the training images by a random crop, a random horizontal flip and a normalization of the pixel values to have zero mean and unit variance. For test images, we only apply the normalization step. For this experiment, we train a ResNet-18 [20] network for 200 rounds. The clients local data are distributed in the same way as described in Section 5, which is highly non-iid. Participation rate p = 0.5, non-iid data. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. 1st row: Fed-EF-SGD. 2nd row: Fed-EF-AMS. The last column presents the corresponding curves that achieve the full-precision accuracy using lowest communication. In Figure 6 , we plot the test accuracy of Fed-EF with different compressors, and Stoc without EF. Again, we see that Fed-EF (both variants) is able to attain the same accuracy level as the corresponding full-precision federated learning algorithms. For Fed-EF-SGD, the compression rate is around 32x for Sign, 100x for TopK and ∼300x for heavy-Sign. For Fed-EF-AMS, the compression ratio can also be around hundreds. Note that for Fed-EF-AMS, the training curve of TopK-0.001 is not stable. Though it reaches a high accuracy, we still plot TopK-0.01 in the third column for comparison. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. 1st row: Fed-EF-SGD. 2nd row: Fed-EF-AMS. The last column presents the corresponding curves that achieve the full-precision accuracy using lowest communication. In Figure 7 we report the results for aggressive partial participation with p = 0.1. Similarly, for SGD, all three compressors are able to match the full-precision accuracy, with significantly reduced number of communicated bits. For Fed-EF-AMS, similar to the observations on FMNIST, we see that TopK outperforms Sign and heavy-Sign, and matches the performance of full-precision method with 100x compression ratio. Sign also performs reasonably well. In conclusion, our results on CIFAR-10 and ResNet again confirm that compared with standard full-precision FL algorithms, the proposed Fed-EF scheme can provide significant communication reduction without performance drop, under data heterogeneity and partial participation.

A.4 MORE RESULTS ON MNIST AND FMNIST

We provide the complete set of experimental results on each method under various compression rates. In Table 1 -Table 4 , for completeness we report the average test accuracy at the end of training and the standard deviations (over 5 independent runs), corresponding to the curves (compression parameters) in Figure 2 and Figure 3 . Figure 8 to Figure 11 present the results for participation rate p = 0.5, and Figure 12 to Figure 15 report the results for p = 0.1. For the hyper-parameter of the compressors (i.e., the compression rate), we test k ∈ {0.001, 0.01, 0.05} for TopK, k ∈ {0.01, 0.05, 0.1} for heavy-Sign and b ∈ {1, 2, 4} for Stoc. For the AMSGrad optimizer, we set β 1 = 0.9, β 2 = 0.999 and ϵ = 10 -8 as the recommended default [46] . For each method, we tune η over {10 -4 , 10 -3 , 10 -2 , 10 -1 , 1, 10} and η l over {10 -4 , 10 -3 , 10 -2 , 10 -1 , 1}. We found that the compressed methods usually have same optimal learning rates as the full-precision training. The best learning rate combinations achieving highest test accuracy are given in Table 5 . Participation rate p = 0.5, non-iid data. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. 1st row: Fed-EF-SGD. 2nd row: Fed-EF-AMS. 

Fed

Fed-EF-SGD Fed-EF-AMS η η l η η l MNIST 10 10 -3 10 -3 10 -2 FMNIST 1 10 -1 10 -2 10 -1 CIFAR-10 1 10 -1 10 -3 10 -2

B COMPRESSION DISCREPANCY

In our theoretical analysis for Fed-EF, Assumption 3 is needed, which states that E[∥ 1 n n i=1 C ∆ t,i + e t,i -1 n n i=1 (∆ t,i + e t,i )∥ 2 ] ≤ q 2 A E[∥ 1 n n i=1 (∆ t,i + e t,i )∥ 2 ] for some q A < 1 during training. In the following, we justify this assumption to demonstrate how it holds in practice. To study sparsified SGD, [4] also used a similar and stronger (uniform bound instead of in expectation) analytical assumption. As a result, our analysis and theoretical results are also valid under their assumption. Please see more related discussion therein.

B.1 SIMULATED DATA

We first conduct a simulation to investigate how the two compressors, TopK and Sign, affect q A . In our presented results, for conciseness we use n = 5 clients and model dimensionality d = 1100. Similar conclusions hold for much larger n and d. We simulate two types of gradients following normal distribution and Laplace distribution (more heavy-tailed), respectively. Examples of the simulated gradients are visualized in Figure 16 and Figure 17 . To mimic non-iid data, we assume that each client has some strong signals (large gradients) in some coordinates, and we scale those gradients by a scaling factor s = 2, 10, 100. Conceptually, larger s represents higher data heterogeneity. Sign TopK-0.1 Figure 18 : The compression coefficient q A in Assumption 3 on simulated gradients. TopK is applied with sparsity k = 0.1. Left: Gaussian distribution. Right: Laplace distribution. q 2 A is computed by q = ∥δ(x)-x∥ 2 ∥x∥ 2 where δ(x) = 1 n n i=1 C(∆ t,i + e t,i ) and x = 1 n n i=1 (∆ t,i + e t,i ). The dashed curves are respectively the compression coefficients q 2 C from Definition 1, which is calculated by replacing δ(x) = C( 1 n n i=1 ∆ t,i + e t,i ). We see that in all cases, q A < 1. We apply the TopK-0.1 and Sign compressor in Definition 1 to the simulated gradients, and compute the averaged q 2 A in Figure 18 over 10 5 independent runs. The dashed curves are respectively the "ideal" compression coefficients q C such that E[∥C 1 n n i=1 ∆ t,i + e t,i -1 n n i=1 (∆ t,i + e t,i )∥ 2 ] ≤ q 2 C E[∥ 1 n n i=1 (∆ t,i + e t,i )∥ 2 ] from Definition 1. We see that in all cases, q A is indeed less than 1. This still holds even when the data heterogeneity increases to as large as 100.

B.2 REAL-WORLD DATA

We report the empirical q A values when training CNN on MNIST and FMNIST datasets. The experimental setup is the same as in Section 5. We present the result in Figure 19 with η = 1, η l = 0.01 under the same heterogeneous setting where client data are highly non-iid. The plots for other learning rate combinations and iid data are similar. In particular, we see for both compressors and both datasets, the empirical q A is well-bounded below 1 throughout the training process. 

C PROOF OF CONVERGENCE RESULTS

In this section, we provide the proof of the convergence rates of Fed-EF. For illustrative purpose, we first present the proof for the more complicated Fed-EF-AMS in Section C.  v t = β 2 v t-1 + (1 -β 2 ) ∆ 2 t , and vt = max{v t-1 , v t }. Also, the first order moving average sequence m t = β 1 m t-1 + (1 -β 1 ) ∆ t and m ′ t = β 1 m ′ t-1 + (1 -β 1 ) ∆t , where m ′ t represents the first moment moving average sequence using the uncompressed local model updates. By construction we have m ′ t = (1 -β 1 ) t τ =1 β t-τ 1 ∆τ . Our proof will use the following auxiliary sequences: for round t = 1, ..., T , E t+1 := (1 -β 1 ) t+1 τ =1 β t+1-τ 1 ēτ , θ ′ t+1 := θ t+1 -η E t+1 √ vt + ϵ . Then, we can write the evolution of θ ′ t as θ ′ t+1 = θ t+1 -η E t+1 √ vt + ϵ = θ t -η (1 -β 1 ) t τ =1 β t-τ 1 ∆ τ + (1 -β 1 ) t+1 τ =1 β t+1-τ 1 ēτ √ vt + ϵ = θ t -η (1 -β 1 ) t τ =1 β t-τ 1 ( ∆ τ + ēτ+1 ) + (1 -β)β t 1 ē1 √ vt + ϵ = θ t -η (1 -β 1 ) t τ =1 β t-τ 1 ēτ √ vt + ϵ -η m ′ t √ vt + ϵ = θ t -η E t vt-1 + ϵ -η m ′ t √ vt + ϵ + η( 1 vt-1 + ϵ - 1 √ vt + ϵ )E t (a) = θ ′ t -η m ′ t √ vt + ϵ + η( 1 vt-1 + ϵ - 1 √ vt + ϵ )E t := θ ′ t -η m ′ t √ vt + ϵ + ηD t E t , where (a) uses the fact of error feedback that for every i ∈ [n], ∆ t,i + e t+1,i = ∆ t,i + e t,i , and e t,1 = 0 at initialization. Further define the virtual iterates: x t+1 := θ ′ t+1 -η β 1 1 -β 1 m ′ t √ vt + ϵ , which follows the recurrence: x t+1 = θ ′ t+1 -η β 1 1 -β 1 m ′ t √ vt + ϵ = θ ′ t -η m ′ t √ vt + ϵ -η β 1 1 -β 1 m ′ t √ vt + ϵ + ηD t E t = θ ′ t -η β 1 m ′ t-1 + (1 -β 1 ) ∆t + β 2 1 1-β1 m ′ t-1 + β 1 ∆t √ vt + ϵ + ηD t E t = θ ′ t -η β 1 1 -β 1 m ′ t-1 √ vt + ϵ -η ∆t √ vt + ϵ + ηD t E t = x t -η ∆t √ vt + ϵ + η β 1 1 -β 1 D t m ′ t-1 + ηD t E t . The general idea is to study the convergence of the sequence x t , and show that the difference between x t and θ t (of interest) is small. First, by the smoothness Assumption 1, we have f (x t+1 ) ≤ f (x t ) + ⟨∇f (x t ), x t+1 -x t ⟩ + L 2 ∥x t+1 -x t ∥ 2 . Taking expectation w.r.t. the randomness at round t and re-arranging terms, we obtain E[f (x t+1 )] -f (x t ) ≤ -ηE ∇f (x t ), ∆t √ vt + ϵ + ηE ∇f (x t ), β 1 1 -β 1 D t m ′ t-1 + D t E t + η 2 L 2 E ∥ ∆t √ vt + ϵ - β 1 1 -β 1 D t m ′ t-1 -D t E t ∥ 2 = -ηE ∇f (θ t ), ∆t √ vt + ϵ I + ηE ∇f (x t ), β 1 1 -β 1 D t m ′ t-1 + D t E t II + η 2 L 2 E ∥ ∆t √ vt + ϵ - β 1 1 -β 1 D t m ′ t-1 -D t E t ∥ 2 III + ηE ∇f (θ t ) -∇f (x t ), ∆t √ vt + ϵ IV , Bounding term I. We have I = -ηE ∇f (θ t ), ∆t vt-1 + ϵ -ηE ∇f (θ t ), ( √ vt + ϵ - 1 vt-1 + ϵ ) ∆t ≤ -ηE ∇f (θ t ), ∆t vt-1 + ϵ + ηη l KG 2 E[∥D t ∥ 1 ], where we use Assumption 4 on the stochastic gradient magnitude. The last inequality holds by simply bounding the aggregated local model update by ∥ ∆t ∥ ≤ 1 n n i=1 ∥η l K k=1 g (k) t,i ∥ ≤ η l KG, and the fact that for any vector in R d , the l 2 norm is upper bounded by the l 1 norm. Regarding the first term in (4), we have -ηE ⟨∇f (θ t ), ∆t vt-1 + ϵ = -ηE ⟨ ∇f (θ t ) vt-1 + ϵ , ∆t -η l K∇f (θ t ) + η l K∇f (θ t )⟩ = -ηη l KE ∥∇f (θ t )∥ 2 vt-1 + ϵ + ηE ⟨ ∇f (θ t ) vt-1 + ϵ , -∆t + η l K∇f (θ t )⟩ (a) ≤ - ηη l K 4η 2 l (1+q 2 ) 3 K 2 (1-q 2 ) 2 G 2 + ϵ E ∥∇f (θ t )∥ 2 + η ∇f (θ t ) vt-1 + ϵ , E - 1 n n i=1 K k=1 η l g (k) t,i + η l K∇f (θ t ) (b) = - ηη l K 4η 2 l (1+q 2 ) 3 K 2 (1-q 2 ) 2 G 2 + ϵ E ∥∇f (θ t )∥ 2 + η √ η l ∇f (θ t ) (v t-1 + ϵ) 1/4 , E √ η l n(v t-1 + ϵ) 1/4 (- n i=1 K k=1 ∇f i (θ (k) t,i ) + K∇f (θ t )) V , where (a) uses Lemma C.6 and (b) is due to Assumption 2 that g (k) t,i is an unbiased estimator of ∇f i (θ (k) t,i ). To bound term V, we note the inequality that ⟨a, b⟩ ≤ α 2 a 2 + 1 2α b 2 for any a, b ∈ R and α > 0. Therefore, we have V ≤ η l K 2 √ ϵ E ∥∇f (θ t )∥ 2 + η l 2K √ ϵ E ∥ 1 n n i=1 K k=1 (∇f i (θ (k) t,i ) -∇f i (θ t ))∥ 2 ≤ η l K 2 √ ϵ E ∥∇f (θ t )∥ 2 + η l 2nK √ ϵ E n i=1 ∥ K k=1 (∇f i (θ (k) t,i ) -∇f i (θ t ))∥ 2 ≤ η l K 2 √ ϵ E ∥∇f (θ t )∥ 2 + η l 2n √ ϵ E n i=1 K k=1 ∥∇f i (θ (k) t,i ) -∇f i (θ t )∥ 2 ≤ η l K 2 √ ϵ E ∥∇f (θ t )∥ 2 + η l L 2 2n √ ϵ E n i=1 K k=1 ∥θ (k) t,i -θ t ∥ 2 , where the last inequality is a result of the L-smoothness assumption on the loss function f i (x). Applying Lemma C.1 to the consensus error, we can further bound term V by V ≤ η l K 2 √ ϵ E ∥∇f (θ t )∥ 2 + η l KL 2 2 √ ϵ 5η 2 l K(σ 2 + 6Kσ 2 g ) + 30η 2 l K 2 E[∥∇f (θ t )∥ 2 ] ≤ 47η l K 64 √ ϵ E ∥∇f (θ t )∥ 2 + 5η 3 l K 2 L 2 2 √ ϵ (σ 2 + 6Kσ 2 g ), when we choose η l ≤ 1 8KL . Further, if we set η l ≤ √ 15(1-q 2 ) √ ϵ 14(1+q 2 ) 1.5 KG , we have 4η 2 l (1 + q 2 ) 3 K 2 (1 -q 2 ) 2 G 2 + ϵ ≤ 60 196 ϵ + ϵ = 64 49 ϵ. Hence, we can establish from (4) that I ≤ - ηη l K 8 √ ϵ E ∥∇f (θ t )∥ 2 + 5ηη 3 l K 2 L 2 2 √ ϵ (σ 2 + 6Kσ 2 g ) + ηη l KG 2 E[∥D t ∥ 1 ]. Bounding term II. By Lemma C.5, we know that ∥E t ∥ ≤ 2η l qKG 1-q 2 , and by Lemma C.3, ∥m ′ t ∥ ≤ η l KG. Thus, we have II ≤ η E ⟨∇f (θ t ), β 1 1 -β 1 D t m ′ t-1 + D t E t ⟩ + E ⟨∇f (x t ) -∇f (θ t ), β 1 1 -β 1 D t m ′ t-1 + D t E t ⟩ ≤ ηE ∥∇f (θ t )∥∥ β 1 1 -β 1 D t m ′ t-1 + D t E t ∥ + η 2 LE ∥ β1 1-β1 m ′ t-1 + E t vt-1 + ϵ ∥∥ β 1 1 -β 1 D t m ′ t-1 + D t E t ∥ ≤ ηη l C 1 KG 2 E[∥D t ∥ 1 ] + η 2 η 2 l C 2 1 LK 2 G 2 √ ϵ E[∥D t ∥ 1 ], where C 1 := β1 1-β1 + 2q 1-q 2 , and the second inequality is due to the smoothness of f (θ). Bounding term III. This term can be bounded as follows: III ≤ η 2 LE ∥ ∆t √ vt + ϵ ∥ 2 + η 2 LE ∥ β 1 1 -β 1 D t m ′ t-1 -D t E t ∥ 2 ≤ η 2 L ϵ E ∥ ∆t ∥ 2 + η 2 LE ∥D t ( β 1 1 -β 1 m ′ t-1 -E t )∥ 2 ≤ η 2 L(2η 2 l K 2 + 120η 4 l K 4 L 2 ) ϵ E ∥∇f (θ t )∥ 2 + 4η 2 η 2 l KL nϵ σ 2 + 20η 2 η 4 l K 3 L 3 ϵ (σ 2 + 6Kσ 2 g ) + η 2 η 2 l C 2 1 LK 2 G 2 E[∥D t ∥ 2 ], where we apply Lemma C.2 and use similar argument as in bounding term II. Bounding term IV. Lastly, for term IV, we have for some ρ > 0, IV = ηE ∇f (θ t ) -∇f (x t ), ∆t vt-1 + ϵ + ηE ∇f (θ t ) -∇f (x t ), ( √ vt + ϵ - 1 vt-1 + ϵ ) ∆t (a) ≤ ηρ 2ϵ E ∥ ∆t ∥ 2 + η 2ρ E ∥∇f (θ t ) -∇f (x t )∥ 2 + η 2 LE ∥ β1 1-β1 m ′ t-1 + E t vt-1 + ϵ ∥∥D t ∆ t ∥ (b) ≤ ρη(η 2 l K 2 + 60η 4 l K 4 L 2 ) ϵ E ∥∇f (θ t )∥ 2 + 2ρηη 2 l K ϵn σ 2 + 10ρηη 4 l K 3 L 2 ϵ (σ 2 + 6Kσ 2 g ) + η 3 L 2 2ρ E ∥ β1 1-β1 m ′ t-1 + E t vt-1 + ϵ ∥ 2 + η 2 η 2 l C 1 LK 2 G 2 √ ϵ E[∥D t ∥] ≤ ρηη 2 l K 2 (60η 2 l K 2 L 2 + 1) ϵ E ∥∇f (θ t )∥ 2 + 2ρηη 2 l K ϵn σ 2 + 10ρηη 4 l K 3 L 2 ϵ (σ 2 + 6Kσ 2 g ) + η 3 L 2 ρϵ β 2 1 (1 -β 1 ) 2 E ∥m ′ t ∥ 2 + E ∥E t ∥ 2 + η 2 η 2 l C 1 LK 2 G 2 √ ϵ E[∥D t ∥ 1 ], where (a) is a consequence of Young's inequality (ρ will be specified later) and the smoothness Assumption 1, and (b) is based on Lemma C.2. Now that we have bounded all four terms in (3), the next step is to gather the ingredients by taking the telescope summation over t = 1, ..., T . Before moving on, for the ease of presentation, we first do this for the third term in (9) . For this term, according to Lemma C.3 and Lemma C.5, summing over t = 1, ..., T , we conclude T t=1 η 3 L 2 ρϵ β 2 1 (1 -β 1 ) 2 E[∥m ′ t ∥ 2 ] + E[∥E t ∥ 2 ] ≤ η 3 β 2 1 L 2 ρ(1 -β 1 ) 2 ϵ 2η 2 l K 2 (60η 2 l K 2 L 2 + 1) T t=1 E ∥∇f (θ t )∥ 2 + 4 T η 2 l K n σ 2 + 20T η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ) + η 3 q 2 L 2 ρ(1 -q 2 ) 2 ϵ 8η 2 l K 2 (60η 2 l K 2 L 2 + 1) T t=1 E ∥∇f (θ τ )∥ 2 + 16T η 2 l K n σ 2 + 80T η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ) ≤ 2η 3 η 2 l C 2 K 2 L 2 ρϵ (60η 2 l K 2 L 2 + 1) T t=1 E ∥∇f (θ t )∥ 2 + 4T η 3 η 2 l C 2 KL 2 ρnϵ σ 2 + 20T η 3 η 4 l C 2 K 3 L 4 ρϵ (σ 2 + 6Kσ 2 g ), with C 2 := β 2 1 (1-β1) 2 + 4q 2 (1-q 2 ) 2 . Putting together. We are in the position to combine pieces together to get our final result by integrating ( 5), ( 6), ( 7), ( 9) and ( 10) into (3) and taking the telescoping sum over t = 1, ..., T . After re-arranging terms, when η l ≤ min 1 8KL , (1-q 2 ) √ ϵ 4(1+q 2 ) 1.5 KG , we have E[f (x T +1 ) -f (x 1 )] ≤ - ηη l K 8 √ ϵ T t=1 E ∥∇f (θ t )∥ 2 + 5T ηη 3 l K 2 L 2 2 √ ϵ (σ 2 + 6Kσ 2 g ) + ηη l KG 2 T t=1 E[∥D t ∥ 1 ] + ηη l C 1 KG 2 T t=1 E[∥D t ∥ 1 ] + η 2 η 2 l C 2 1 LK 2 G 2 √ ϵ T t=1 E[∥D t ∥ 1 ] + η 2 L(2η 2 l K 2 + 120η 4 l K 4 L 2 ) ϵ T t=1 E ∥∇f (θ t )∥ 2 + 4T η 2 η 2 l KL nϵ σ 2 + 20T η 2 η 4 l K 3 L 3 ϵ (σ 2 + 6Kσ 2 g ) + η 2 η 2 l C 2 1 LK 2 G 2 T t=1 E[∥D t ∥ 2 ] + ρηη 2 l K 2 (60η 2 l K 2 L 2 + 1) ϵ T t=1 E ∥∇f (θ t )∥ 2 + 2T ρηη 2 l K ϵn σ 2 + 10T ρηη 4 l K 3 L 2 ϵ (σ 2 + 6Kσ 2 g ) + 2η 3 η 2 l C 2 K 2 L 2 ρϵ (60η 2 l K 2 L 2 + 1) T t=1 E ∥∇f (θ t )∥ 2 + 4T η 3 η 2 l C 2 KL 2 ρnϵ σ 2 + 20T η 3 η 4 l C 2 K 3 L 4 ρϵ (σ 2 + 6Kσ 2 g ) + η 2 η 2 l C 1 LK 2 G 2 √ ϵ T t=1 E[∥D t ∥ 1 ] = Υ 1 • T t=1 E ∥∇f (θ t )∥ 2 + Υ 2 • (σ 2 + 6Kσ 2 g ) + Υ 3 • σ 2 + Υ 4 • T t=1 E[∥D t ∥ 1 ] + η 2 η 2 l C 2 1 LK 2 G 2 T t=1 E[∥D t ∥ 2 ], where Υ 1 = - ηη l K 8 √ ϵ + η 2 L(2η 2 l K 2 + 120η 4 l K 4 L 2 ) ϵ + ρηη 2 l K 2 (60η 2 l K 2 L 2 + 1) ϵ + 2η 3 η 2 l C 2 K 2 L 2 ρϵ (60η 2 l K 2 L 2 + 1) ≤ - ηη l K 8 √ ϵ + 2η 2 η 2 l K 2 L ϵ + 120η 2 η 4 l K 4 L 3 ϵ + 2ρηη 2 l K 2 ϵ + 4η 3 η 2 l C 2 K 2 L 2 ρϵ , Υ 2 = 5T ηη 3 l K 2 L 2 2 √ ϵ + 20T η 2 η 4 l K 3 L 3 ϵ + 10T ρηη 4 l K 3 L 2 ϵ + 20T η 3 η 4 l C 2 K 3 L 4 ρϵ , Υ 3 = 4T η 2 η 2 l KL nϵ + 2T ρηη 2 l K nϵ + 4T η 3 η 2 l C 2 KL 2 ρnϵ , Υ 4 = ηη l (C 1 + 1)KG 2 + η 2 η 2 l C 2 1 LK 2 G 2 √ ϵ + η 2 η 2 l C 1 LK 2 G 2 √ ϵ , where to bound Υ 1 we use the fact that η l ≤ 1 8KL . We now look at the upper bound (12) of Υ 1 which contains 5 terms. In the following, we choose ρ ≡ Lη in ( 9) and (10) . Suppose we choose ϵ < 1. Then, when the local learning rate satisfies η l ≤ 1 K min 1 8L , ( -q 2 ) √ ϵ 4(1 + q 2 ) 1.5 G , √ ϵ 128ηL , √ ϵ 256C 2 ηL , √ ϵ 7680η 1/3 1 L ≤ √ ϵ 8KL min 1 √ ϵ , 2(1 -q 2 )L (1 + q 2 ) 1.5 G , 1 max{16, 32C 2 }η , 1 3η 1/3 , each of the last four terms can be bounded by ηη l K 48 √ ϵ . Thus, under this learning rate setting, Υ 1 ≤ - ηη l K 16 √ ϵ . Taking the above into (11) , we arrive at ηη l K 16 √ ϵ T t=1 E ∥∇f (θ t )∥ 2 ≤ f (x 1 ) -E[f (x T +1 )] + Υ 2 • (σ 2 + 6Kσ 2 g ) + Υ 3 • σ 2 + Υ 4 • d √ ϵ + η 2 η 2 l C 2 1 LK 2 G 2 d ϵ , where Lemma C.7 on the difference sequence D t is applied. Consequently, we have 1 T T t=1 E ∥∇f (θ t )∥ 2 ≲ f (x 1 ) -E[f (x T +1 )] ηη l T K + Υ 2 • (σ 2 + 6Kσ 2 g ) + Υ 3 • σ 2 + (C 1 + 1)G 2 d T √ ϵ + 2ηη l C 2 1 LKG 2 d T ϵ + ηη l C 2 1 LKG 2 d T ϵ ≤ f (x 1 ) -E[f (x T +1 )] ηη l T K + Υ 2 • (σ 2 + 6Kσ 2 g ) + Υ 3 • σ 2 + (C 1 + 1)G 2 d T √ ϵ + 3ηη l C 2 1 LKG 2 d T ϵ , where we make simplification at the second inequality using the fact that C 1 ≤ C 2 1 since C 1 ≥ 1. Moreover, Υ 2 and Υ 3 is defined as (recall that we have chosen ρ ≡ Lη) Υ 2 = 5η 2 l KL 2 2 √ ϵ + 20ηη 3 l K 2 L 3 ϵ + 10ηη 3 l K 2 L 3 ϵ + 20ηη 3 l C 2 K 2 L 3 ϵ ≤ 5η 2 l KL 2 2 √ ϵ + ηη 3 l (30 + 20C 2 )K 2 L 3 ϵ , Υ 3 = 4ηη l L nϵ + 2ηη l L nϵ + 4ηη l C 2 L nϵ ≤ ηη l L(6 + 4C 2 ) nϵ . Finally, to connect the virtual iterates x t with the actual iterates θ t , note that x 1 = θ 1 , and f (x T +1 ) ≥ f (θ * ) since θ * = arg min θ f (θ). Replacing Υ 2 and Υ 3 with above upper bounds, this eventually leads to the bound 1 T T t=1 E ∥∇f (θ t )∥ 2 ≲ f (θ 1 ) -f (θ * ) ηη l T K + 5η 2 l KL 2 2 √ ϵ + ηη 3 l (30 + 20C 2 1 )K 2 L 3 ϵ (σ 2 + 6Kσ 2 g ) + ηη l L(6 + 4C 2 1 ) nϵ σ 2 + (C 1 + 1)G 2 d T √ ϵ + 3ηη l C 2 1 LKG 2 d T ϵ , which gives the desired result. Here we use the fact that C 2 ≤ C 2 1 . This completes the proof.

C.2 PROOF OF THEOREM 1: FED-EF-SGD

Proof. Now, we prove the variant of Fed-EF with SGD as the central server update rule. The proof follows the same routine as the one for Fed-EF-AMS, but is simpler since there are no moving average terms that need to be handled. Note that for this algorithm, we do not need Assumption 4 that the stochastic gradients are uniformly bounded. For Fed-EF-SGD, consider the virtual sequence x t+1 = θ t+1 -ηē t+1 = θ t -η ∆ t -ηē t+1 = θ t - η n n i=1 ( ∆t,i + e t+1,i ) = θ t -η ∆t -ηē t = x t -η ∆t , where the second last equality follows from the update rule that ∆t,i + e t+1,i = ∆ t,i + e t,i for all i ∈ [n] and t ∈ [T ]. By the smoothness Assumption 1, we have f (x t+1 ) ≤ f (x t ) + ⟨∇f (x t ), x t+1 -x t ⟩ + L 2 ∥x t+1 -x t ∥ 2 . Taking expectation w.r.t. the randomness at round t gives E[f (x t+1 )] -f (x t ) ≤ -ηE ⟨∇f (x t ), ∆t ⟩ + η 2 L 2 E ∥ ∆t ∥ 2 = -ηE ⟨∇f (θ t ), ∆t ⟩ + η 2 L 2 E ∥ ∆t ∥ 2 + ηE ⟨∇f (θ t ) -∇f (x t ), ∆t ⟩ We can bound the first term in ( 14) using similar technique as bounding term I in the proof of Fed-EF-AMS. Specifically, we have -ηE ⟨∇f (θ t ), ∆t ⟩ = -ηE ⟨∇f (θ t ), ∆t -η l K∇f (θ t ) + η l K∇f (θ t )⟩ = -ηη l KE ∥∇f (θ t )∥ 2 + ηE ⟨∇f (θ t ), -∆t + η l K∇f (θ t )⟩ . Note that the second term in the above can be bounded in the same way as bounding term V in Fed-EF-AMS (without the √ ϵ factor). Thus, with η l ≤ 1 8KL we have -ηE ⟨∇f (θ t ), ∆t ⟩ ≤ -ηη l KE ∥∇f (θ t )∥ 2 + 3ηη l K 4 E ∥∇f (θ t )∥ 2 + 5ηη 3 l K 2 L 2 2 (σ 2 + 6Kσ 2 g ) = - ηη l K 4 E ∥∇f (θ t )∥ 2 + 5ηη 3 l K 2 L 2 2 (σ 2 + 6Kσ 2 g ). The second term in ( 14) can be bounded using Lemma C.2 as η 2 L 2 E ∥ ∆t ∥ 2 ≤ η 2 η 2 l K 2 L(60η 2 l K 2 L 2 + 1)E ∥∇f (θ t )∥ 2 + 2η 2 η 2 l KL n σ 2 + 10η 2 η 4 l K 3 L 3 (σ 2 + 6Kσ 2 g ). The last term in ( 14) can be bounded similarly as VI in Fed-EF-AMS by ηE ⟨∇f (θ t ) -∇f (x t ), ∆t ⟩ ≤ ηρ 2 E ∥ ∆t ∥ 2 + η 2ρ E ∥∇f (θ t ) -∇f (x t )∥ 2 (a) ≤ η 2 2 E ∥ ∆t ∥ 2 + η 2 L 2 2 E ∥ē t ∥ 2 ( ) (b) ≤ η 2 L 2 2η 2 l K 2 (60η 2 l K 2 L 2 + 1)E ∥∇f (θ t )∥ 2 + 4 η 2 l K n σ 2 + 20η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ) + η 2 L 2 4q 2 η 2 l K 2 (60η 2 l K 2 L 2 + 1) 1 -q 2 t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 + 16η 2 l q 2 K (1 -q 2 ) 2 n σ 2 + 80η 4 l q 2 K 3 L 2 (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ) , where (a) uses Young's inequality and (b) uses Lemma C.2 and Lemma C.4. When taking telescoping sum of this term over t = 1, ..., T , again using the geometric summation trick, we further obtain η T t=1 E ⟨∇f (θ t ) -∇f (x t ), ∆t ⟩ ≤ η 2 η 2 l C 1 K 2 L(60η 2 l K 2 L 2 + 1) T t=1 E ∥∇f (θ t )∥ 2 + 2T η 2 η 2 l C 1 KL n σ 2 + 10T η 2 η 4 l C 1 K 3 L 3 (σ 2 + 6Kσ 2 g ), where C 1 = 1 + 4q 2 (1-q 2 ) 2 . Now, taking the summation over all terms in ( 14), we get E[f (x t+1 )] -f (x 1 ) ≤ - ηη l K 4 + η 2 η 2 l (C 1 + 1)K 2 L(60η 2 l K 2 L 2 + 1) T t=1 E ∥∇f (θ t )∥ 2 + 2T η 2 η 2 l (C 1 + 1)KL n σ 2 + 10T η 2 η 4 l (C 1 + 1)K 3 L 3 (σ 2 + 6Kσ 2 g ). Since η l ≤ 1 8KL , we know that 60η 2 l K 2 L 2 + 1 < 2. Therefore, provided that the local learning rate is such that η l ≤ 1 2KL • max{4, η(C 1 + 1)} , we have ηη l K 8 T t=1 E ∥∇f (θ t )∥ 2 ≤ f (x 1 ) -E[f (x t+1 )] + 2T η 2 η 2 l (C 1 + 1)KL n σ 2 + 10T η 2 η 4 l (C 1 + 1)K 3 L 3 (σ 2 + 6Kσ 2 g ), leading to 1 T T t=1 E ∥∇f (θ t )∥ 2 ≲ f (x 1 ) -E[f (x t+1 )] ηη l T K + 2ηη l (C 1 + 1)L n σ 2 + 10ηη 3 l (C 1 + 1)K 2 L 3 (σ 2 + 6Kσ 2 g ) ≤ f (θ 1 ) -f (θ * ) ηη l T K + 2ηη l (C 1 + 1)L n σ 2 + 10ηη 3 l (C 1 + 1)K 2 L 3 (σ 2 + 6Kσ 2 g ), which concludes the proof.

C.3 INTERMEDIATE LEMMAS

In our analysis, we will make use of the following lemma on the consensus error. Note that this is a general result holding for algorithms (both Fed-EF-SGD and Fed-EF-AMS) with local SGD steps. Lemma C. 1 ([45] ). For η l ≤ 1 8LK , for any round t, local step k ∈ [K] and client i ∈ [n], under Assumption 1 to Assumption 2, it holds that E ∥θ (k) t,i -θ t ∥ 2 ≤ 5η 2 l K(σ 2 + 6Kσ 2 g ) + 30η 2 l K 2 E ∥∇f (θ t )∥ 2 . We then state some results that bound several key ingredients in our analysis. Lemma C.2. Recall ∆t = 1 n n i=1 ∆ t,i . Under Assumption 1 to Assumption 2, for ∀t, the following bounds hold: 1. Bound by local gradients: E ∥ ∆t ∥ 2 ≤ η 2 l n 2 E ∥ n i=1 K k=1 ∇f i (θ (k) t,i )∥ 2 + η 2 l K n σ 2 . 2. Bound by global gradient: E ∥ ∆t ∥ 2 ≤ (2η 2 l K 2 + 120η 4 l K 4 L 2 )E ∥∇f (θ t )∥ 2 + 4 η 2 l K n σ 2 + 20η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ). Proof. By definition, we have E ∥ ∆t ∥ 2 = E ∥ 1 n n i=1 K k=1 η l g (k) t,i ∥ 2 ≤ η 2 l n 2 E ∥ n i=1 K k=1 (g (k) t,i -∇f i (θ (k) t,i )∥ 2 + η 2 l n 2 E ∥ n i=1 K k=1 ∇f i (θ (k) t,i )∥ 2 ≤ η 2 l K n σ 2 + η 2 l n 2 E ∥ n i=1 K k=1 ∇f i (θ (k) t,i )∥ 2 , where the second line is due to the variance decomposition, and the last line is a result of Assumption 2 that the stochastic gradients are independent and unbiased. This proves the first part. For the second part, note that E ∥ ∆t ∥ 2 = E ∥ 1 n n i=1 K k=1 η l g (k) t,i -Kη l ∇f (θ t ) + Kη l ∇f (θ t )∥ 2 ≤ 2η 2 l K 2 E ∥∇f (θ t )∥ 2 + 2η 2 l E ∥ 1 n n i=1 K k=1 g (k) t,i - K n n i=1 ∇f i (θ t )∥ 2 = 2η 2 l K 2 E ∥∇f (θ t )∥ 2 + 2η 2 l n 2 E ∥ n i=1 K k=1 (g (k) t,i -∇f i (θ t ))∥ 2 ≤ 2η 2 l K 2 E ∥∇f (θ t )∥ 2 + 2η 2 l n 2 E ∥ n i=1 K k=1 (g (k) t,i -∇f i (θ (k) t,i ) + ∇f i (θ (k) t,i ) -∇f i (θ t ))∥ 2 ≤ 2η 2 l K 2 E ∥∇f (θ t )∥ 2 + 2η 2 l n 2 E ∥ n i=1 K k=1 (g (k) t,i -∇f i (θ (k) t,i ) + ∇f i (θ (k) t,i ) -∇f i (θ t ))∥ 2 A . The expectation A can be further bounded as A ≤ 2E ∥ n i=1 K k=1 (g (k) t,i -∇f i (θ (k) t,i ))∥ 2 + 2E ∥ n i=1 K k=1 (∇f i (θ (k) t,i ) -∇f i (θ t ))∥ 2 (a) ≤ 2nKσ 2 + 2nK n i=1 K k=1 E ∥∇f i (θ (k) t,i ) -∇f i (θ t )∥ 2 (b) ≤ 2nKσ 2 + 2nKL 2 n i=1 K k=1 E ∥θ (k) t,i -θ t ∥ 2 (c) ≤ 60η 2 l n 2 K 4 L 2 E ∥∇f (θ t )∥ 2 + 2nKσ 2 + 10η 2 l n 2 K 3 L 2 (σ 2 + 6Kσ 2 g ), where (a) is implied by Assumption 2 that each local stochastic gradient g (k) t,i can be written as g (k) t,i = ∇f i (θ (k) t,i ) + ξ (k) t,i , where ξ k t,i is a zero-mean random noise with bounded variance σ 2 , and all the noises for t ∈ [T ], i ∈ [n], k ∈ [K] are independent. The inequality (b) is due to the smoothness Assumption 1, and (c) follows from Lemma C.1. Therefore, we obtain E[∥ ∆t ∥ 2 ] ≤ (2η 2 l K 2 + 120η 4 l K 4 L 2 )E[∥∇f (θ t )∥ 2 ] + 4 η 2 l K n σ 2 + 20η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ), which completes the proof of the second claim. Lemma C.3. Under Assumption 1 to Assumption 2 we have: ∥m ′ t ∥ ≤ η l KG, for ∀t, T t=1 E ∥m ′ t ∥ 2 ≤ (2η 2 l K 2 + 120η 4 l K 4 L 2 ) T t=1 E ∥∇f (θ t )∥ 2 + + 4 T η 2 l K n σ 2 + 20T η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ). Proof. For the first part, by Assumption 4 we know that ∥m ′ t ∥ = (1 -β 1 )∥ t τ =1 β t-τ 1 ∆t ∥ = (1 -β 1 ) t τ =1 β t-τ 1 η l n n i=1 K k=1 ∥g (k) t,i ∥ ≤ η l KG. For the second claim, by Lemma C.2 we know that E ∥ ∆t ∥ 2 ≤ (2η 2 l K 2 + 120η 4 l K 4 L 2 )E ∥∇f (θ t )∥ 2 + 4 η 2 l K n σ 2 + 20η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ). Let ∆t,j denote the j-th coordinate of ∆t . By the updating rule of Fed-EF, we have E ∥m ′ t ∥ 2 = E ∥(1 -β 1 ) t τ =1 β t-τ 1 ∆τ ∥ 2 ≤ (1 -β 1 ) 2 d j=1 E ( t τ =1 β t-τ 1 ∆τ,j ) 2 (a) ≤ (1 -β 1 ) 2 d j=1 E ( t τ =1 β t-τ 1 )( t τ =1 β t-τ 1 ∆2 τ,j ) ≤ (1 -β 1 ) t τ =1 β t-τ 1 E ∥ ∆τ ∥ 2 ≤ (2η 2 l K 2 + 120η 4 l K 4 L 2 )(1 -β 1 ) t τ =1 β t-τ 1 E ∥∇f (θ t )∥ 2 + 4 η 2 l K n σ 2 + 20η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ), where (a) is due to Cauchy-Schwartz inequality. Summing over t = 1, ..., T , we obtain T t=1 E ∥m ′ t ∥ 2 ≤ (2η 2 l K 2 + 120η 4 l K 4 L 2 )(1 -β) T t=1 t τ =1 β t-τ 1 E ∥∇f (θ t )∥ 2 + + 4 T η 2 l K n σ 2 + 20T η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ) ≤ (2η 2 l K 2 + 120η 4 l K 4 L 2 ) T t=1 E ∥∇f (θ t )∥ 2 + + 4 T η 2 l K n σ 2 + 20T η 4 l K 3 L 2 (σ 2 + 6Kσ 2 g ), which concludes the proof. Lemma C.4. Under Assumption 2, we have for ∀t and each local worker ∀i ∈ [n], ∥e t,i ∥ 2 ≤ 4η 2 l q 2 K 2 G 2 (1 -q 2 ) 2 , ∀t, E[∥ē t+1 ∥ 2 ] ≤ 4q 2 η 2 l K 2 (60η 2 l K 2 L 2 + 1) 1 -q 2 t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 + 16η 2 l q 2 K (1 -q 2 ) 2 n σ 2 + 80η 4 l q 2 K 3 L 2 (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ). Proof. To prove the second claim, we start by using Assumption 3 and Young's inequality to get ∥ē t+1 ∥ 2 = ∥ ∆t + ēt - 1 n n i=1 C(∆ t,i + e t,i )∥ 2 ≤ q 2 ∥ ∆t + ēt ∥ 2 ≤ q 2 (1 + ρ)∥ē t ∥ 2 + q 2 (1 + 1 ρ )∥ ∆t,i ∥ 2 ≤ 1 + q 2 2 ∥ē t ∥ 2 + 2q 2 1 -q 2 ∥ ∆t ∥ 2 , where ( 16) is derived by choosing ρ = 1-q 2 2q 2 and the fact that q < 1. Now by recursion and the initialization e 1,i = 0, ∀i, we have E ∥ē t+1 ∥ 2 ≤ 2q 2 1 -q 2 t τ =1 ( 1 + q 2 2 ) t-τ E ∥ ∆τ ∥ 2 ≤ 4q 2 η 2 l K 2 (60η 2 l K 2 L 2 + 1) 1 -q 2 t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 + 16η 2 l q 2 K (1 -q 2 ) 2 n σ 2 + 80η 4 l q 2 K 3 L 2 (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ), which proves the second argument, where we use Lemma C.1 to bound the local local model update. In addition, we know that ∥∆ t ∥ ≤ η l KG by Assumption 4 for any t. The absolute bound ∥e t,i ∥ 2 ≤ 4η 2 l q 2 C K 2 G 2 (1-q 2 C ) 2 follows from ( 16) by a similar recursion argument used on local error e t,i , and the fact that q C ≤ max{q C , q A } = q. Lemma C.5. For the moving average error sequence E t , it holds that ∥E t ∥ 2 ≤ 4η 2 l q 2 K 2 G 2 (1 -q 2 ) 2 , for ∀t, T t=1 E ∥E t ∥ 2 ≤ 8q 2 η 2 l K 2 (60η 2 l K 2 L 2 + 1) (1 -q 2 ) 2 T t=1 E ∥∇f (θ τ )∥ 2 + 16T η 2 l q 2 K (1 -q 2 ) 2 n σ 2 + 80T η 4 l q 2 K 3 L 2 (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ). Proof. The first argument can be simply deduced by the definition of E t that ∥E t ∥ = (1 -β 1 )∥ t τ =1 β t-τ 1 ēt ∥ ≤ ∥e t,i ∥ ≤ 2η l qKG 1 -q 2 . Denote the quantity K t := t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 . By the same technique as in the proof of Lemma C.3, denoting ēt,j as the j-th coordinate of ēt , we can bound the accumulated error sequence by E ∥E t ∥ 2 = E ∥(1 -β 1 ) t τ =1 β t-τ 1 ēτ ∥ 2 ≤ (1 -β 1 ) 2 d j=1 E ( t τ =1 β t-τ 1 ēτ,j ) 2 (a) ≤ (1 -β 1 ) 2 d j=1 E ( t τ =1 β t-τ 1 )( t τ =1 β t-τ 1 ē2 τ,j ) ≤ (1 -β 1 ) t τ =1 β t-τ 1 E ∥ē τ ∥ 2 (b) ≤ 16η 2 l q 2 K (1 -q 2 ) 2 n σ 2 + 80η 4 l q 2 K 3 L 2 (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ) + 4(1 -β 1 )q 2 η 2 l K 2 (60η 2 l K 2 L 2 + 1) 1 -q 2 t τ =1 β t-τ 1 K τ , where (a) is due to Cauchy-Schwartz inequality and (b) is a result of Lemma C.4. Summing over t = 1, ..., T and using the technique of geometric series summation leads to T t=1 E ∥E t ∥ 2 ≤ 16T η 2 l q 2 K (1 -q 2 ) 2 n σ 2 + 80T η 4 l q 2 K 3 L 2 (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ) + 4(1 -β 1 )q 2 η 2 l K 2 (60η 2 l K 2 L 2 + 1) 1 -q 2 T t=1 t τ =1 β t-τ 1 K τ ≤ 16T η 2 l q 2 K (1 -q 2 ) 2 n σ 2 + 80T η 4 l q 2 K 3 L 2 (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ) + 4q 2 η 2 l K 2 (60η 2 l K 2 L 2 + 1) 1 -q 2 T t=1 t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 ≤ 16T η 2 l q 2 K (1 -q 2 ) 2 n σ 2 + 80T η 4 l q 2 K 3 L 2 (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ) + 8q 2 η 2 l K 2 (60η 2 l K 2 L 2 + 1) (1 -q 2 ) 2 T t=1 E ∥∇f (θ τ )∥ 2 . The desired result is obtained. Lemma C.6. It holds that ∀t ∈ [T ], ∀i ∈ [d], vt,i ≤ 4η 2 l (1+q 2 ) 3 K 2 (1-q 2 ) 2 G 2 . Proof. For any t, by Lemma C.4 and Assumption 4 we have ∥ ∆ t ∥ 2 = ∥C(∆ t + e t )∥ 2 ≤ ∥C(∆ t + e t ) -(∆ t + e t ) + (∆ t + e t )∥ 2 ≤ 2(q 2 + 1)∥∆ t + e t ∥ 2 ≤ 4(q 2 + 1)(η 2 l K 2 G 2 + 4η 2 l q 2 K 2 G 2 (1 -q 2 ) 2 ) = 4η 2 l (1 + q 2 ) 3 K 2 (1 -q 2 ) 2 G 2 . Consider the updating rule of vt = max{v t , vt-1 }. We know that there exists a j ∈ [t] such that vt = v j . Thus, we have vt,i = (1 -β 2 ) j τ =1 β j-τ 2 g2 t,i ≤ 4η 2 l (1 + q 2 ) 3 K 2 (1 -q 2 ) 2 G 2 , which proves the claim. Lemma C.7. Let D t := 1 √ vt-1+ϵ - √ vt+ϵ be defined as above. Then, T t=1 ∥D t ∥ 1 ≤ d √ ϵ , T t=1 ∥D t ∥ 2 ≤ d ϵ . Proof. By the updating rule of Fed-EF-AMS, vt-1 ≤ vt for ∀t. Therefore, by the initialization v0 = 0, we have T t=1 ∥D t ∥ 1 = T t=1 d i=1 ( 1 vt-1,i + ϵ - 1 vt,i + ϵ ) = d i=1 ( 1 v0,i + ϵ - 1 vT,i + ϵ ) ≤ d √ ϵ . For the sum of squared l 2 norm, note the fact that for a ≥ b > 0, it holds that (a -b) 2 ≤ (a -b)(a + b) = a 2 -b 2 . Thus, T t=1 ∥D t ∥ 2 = T t=1 d i=1 ( 1 vt-1,i + ϵ - 1 vt,i + ϵ ) 2 ≤ T t=1 d i=1 ( 1 vt-1,i + ϵ - 1 vt,i + ϵ ) ≤ d ϵ , which gives the desired result.

C.4 PROOF OF THEOREM 3: PARTIAL PARTICIPATION

Proof. We can use a similar proof structure as previous analysis for full participation Fed-EF-SGD as in Section C.2. Like before, we first define the following virtual iterates: x t+1 = θ t+1 -η 1 m n i=1 e t+1,i = θ t -η ∆ t,Mt -η 1 m i∈Mt e t+1,i -η 1 m i / ∈Mt e t+1,i = θ t -η ∆t,Mt -η 1 m i∈Mt e t,i -η 1 m i / ∈Mt e t,i = θ t -η ∆t,Mt -η 1 m n i=1 e t,i = x t -η ∆t,Mt . Here, (17) follows from the partial participation setup where there is no error accumulation for inactive clients. The smoothness of loss functions implies f (x t+1 ) ≤ f (x t ) + ⟨∇f (x t ), x t+1 -x t ⟩ + L 2 ∥x t+1 -x t ∥ 2 . Taking expectation w.r.t. the randomness at round t, we have For term II, we have E[f (x t+1 )] -f (x t ) ≤ - II ≤ η 2 η 2 l KL 2m σ 2 + η 2 η 2 l L 2n(n -1) E ∥ n i=1 K k=1 ∇f i (θ (k) t,i )∥ 2 + C ′ 3η 2 η 2 l K 2 L(30η 2 l K 2 L 2 + 1) 2m E ∥∇f (θ t )∥ 2 + 15η 2 η 4 l K 3 L 3 2m (σ 2 + 6Kσ 2 g ) + 3η 2 η 2 l K 2 L 2m σ 2 g , with C ′ = n-m n-1 . Furthermore, we have that III ≤ 2η 2 LE ∥ 1 m n i=1 e t,i ∥ 2 + η 2 L 2 E ∥ ∆t,Mt ∥ 2 . The second term is the same as term II. We now bound the first term. Denote ẽt,i = e t,i + ∆ We have by Lemma C.1 that  E[∥e t,i ∥ 2 ] ≤ 20q 2 η 2 l K (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ) + 60η 2 l q 2 K 2 1 -q 2 t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 , E[∥∆ t,i ∥ 2 ] ≤ 5η 2 l K(σ 2 + 6Kσ 2 g ) + 30η ≤ (n -m) m(n -1) 70q 2 η 2 l K (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ) + 180η 2 l q 2 K 2 1 -q 2 t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 + 60η 2 l q 2 K 2 (1 -q 2 ) 2 E ∥∇f (θ t )∥ 2 . Recall q = max{q A , q C }. Let γ = (1 -q 2 )/2q 2 . We have m(1 + γ)q 2 + (n -m) n = 1 - (1 -q 2 )m 2n < 1, m(1 + 1/γ)q 2 n = m(1 + q 2 )q 2 n(1 -q 2 ) ≤ 2mq 2 n(1 -q 2 ) . By the recursion argument used before, applying Lemma C.2 (adjusted by a n 2 /m 2 factor) we obtain E ∥ 1 m n i=1 e t+1,i ∥ 2 ≤ 2mq 2 n(1 -q 2 ) t τ =1 1 - (1 -q 2 )m 2n t-τ n 2 m 2 η 2 l n 2 E ∥ n i=1 K k=1 ∇f i (θ (k) τ,i )∥ 2 + η 2 l K n σ 2 + 2n(n -m) (1 -q 2 )m 2 (n -1) 70q 2 η 2 l K (1 -q 2 ) 2 (σ 2 + 6Kσ 2 g ) + 180η 2 l q 2 K 2 1 -q 2 t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 + 60η 2 l q 2 K 2 (1 -q 2 ) 2 E ∥∇f (θ t )∥ 2 ≤ 2η 2 l q 2 (1 -q 2 )mn t τ =1 1 - (1 -q 2 )m 2n t-τ E ∥ n i=1 K k=1 ∇f i (θ (k) τ,i )∥ 2 + 4η 2 l q 2 Kn (1 -q 2 ) 2 m 2 σ 2 + 280η 2 l q 2 (n -m)K (1 -q 2 ) 3 m 2 (σ 2 + 6Kσ 2 g ) + 720η 2 l q 2 (n -m)K 2 (1 -q 2 ) 2 m 2 t τ =1 ( 1 + q 2 2 ) t-τ E ∥∇f (θ τ )∥ 2 + 240η 2 l q 2 (n -m)K 2 (1 -q 2 ) 3 m 2 E ∥∇f (θ t )∥ 2 . Summing over t = 1, ..., T gives T t=1 E ∥ 1 m n i=1 e t+1,i ∥ 2 ≤ 4η 2 l q 2 (1 -q 2 ) 2 m 2 T t=1 E ∥ n i=1 K k=1 ∇f i (θ (k) t,i )∥ 2 + 4T η 2 l q 2 Kn (1 -q 2 ) 2 m 2 σ 2 + 280T η 2 l q 2 (n -m)K (1 -q 2 ) 3 m 2 (σ 2 + 6Kσ 2 g ) + 1680η 2 l q 2 (n -m)K 2 (1 -q 2 ) 3 m 2 T t=1 E ∥∇f (θ t )∥ 2 . Now we turn back to (18) . By taking the telescoping sum, we have E[f (x t+1 )] -f (x 1 ) ≤ - ηη l K 4 T t=1 E ∥∇f (θ t )∥ 2 + 5T ηη 3 l K 2 L 2 2 (σ 2 + 6Kσ 2 g ) - ηη l 2Kn 2 T t=1 E ∥ n i=1 K k=1 ∇f i (θ (k) t,i )∥ 2 + T η 2 η 2 l KL m σ 2 + η 2 η 2 l L n(n -1) T t=1 E ∥ n i=1 K k=1 ∇f i (θ (k) t,i )∥ 2 + 3η 2 η 2 l C ′ K 2 L(30η 2 l K 2 L 2 + 1) m T t=1 E ∥∇f (θ t )∥ 2 + 15T η 2 η 4 l C ′ K 3 L 3 m (σ 2 + 6Kσ 2 g ) + 3T η 2 η 2 l C ′ K 2 L m σ 2 g + 8η 2 η 2 l q 2 L (1 -q 2 ) 2 m 2 T t=1 E ∥ n i=1 K k=1 ∇f i (θ (k) t,i )∥ 2 + 8T η 2 η 2 l q 2 KLn (1 -q 2 ) 3 m 2 σ 2 + 560T η 2 η 2 l q 2 (n -m)KL (1 -q 2 ) 3 m 2 (σ 2 + 6Kσ 2 g ) + 3360η 2 η 2 l q 2 (n -m)K 2 L (1 -q 2 ) 3 m 2 T t=1 E ∥∇f (θ t )∥ 2 . When the learning rate is chosen such that η l ≤ min 1 6 , m 96C ′ η , m 2 53760(n -m)C 1 η , 1 4η , 1 32C 1 η 1 KL , we can get 1 T T t=1 E ∥∇f (θ t )∥ 2 ≲ f (θ 1 ) -f (θ * ) ηη l T K + ηη l L m + 8ηη l C 1 Ln m 2 σ 2 + 3ηη l C ′ KL m σ 2 g + 5η 2 l KL 2 2 + 15ηη 3 l C ′ K 2 L 3 m + 560ηη l C 1 (n -m)L m 2 (σ 2 + 6Kσ 2 g ), where C 1 = q 2 /(1 -q 2 ) 3 . Denote B = n/m. When choosing η = Θ( √ Km), η l = Θ( 1 K √ T B ), the rate can be further bounded by 1 T T t=1 E ∥∇f (θ t )∥ 2 = O √ B(f (θ 1 ) -f (θ * )) √ T Km + ( 1 √ T KmB + √ B √ T Km )σ 2 + √ K √ T mB σ 2 g + ( 1 T KB + 1 T 3/2 B 3/2 √ Km + √ B √ T Km )(σ 2 + 6Kσ 2 g ) , which can be further simplified by ignoring smaller terms as 1 T T t=1 E ∥∇f (θ t )∥ 2 = O √ n √ m f (θ 1 ) -f (θ * ) √ T Km + 1 √ T Km σ 2 + √ K √ T m σ 2 g . This completes the proof. where we again use the fact of EF that ϕ t+1 + Ht = ϕ t + ∆ t . Then we can construct a similar sequence x t as in ( 13) associated with θt by x t+1 = θt+1 -ηē t+1 = x t -η ∆t . Then we can apply same analysis to derive the convergence bound as in Section C.2. The only difference is in (15) , where the second term becomes η 2 L 2 2 E ∥ē t + ϕ t ∥ 2 ≤ η 2 L 2 E ∥ē t ∥ 2 + η 2 L 2 E ∥ϕ t ∥ 2 . ( ) The first term can be bounded in the same way as in (15) . Regarding the second term, we can use a similar trick as Lemma C.4 that ∥ϕ t+1 ∥ 2 = ∥ϕ t + ∆ t -C(ϕ t + ∆ t )∥ 2 ≤ q 2 C ∥ϕ t + ∆ t ∥ 2 ≤ 1 + q 2 C 2 ∥ϕ t ∥ 2 + 2q 2 C 1 -q 2 C ∥ ∆ t ∥ 2 . Then, by recursion and the geometric sum, ∥ϕ t+1 ∥ 2 can be bounded by the second term in above up to a constant. We can write E ∥ ∆ t ∥ 2 ≤ E ∥ ∆t + ēt -ēt+1 ∥ 2 ≤ 3(E ∥ ∆t ∥ 2 + E ∥ē t ∥ 2 + E ∥ē t+1 ∥ 2 ). As a result, it holds that E ∥ϕ t ∥ 2 ≤ O(E ∥ē t ∥ 2 ) since E ∥ē t ∥ 2 = O(E ∥ ∆t ∥ 2 ) by Lemma C.2 and Lemma C.4 under our assumptions. Therefore, (19) has same order as (15) . Since other parts of the proof are the same, we know that two-way compression does not change the convergence rate asymptotically of Fed-EF.



Figure 1: MLP trained by Fed-SGD on MNIST: full-precision vs. Sign compression, n = 200 non-iid clients.

Compressed Federated Learning with Error Feedback (Fed-EF) Input: learning rates η, η l , hyper-parameters β 1 , β 2 , ϵ 2: Initialize: central server parameter θ 1 ∈ R d ⊆ R d ; e 1,i = 0 the accumulator for each worker; m 0 = 0, v 0 = 0, v0 = 0 for t = 1, . . . , T do 4:

Figure 4: MLP on MNIST with TopK-0.01 compressed Fed-EF: Squared gradient norm (left two) and train loss (right two) against the number of training rounds, averaged over 20 independent runs.

Figure 6: CIFAR-10 dataset trained by ResNet-18. Test accuracy of Fed-EF with TopK, Sign and heavy-Sign compressors.Participation rate p = 0.5, non-iid data. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. 1st row: Fed-EF-SGD. 2nd row: Fed-EF-AMS. The last column presents the corresponding curves that achieve the full-precision accuracy using lowest communication.

Figure 7: CIFAR-10 dataset trained by ResNet-18. Test accuracy of Fed-EF with TopK, Sign and heavy-Sign compressors.Participation rate p = 0.1, non-iid data. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. 1st row: Fed-EF-SGD. 2nd row: Fed-EF-AMS. The last column presents the corresponding curves that achieve the full-precision accuracy using lowest communication.

(%) with client participation rate p = 0.1, of Fed-EF-AMS with Sign, TopK and heavy-Sign compressor and Stoc (stochastic quantization) without EF. The compression parameters (i.e., k and b) of the compressors are consistent with Figure 3. A.5 DATA SPLIT AND PARAMETER TUNING In our experiments, for n = 200 clients, we first split the training samples into 2n = 400 shards, where each shard contains samples from only one class. Then, each client is randomly assigned with two shards uniformly. This way, in expectation, around 180 clients would have samples from two classes, and about 20 clients would have samples from only one class. This corresponds to a strong data heterogeneity among the clients.

Figure 8: MNIST dataset trained by CNN. Training loss of Fed-EF with TopK, Sign and heavy-Sign compressors, and Stoc without EF.Participation rate p = 0.5, non-iid data. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. 1st row: Fed-EF-SGD. 2nd row: Fed-EF-AMS.

Figure 9: FMNIST dataset trained by CNN. Training loss of Fed-EF with TopK, Sign and heavy-Sign compressors, and Stoc without EF.Participation rate p = 0.5, non-iid data. "sign", "topk" and "hv-sign" are applied with Fed-EF, while "Stoc" is the stochastic quantization without EF. 1st row: Fed-EF-SGD. 2nd row: Fed-EF-AMS.

Figure16: The simulated gradients of 5 heterogeneous clients, from N (0, γ 2 ) with γ = 0.01. The gradient on each distinct client is scaled by s = 2, 10, 100 (left, mid, right), respectively. Larger s implies higher data heterogeneity.

Figure 19: The compression coefficient q A in Assumption 3 in our experiments (Section 5) for CNN trained on MNIST and FMNIST dataset, averaged over multiple runs. η = 1, η l = 0.01, non-iid client data distribution. Left: Sign compression. Mid: TopK compression with k = 0.01. Right: TopK compression with k = 0.1.

and ∥C(x)∥ ≥ k∥x∥ 2 by Proposition A.1. Input: learning rates η, η l , hyper-parameters β 1 , β 2 , ϵ 2: Initialize: central server parameter θ 1 ∈ R d ⊆ R d ; e 1,i = 0 the accumulator for each worker;

Test accuracy (%) with client participation rate p = 0.5, of Fed-EF-SGD with Sign, TopK and heavy-Sign compressor and Stoc (stochastic quantization) without EF. The compression parameters (i.e., k and b) of the compressors are consistent with Figure 2.

Test accuracy (%) with client participation rate p = 0.5, of Fed-EF-AMS with Sign, TopK and heavy-Sign compressor and Stoc (stochastic quantization) without EF. The compression parameters (i.e., k and b) of the compressors are consistent with Figure 2.

Test accuracy (%) with client participation rate p = 0.1, of Fed-EF-SGD with Sign, TopK and heavy-Sign compressor and Stoc (stochastic quantization) without EF. The compression parameters (i.e., k and b) of the compressors are consistent with Figure 3.

Test accuracy

Optimal global (η) and local (η l ) learning rate combinations to attain highest test accuracy.

1, and the proof of Fed-EF-SGD would follow in Section C.2. Section C.3 contains intermediary lemmas and Section C.4 provides the analysis of Fed-EF in partial participation. C.1 PROOF OF THEOREM 2: FED-EF-AMSProof. We first clarify some notations. At round t, let the full-precision local model update of the i-th worker be ∆ t,i , the error accumulator be e t,i , and denote ∆ t,i = C(g t,i + e t,i ).

ηE ⟨∇f (x t ), ∆t,Mt ⟩ + η 2 L 2 E ∥ ∆t,Mt ∥ 2 = -ηE ⟨∇f (θ t ), ∆t,Mt ⟩ II + ηE ⟨∇f (x t ) -∇f (θ t ), ∆t,Mt ⟩Note that the expectation is also with respect to the randomness in the client sampling procedure. For the first term, since the client sampling is random, we haveE[ ∆t,Mt ] = E[ ∆t ]. Thus,I = -ηE ⟨∇f (θ t ), ∆t,Mt ⟩ = -ηE ⟨∇f (θ t ), ∆t -η l K∇f (θ t ) + η l K∇f (θ t )⟩ = -ηη l KE ∥∇f (θ t )∥ 2 + ηη l = -ηη l KE ∥∇f (θ t )∥ 2 + ηη l E K 2 ∥∇f (θ t )∥ 2 ≤ -ηη l KE ∥∇f (θ t )∥ 2 + ηη l K 2 E ∥∇f (θ t )∥ 2 when η l ≤ 1 8KL, where (a) is a result of the fact that ⟨z 1 , z 2 ⟩ = ∥z 1 ∥ 2 + ∥z 2 ∥ 2 -∥z 1 -z 2 ∥ 2 , (b) is because of Lemma C.1.

t,i -∆ t,i , we have By the updating rule of e t,i , the inner expectation, conditional on e t = (e t,1 , ..., e t,n ) T , can be computed as ẽ t,i ẽt,j + e t,i e t,j -ẽt,i e t,j ) + 1)∥e t,i ∥ 2 + 2q 2 ∥∆ t,i ∥ 2 ) + 1)∥e t,i ∥ 2 + 2q 2 ∥∆ t,i ∥ 2 ).

2 l K 2 E ∥∇f (θ t )∥ 2 , + 1)∥e t,i ∥ 2 + 2q 2 ∥∆ t,i ∥ 2 )

Appendices

parallel for worker i ∈ [n] do:Receive Ht from the server and set θ for k = 1, . . . , K do 7:Compute stochastic gradient gt,i at θLocal update θend for Update the error e t+1,i = e t,i + ∆ t,i -∆ t,i 13:end parallel

14:

Central server do::Compress Ht = C( mtUpdate server error accumulator ϕ t+1 = ϕ t + (θ t+1 -θ t ) -Ht

21:

Update global model θ t+1 = θ t -η Ht and broadcast Ht to clients 22: end for

D TWO-WAY COMPRESSION IN FED-EF

As discussed in the paper, our Fed-EF scheme can also be combined with two-way compression, for both uploading (clients-to-server) and downloading (server-to-clients) channel. This can lead to even more communication reduction in practice. The steps can be found in Algorithm 3. Next, we briefly discuss its impact on the convergence analysis. For simplicity, we will focus on two-way compressed Fed-EF-SGD here, while same arguments hold for Fed-EF-AMS. The general approach is: 1) the clients transmit ∆t,i to the server which are compressed; 2) the server again compresses the aggregated update ∆ t and broadcast the compressed Ht to the clients, also using an error feedback at the central node. Note that this approach requires the clients to additionally store the model at the beginning of each round.To study the convergence of Algorithm 3, we consider a series of virtual iterates as

