FEDSPEED: LARGER LOCAL INTERVAL, LESS COM-MUNICATION ROUND, AND HIGHER GENERALIZATION ACCURACY

Abstract

Federated learning is an emerging distributed machine learning framework which jointly trains a global model via a large number of local devices with data privacy protections. Its performance suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting. In this paper, we propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems. Concretely, FedSpeed applies the prox-correction term on the current local updates to efficiently reduce the biases introduced by the prox-term, a necessary regularizer to maintain the strong local consistency. Furthermore, FedSpeed merges the vanilla stochastic gradient with a perturbation computed from an extra gradient ascent step in the neighborhood, thereby alleviating the issue of local over-fitting. Our theoretical analysis indicates that the convergence rate is related to both the communication rounds T and local intervals K with a upper bound O(1/T ) if setting a proper local interval. Moreover, we conduct extensive experiments on the real-world dataset to demonstrate the efficiency of our proposed FedSpeed, which performs significantly faster and achieves the state-of-the-art (SOTA) performance on the general FL experimental settings than several baselines including FedAvg, FedProx, FedCM, FedAdam, SCAFFOLD, FedDyn, FedADMM, etc. 

1. INTRODUCTION

Since McMahan et al. (2017) proposed federated learning (FL) , it has gradually evolved into an efficient paradigm for large-scale distributed training. Different from the traditional deep learning methods, FL allows multi local clients to jointly train a single global model without data sharing. However, FL is far from its maturity, as it still suffers from the considerable performance degradation over the heterogeneously distributed data, a very common setting in the practical application of FL. We recognize the main culprit leading to the performance degradation of FL as local inconsistency and local heterogeneous over-fitting. Specifically, for canonical local-SGD-based FL method, e.g., FedAvg, the non-vanishing biases introduced by the local updates may eventually lead to inconsistent local solution. Then, the rugged client-drifts resulting from the local over-fitting into inconsistent local solutions may make the obtained global model degrading into the average of client's local parameters. The non-vanishing biases have been studied by several previous works Charles & Konečnỳ (2021) ; Malinovskiy et al. (2020) in different forms. The inconsistency due to the local heterogeneous data will compromise the global convergence during the training process. Eventually it leads to serious client-drifts which can be formulated as x * ̸ = i∈ [m] x * i /m. Larger data heterogeneity may enlarge the drifts, thereby degrading the practical training convergence rate and generalization performance. In order to strengthen the local consistency during the local training process, and avoid the client-drifts resulting from the local over-fitting, we propose a novel and practical algorithm, dubbed as FedSpeed. Notably, FedSpeed incorporates two novel components to achieve SOTA performance. i) Firstly, FedSpeed inherits a penalized prox-term to force the local offset to be closer to the initial point at each communication round. However, recognized from Hanzely & Richtárik (2020) ; Khaled et al. (2019) that the prox-term between global and local solutions may introduce undesirable local training bias, we propose and utilize a prox-correction term to counteract the adverse impact. Indeed, in our theoretical analysis, the implication of the prox-correction term could be considered as a momentumbased term of the weighted local gradients. Via utilizing the historical gradient information, the bias brought by the prox-term can be effectively corrected. ii) Secondly, to avoid the rugged local over-fitting, FedSpeed incorporates a local gradient perturbation via merging the vanilla stochastic gradient with an extra gradient, which can be viewed as taking an extra gradient ascent step for each local update. Based on the analysis in Zhao et al. (2022) ; van der Hoeven (2020), we demonstrate that the gradient perturbation term could be approximated as adding a penalized squared L2-norm of the stochastic gradients to the original objective function, which can efficiently search for the flatten local minima Andriushchenko & Flammarion (2022) to prevent the local over-fitting problems. We also provide the theoretical analysis of our proposed FedSpeed and further demonstrate that its convergence rate could be accelerated by setting an appropriate large local interval K. Explicitly, under the non-convex and smooth cases, FedSpeed with an extra gradient perturbation could achieve the fast convergence rate of O(1/T ), which indicates that FedSpeed achieves a tighter upper bound with a proper local interval K to converge, without applying a specific global learning rate or assuming the precision for the local solutions (Durmus et al., 2021; Wang et al., 2022) . Extensive experiments are tested on CIFAR-10/100 and TinyImagenet dataset with a standard ResNet-18-GN network under the different heterogeneous settings, which shows that our proposed FedSpeed is significantly better than several baselines, e.g. for FedAvg, FedProx, FedCM, FedPD, SCAFFOLD, FedDyn, on both the stability to enlarge the local interval K and the test generalization performance in the actual training. To the end, we summarize the main contributions of this paper as follows: • We propose a novel and practical federated optimization algorithm, FedSpeed, which applies a prox-correction term to significantly reduce the bias due to the local updates of the prox-term, and an extra gradient perturbation to efficiently avoid the local over-fitting, which achieves a fast convergence speed with large local steps and simultaneously maintains the high generalization. • We provide the convergence rate upper bound under the non-convex and smooth cases and prove that FedSpeed could achieve a fast convergence rate of O(1/T ) via enlarging the local training interval K = O(T ) without any other harsh assumptions or the specific conditions required. • Extensive experiments are conducted on the CIFAR-10/100 and TinyImagenet dataset to verify the performance of our proposed FedSpeed. To the best of our interests, both convergence speed and generalization performance could achieve the SOTA results under the general federated settings. FedSpeed could outperform other baselines and be more robust to enlarging the local interval. 2 RELATED WORK McMahan et al. (2017) propose the federated framework with the properties of jointly training with several unbalance and non-iid local dataset via communicating with lower costs during the total training stage. The general FL optimization involves a local client training stage and a global server update operation Asad et al. (2020) and it has been proved to achieve a linear speedup property in Yang et al. (2021) . With the fast development of the FL, a series of efficient optimization method are applied in the federated framework. Li et al. (2020b) and Kairouz et al. (2021) introduce a detailed overview in this field. There are still many difficulties to be solved in the practical scenarios, while in this paper we focus to highlight the two main challenges of the local inconsistent solution and client-drifts due to heterogeneous over-fitting, which are two acute limitations in the federated optimization Li et al. (2020a) ; Yang et al. (2019) ; Konečnỳ et al. (2016) ; Liu et al. (2022) ; Shi et al. (2023) ; Liu et al. (2023) . Local consistency. Sahu et al. (2018) study the non-vanishing biases of the inconsistent solution in the experiments and apply a prox-term regularization. FedProx utilizes the bounded local updates by penalizing parameters to provide a good guarantee of consistency. In Liang et al. (2019) they introduce the local gradient tracking to reduce the local inconsistency in the local SGD method. Charles & Konečnỳ (2021) ; Malinovskiy et al. (2020) show that the local learning rate decay can balance the trade-off between the convergence rate and the local inconsistency with the rate of O(η l (K -1)). Furthermore, Wang et al. (2021; 2020b) through a simple counterexample to show that using adaptive optimizer or different hyper-parameters on local clients leads to an additional gaps. They propose a local correction technique to alleviate the biases. Wang et al. (2020a) ; Tan et al. (2022) consider the different local settings and prove that in the case of asynchronous aggregation, the inconsistency bias will no longer be eliminated by local learning rate decay. 2021) propose FedDyn as a variants via averaging all the dual variables (the average quantity can then be viewed as the global gradient) under the partial participation settings, which can also achieve the same O( 1T ) under the assumption that exact local solution can be found by the local optimizer. Wang et al. (2022) ; Gong et al. (2022) propose two other variants to apply different dual variable aggregation strategies under partial participation settings. These methods benefit from applying the prox-term Li et al. (2019) ; Chen & Chao (2020) or higher efficient optimization methods Bischoff et al. (2021) ; Yang et al. (2022) to control the local consistency. Client-drifts. Karimireddy et al. (2020) firstly demonstrate the client-drifts for federated learning framework to indicate the negative impact on the global model when each local client over-fits to the local heterogeneous dataset. They propose SCAFFOLD via a variance reduction technique to mitigate this drifts. Yu et al. (2019) and Wang et al. (2019) introduce the momentum instead of the gradient to the local and global update respectively to improve the generalization performance. To maintain the property of consistency, Xu et al. (2021) propose a novel client-level momentum term to improve the local training process. Ozfatura et al. (2021) incorporate the client-level momentum with local momentum to further control the biases. In recent Gao et al. (2022) ; Kim et al. (2022) , they propose a drift correction term as a penalized loss on the original local objective functions with a global gradient estimation. Chen et al. (2020) and Chen et al. (2021; 2022) focus on the adaptive method to alleviate the biases and improve the efficiency. Our proposed FedSpeed inherits the prox-term at local update to guarantee the local consistency during local training. Different from the previous works, we adopt an extra prox-correction term to reduce the bias during the local training introduced by the update direction towards the last global model parameters. This ensures that the local update could be corrected towards the global minima. Furthermore, we incorporate a gradient perturbation update to enhance the generalization performance of the local model, which merges a gradient ascent step.

3. METHODOLOGY

In this part, we will introduce the preliminaries and our proposed method. We will explain the implicit meaning for each variables and demonstrate the FedSpeed algorithm inference in details. Notations and preliminary. Let m be the number of total clients. We denote S t as the set of active clients at round t. K is the number of local updates and T is the communication rounds. (•) t i,k denotes variable (•) at k-th iteration of t-th round in the i-th client. x is the model parameters. g is the stochastic gradient computed by the sampled data. g is the weighted quasi-gradient computed as defined in Algorithm 1. ĝ is the prox-correction term. We denote ⟨•, •⟩ as the inner product for two vectors and ∥ • ∥ is the Euclidean norm of a vector. Other symbols are detailed at their references. As the most FL frameworks, we consider to minimize the following finite-sum non-convex problem: F (x) = 1 m m i=1 F i (x) , Algorithm 1 FedSpeed Algorithm Framework Input: model parameters x 0 , total communication rounds T , local gradient controller ĝ-1 i = 0, penalized weight λ. Output: model parameters x T . 1: for t = 0, 1, 2, • • • , T -1 do 2: select active clients-set S t at round t 3: for client i ∈ S t parallel do 4: communicate x t to local client i and set x t i,0 = x t 5: for k = 0, 1, 2, • • • , K -1 do 6: sample a minibatch ε t i,k and do 7: compute unbiased stochastic gradient: g t i,k,1 = ∇Fi(x t i,k ; ε t i,k ) 8: update the extra step: xt i,k = x t i,k + ρg t i,k,1 9: compute unbiased stochastic gradient: g t i,k,2 = ∇Fi(x t i,k ; ε t i,k ) 10: compute quasi-gradient: gt i,k = (1 -α)g t i,k,1 + αg t i,k,2 11: update the gradient descent step: x t i,k+1 = x t i,k -η l gt i,k -ĝt-1 i + 1 λ (x t i,k -x t ) 12: end for 13: ĝt i = ĝt-1 i -1 λ (x t i,K -x t ) 14: communicate xt i = x t i,K -λĝ t i to the global server 15: end for 16: x t+1 = 1 S i∈S t xt i 17: end for where F : R d → R, F i (x) := E εi∼Di F i (x, ε i ) is objective function in the client i, and ε i represents for the random data samples obeying the distribution D i . m is the total number of clients. In FL, D i may differ across the local clients, which may introduce the client drifts by the heterogeneous data.

3.1. FEDSPEED ALGORITHM

In this part, we will introduce our proposed method to alleviate the negative impact of the heterogeneous data and reduces the communication rounds. We are inspired by the dynamic regularization Durmus et al. (2021) for the local updates to eliminate the client drifts when T approaches infinite. Our proposed FeedSpeed is shown in Algorithm 1. At the beginning of each round t, a subset of clients S t are required to participate in the current training process. The global server will communicate the parameters x t to the active clients for local training. Each active local client performs three stages: (1) computing the unbiased stochastic gradient g t i,k,1 = ∇F i (x t i,k ; ε t i,k ) with a randomly sampled mini-batch data ε t i,k and executing a gradient ascent step in the neighbourhood to approach xt i,k ; (2) computing the unbiased stochastic gradient g t i,k,2 with the same sampled minibatch data in (1) at the xt i,k and merging the g t i,k,1 with g t i,k,2 to introduce a basic perturbation to the vanilla descent direction; (3) executing the gradient descent step with the merged quasi-gradient gt i,k , the prox-term ∥x t i,k -x t ∥ 2 and the local prox-correction term ĝt-1 i . After K iterations local training, prox-correction term ĝt-1 i will be updated as the weighted sum of the current local offset (x t i,K -x t i,0 ) and the historical offsets momentum. Then we communicate the amended model parameters xt i = x t i,K -λĝ t i to the global server for aggregation. On the global server, a simple average aggregation is applied to generate the current global model parameters x t+1 at round t. Prox-correction term. In the general optimization, the prox-term ∥x t i,k -x t ∥ 2 is a penalized term for solving the non-smooth problems and it contributes to strengthen the local consistency in the FL framework by introducing a penalized direction in the local updates as proposed in Sahu et al. (2018) . However, as discussed in Hanzely & Richtárik (2020) , it simply performs as a balance between the local and global solutions, and there still exists the non-vanishing inconsistent biases among the local solutions, i.e., the local solutions are still largely deviated from each other, implying that local inconsistency is still not eliminated, which limits the efficiency of the federated learning framework. To further strengthen the local consistency, we utilize a prox-correction term ĝt i which could be considered as a previous local offset momentum. According to the local update, we combine the x t i,k-1 term in the prox term and the local state, setting the weight as (1 -η l λ ) multiplied to the basic local state. As shown in the local update in Algorithm 1 (Line.11), for ∀ x ∈ R d we have: x t i,K -x t = -γλ K-1 k=0 γ k γ gt i,k + γλĝ t-1 i , where K-1 k=0 γ k = K-1 k=0 η l λ 1 -η l λ K-1-k = γ. Proof details can be referred to the Appendix. Firstly let ĝ-1 i = 0, Equation (2) indicates that the local offset will be transferred to a exponential average of previous local gradients when applying the prox-term, and the updated formation of the local offset is independent of the local learning rate η l . This is different from the vanilla SGD-based methods, e.g. FedAvg, which treats all local updates fairly. γ k changes the importance of the historical gradients. As K increases, previous updates will be weakened by exponential decay significantly for η l < λ. Thus, we apply the prox-correction term to balance the local offset. According to the iterative formula for ĝt i (Line.13 in Algorithm 1) and the equation ( 2), we can rewrite this update as: ĝt i = (1 -γ)ĝ t-1 i + γ K-1 k=0 γ k γ gt i,k , where γ and γ k is defined the same as in Equation ( 2). Proof details can be referred to the Appendix. Note that ĝt i performs as a momentum term of the historical local updates before round t, which can be considered as a estimation of the local offset at round t. At each local iteration k of round t, ĝt-1 i provides a correction for the local update to balance the impact of the prox-term to enhance the contribution of those descent steps executed firstly at each local stages. It should be noted that ĝt-1 i is different from the global momentum term mentioned in Wang et al. (2019) which aggregates the average local updates to improve the generalization performance. After the local training, it updates the current information. Then we subtract the current ĝt i from the local models x t i,K to counteract the influence in the local stages. Finally it sends the post-processed parameters xt i,K to the global server. Gradient perturbation. Gradient perturbations (Foret et al., 2020a; Mi et al.; Zhao et al., 2022; Zhong et al., 2022) significantly improves generalization for deep models. An extra gradient ascent in the neighbourhood can effectively express the curvature near the current parameters. Referring to the analysis in Zhao et al. (2022) , we show that the quasi-gradient g, which merges the extra ascent step gradient and the vanilla gradient, could be approximated as penalizing a square term of the L2-norm of the gradient on the original function. On each local client to solve the stationary point of min x {F i (x) + β∥∇F i (x)∥ 2 } can search for a flat minima. Flatten loss landscapes will further mitigate the local inconsistency due to the averaging aggregation on the global server on heterogeneous dataset. Detailed discussions can be referred to the Appendix.

4. CONVERGENCE ANALYSIS

In this part we will demonstrate the theoretical analysis of our proposed FedSpeed and illustrate the convergence guarantees under the specific hyperparameters. Due to the space limitations, more details could be referred to the Appendix. Some standard assumptions are stated as follows. Assumption 4.1 (L-Smoothness) For the non-convex function F i holds the property of smoothness for all i ∈ [m], i.e., ∥∇F i (x) -∇F i (y)∥ ≤ L∥x -y∥, for all x, y ∈ R d . (Reddi et al., 2020; Yang et al., 2021; Xu et al., 2021; Wang et al., 2021; Karimi et al., 2021) . Our theoretical analysis depends on the above assumptions to explore the comprehensive properties in the local training process. Proof sketch. To express the essential insights in the updates of the Algorithm 1, we introduce two auxiliary sequences. Considering the u t = 1 m i∈[m] x t i,K as the mean averaged parameters of the last iterations in the local training among the local clients. Based on {u t }, we introduce the auxiliary sequences {z t = u t + 1-γ γ (u tu t-1 )} t>0 . Combining the local update and the Equation (3): u t+1 = u t -λ 1 m i∈[m] K-1 k=0 γ k γ γ gt i,k + (1 -γ)ĝ t-1 i . If we introduce K virtual states u t i,k , it could be considered as a momentum-based update of the proxcorrection term ĝt-1 i with the coefficient γ. And the prox-correction term ĝt i = -1 λ u t i,K -u t i,0 , which implies the global update direction in the local training process. And z t updates as: z t+1 = z t -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k , Detailed proofs can be referred in the Appendix. After mapping x t to u t , the local update could be considered as a client momentum-like method with a normalized weight parameterized by γ k . Further, after mapping u t to z t , the entire update process will be simplified to a SGD-type method with the quasi-gradients g. z t contains the penalized prox-term in the total local training stage. Though a prox-correction term is applied to eliminate the local biases, x t maintains to be beneficial from the update of penalizing the prox-term. The prox-correction term plays the role as exponential average of the global offset. Then we introduce our proof of the convergence rate for the FedSpeed algorithm: Theorem 4.4 Under the Assumptions 4.1-4.3, when the perturbation learning rate satisfies ρ ≤ 1 √ 6αL , and the local learning rate satisfies η l ≤ min{ 1 32 √ 3KL , 2λ}, and the local interval satisfies K ≥ λ/η l , let κ = 1 2 -3α 2 L 2 ρ 2 -1536η 2 l L 2 K is a positive constant with selecting the proper η l and ρ, the auxiliary sequence z t in Equation (5) generated by executing the Algorithm 1 satisfies: 1 T T -1 t=1 E∥∇F (z t )∥ 2 ≤ 2(F (z 1 ) -F * ) λκT + 64η l L 2 K κmT i∈[m] E∥ĝ 0 i ∥ 2 + 32λ 2 L 2 κT E t ∥ 1 m i∈[m] ĝ0 i ∥ 2 +Φ, ) where F is a non-convex objective function F * is the optimal of F . The term Φ is: Φ = 1 κ 32λη 2 l L 2 K(16σ 2 g + σ 2 l ) + λα 2 L 2 ρ 2 (3σ 2 g + σ 2 l ) , ( ) where α is the perturbation weight. More proof details can be referred to the Appendix. Remark 4.6 Compared with the other prox-based works, e.g. for (Durmus et al., 2021; Wang et al., 2022; Gong et al., 2022) , their proofs rely on the harsh assumption that local client must approach an exact stationary point or ϵ-inexact stationary point in the local training per round. It cannot be strictly satisfied in the practical federated learning framework with the current theoretical analysis of the last iteration point on the non-convex case. We relax this assumption through enlarging the local interval and prove that federated prox-based methods can also achieve the convergence of O(1/T ). Remark 4.7 Compared with the other current methods, FedSpeed can improve the convergence rate by increasing the local interval K, which is a good property for the practical federated learning framework. For the analysis of FedAvg (Yang et al., 2021) In each group, the left shows the performance on IID dataset while the right shows the performance on the non-IID dataset, which are split by setting heterogeneity weight of the Dirichlet as 0.6.

5. EXPERIMENTS

In this part, we firstly introduce our experimental setups. We present the convergence and generalization performance in Section 5.2, and study ablation experiments in Section 5.3.

5.1. SETUP

Dataset and backbones. We test the experiments on CIFAR-10, CIFAR-100 Krizhevsky et al. (2009) and TinyImagenet. Due to the space limitations we introduce these datasets in the Appendix. We follow the Hsu et al. (2019) to introduce the heterogeneity via splitting the total dataset by sampling the label ratios from the Dirichlet distribution. We train and test the performance on the standard ResNet-18 He et al. (2016) backbone with the 7×7 filter size in the first convolution layer with BN-layers replaced by GN Wu & He (2018) ; Hsieh et al. (2020) to avoid the invalid aggregation. Implementation details. We select each hyper-parameters within the appropriate range and present the combinations under the best performance. To fairly compare these baseline methods, we fix the most hyper-parameters for all methods under the same setting. For the 10% participation of total 100 clients training, we set the local learning rate as 0.1 initially and set the global learning rate as 1.0 for all methods except for FedAdam which applies 0.1 on global server. The learning rate decay is set as multiplying 0.998 per communication round except for FedDyn, FedADMM and FedSpeed which apply 0.9995. Each active local client trains 5 epochs with batchsize 50. Weight decay is set as 1e-3 for all methods. The weight for the prox-term in FedProx, FedDyn, FedADMM and FedSpeed is set as 0.1. For the 2% participation, the learning rate decay is adjusted to 0.9998 for FedDyn and FedSpeed. Each active client trains 2 epochs with batchsize 20. The weight for the prox-term is set as 0.001. The other hyper-parameters specific to each method will be introduced in the Appendix. 1 (a) shows the results of 10% participation of total 100 clients. For the IID splits, FedSpeed achieves 6.1% ahead of FedAvg as 88.5%. FedDyn suffers the instability when learning rate is small, which is the similar phenomenon as mentioned in Xu et al. (2021) . When introducing the heterogeneity, FedAdam suffers from the increasing variance obviously with the accuracy dropping from 85.7% to 83.2%. Figure 1 (b) shows the impact from reducing the participation. FedAdam is lightly affected by this change while the performance degradation of SCAFFOLD is significant which drops from 85.3% to 80.1%. CIFAR-100 & TinyImagenet. As shown in Figure 1 (c) and (d), the performance of FedSpeed on the CIFAR-100 and TinyImagenet with low participating setting performs robustly and achieves approximately 1.6% and 1.8% improvement ahead of the FedCM respectively. As the participation is too low, the impact from the heterogeneous data becomes weak gradually with a similar test accuracy. SCAFFOLD is still greatly affected by a low participation ratio, which drops about 3.3% lower than FedAdam. FedCM converges fast at the beginning of the training stage due to the benefits from strong consistency limitations. FedSpeed adopts to update the prox-correction term and converges faster with its estimation within several rounds and then FedSpeed outperforms other methods. Table 1 shows the accuracy under the low participation ratio equals to 2%. Our proposed FedSpeed outperforms on each dataset on both IID and non-IID settings. Table 1 shows the accuracy under the low participation ratio equals to 2%. Our proposed FedSpeed outperforms on each dataset on both IID and non-IID settings. We observe the similar results as mentioned in Reddi et al. (2020) ; Xu et al. (2021) . FedAdam and FedCM could maintain the low consistency in the local training stage with a robust results to achieve better performance than others. While FedDyn is affected greatly by the number of training samples in the dataset, which is sensitive to the partial participation ratios. Large local interval for the prox-term. From the IID case to the non-IID case, the heterogeneous dataset introduces the local inconsistency and leads to the severe client-drifts problem. Almost all the baselines suffer from the performance degradation. High local consistency usually supports for a large interval as for their bounded updates and limited offsets. Applying prox-term guarantees the local consistency, but it also has an negative impact on the local training towards the target of weighted local optimal and global server model. FedDyn and FedADMM succeed to apply the primal-dual method to alleviate this influence as they change the local objective function whose target is reformed by a dual variable. These method can mitigate the local offsets caused by the prox-term and they improve about 3% ahead of the FedProx on CIFAR-10. However, the primal-dual method requires a local ϵ-close solution. An interesting experimental phenomenon is that the performance of SCAFFOLD gradually degrades under the low participation ratio. It should be noticed that under the 10% participation case, SCAF-FOLD performs as well as the FedCM. It benefits from applying a global gradient estimation to correct the local updates, which can weaken the client-drifts by a quasi gradient towards to the global optimal. Actually the estimation variance is related to the participation ratio, which means that their efficiencies rely on the enough number of clients. When the participation ratio decreases to be extremely low, their performance will also be greatly affected by the huge biases in the local training.

5.3. HYPERPARAMETERS SENSITIVITY

Local interval K. To explore the acceleration on T by applying a large interval K, we fix the total training epochs E. It should be noted that K represents for the iteration and E represents for the epoch. A larger local interval can be applied to accelerate the convergence in many previous works theoretically, e.g. for SCAFFOLD and FedAdam, while empirical studies are usually unsatisfactory. As shown in Figure 2 , in the FedAdam and FedCM, when K increases from 1 to 20, the accuracy drops about 13.7% and 10.6% respectively. SCAFFOLD is affected lightly while its performance is much lower. In Figure 2 (d), FedSpeed applies the larger E to accelerate the communication rounds T both on theoretical proofs and empirical results, which stabilizes to swing within 3.8% lightly. Learning rate ρ for gradient perturbation. In the simple analysis, ρ can be selected as a proper value which has no impact on the convergence complexity. By noticing that if α ̸ = 0, ρ could be selected irrelevant to η l . To achieve a better performance, we apply the ascent learning rate ρ = ρ 0 /∥∇F i ∥ to in the experiments, where ρ 0 is a constant value selected from the Table 2 . ρ is consistent with the sharpness aware minimization Foret et al. (2020b) which can search for a flat local minimal. Table 2 shows the performance of utilizing the different ρ 0 on CIFAR-10 by 500 communication rounds under the 10% participation of total 100 clients setting. Due to the space limitations more details could be referred to the Appendix.

6. CONCLUSION

In this paper, we propose a novel and practical federated method FedSpeed which applies a proxcorrection term to neutralize the bias due to prox-term in each local training stage and utilizes a perturbation gradient weighted by an extra gradient ascent step to improve the local generalization performance. We provide the theoretical analysis to guarantee its convergence and prove that FedSpeed benefits from a larger local interval K to achieve a fast convergence rate of O(1/T ) without any other harsh assumptions. We also conduct extensive experiments to highlight the significant improvement and efficiency of our proposed FedSpeed, which is consistent with the properties of our analysis. This work inspires the FL framework design to focus on the local consistency and local higher generalization performance to implement the high-efficient method to federated learning. In this part, we will introduce the gradient perturbation, the proofs of the major equations and the theorem, and some extra experiments. In Section A, we provide a explanation for understanding the gradient perturbation step in our proposed FedSpeed. In Section C, we provide the full proofs of the major equation in the text, some main lemmas and the theorem. In Section B, we provide the details of the implementation of the experiments including the setups, dataset, hyper-parameters and some extra experiments.

A GRADIENT PERTURBATION

A.1 UNDERSTANDING OF GRADIENT PERTURBATION We propose the gradient perturbation in the local training stage instead of the traditional stochastic gradient, which merges an extra gradient ascent step to the vanilla gradient by a hyper-parameter α. While its ascent step usually approximates the worst point in the neighbourhood. This has been studied in many previous works, e.g. for the form of extra gradient and the sharpness aware minimization. In our studies, we perform the extra gradient ascent step instead of the descent step in extra gradient method. It also could be considered as a variant of the sharpness aware minimization method via weighted averaging the ascent step gradient and the vanilla gradient, instead of the normalized gradient. Here we illustrate the implicit of this quasi-gradient g in our proposed FedSpeed and explain the positive efficiency for the local training from the perspective of objective functions. Firstly we consider to minimize the non-convex problem L p (x). To approach the stationary point of L p , we can simply introduce a penalized gradient term as a extra loss in L p , which is to solve the problem min x {L(x) ≜ L p (x) + β 2 ∥∇L p (x)∥ 2 }. The final optimization target is consistent with the vanilla target, while penalizing gradient term can approach a flatten minimal empirically. We compute the gradient form as follows: ∇L(x) = ∇L p (x) + β 2 ∇∥∇L p (x)∥ 2 = ∇L p (x) + β∇ 2 L p (x) • ∇L p (x). The update in Equation ( 8) contains second-order Hessian information, which involves a huge amount of parameters for calculation. To further simplify the updates, we consider an approximation for the gradient form. We expand the function L p via Taylor expansion as: L p (x + ∆) = L p (x) + ∇L p (x)∆ + 1 2 ∆ T ∇ 2 L p (x)∆ + R ∆ , where R ∆ = O(∥∆∥ 2 ) is the infinitesimal to ∥∆∥ 2 , which is directly omitted in our approximation. Thus we have the gradient form on ∆ as: ∇L p (x + ∆) ≈ ∇L p (x) + ∇ 2 L p (x)∆. R ∆ is relevant to ∆. We set the ∆ = ρ∇L p (x) and then we have: ∇ 2 L p (x)∇L p (x) ≈ 1 ρ ∇L p x + ρ∇L p (x) -∇L p (x) . Thus we connect Equation (8) and Equation ( 9), we have: ∇L(x) = ∇L p (x) + β∇ 2 L p (x) • ∇L p (x) ≈ ∇L p (x) + β ρ ∇L p x + ρ∇L p (x) -∇L p (x) = 1 - β ρ ∇L p (x) + β ρ ∇L p x + ρ∇L p (x) = (1 -α)∇L p (x) + α∇L p x + ρ∇L p (x) . Here we can see that the balance weight α in our proposed method is actually the ratio of the gradient penalized weight β and the gradient ascent step size ρ. To fix the step size ρ, increasing α means increasing the gradient penalized weight β, which facilitates searching for a flatten stationary point to improve the generalization performance. While the second term of ∇L(x) can not be directly computed for its nested form, we approximate the second term with the chain rule as follows: ∇L p x + ρ∇L p (x) ≈ ∇L p (θ)| θ=x+ρ∇Lp(x) . Finally we have: ∇L(x) ≈ (1 -α)∇L p (x) + α∇L p (θ)| θ=x+ρ∇Lp(x) . (10) The Equation ( 10) provides an understanding for the weighted quasi gradient g on the local training stage in our proposed FedSpeed. We select an appropriate 0 ≤ β ≤ ρ to satisfy the update of perturbation gradient. It executes a gradient ascent step firstly with the step size ρ to x. Then it generates the stochastic gradient by the same sampled mini-batch data as the ascent step at x. The quasi-gradient is merged as Equation ( 10) to execute the gradient descent step. This is just a simple approximation for the gradient perturbation to help for understanding the implicit of the quasi-gradient and its performance in the training stage. Actually the error of the approximation depends a lot on ρ. The smaller ρ, the higher the accuracy of this estimation, but the smaller ρ, the less efficient the optimizer performs. Similar understanding can be referred in the (Qu et al., 2022; Caldarola et al., 2022; Andriushchenko & Flammarion, 2022) . Dataset Partitions. To fairly compare with the other baselines, we follow the Hsu et al. (2019) to introduce the heterogeneity via splitting the total dataset by sampling the label ratios from the Dirichlet distribution. An additional parameter is used to control the level of the heterogeneity of the entire data partition. In order to visualize the distribution of heterogeneous data, we make the heat maps of the label distribution in different dataset, as shown in Figure 3 . Since the heat map of 500 clients cannot be displayed normally, we show 100 clients case. It could be seen that for heterogeneity weight equals to 0.6, about 10% to 20% of the categories dominate on each client, which is white block in the Figure 3 . The IID dataset is totally averaged in each client. 

B.2 EXPERIMENTS B.3 HYER-PARAMETERS

Hyper-parameters Selections. We fix the local learning rate as 0.1 and global learning rate as 1.0 for average, except for the FedAdam which is applied 0.1. The penalized weight of prox-term in FedProx, FedDyn, FedADMM and FedSpeed is selected from the [0.001, 0.01, 0.1, 0.5]. The learning rate decay is fixed as 0.998 expect for the FedDyn, FedADMM and FedSpeed is selected from [0.998, 0.999, 0.9995, 0.99995]. The perturbation weight is selected from [0, 0.5, 0.75, 0.875, 0.9375, 1]. The batchsize is selected from [20, 50] . The local interval K is selected from [1, 2, 5, 10, 20] . For the specific parameters in FedAdam, the momentum weight is set as 0.1 and the second order momentum weight is set as 0.01. The minimal value is set as 0.001 to prevent the calculation of dividing by 0. The client-level momentum weight of FedCM is set as 0.1. Here we briefly introduce the selection of the hyperparameters in FedSpeed. (1) η l is the learning rate which is a basic hyperparameters in the deep learning, and usually we do not finetune this for the fair comparison in the experiments. We just select the same and common settings as the previous works mentioned. (2) λ is the coefficient for the prox-term, which is proposed in the FedProx and a lot of prox-based federated methods adopt this hyperparameter widely both in personalized-FL and centralized-FL. The selection of this hyperparameter has been studied in many previous works which verify its efficiency. Usually the selection of λ are in {10, 100} on the CIFAR-10/100 dataset, and we test it also works on the TinyImagenet. (3) ρ is the ascent step learning rate. Like many extra gradient method, the selection of ρ is usually related to the local learning rate η l . In order not to unduly affect the performance of the gradient descent, the learning rate for the extra gradient step ρ is usually set not much larger than the learning rate for the gradient descent step η l . Obviously, if ρ is set very small, the updated state of the extra gradient steps will be very limited, which makes this operation have no effect. Therefore, the selection of ρ usually matches that of η l . In our experiments, the η l is set as 0.1, which is a common selection in the previous works. We test the selection of ρ in {0, 0.01, 0.05, 0.1, 0.2} which represents for {"no extra gradient", "0.1η l ", "0.5η l , "1η l ", "2η l "}. The best performing selection is ρ = 1η l in CIFAR-10 (details in Section 5.3 paragraph "Learning rate ρ for the gradient perturbation"). We also test this selection on the CIFAR-100 and TinyImagenet, and it also works well. We recommend that the selection of ρ should be kept comparable to the learning rate η l . (4) α is the ratio for merging the gradient of the extra ascent step. In FedSpeed, the α is in the range of [0, 1]. The same, if α is set very small, which means it does not merge the gradient of the ascent steps. In our experiments, we test the selection of α in {0, 0.5, 0.75, 0.875, 0.9375, 1.0}. The best performing selection is α = 0.9375 in CIFAR-10. In fact α = 1 also works well (details in Section 5.3 paragraph "Perturbation weight α"). Thus, about α, we recommend that it should be close to 1.0, e.g. for 0.9, 0.99, 1.0. This also verifies the improvements of the ascent steps.

B.3.1 BEST PERFORMING HYPER-PARAMETERS.

For fair comparison, the learning rate is fixed for all the methods. For CIFAR-10 dataset, we select the batchsize as 50 for 100 clients and 20 for 500 clients. The total dataset is 50,000 and there are 100 images under a single client if it is set as 500 clients. 2021) etc. and their performance is matching. We select the local interval K as 5. The prox-term weight is selected as 0.1. The learning rate decay is selected as 0.9995 for prox-term based methods. We train the total dataset for 1,500 communication rounds. For CIFAR-100 dataset, we select the 500 clients with 2% participation ratio in the experiments. Thus for each hyper-parameters we fine-tune a little. The batchsize is selected as 20 to avoid too little iterations per local epoch. The local epochs is set as 2 for the final results comparison. The ablation study on local interval K indicates that our proposed FedSpeed outperforms significantly than other methods when K is large. Thus to compare the performance more clearly, we select the 2 as the local epochs. We decay the prox-term weight as 0.01 for prox-term based methods. The learning rate decay is selected as 0.99995 for prox-based methods. We train 1,500 rounds and then test the performance. For TinyImagenet dataset, the most selections are the same as for the CIFAR-100 dataset. The prox-term weight is selected as 0.1 and the learning rate decay is selected as 0.9995. Total 3,000 communication rounds are implemented in the training stage.

B.3.2 SPEED COMPARISON.

Table 4 shows the communication rounds required to achieve the target test accuracy. At the beginning of training, FedCM performs faster than others and usually achieve a high accuracy finally. FedSpeed is faster in the middle and late stages of training. We bold the data for the top-2 in each test and generally FedCM and FedSpeed significantly performs well on the training speed. Figure 4 shows the performance of different learning rate decay and prox-term weight for FedSpeed. We test the time on the A100-SXM4-40GB GPU and show the performance in the stronger than Dir-0.6 setups. the heterogeneity becomes stronger, FedSpeed can still maintain a stable generalization performance. The correction term helps to correct the biases during the local training, while the gradient perturbation term helps to resist the local over-fitting on the heterogeneous dataset. FedSpeed can benefit from avoiding falling into the biased optima. From the practical training point of view, compared with the vanilla FedAvg, FedSpeed adds three main modules: (1) prox-term, (2) prox-correction term, and (3) gradient perturbation. We test the performance of 500 communication rounds of the different combination of the modules above on the CIFAR-10 with the settings of 10% participating ratio of total 100 clients. The TableB.3.5 shows their performance.

B.3.5 ABLATION STUDIES

From the table above, we can clearly see the performance of different modules. The prox-term is proposed by the FedProx. But due to some issues we point out in our paper, this term has also a negative impact on the performance in FL. When the prox-correction term is introduced in, it improves the performance from 82.24% to 83.94%. When the gradient perturbation is introduced in, it improves the performance from 82.24% to 83.88%. While FedSpeed applies them together and achieves a 3.46% improvement.

Different performance of these modules:

As introduced in our paper, the prox-term simply performs as a balance between the local and global solutions, and there still exists the non-vanishing inconsistent biases among the local solutions, i.e., the local solutions are still largely deviated from each other, implying that local inconsistency is still not eliminated. Thus we utilize the prox-correction term to correct the inconsistent biases during the local training. About the function of gradient perturbation, we refer to a theoretical explanation in the main text, and its proof is provided in the supplementary material due to the space limitations. This perturbation is similar to utilize a penalized gradient term to the objective function during local optimization process. The additional penalty will bring better properties to the local state, e.g. for flattened minimal and smoothness. For federated learning, the smoother the local minima is, the more flatness the model merged on the server will be. FedSpeed benefits from these two modules to improve the performance and achieves the SOTA results.

B.3.6 LOSS CURVE COMPARISON

According to the Figure 5 , in (a) and (b) we can see the common FedAvg method fail to resist the local over-fitting and finally does not approach a stable state in the training, while FedSpeed can converge stably and efficiently, which if far faster than other baselines. (c) and (d) show the empirical studies of increasing the local interval. FedCM is the second SOTA-performing method in our baselines, while it still can not speed up by increasing local interval in the practical training. As shown in (c), increasing local interval almost does not have any benefits to FedCM, and the communication rounds can not be reduced. While in (d), FedSpeed succeed to apply a larger local interval to reduce the communication rounds. When K is increasing, to achieve the similar performance, the total communication rounds is nearly decreasing as K×, which is a useful property that is very efficient in practical training. In our paper, what we claim is that if a federated learning method could adopt a large K, it is a good property. Unfortunately, most SGD-type algorithms cannot increase the convergence rate by increasing K. Some useful techniques are adopt in the framework to improve the performance, e.g. variance reduction, gradient tracking and regularization term (mainly the prox-based methods). FedSpeed is a prox-based method which incorporates the correction term and extra ascent gradient to improve the performance. In fact, it has been proven that prox-based methods have the potential to apply the larger local interval in the local training under the requirement of local minimal solution per communication round. We theoretically prove that FedSpeed can achieve the fast rate without this harsh assumption and it can apply the large K in the local client. Yang et al. (2021) have proven that if FedAvg change the partial participation to the full participation (Local-SGD-type), the dominant term of convergence rate will change from O( √ K √ nT ) to O( 1 √ mKT ), which will be relaxed to K times faster. Full participation usually achieves higher theoretical rate than the partial participation. VRL-SGD Liang et al. (2019) Perturbation weight α. α determines the degree of influence of the perturbation gradient term to the vanilla stochastic gradient on the local training stage. It is a trade-off to balance the ratio of the perturbation term. We select the α from 0 to 1 and find FedSpeed can converge with any α ∈ [0, 1]. Though the theoretical analysis demonstrates that by applying a α > 0 in the term Φ will not increasing the extra orders. And the experimental results shown in Table 8 , indicates that the generalization performance improves by increasing α.

C PROOFS FOR ANALYSIS

In this part we will demonstrate the proofs of all formula mentioned in this paper. Each formula is presented in the form of a lemma.

C.1 PROOF OF EQUATION (2)

Equation ( 2) shows the update in the total local training stage. Lemma C.1 For ∀ x t i,k ∈ R d and i ∈ S t , we denote δ t i,k = x t i,k -x t i,k-1 with setting δ t i,0 = 0, and ∆ t i,K = K k=0 δ t i,k = x t i,K -x t i,0 , under the update rule in Algorithm Algorithm 1, we have: ∆ t i,K = -λγ K-1 k=0 γ k γ gt i,k + γλĝ t-1 i , where K-1 k=0 γ k = K-1 k=0 η l λ 1 -η l λ K-1-k = γ = 1 -(1 -η l λ ) K . Proof 1 According to the update rule of Line.11 in Algorithm Algorithm 1, we have: δ k = ∆ t i,k -∆ t i,k-1 = x t i,k -x t i,k-1 = -η l gt i,k-1 -ĝt-1 i + 1 λ (x t i,k-1 -x t i,0 ) = -η l (g t i,k-1 -ĝt-1 i + 1 λ ∆ t i,k-1 ). Then We can formulate the iterative relationship of ∆ t i,k as: ∆ t i,k = ∆ t i,k-1 -η l (g t i,k-1 -ĝt-1 i + 1 λ ∆ t i,k-1 ) = (1 - η l λ )∆ t i,k-1 -η l (g t i,k-1 -ĝt-1 i ). Taking the iteration on k and we have: x t i,K -x t i,0 = ∆ t i,K = (1 - η l λ ) K ∆ t i,0 -η l K-1 k=0 (1 - η l λ ) K-1-k (g t i,k -ĝt-1 i ) (a) = -η l K-1 k=0 (1 - η l λ ) K-1-k (g t i,k -ĝt-1 i ) = -λ K-1 k=0 η l λ (1 - η l λ ) K-1-k (g t i,k -ĝt-1 i ) = -λ K-1 k=0 η l λ (1 - η l λ ) K-1-k gt i,k + 1 -(1 - η l λ ) K λĝ t-1 i = -λγ K-1 k=0 γ k γ gt i,k + γλĝ t-1 i . (a) applies ∆ t i,0 = δ t i,0 = 0.

C.2 PROOF OF EQUATION (3)

Equation ( 3) shows the update of the prox-correction term, which utilizes the weighted sum of the previous local offsets as a bias controller for eliminating the non-vanishing bias resulting from the prox-term. Lemma C.2 Under the update rule in Algorithm Algorithm 1, we have: ĝt i = (1 -γ)ĝ t-1 i + γ K-1 k=0 γ k γ gt i,k . where K-1 k=0 γ k = K-1 k=0 η l λ 1 -η l λ K-1-k = γ = 1 -(1 -η l λ ) K . Proof 2 According to the update rule of Line.13 in Algorithm Algorithm 1, we have: ĝt i = ĝt-1 i - 1 λ (x t i,K -x t i,0 ) (a) = ĝt-1 i + η l λ K-1 k=0 1 - η l λ K-1-k (g t i,k -ĝt-1 i ) = ĝt-1 i + η l λ K-1 k=0 1 - η l λ K-1-k gt i,k - η l λ K-1 k=0 1 - η l λ K-1-k ĝt-1 i = ĝt-1 i + η l λ K-1 k=0 1 - η l λ K-1-k gt i,k - η l λ 1 -(1 -η l λ ) K η l λ ĝt-1 i = (1 - η l λ ) K ĝt-1 i + η l λ K-1 k=0 1 - η l λ K-1-k gt i,k = (1 -γ)ĝ t-1 i + γ K-1 k=0 γ k γ gt i,k . (a) applies the Lemma C.1. C.3 PROOF OF EQUATION ( 4) AND ( 5) Lemma C.3 Considering the u t+1 = 1 m i∈[m] x t i,K is the mean averaged parameters among the last iteration of local clients at time t, the auxiliary sequence z t = u t + 1-γ γ (u tu t-1 ) t>0 satisfies the update rule as: z t+1 = z t -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k . Proof 3 Firstly, according to the lemma C.1 and Line.14 and Line.16 in Algorithm 1, we have: u t+1 -u t = 1 m i∈[m] (x t i,K -x t-1 i,K ) = 1 m i∈[m] (x t i,K -x t i,0 -λĝ t-1 i ) = 1 m i∈[m] (-λγ K-1 k=0 γ k γ gt i,k + λγ ĝt i -λĝ t-1 i ) = -λ 1 m i∈[m] K-1 k=0 γ k γ γ gt i,k + (1 -γ)ĝ t-1 i . This could be considered as a momentum-like term with the coefficient of γ. Here we define a virtual observation sequence {u t } and its update rule is: u t i,k+1 = u t i,k -λ γ k γ γ gt i,k + (1 -γ)ĝ t-1 i , u t+1 i,0 = u t+1 = 1 m i∈[m] u t i,K . According to the lemma C.2 and above update rule, we can get that: ĝt i = (1 -γ)ĝ t-1 i + γ K-1 k=0 γ k γ gt i,k = - 1 λ (u t i,K -u t i,0 ) -γ K-1 k=0 γ k γ gt i,k + γ K-1 k=0 γ k γ gt i,k = - 1 λ (u t i,K -u t i,0 ). This function indicates that the virtual sequence u t could be considered as a momentum-based update method with a global correction term to guide the local update, and the correction term is calculated from the offset of the virtual observation sequence during the training process at round t. Then we expand the the auxiliary sequence z t as: z t+1 -z t = (u t+1 -u t ) + 1 -γ γ (u t+1 -u t ) - 1 -γ γ (u t -u t-1 ) = 1 γ (u t+1 -u t ) - 1 -γ γ (u t -u t-1 ) = -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k + 1 -γ γ ĝt-1 i - 1 -γ γ (u t -u t-1 ) = -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k - 1 -γ γ 1 m i∈[m] λĝ t-1 i - 1 -γ γ (u t -u t-1 ) = -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k - 1 -γ γ 1 m i∈[m] (u t -u t-1 + λĝ t-1 i ) = -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k - 1 -γ γ 1 m i∈[m] (x t-1 i,K -x t-2 i,K + λĝ t-1 i ) = -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k - 1 -γ γ 1 m i∈[m] (x t-1 i,K -x t-1 i,0 + λĝ t-1 i -λĝ t-2 i ) = -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k . C.4 PROOF OF THEOREM 4.5 Firstly we state some important lemmas applied in the proof. ĝt i holds the upper bound of: E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 ≤ 1 γ   E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   + E t ∥ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 . Proof 4 According to the lemma C.2,we have: 1 m i∈[m] ĝt i = (1 -γ) 1 m i∈[m] ĝt-1 i + γ 1 m i∈[m] K-1 k=0 γ k γ gt i,k . Take the L2-norm and we have: ∥ 1 m i∈[m] ĝt i ∥ 2 = ∥(1 -γ) 1 m i∈[m] ĝt-1 i + γ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 ≤ (1 -γ)∥ 1 m i∈[m] ĝt-1 i ∥ 2 + γ∥ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 . Thus we have the following recursion, E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 ≤ 1 γ   E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   + E t ∥ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 . Lemma C.5 (Bounded local update) The local update ĝt i holds the upper bound of: 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 ≤ P γ 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 + 24P L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 12P E t ∥∇F (z t )∥ 2 + P (12σ 2 g + σ 2 l ) , where 1 P = 1 -24λ 2 L 2 (1-2γ) 2 γ 2 . Proof 5 According to the lemmaC.2, we have: ĝt i = (1 -γ)ĝ t-1 i + γ K-1 k=0 γ k γ gt i,k . Take the L2-norm and we have: ∥ĝ t i ∥ 2 = ∥(1 -γ)ĝ t-1 i + γ K-1 k=0 γ k γ gt i,k ∥ 2 (a) ≤ (1 -γ)∥ĝ t-1 i ∥ 2 + γ∥ K-1 k=0 γ k γ gt i,k ∥ 2 (b) ≤ (1 -γ)∥ĝ t-1 i ∥ 2 + γ K-1 k=0 γ k γ ∥g t i,k ∥ 2 = (1 -γ)∥ĝ t-1 i ∥ 2 + K-1 k=0 γ k ∥g t i,k ∥ 2 . (a) and (b) apply the Jensen inequality. Thus we have the following recursion: 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 ≤ 1 γ 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 + 1 m i∈[m] K-1 k=0 γ k γ E t ∥g t i,k ∥ 2 . Here we provide a loose upper bound as a constant for the quasi-stochastic gradient: 1 m i∈[m] K-1 k=0 γ k γ E t ∥g t i,k ∥ 2 = 1 m i∈[m] K-1 k=0 γ k γ E t ∥(1 -α)g t i,k,1 + αg t i,k,2 ∥ 2 = 1 m i∈[m] K-1 k=0 γ k γ E t ∥g t i,k,1 + α(g t i,k,2 -g t i,k,1 )∥ 2 ≤ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k )∥ 2 + α 2 E t ∥∇F i (x t i,k ) -∇F i (x t i,k )∥ 2 + σ 2 l ≤ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k )∥ 2 + α 2 L 2 ρ 2 E t ∥∇F i (x t i,k )∥ 2 + σ 2 l ≤ 4 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k ) -∇F i (z t ) + ∇F i (z t ) -∇F (z t ) + ∇F (z t )∥ 2 + σ 2 l ≤ 12L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -z t ∥ 2 + 12E t ∥∇F (z t )∥ 2 + (12σ 2 g + σ 2 l ) ≤ 12L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t + x t -u t + u t -z t ∥ 2 + 12E t ∥∇F (z t )∥ 2 + (12σ 2 g + σ 2 l ) ≤ 24L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 24L 2 ∥x t -u t + u t -z t ∥ 2 + (12σ 2 g + σ 2 l ) + 12E t ∥∇F (z t )∥ 2 ≤ 24L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 24L 2 λ 2 (1 -2γ) 2 γ 2 1 m i E t ∥ĝ t-1 i ∥ 2 + 12E t ∥∇F (z t )∥ 2 + (12σ 2 g + σ 2 l ). We applies the Jensen inequality, the basic inequality ∥ n i=1 a i ∥ 2 ≤ n n i=1 ∥a i ∥ 2 , and the upper bound of ρ ≤ 1 αL . Combining the above inequalities, let 1 P = 1 -24L 2 λ 2 (1-2γ 2 ) γ 2 is the constant, we have: 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 ≤ P γ 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 + 24P L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 12P E t ∥∇F (z t )∥ 2 + P (12σ 2 g + σ 2 l ).

C.4.1 L-SMOOTHNESS OF THE FUNCTION F

For the general non-convex case, according to the Assumptions and the smoothness of F , we take the conditional expectation at round t + 1 and expand the F (z t+1 ) as: E t [F (z t+1 )] ≤ F (z t ) + E t ⟨∇F (z t ), z t+1 -z t ⟩ + L 2 E t ∥z t+1 -z t ∥ 2 = F (z t ) + ⟨∇F (z t ), E t [z t+1 ] -z t ⟩ + L 2 E t ∥z t+1 -z t ∥ 2 = F (z t ) + E t ⟨∇F (z t ), -λ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ⟩ + L 2 E t ∥z t+1 -z t ∥ 2 = F (z t ) -λE t ⟨∇F (z t ), 1 m i∈[m] K-1 k=0 γ k γ gt i,k -∇F (z t ) + ∇F (z t )⟩ + L 2 E t ∥z t+1 -z t ∥ 2 = F (z t ) -λ∥∇F (z t )∥ 2 -λE t ⟨∇F (z t ), 1 m i∈[m] K-1 k=0 γ k γ gt i,k -∇F (z t )⟩ R1 + L 2 E t ∥z t+1 -z t ∥ 2 R2 . C.4.2 BOUNDED R1 Note that R1 can be bounded as: R1 = -λE t ⟨∇F (z t ), 1 m i∈[m] K-1 k=0 γ k γ gt i,k -∇F (z t )⟩ (a) = -λE t ⟨∇F (z t ), 1 m i∈[m] K-1 k=0 γ k γ gt i,k - 1 m i∈[m] K-1 k=0 γ k γ ∇F i (z t )⟩ (b) = λ 2 ∥∇F (z t )∥ 2 + λ 2 E t ∥ 1 m i∈[m] K-1 k=0 γ k γ Eg t i,k -∇F i (z t ) ∥ 2 - λ 2m 2 E t ∥ i∈[m] K-1 k=0 γ k γ Eg t i,k ∥ 2 (c) ≤ λ 2 ∥∇F (z t )∥ 2 + λ 2 1 m i∈[m] K-1 k=0 γ k γ E t ∥Eg t i,k -∇F i (z t )∥ 2 R1.a - λ 2m 2 E t ∥ i∈[m] K-1 k=0 γ k γ Eg t i,k ∥ 2 . (a) applies the fact that 1 m i∈[m] ∇F i (z t ) = ∇F (z t ). (b) applies -⟨x, y⟩ = 1 2 ∥x∥ 2 + ∥y∥ 2 - ∥x + y∥ 2 . (c) applies the Jensen's inequality and the fact that K-1 k=0 γ k γ = 1. According to the update rule we have: Eg t i,k = (1 -α)E g t i,k,1 + αE g t i,k,2 = (1 -α)E ∇F i (x t i,k ; ε t i,k ) + αE ∇F i (x t i,k ; ε t i,k ) = (1 -α)∇F i (x t i,k ) + α∇F i (x t i,k ) = (1 -α)∇F i (x t i,k ) + α∇F i (x t i,k + ρg t i,k,1 ). Let ρ ≤ 1 √ 3αL , thus we could bound the term R1.a as follows: 1 m i∈[m] K-1 k=0 γ k γ E t ∥Eg t i,k -∇F i (z t )∥ 2 = 1 m i∈[m] K-1 k=0 γ k γ E t ∥(1 -α)∇F i (x t i,k ) + α∇F i (x t i,k + ρg t i,k,1 ) -∇F i (z t )∥ 2 = 1 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k ) -∇F i (z t ) + α ∇F i (x t i,k + ρg t i,k,1 ) -∇F i (x t i,k ) ∥ 2 ≤ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k ) -∇F i (z t )∥ 2 + 2α 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k ) -∇F i (x t i,k )∥ 2 ≤ 2L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -z t ∥ 2 + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥g t i,k,1 ∥ 2 = 2L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t + x t -u t + u t -z t ∥ 2 + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥g t i,k,1 ∥ 2 ≤ 4L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 4L 2 m i∈[m] K-1 k=0 γ k γ E t ∥(x t -u t ) + (u t -z t )∥ 2 + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥g t i,k,1 -∇F i (x t i,k )∥ 2 + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k )∥ 2 ≤ 4L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k )∥ 2 + 2α 2 L 2 ρ 2 σ 2 l + 4L 2 E t ∥(x t -u t ) + (u t -z t )∥ 2 = 4L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k )∥ 2 + 2α 2 L 2 ρ 2 σ 2 l + 4L 2 E t ∥ - 1 m i∈[m] λĝ t-1 i + γ -1 γ (u t -u t-1 )∥ 2 = 4L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k )∥ 2 + 2α 2 L 2 ρ 2 σ 2 l + 4L 2 E t ∥ 1 m i∈[m] (u t -u t-1 + λĝ t-1 i ) - 1 γ (u t -u t-1 + λĝ t-1 i ) + ( 1 -2γ γ )λĝ t-1 i ∥ 2 = 4L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k )∥ 2 + 2α 2 L 2 ρ 2 σ 2 l + 4λ 2 L 2 (1 -2γ) 2 γ 2 E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 = 4L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 4λ 2 L 2 (1 -2γ) 2 γ 2 E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 + 2α 2 L 2 ρ 2 σ 2 l + 2α 2 L 2 ρ 2 m i∈[m] K-1 k=0 γ k γ E t ∥∇F i (x t i,k ) -∇F i (z t ) + ∇F i (z t ) -∇F (z t ) + ∇F (z t )∥ 2 (a) ≤ 4L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 4λ 2 L 2 (1 -2γ) 2 γ 2 E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 + 2α 2 L 2 ρ 2 σ 2 l + 2L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -z t ∥ 2 + 6α 2 L 2 ρ 2 σ 2 g + 6α 2 L 2 ρ 2 E t ∥∇F (z t )∥ 2 ≤ 8L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 8λ 2 L 2 (1 -2γ) 2 γ 2 E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 + 2α 2 L 2 ρ 2 σ 2 l + 6α 2 L 2 ρ 2 σ 2 g + 6α 2 L 2 ρ 2 E t ∥∇F (z t )∥ 2 . (b) ≤ 8L 2 m i∈[m] K-1 k=0 γ k γ E t ∥x t i,k -x t ∥ 2 + 8λ 2 L 2 (1 -2γ) 2 γ 3   E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   + 8λ 2 L 2 (1 -2γ) 2 γ 2 E t ∥ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 + 2α 2 L 2 ρ 2 σ 2 l + 6α 2 L 2 ρ 2 σ 2 g + 6α 2 L 2 ρ 2 E t ∥∇F (z t )∥ 2 . (a) applies the bound of ρ as ρ ≤ 1 √ 3αL . (b) applies the lemma C.4. These others use the fact E[x -E[x]] 2 = E[x 2 ] -[E[x]] 2 and ∥x + y∥ 2 ≤ (1 + a)∥x∥ 2 + (1 + 1 a )∥y∥ 2 . We denote c t = 1 m i∈m K-1 k=0 (γ k /γ)E t ∥x t i,k -x t ∥ 2 term as the local offset after k iterations updates, we firstly consider the c t k = 1 m i∈m E t ∥x t i,k -x t ∥ 2 and it can be bounded as: c t k = 1 m i∈[m] E t ∥x t i,k -x t ∥ 2 = 1 m i∈[m] E t ∥x t i,k -x t i,k-1 + x t i,k-1 -x t i,0 ∥ 2 = 1 m i∈[m] E t ∥ -η l (g t i,k-1 -ĝt-1 i ) + (1 - η l λ )(x t i,k-1 -x t i,0 )∥ 2 ≤ (1 + a)(1 - η l λ ) 2 1 m i∈[m] E t ∥x t i,k-1 -x t i,0 ∥ 2 + (1 + 1 a ) η 2 l m i∈[m] E t ∥g t i,k-1 -ĝt-1 i ∥ 2 = (1 + a)(1 - η l λ ) 2 c t k-1 + (1 + 1 a ) η 2 l m i∈[m] E t ∥(1 -α)g t i,k-1,1 + αg t i,k-1,2 -ĝt-1 i ∥ 2 = (1 + 1 a ) η 2 l m i∈[m] E t ∥∇F i (x t i,k-1 ) -ĝt-1 i + α(∇F i (x t i,k-1 ) -∇F i (x t i,k-1 ))∥ 2 + (1 + 1 a )η 2 l σ 2 l + (1 + a)(1 - η l λ ) 2 c t k-1 ≤ (1 + 1 a ) 3η 2 l m i∈[m] E t ∥∇F i (x t i,k-1 )∥ 2 + E t ∥ĝ t-1 i ∥ 2 + α 2 L 2 ρ 2 E t ∥∇F i (x t i,k-1 )∥ 2 + (1 + 1 a )η 2 l σ 2 l + (1 + a)(1 - η l λ ) 2 c t k-1 ≤ (1 + 1 a ) 4η 2 l m i∈[m] E t ∥∇F i (x t i,k-1 )∥ 2 + (1 + 1 a ) 3η 2 l m i∈[m] E t ∥ĝ t-1 i ∥ 2 + (1 + 1 a )η 2 l σ 2 l + (1 + a)(1 - η l λ ) 2 c t k-1 ≤ (1 + 1 a ) 4η 2 l m i∈[m] E t ∥∇F i (x t i,k-1 ) -∇F i (x t ) + ∇F i (x t ) -∇F i (z t ) + ∇F i (z t ) -∇F (z t ) + ∇F (z t )∥ 2 + (1 + 1 a ) 3η 2 l m i∈[m] E t ∥ĝ t-1 i ∥ 2 + (1 + 1 a )η 2 l σ 2 l + (1 + a)(1 - η l λ ) 2 c t k-1 ≤ (1 + 1 a ) 16η 2 l L 2 m i∈[m] E t ∥x t i,k-1 -x t ∥ 2 + (1 + 1 a )16η 2 l L 2 ∥x t -z t ∥ 2 + (1 + 1 a )η 2 l (16σ 2 g + σ 2 l ) + (1 + 1 a )16η 2 l ∥∇F (z t )∥ 2 + (1 + 1 a ) 3η 2 l m i∈[m] E t ∥ĝ t-1 i ∥ 2 + (1 + a)(1 - η l λ ) 2 c t k-1 ≤ (1 + a)(1 - η l λ ) 2 + (1 + 1 a )16η 2 l L 2 c t k-1 + (1 + 1 a )η 2 l (16σ 2 g + σ 2 l ) + (1 + 1 a )16η 2 l E t ∥∇F (z t )∥ 2 + (1 + 1 a )η 2 l 3 + 16λ 2 L 2 (1 -2γ) 2 γ 2 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 = (1 + a)(1 - η l λ ) 2 + (1 + 1 a )16η 2 l L 2 c t k-1 + (1 + 1 a )η 2 l (16σ 2 g + σ 2 l ) + (1 + 1 a )η 2 l L 2 (88P -16) c t + (1 + 1 a ) 2η 2 l (P -1) 3 (12σ 2 g + σ 2 l ) + (1 + 1 a )16η 2 l E t ∥∇F (z t )∥ 2 + (1 + 1 a )η 2 l (44P -8)E t ∥∇F (z t )∥ 2 + (1 + 1 a ) 2η 2 l (P -1) 3γ 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 When P satisfies the condition of P ≤ 2, which means 1 P = 1 -24λ 2 L 2 (1-2γ) 2 γ 2 ≥ 1 2 , then we have the constant of 2(P -1) 3 ≤ 2 3 < 1, let the last 12σ 2 g enlarged to 16σ 2 g for convenience, we have: c t k ≤ (1 + a)(1 - η l λ ) 2 + (1 + 1 a )16η 2 l L 2 c t k-1 + 2(1 + 1 a )η 2 l (16σ 2 g + σ 2 l ) + 160(1 + 1 a )η 2 l L 2 c t 96(1 + 1 a )η 2 l E t ∥∇F (z t )∥ 2 + 2(1 + 1 a ) η 2 l γ 1 m i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 . Here we get the recursion formula between the c t k and c t k-1 . Actually we need to upper bound the c t = K-1 k=0 (γ k /γ)c t k , thus let the weight satisfies that: (1 + a)(1 - η l λ ) 2 + (1 + 1 a )16η 2 l L 2 ≤ γ K-2 γ K-1 = γ K-3 γ K-2 = • • • = γ 1 γ 0 = 1 - η l λ , let η l ≤ λ and thus we have: for convenience, we can bound the c t as: c t = K-1 k=0 γ k γ c t k ≤ 2(1 + 1 a ) η 2 l γ K-1 k ′ =0   k ′ -1 k=0 γ k   16σ 2 g + σ 2 l + 48E t ∥∇F (z t )∥ 2 + 80L 2 c t + 1 mγ i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 (a) ≤ 2(1 + 1 a )η 2 l K-1 k ′ =0 K-1 k=0 γ k γ 16σ 2 g + σ 2 l + 48E t ∥∇F (z t )∥ 2 + 80L 2 c t + 1 mγ i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 = 2(1 + 1 a )η 2 l K   16σ 2 g + σ 2 l + 48E t ∥∇F (z t )∥ 2 + 1 mγ i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 c t = 4(1 + 1 a )η 2 l K   16σ 2 g + σ 2 l + 48E t ∥∇F (z t )∥ 2 + 1 mγ i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2   . Let the a satisfies a = 1 for convenience, we summarize the extra terms above and bound the term R1.a as: R1.a = 1 m i∈[m] K-1 k=0 γ k γ E t ∥E[g t i,k ] -∇F i (z t )∥ 2 ≤ 8L 2 c t + 8λ 2 L 2 (1 -2γ) 2 γ 3   E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   + 2α 2 L 2 ρ 2 σ 2 l + 8λ 2 L 2 (1 -2γ) 2 γ 2 E t ∥ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 + 6α 2 L 2 ρ 2 σ 2 g + 6α 2 L 2 ρ 2 E t ∥∇F (z t )∥ 2 ≤ 8λ 2 L 2 (1 -2γ) 2 γ 3   E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   + 2α 2 L 2 ρ 2 σ 2 l + 6α 2 L 2 ρ 2 σ 2 g + 8λ 2 L 2 (1 -2γ) 2 γ 2 E t ∥ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 + 64η 2 l L 2 K mγ i∈[m] E t ∥ĝ t-1 i ∥ 2 -E t ∥ĝ t i ∥ 2 + 3072η 2 l L 2 KE t ∥∇F (z t )∥ 2 + 6α 2 L 2 ρ 2 E t ∥∇F (z t )∥ 2 + 64η 2 l L 2 K(16σ 2 g + σ 2 l ). thus we can bound the R1 as follow: R1 ≤ λ 2 E t ∥∇F (z t )∥ 2 + λ 2 R1.a - λ 2m 2 E t ∥ i∈[m] K-1 k γ k γ E[g t i,k ]∥ 2 ≤ λ 2 + 3λα 2 L 2 ρ 2 + 1536λη 2 l L 2 K E t ∥∇F (z t )∥ 2 + 32λη l L 2 K γm i∈[m] E∥ĝ t-1 i ∥ 2 -E∥ĝ t i ∥ 2 + 4λ 3 L 2 (1 -2γ) 2 γ 3   E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   + λα 2 L 2 ρ 2 (3σ 2 g + σ 2 l ) + 4λ 3 L 2 (1 -2γ) 2 γ 2 E t ∥ 1 m i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 + 32λη 2 l L 2 K(16σ 2 g + σ 2 l ). We notice that R1 contains the same term with a negative weight, thus we can set another constrains for λ to eliminate this term. We will prove it in the next part.

C.4.3 BOUNDED GLOBAL GRADIENT

As we have bounded the term R1 and R2, according to the smoothness inequality, we combine the inequalities above and get the inequality: E t [F (z t+1 )] ≤ F (z t ) -λ∥∇F (z t )∥ 2 + R1 + L 2 R2 = F (z t ) - λ 2 -3λα 2 L 2 ρ 2 -1536λη 2 l L 2 K ∥∇F (z t )∥ 2 + λα 2 L 2 ρ 2 (3σ 2 g + σ 2 l ) + 4λ 3 L 2 (1 -2γ) 2 γ 2 + λ 2 L 2m 2 - λ 2m 2 E t ∥ i∈[m] K-1 k=0 γ k γ gt i,k ∥ 2 + 32λη l L 2 K γm i∈[m] E∥ĝ t-1 i ∥ 2 -E∥ĝ t i ∥ 2 + 32λη 2 l L 2 K(16σ 2 g + σ 2 l ) + 4λ 3 L 2 (1 -2γ) 2 γ 3   E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   . We follow as Yang et al. (2021) to set λ that it satisfies 4λ 3 L 2 (1-2γ) 2 γ 2 + λ 2 L 2m 2 -λ 2m 2 ≤ 0, which is easy to verified that λ has a upper bound for the quadratic inequality. Thus, the stochastic gradient term is diminished by this λ. We denote the constant λκ = λ 2 -3λα 2 L 2 ρ 2 -1536λη 2 l L 2 K and κ could be considered as a constant. We can select two constants c 1 ∈ (0, 1 2 ), c 2 ∈ (0, 1 2 ) and they satisfy c 1 + c 2 ∈ (0, 1 2 ), we let 1 2 -3α 2 L 2 ρ 2 > 1 2 -c 1 and 1 2 -1536η 2 l L 2 K > 1 2 -c 2 , where the ρ and η l satisfy ρ < 3KL . Then we can bound the κ = 1 2 -3α 2 L 2 ρ 2 -1536η 2 l L 2 K > 1 2 -c 1 -c 2 > 0, and the term 1 κ < 2 1-2c1-2c2 which is a constant upper bound. We take the full expectation on the bounded global gradient as: λκE∥∇F (z t )∥ 2 ≤ EF (z t ) -EF (z t+1 ) + 32λη l L 2 K γm i∈[m] E∥ĝ t-1 i ∥ 2 -E∥ĝ t i ∥ 2 + 4λ 3 L 2 (1 -2γ) 2 γ 3   E t ∥ 1 m i∈[m] ĝt-1 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   + 32λη 2 l L 2 K(16σ 2 g + σ 2 l ) + λα 2 L 2 ρ 2 (3σ 2 g + σ 2 l ). Take the full expectation and telescope sum on the inequality above and applying the fact that F * ≤ F (x) for x ∈ R d , we have: 1 T T -1 t=1 E t ∥∇F (z t )∥ 2 ≤ 1 λκT F (z 1 ) -E t [F (z T )] + 32η l L 2 K κγmT i∈[m] E∥ĝ 0 i ∥ 2 -E∥ĝ t i ∥ 2 + 4λ 2 L 2 (1 -2γ) 2 κγ 3 T   E t ∥ 1 m i∈[m] ĝ0 i ∥ 2 -E t ∥ 1 m i∈[m] ĝt i ∥ 2   + 1 κ 32λη 2 l L 2 K(16σ 2 g + σ 2 l ) + λα 2 L 2 ρ 2 (3σ 2 g + σ 2 l ) ≤ 1 λκT F (z 0 ) -F * ) + 32η l L 2 K κγmT i∈[m] E∥ĝ 0 i ∥ 2 + 4λ 2 L 2 (1 -2γ) 2 κγ 3 T E t ∥ 1 m i∈[m] ĝ0 i ∥ 2 + 1 κ 32λη 2 l L 2 K(16σ 2 g + σ 2 l ) + λα 2 L 2 ρ 2 (3σ 2 g + σ 2 l ) Here we summarize the conditions and some constrains in the above conclusion. Firstly we should note that γ = 1 -(1 -η l λ ) K < 1 when η l ≤ 2λ. Thus we have 1/γ > 1. When K satisfies that K ≥ λ η l , (1 -η l λ ) K ≤ e -η l λ K ≤ e -1 , and then γ > 1 -e -1 and 1/γ < e e-1 < 2. To let κ = 1 2 -3α 2 L 2 ρ 2 -1536η 2 l L 2 K > 0 hold, ρ and η l satisfy that ρ < 1 √ 6αL and η l < 1 32 √ 3KL . 1 T T -1 t=1 E∥∇F (z t )∥ 2 ≤ 2(F (z 1 ) -F * ) λκT + 64η l L 2 K κT 1 m i∈[m] E∥ĝ 0 i ∥ 2 + 32λ 2 L 2 κT E t ∥ 1 m i∈[m] ĝ0 i ∥ 2 + 1 κ 32λη 2 l L 2 K(16σ 2 g + σ 2 l ) + λα 2 L 2 ρ 2 (3σ 2 g + σ 2 l ) .



Let ρ = O(1/ √ T ) with the upper bound of ρ ≤ 1/ √ 6αL, and let η l = O(1/K) with the lower bound of η l ≥ λ/K, when the local interval K is long enough with K = O(T ), the proposed FedSpeed achieves a fast convergence rate of O(1/T ).

, under the same assumptions, it achieves O(1/ √ SKT + K/T ) which restricts the value of K to not exceed the order of T . Karimireddy et al. (2020) contribute the convergence as O(1/ √ SKT ) under the constant local interval, and (Reddi et al., 2020) proves the same convergence under the strict coordinated bounded variance assumption for the global full gradient in the FedAdam. Our experiments also verify this characteristic in Section 5.3. Most current algorithms are affected by increasing K in the training while FedSpeed shows the good stability under the enlarged local intervals and shrunk communication rounds.

Figure 1: The top-1 accuracy in communication rounds of all compared methods on CIFAR-10/100 and TinyImagenet. Communication rounds are set as 1500 for CIFAR-10/100, 3000 for TinyImagenet.In each group, the left shows the performance on IID dataset while the right shows the performance on the non-IID dataset, which are split by setting heterogeneity weight of the Dirichlet as 0.6.

We compare several classical and efficient methods with the proposed FedSpeed in our experiments, which focus on the local consistency and client-drifts, including FedAvg McMahan et al. (2017), FedAdam Reddi et al. (2020), SCAFFOLD Karimireddy et al. (2020), FedCM Xu et al. (2021), FedProx Sahu et al. (2018), FedDyn Durmus et al. (2021) and FedADMMWang et al. (2022). FedAdam applies adaptive optimizer to improve the performance on the global updates. SCAFFOLD and FedCM utilize the global gradient estimation to correct the local updates. FedProx introduces the prox-term to alleviate the local inconsistency. FedDyn and FedADMM both employ the different variants of the primal-dual method to reduce the local inconsistency. Due to the limited space, more detailed description and discussions on these compared baselines are placed in the Appendix.

Figure 2: Performance of FedAdam, FedCM, SCAFFOLD and FedSpeed with local epochs E = 1, 2, 5, 10, 20 on the 10% participation case of total 100 clients on CIFAR-10. We fix T × E = 2500 as the equaled total training epochs to illustrate the performance of increasing E and decreasing T .

Figure 3: Heat maps for different dataset under heterogeneity weight equals to 0.6 for Dirichlet distribution.

Thus we decay it to 20 for 5 iterations per local epoch. The local epochs is set as 5, the same as the experiments of Karimireddy et al. (2020); Durmus et al. (2021); Xu et al. (

Figure 4: Performance of different ascent step size ρ under different prox-term weights of [0.001, 0.01, 0.1, 0.5].

Figure 5: (a) and (b) show the loss curves on the CIFAR-10 IID/DIR-0.6 dataset. The FedSpeed achieves the best and stable performance in the training. (c) and (d) show the loss curve of FedCM and FedSpeed on the CIFAR-10 DIR-0.6 dataset with increasing the local epochs from E = 1 to 20.

Kc t . (a) enlarge the sum from k ′ to K -1 where k ′ ≤ K -1. Let η l satisfies the upper bound of η l ≤ 1 √ 320(1+1/a)KL

Haddadpour et al. (2021) compress the local offset and adopt a global correction to reduce the biases.Zhang et al. (2021) apply the primal dual method instead of the primal method to solve a series of sub-problems on the local clients and alternately updates the primal and dual variables which can achieve the fast convergence rate of O(1

Test accuracy (%) on the CIFAR-10/100 and TinyImagenet under the 2% participation of 500 clients with IID and non-IID dataset. The heterogeneity is applied as Dirichlet-0.6 (DIR.).

In the non-convex optimization it is difficult to determine the selection of local training interval K under this requirement. Though Durmus et al. (2021) claim that 5 local epochs

Performance of different ρ 0 with α = 1.

Yang Zhao, Hao Zhang, and Xiuyuan Hu. Penalizing gradient norm for efficiently improving generalization in deep learning. arXiv preprint arXiv:2202.03599, 2022. Qihuang Zhong, Liang Ding, Li Shen, Peng Mi, Juhua Liu, Bo Du, and Dacheng Tao. Improving sharpness-aware minimization with fisher mask for better generalization on language models. arXiv preprint arXiv:2210.05497, 2022.

Dataset introductions. Extensive experiments are tested on CIFAR-10/100 dataset. We test on the two different settings as 10% participation of total 100 clients and 2% participation of total 500 clients. CIFAR-10 dataset contains 50,000 training data and 10,000 test data in 10 classes. Each data sample is a 3×32×32 color image. CIFAR-100Krizhevsky et al. (2009) includes 50,000 training data and 10,000 test data in 100 classes as 500 training samples per class. TinyImagenet involves 100,000 training images and 10,000 test images in 200 classes for 3×64×64 color images, as shown in Table3. To fairly compare with the other baselines, we train and test the performance on the standard ResNet-18He et al. (2016) backbone with the 7×7 filter size in the first convolution layer as implemented in the previous works, e.g. forKarimireddy et al. (2020);Durmus et al. (2021);Xu et al. (2021). We follow theHsieh et al. (2020) to replace the batch normalization layer with group normalization layerWu & He (2018), which can be aggregated directly by averaging. These are all common setups in many previous works.

Communication rounds required to achieve the target accuracy. On CIFAR-10/100 it trains 1,500 rounds and on TinyImagenet it trains 3,000 rounds. "-" means the test accuracy can not achieve the target accuracy within the fixed training rounds. DIR represents for the Dirichlet distribution with the heterogeneity weight equal to 0.6. Local interval K is set as 5 on CIFAR-10 (100-10%) and 2 on others. Other hyper-parameters are introduced above.

Training wall-clock time comparison.



Comparison on different heterogeneous dataset.

Ablation studies on different modules.

can theoretically improve the efficiency by adopting a larger order of local interval K = O( Performance of different α with ρ 0 = 0.1.

