ENHANCE LOCAL CONSISTENCY FOR FREE: A MULTI-STEP INERTIAL MOMENTUM APPROACH

Abstract

Federated learning (FL), as a collaborative distributed training paradigm with several edge computing devices under the coordination of a centralized server, is plagued by inconsistent local stationary points due to the heterogeneity of the local partial participation clients, which precipitates the local client-drifts problems and sparks off the unstable and slow convergence, especially on the aggravated heterogeneous dataset. To address these issues, we propose a novel federated learning algorithm, named FedMIM, which adopts the multi-step inertial momentum on the edge devices and enhances the local consistency for free during the training to improve the robustness of the heterogeneity. Specifically, we incorporate the weighted global gradient estimations as the inertial correction terms to guide both the local iterates and stochastic gradient estimation, which can reckon the global objective optimization on the edges' heterogeneous dataset naturally and maintain the demanding consistent iteration locally. Theoretically, we show that FedMIM achieves the O( 1 √ SKT ) convergence rate with a linear speedup property with respect to the number of selected clients S and proper local interval K in each communication round under the nonconvex setting. Empirically, we conduct comprehensive experiments on various real-world datasets and demonstrate the efficacy of the proposed FedMIM against several state-of-the-art baselines.

1. INTRODUCTION

Federated Learning (FL) is an increasingly important distributed learning framework where the distributed data is utilized over a large number of clients, such as mobile phones, wearable devices or network sensors (Kairouz et al., 2021) . In the contrast to traditional machine learning paradigms, FL places a centralized server to coordinate the participating clients to train a model, without collecting the client data, thereby achieving a basic level of data privacy and security (Li et al., 2020a) . The common pipelines to achieve this goal includes three steps (Bonawitz et al., 2019) : i) The server broadcasts the current model to clients at the beginning of each communication iteration; ii) The clients synchronize the local models and update the local model based on their own data; iii) The server averages the latest local models and repeats these procedures until convergence. Despite the empirical success of the past work, there are still some key challenges for FL: expensive communication, privacy concern and statistical diversity. The first two problems are well fixed in past work (Konečnỳ et al., 2016; Sattler et al., 2019; Hamer et al., 2020; Truex et al., 2019; Xu et al., 2019) although the last one is still the main challenge that need to be deal with. Due to statistical diversity among clients within FL system, client drift (Karimireddy et al., 2020a) leads to slow and unstable convergence within model training. In the case of heterogeneous data, each client's optimum is not well aligned with the global optimum. The conventional FL algorithm does not consider this data heterogeneity problem and simply applies the stochastic gradient descent algorithm to the local update. As a consequence, the final converged solution of clients may differ from the stationary point of the global objective function since the average of client updates move towards the average of clients' optimums rather than the true optimum. As the distribution drift exists over the client's dataset, the model may overfit the local training data by applying empirical risk minimization and it has been reported that the generalization performance on clients' local data may exacerbate when clients have different distributions between training and testing dataset (Liang et al., 2020) . In order to overcome these problems, several solutions have been put forward in recent years. Generally, there are three types of methods: variance reduction based (Karimireddy et al., 2020b) , regularization based (Li et al., 2020b; Acar et al., 2021) and momentum based (Xu et al., 2021; Reddi et al., 2020) . Although these past works present some effective methods to reduce the client drift and improve the generalization performance, the problem of local inconsistency is not fully considered. In the real experiment setting, the local interval K is finite and the local update could not reach the local optimum. With the iteration running, the final points for local iteration will remain relatively stable and become dynamic equilibrium. The stability of these points determines the effectiveness of algorithms and their position will alter when different algorithms are applied. The variance among these points brings the local inconsistency problem Wang et al. (2021a) . However, the analysis of these past works are not comprehensive and experimental verification of the reduced local inconsistency is lacking. In particular, when data heterogeneity among clients raises, the local update may repudiate mutually, that is, the direction of the local gradient could not remain compatible. Thus, the weighted average of local gradient at the aggregation stage is extraordinarily small and the moving global iteration point may stagnate, which leads to low generalization performance. To settle this problem, a federated learning algorithm is required to incorporate historical information of full gradient into client local updates for scaling down the variance between local dynamic equilibrium points. Furthermore, the usage of historical full gradient information to navigate the local update ought to be considered wisely instead of simply applied in the weight of models. In this paper, we develop a new FL algorithm to enhance local consistency for free, Federated Multi-step Inertial Momentum Algorithm (FedMIM), that mitigates client drift and reduces local inconsistency. From a high-level algorithmic perspective, we bring multistep inertial momentum to the local update, that is, multi-step momentum is placed in both weight (orange arrow shown in Figure 1 ) and gradient (yellow arrow shown in Figure 1 ) to modify the local update. Rather than calculating the momentum updates at the server's side and transmitting them through the down-link, all the clients compute the momentum term before the local iteration, while the historical momentum is kept in the client's storage. FedMIM has two major benefits to undertaking aforementioned deficiencies. Firstly, FedMIM does not acquire the server to broadcast the momentum between rounds, which curtails the communication burden. Secondly, in contrast to previous work that focuses on server side momentum (Karimireddy et al., 2020b) or client side momentum Xu et al. (2021) , FedMIM delivers inertial momentum term to introduce global information avoiding the gradient exclusion in local update when there exists large data heterogeneity among the participating clients. Theoretically, we provide a detailed convergence analysis for FedMIM. By setting proper local learning rate, FedMIM could achieve O( 1 √ SKT ) convergence rate with a linear speedup property for general non-convex setting with the number of selected clients S, local interval K and communication round T . As for non-convex function under PL condition, convergence rate achieves O( 1T ) with proper setting of local learning rate. We test FedMIM algorithm on three datasets (CIFAR-10, CIFAR-100 and TinyImagenet) with i.i.d, and different Dirichlet distributions in the empirical studies. The results display that our proposed FedMIM shows the best performance among the state-of-the-art baselines. When the heterogeneity increases extremely, the performance of the federated algorithms drops rapidly due to the negative impact of enlarging the local interval, while our proposed FedMIM can efficiently maintain stability under the same experimental setups. Contribution. We summarize the main contributions of this work as three-fold: • FedMIM algorithm delivers a multi-step inertial momentum to guide the gradient updates. We show that FedMIM successfully solves the problems on the heterogeneous datasets, which benefits the cross-device implantation in practical applications. • We display the convergence analysis of FedMIM for general non-convex function and nonconvex function under PL conditions. The theoretical analysis highlights the advantage of innovating multi-step inertial momentum and presents hyperparameter conditions. 

3. FEDMIM: FEDERATED MULTI-STEP MOMENTUM ALGORITHM

In this section, we describe how FedMIM works while reducing client drift and improving convergence. To begin with, we provide some preliminary for FL and notations adopted in this paper in Section 3.1. We introduce the diagram of our proposed FedMIM method, and the insights of its improvement on the performance and the resistance to the local heterogeneity in Section 3.2.

3.1. PROBLEM SETUP

Considering an FL framework with N local clients and a centralized server to handle the training process. The client i for i ∈ [N ] has the local private dataset D i without sharing, and the data sample ξ i is randomly drawn from the local dataset D i . The minimization problem could be formulated as: min x∈R d f (x) := 1 N N i=1 f i (x) where for client i ∈ St parallel do 5: f i (x) := E ξi∼pi [f i (x; ξ i )] Local Update: with a multi-step momentum. It should be noted that α and β are adopted for averaging weights in the multi-step momentum term. Each client calculates an unbiased stochastic gradient g t i,k and updates its state. When the local update stops, x t i,k is transmitted to the server. The iterate scheme details of FedMIM is summarized in Algorithm 1. 6: δt = -(xt -xt-1)/K 7: for k = 0, 1, 2, • • • , K -1 do 8: y t i,k,1 = x t i,k -j∈I αjδt-j ( j∈I αj < 1) 9: y t i,k,2 = x t i, Intuitive Justification To build intuition into our method, we first highlight multi-step inertial part. Lemma in appendix illustrate that δ t is the exponential moving average of past client gradient. The momentum term δ t represent as an approximation to the offset of the global loss function ∇f (x t ), that is, δ t ≈ η l ∇f (x t ). Thus, we have local update: x t i,k+1 = x t i,k -(1 -A)η l δ t-j -Aη l ∇f i (x t i,k -η l j∈I β j δ t-j ) ≈ x t i,k -(1 -A)η l ∇f (x t ) -Aη l ∇f i x t i,k -ρ∇f (x t ) ≈ x t i,k -η l [∇f i (x t i,k -ρ∇f (x t )) + (1 -A)(∇f i (x t i,k -ρ∇f (x t )) -∇f (x t ))]. (2) For simplicity, we set the constant A = 1 -j∈I α j and ρ = η l ρ. This equation illustrates that FedMIM interprets the correction term to the local gradient direction. This correction term matches the difference between global and local gradient. The second term ∇f i (x t i,k -ρ∇f (x t )) in Eq.( 2) behaves like Nesterov gradient part, which means that there is global momentum ρ∇f (x t ) placed on local iteration point x t i,k when client i computes the local gradient in k-th local update and t-th communication round. The added global momentum pushes the local gradient calculation point to move in the same direction compared with ∇f i (x t i,k ). This benefits the circumstance where the data distribution among the clients differs intensively since the client's update would radiate in a high data heterogeneity environment. In the meanwhile, the added global momentum dwindles gradually as the full gradient ∇f (x) reduces, and the influence of global momentum scales down with the training process. Therefore, each participating client could reach their dynamic equilibrium at the end of the training. The final correction term is controlled by the parameter α. It is notable that the local gradient part is ∇f i (x t i,k -ρ∇f (x t )) rather than ∇f i (x t i,k ) as the direction of local update is ∇f i (x t i,k -ρ∇f (x t )) in the front term and the correction term ought to be consistent with it. Discussion. FedMIM saves the communication bandwidth, which is a crucial problem in FL study. During the broadcasting stage, the server only needs to send the current global state x t to the clients rather than the aggregation gradient information in FedCM. The storage of the client is efficiently utilized since it only needs to store J steps of historical information where J is usually set to be very small in practical scenarios and the client's storage requirement does not increase violently. Next, FedMIM simply calculates the gradient once, while FedSAM computes the gradient twice in one local iteration. Thus, the local calculation process could be condensed and total training time is much reduced. The historical global state is stored in clients' storage. FedMIM brings multi-step inertial momentum, which is robust to high client heterogeneity. Since global gradient information is applied to avert the average of client update direction to be minuscule and force global iteration point to move. The introduced multi-step inertial momentum makes the gradient changes more smooth during the local training, although there are some atrocious clients who hold discordant data. The long-step looking makes the approximation exact and smooth for local training, which promotes communication efficiency and enhances the robustness to the heterogeneity in the FedMIM.

4. CONVERGENCE ANALYSIS

In this section, we provide the theoretical analysis for FedMIM focusing on the general non-convex setting. Before proposing our convergence analysis, We first state the several assumptions as follows. Assumption 1 For all x, y ∈ R d , the non-convex f i is a L-smooth function for all i ∈ [N ], i.e., ∥∇f i (x) -∇f i (y)∥ ≤ L∥x -y∥ Assumption 2 Let f * = f (x * ) and x * is a minimizer of f , for all x ∈ R d , the function f satisfies PL inequality if there exists the constant µ > 0 such that the function f satisfies the following: 1 2 ∥∇f (x)∥ 2 ≤ µ(f (x) -f * ) Assumption 3 For all x ∈ R d , the stochastic gradient ∇f i (x, ξ), computed by the sampled data ξ on the local client i, is an unbiased estimator of ∇f i (x) with bounded variance σ 2 l , i.e., E ξ [∇f i (x, ξ)] = ∇f i (x), E ξ ∥∇f i (x, ξ) -∇f i (x)∥ 2 ≤ σ 2 l (5) Assumption 4 For all x ∈ R d , the local functions f i holds (G, B)-locally dissimilarity with f , i.e., 1 N N i=1 ∥∇f i (x)∥ 2 ≤ G 2 + B 2 ∥∇f (x)∥ 2 . ( ) These assumptions are commonly used in federated optimization (Li et al., 2020b; Reddi et al., 2020; Karimireddy et al., 2020a; b) . Assumption 1 tells the smoothness of local loss function f i , that is, the gradient function of f i is Lipschitz continuous with Lipschitz constant L. Assumption 2 shows the global function satisfies the PL conditions. The PL inequality does not require f i to be convex but suggests that every stationary point is a minimum. The µ-PL property is implied by µ-strong convexity, but it allows for multiple minima and does not require convexity of any kind.  and λ ∈ (0, 1 2 ) in Algorithm 1, the sequence {u t } satisfies the following upper bound: min t∈[T ] E∥∇f (u t )∥ 2 ≤ f 0 -f * η l λKT + Ψ where Ψ = 1 λ η l Lσ 2 l S + 4η l KLG 2 S + 9η 2 l A 2 KL 2 σ 2 l + 72η 2 l AK 2 L 2 G 2 + 3η 2 l L 2 K 2 V C . V , C are two constants defined in the proof for the convergence analysis (details are stated in the Appendix). Remark 4.1 If the number of total clients N is large enough, the initial state point will affect the convergence upper bound to a great extent, which requires a larger local learning rate η l to diminish the negative impact. Specifically, when we fix N as a constant and select a proper local interval K, let η l = O( Theorem 4.2 Let Assumption 1, 2, 3, and 4 hold and all the conditions being similar as required by Theorem 4.1. Given η l ≥ 1 µλKT , the output u out chosen randomly from the sequence {u t } satisfies:  E ∇f u out 2 ≤ 4µ(f 0 -f * )e -µη l λKT + Ψ where Ψ = 1 λ 2η l Lσ 2 l S + 8η l KLG 2 S + 18η 2 l KA 2 L 2 σ 2 l + 144η 2 l AK 2 L 2 G 2 + 6η 2 l L 2 K 2 V C .

5. EXPERIMENTS

In this section, we demonstrate the efficacy of the proposed FedMIM. We test the generalization performance under different levels of the heterogeneity on the real-world dataset. To ensure a fair comparison, we fix all the common hyper-parameters and finetune the specific parameters unique to each algorithm to search for their best performance. We provide a brief introduction to the experimental setups in 5.1. We compare the proposed FedMIM with the baselines and report their performance in 5.2. Some ablation studies and hyper-parameters sensitivity studies are stated in 5.3.

5.1. SETUPS

Dataset. We conduct the extensive experiments on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and TinyImagenet (Le & Yang, 2015) . Both CIFAR-10 and CIFAR-100 contain 50K training samples and 10K test samples of images with the size of 32 × 32. TinyImagenet contains 200 categories of 100K training samples and 10K test samples of images with the size of 64 × 64 selected from the Imagenet (Deng et al., 2009) . We divide the training dataset into N parts and deploy them to local clients without sharing access. At the beginning of each communication round in the training, we randomly crop, horizontally flip, and normalize the local dataset as the common data augmentation. Heterogeneity. For the IID setting, the local dataset is randomly sampled. For the non-IID setting, we follow by Hsu et al. (2019) to introduce different levels of the heterogeneity by sampling the label ratios from different Dirichlet distributions, which is a common federated setup in previous works (Reddi et al., 2020; Karimireddy et al., 2020b; Acar et al., 2021; Xu et al., 2021; Qu et al., 2022; Kim et al., 2022) . In addition, we superimpose a color perturbation (Arjovsky et al., 2019) which is strongly correlated to the local clients to further induce the heterogeneity. Specifically, we adopt different brightness and saturation coefficients to the different local training data samples. Baselines. We compare the performance of several SOTA baselines, including FedAvg (McMahan et al., 2017) , FedAdam (Reddi et al., 2020) , SCAFFOLD (Karimireddy et al., 2020b ), FedDyn (Acar et al., 2021) , FedCM (Xu et al., 2021) , and FedSAM (Qu et al., 2022) , on the backbone of standard ResNet-18 network implemented in the Pytorch Model Zoo (7 × 7 filter in the 1st conv) (He et al., 2016) with the group normalization (Wu & He, 2018; Hsieh et al., 2020) . We summarize and discuss these methods in Section 3.2 to illustrate the respective improvements and practical performance. Hyper-parameters selections. To ensure a fair comparison, we fix the common hyper-parameter setups. We set the local learning rate as 0.1 and decay it as 0.998× per round. The global learning rate is set as 1.0 to aggregate each local parameters without decaying, except for FedAdam which adopts 0.1. The local mini-batch is selected in {20, 50}. The weight decay is set as 1e-3. The local training epoch is selected in {1, 2, 5, 10} to further show the impacts of enlarging the local interval. The number of total clients is selected in {100, 500} and the sampling probability of each client being activated per communication round is selected in {0.2, 0.1, 0.02}. The prox-weight in FedDyn and the client-level momentum weight in FedCM are both set as 0.1. We report the detailed hyper-parameters selections for each experimental result in the following figures and tables.

5.2.1. COMPARISON ON INCREASING THE HETEROGENEITY.

To explore the impact of introducing the heterogeneity, we select the three splitting methods on the dataset, including IID, Dirichlet-0.3 and Dirichlet-0.1. On the simple CIFAR-10/100 dataset, FedMIM achieves top-1 performance among the three heterogeneous settings. In the IID case on CIFAR-10, FedMIM achieves 86.39% with 4.23% over the FedAvg baseline. The second top performance of FedCM is 85.62%. When the heterogeneity is increased to DIR-0.3, FedCM drops from 85.62% to 82.39% with 3.23% loss, while FedMIM drops only 2% and maintains the top-1 accuracy. The other methods like SCAFFOLD, FedDyn and FedSAM are affected at different levels. When we further enlarge the heterogeneity to DIR-0.1, the FedAdam is most affected and its accuracy drops from 83.19% to 71.75% with an approximate 12% loss. On the large CIFAR-100 and TinyImagenet datasets, similar results can be observed. Our proposed FedMIM have the very stable test accuracy. In particular, when the heterogeneity is introduced to DIR-0.1 on CIFAR-100, Fed-MIM achieves only 1% drops, which is far better than the others with at least 2%. Its performance is also better than the test accuracy on DIR-0.3 of most other baselines. In the IID splitting of Tiny-Imagenet, the FedAdam even can achieve the top-1 performance, while when the heterogeneity its performance drops rapidly and even worse than FedAvg. FedMIM adopts the inertial momentum to the local training both on the iteration points and gradients and enhances the local consistency, which can efficiently resist on the heterogeneity. The multi-step makes the gradient changes more smooth during the training, even under the participation of some bad samples of clients whose dataset holds a very large difference, the long-step looking makes the approximation exact and stable for local training, which encourages the efficiency and robustness to the heterogeneity for the FedMIM.

5.2.2. COMPARISON ON ENLARGING THE LOCAL INTERVAL.

To further explore the impact of enlarging the local interval K, we select the three different local intervals to test the performance of our proposed FedMIM and the other baselines. We follow the previous works (Acar et al., 2021; Xu et al., 2021; Kim et al., 2022) to compare the performance under different local epochs E = 1, 5, 10. The total training samples in CIFAR-10/100 is 50,000 and they are split into 100 parts equally with 500 samples from a local client. To fairly compare with the others, we fix the batchsize as 50, which means the local iteration is TrainSamples/Batchsize = 10 per epoch. It should be noted that in the proof the local interval K corresponds to the iteration. When the E = 1 with a short local interval, local training do not introduce more local heterogeneity to the global view. When the local epochs are enlarged to 10, the long local update exacerbates the inconsistency problem and shows a negative impact on the test accuracy. Especially on the large TinyImagenet dataset, most algorithms fail to converge at T = 1000. Thus the test accuracy could be considered as the convergence rate for all the methods. FedMIM achieves the top-1 accuracy on both short epochs and long epochs. In the local iteration, the inertial momentum which promotes the local consistency, plays an important role in the stochastic gradient estimation. FedMIM obtains the iterative point closer to the global iterative point via perturbing the local gradient, which approximates the global direction and updates it by one step gradient descent. This allows the local update to be corrected not only on the gradient term, but also on the iterative points where the gradient is calculated. In the next part, we will discuss the consistency between the baselines and some ablation studies, including the participation ratio, the selection of the α j and β j and the different multi-steps.

5.3.1. PARTICIPATION RATIO

We test the experiments on the CIFAR-10 dataset under different participation ratios, which are selected from 5%, 20% to test the convergence rate under the setups of fixed local epoch 5 and batchsize 50. The heterogeneity is set as DIR-0.1. From the Figure 2 (a), when the heterogeneity is enlarged, the convergence speed of FedAvg loses the most performance. FedAdam, FedCM, and FedDyn show a high sensitivity to the participation ratios. SCAFFOLD performs well and maintains the excellent generalization performance via the variance reduction technique under the higher participation option. Beneficial from the inertial momentum, the global direction could be exact estimated. And a multi-steps calculation is adopted to further enhance the stability and smooth characteristic in the estimation. Our proposed FedMIM shows a very stable performance both on different heterogeneity and participation ratios, especially on the extreme heterogeneous settings. We test the different selection of steps J and selection of α j , β j . The experimental setup is: local epochs 5, total communication rounds 500 and batchsize is fixed as 50. When J = 2, FedMIM achieves the best generalization performance. As shown in Table 3 , if we set α j = 0 and β j = 0, FedMIM degenerates to the FedAvg method. And if we set α 2 = 0 and β j = 0, FedMIM degenerates to the FedCM method. It shows that the β j with a long history is not a good selection for the local clients, due to the expired information before the current time. Local updates will be misled by the redundancy of the invalid offset. The adjacent update is the most important. While the α j is more relaxed, which can be searched from the last two or three steps. In the empirical studies, we recommend the selection can be decided by different indicators, a large α j and a proper β j are better.

5.3.3. LOCAL CONSISTENCY

We test the consistency during the training as 1 S i ∥x t i,K -x t+1 ∥ 2 where x t+1 = 1 S i x t i,K . In the practical training, the local models can not approach the true local optimal due to the limitation of local interval K, thus all the x t i,K will represent for the dispersion from the global model x t+1 . To keep the x t i,K close to each other can improve the resistance to the local heterogeneity (the idealized case is that all local clients always generate the same parameters per round). Figure 2 (b) shows the empirical results of the consistency on the different dataset, FedMIM handles the more excellent efficiency on maintaining the local similarity than the other baselines on the both CIFAR-10/100.

6. CONCLUSION

In this work, we propose a novel federated algorithm, named FedMIM, which adopts the multistep inertial momentum to guide the local training on the heterogeneous clients both on the gradient estimation and the iterative point for gradient calculation. We also theoretically prove that the proposed FedMIM achieves O( 1 √ SKT ) under the smoothness assumptions and O( 1 T ) under the Polyak-Lojasiewicz (PL) inequality, under the non-convex cases. FedMIM can efficiently improve the local consistency to mitigate the influence from the heterogeneous dataset. We conduct extensive experiments to demonstrate the significant performance of our proposed FedMIM on the real-world dataset. Furthermore, we learn some ablation studies to verify the stability under different setups.



Figure 1: Local steps of FedAvg and FedMIM with 2 clients.

Another efficient approach is to adopt the regularization terms on the local training process to correct the local objective to approach the global optimal.Li et al. (2020b)  firstly employs the prox-term into FL framework and propose the FedProx. Tran Dinh et al. (2021) proposes the FedDR with a Douglas-Rachford splitting in the training. Zhang et al. (2020) adopts the primal-dual method to the FL. Acar et al. (2021) improve the FedPD and propose a partial merged parameters method with the full merged dual variables in the global server, named FedDyn. Wang et al. (2022); Gong et al. (2022) adopt the alternating direction method of multipliers in the FL to extend the federated primal-dual methods. Fallah et al. (2020) puts forward a personalized federated framework with the regularization to achieve a better generalization. T Dinh et al. (2020) incorporates the Moreau-Envelopes in the local training with a stage-wised prox-term. Huang et al. (2021) proposes an adaptive weight for the regularization term to encourage the clients to aggregate more with similar neighbours. The efficient regularization methods are important to the FL field. Global / Local momentum. Inspired by the success of the global correction technique, the exponential moving average term is introduced to federated learning framework to correct the local training. Liu et al. (2020) adopts the momentum-SGD to the local clients to improve the generalization performance with a convergence analysis. Wang et al. (2019) proposes a global momentum method to further improve the stability in the server side. Xu et al. (2021) incorporate the global offset to the local client as a client-level momentum to correct the heterogeneous drifts. Ozfatura et al. (2021) combine the global and local momentum update and propose the FedADC algorithm to avoid the local over-fitting. Reddi et al. (2020) sets a global ADAM optimizer with the momentum update and propose the adaptive federated optimizer in the FL. Wang et al. (2021b) corrects the pre-conditioner in the global server. Though momentum terms are the biased estimation of global information, they still contribute a lot to the federated frameworks in practical empirical experiments.

the convergence rate achieves at least O( 1 √ SKT ), which indicate the linear speedup of the FedMIM and the stochastic variance dominates the upper bound of the convergence.

The term introduced by initial point is exponential diminished by the communication round T . Let η l = O( log(µ 2 ST ) µλKT ) ≥ 1 µλKT , the convergence rate achieves at least O( 1 µST ). Remark 4.3 The B in Assumption 4 weakly influences the convergence bound both in Theorem 4.1 and 4.2 in our proof, which indicates that the major negative impact from the heterogeneity is the constant upper bound G. If G maintains the stability without large fluctuations during the training, let x = x * we have 1 N i=1 ∥∇f i (x * )∥ 2 ≤ G 2 , where G measures the local inconsistency of total clients. Enhancing the local consistency will further improve the performance in the FL framework.

Figure 2: (a) Comparison on different participation ratios on CIFAR-10 dataset and (b) comparison on different methods for the local consistency. We fix the other hyper-parameters. In (a), L/R shows the ratio equals to 5%/20%. In (b), L/R shows the consistency under the CIFAR10/100 dataset.

is the local loss function corresponding to the client i with the data distribution p i . Note that p i may differ among the local clients, which introduces heterogeneity.Notations. In this paper, we consider K local iteration steps and total T communication rounds in the training. ∥∥ denotes the Euclidean l 2 norm if not otherwise specified. In t-th communication round, a set of active clients S t with size S is adopted. The symbol (•) t i,k represents the vectors at k-th local step on the i-th client after the t communication rounds. For simplicity, [N ] represents the set {1, 2, • • • , N }. x * and x * i stand for global optimum and local optimum for client i respectively.

to the server, while the server aggregates the received local models as the updated global model. In the local clients, they firstly calculate the global model increment δ t by the last global model x t-1 in their local own storage. And then, the clients compute the momentum updated y t i,k,1 and y t

Assumption 3 bounds the variance of stochastic gradient and Assumption 4 provides the bound of the different levels of the local private heterogeneity.We now state our convergence results of FedMIM. The detailed proof is stated in Appendix. Let Assumptions 1, 3, and 4 hold. Assume the partial participation ratio being |S t |/N where S t is a uniformly sampled subset from the N clients and satisfies |S t | = S and let u t = x t -

Test accuracy (%) after 1000 communication rounds on IID., Dirichlet-0.3 (DIR.3), and Dirichlet-0.1 (DIR.1) dataset. We set the number of clients as 100 and set the participation ratio as 0.1. The local interval is fixed as 5 epochs and the batchsize is fixed as 50 for all algorithms.

Test accuracy (%) after the corresponding proper communication rounds and local epoch E = 1, 5, 10 respectively. We set the number of clients as 100 and the participation ratio as 0.1. The batchsize is fixed as 50 and the heterogeneous dataset is divided as the Dirichlet-0.6 distribution.

Selection of α j and β j .

