COMMUNICATION-EFFICIENT FEDERATED LEARNING WITH ACCELERATED CLIENT GRADIENT

Abstract

Federated learning often suffers from slow and unstable convergence due to heterogeneous characteristics of participating client datasets. Such a tendency is aggravated when the client participation ratio is low since the information collected from the clients is prone to have large variations. To tackle this challenge, we propose a novel federated learning framework, which improves the consistency across clients and facilitates the convergence of the server model. This is achieved by making the server broadcast a global model with a gradient acceleration. By adopting the strategy, the proposed algorithm conveys the projective global update information to participants effectively with no extra communication cost, and relieves the clients from storing the previous models. We also regularize local updates by aligning each of the client with the overshot global model to reduce bias and improve the stability of our algorithm. We perform comprehensive empirical studies on real data under various settings and demonstrate remarkable performance gains of the proposed method in terms of accuracy and communication efficiency compared to the state-of-the-art methods, especially with low client participation rates. We will release our code to facilitate and disseminate our work.

1. INTRODUCTION

Federated learning (McMahan et al., 2017 ) is a large-scale machine learning framework that learns a shared model in a central server through collaboration with a large number of remote clients with separate datasets. This decentralized learning concept allows federated learning to achieve the basic level of data privacy since the server does not observe training data directly. On the other hand, remote clients such as mobile or IoT devices have limited communication bandwidths, and federated learning algorithms are particularly sensitive to communication costs. A baseline algorithm of federated learning, FedAvg (McMahan et al., 2017) updates a subset of its client models based on a gradient descent method using their local data and then uploads the resulting models to the server for computing the global model parameters via model averaging. As discussed extensively on the convergence of FedAvg (Stich, 2019; Yu et al., 2019; Wang & Joshi, 2021; Stich & Karimireddy, 2019; Basu et al., 2020) , multiple local updates conducted before serverside aggregation provide theoretical support and practical benefit of federated learning by reducing communication cost greatly. Despite the initial success, federated learning faces two key challenges: high heterogeneity in training data distributed over clients and limited participation rates of clients. Several studies (Zhao et al., 2018; Karimireddy et al., 2020) have shown that multiple local updates in the clients with non-i.i.d (independent and identically distributed) data lead to client model drift, in other words, diverging updates in the individual clients. Such a phenomenon introduces the high variance issue in the FedAvg step for global model updates, which hampers the convergence to the optimal average loss over all clients (Li et al., 2020; Wang et al., 2019b; Khaled et al., 2019; Li et al., 2019b; Hsieh et al., 2020; Wang et al., 2020) . The challenge related to client model drift is exacerbated when the client participation rate per communication round is low, due to unstable client device operations and limited communication channels. To properly address the client heterogeneity issue, we propose a novel optimization algorithm for federated learning, Federated averaging with Accelerated Client Gradient (FedACG) , which conveys the momentum of the global gradient to clients and enables the momentum to be incorporated into the local updates in the individual clients. Specifically, we introduce an extra-gradient step on the global model via the global momentum, which allows each client performs its local gradient step along the future gradient. This approach turns out to be effective for reducing the gap between global and local losses. Contrary to the existing methods that require to send additional bits to communicate the momentum, FedACG transmits the global model integrated with the momentum in the form of a single message and saves the cost for communication. In addition, FedACG adds a regularization term in the objective function of clients to make the local gradients more consistent across clients. Although there have been a growing number of works that handle the client heterogeneity in federated learning, FedACG has the following major advantages. Unlike existing approaches focusing on server-level optimization (Reddi et al., 2021; Wang et al., 2019a; Hsu et al., 2019) or client-level optimization (Xu et al., 2021; Acar et al., 2021; Karimireddy et al., 2020; Li et al., 2021; 2020; Zhang et al., 2020; Karimireddy et al., 2021; Li et al., 2019a; Liang et al., 2019) , FedACG incorporates the momentum based on the global gradient information for client-side updates. This strategy allows the proposed algorithm to achieve the same level of task-specific performance with fewer communication rounds. Moreover, while most of existing methods have additional requirements compared to FedAvg including full participation (Liang et al., 2019; Zhang et al., 2020; Khanduri et al., 2021) , additional communication bandwidth (Xu et al., 2021; Karimireddy et al., 2020; Zhu et al., 2021; Karimireddy et al., 2021; Li et al., 2019a; Das et al., 2020; Gao et al., 2022) , and memory budgets in clients to store local states or variables (Acar et al., 2021; Karimireddy et al., 2020; Li et al., 2021; Gao et al., 2022) , FedACG is completely free from any additional communication and memory overhead, which ensures the compatibility with large-scale and low-participation federated learning problems. The main contributions of this paper are summarized as follows. • We propose a communication-efficient federated optimization algorithm that deals with client heterogeneity effectively. The proposed approach employs the global momentum for the acceleration of client gradients to facilitate the optimization of local models. • We also revise the objective function of clients, which augments a regularization term to the local gradient direction, which further aligns the gradients of server and individual clients. • We show that the proposed approach does not require any additional communication cost and memory overhead, which is desirable for the real-world settings of federated learning. • We demonstrate outstanding performance of our optimization technique in terms of communication efficiency and robustness to client heterogeneity, especially when the participation ratio is low.

2. RELATED WORK

Federated learning was first introduced in McMahan et al. (2017) , which formulates the problem and provides the FedAvg algorithm as a solution for its key challenges such as non-iid client data, massively distributed clients, and partial participation of clients. Many works explore the negative influence of heterogeneity in federated learning empirically (Zhao et al., 2018) and derive convergence rates depending on the level of heterogeneity (Li et al., 2020; Wang et al., 2019b; Khaled et al., 2019; Li et al., 2019b; Hsieh et al., 2020; Wang et al., 2020) . There exists a long line of research for client-side optimization to prevent the divergence of clients from the global model. FedProx (Li et al., 2020) penalizes the difference between the server and client parameters, while FedDyn (Acar et al., 2021) and FedPD (Zhang et al., 2020) use cumulative gradients of each client to dynamically regularize local update. FedDC (Gao et al., 2022) introduces the auxiliary drift variables for each client to reduce the impact of the local drift on the global objective. There is another line of works which adopt variance reduction techniques in client update to eliminate inconsistent update across clients. SCAFFOLD (Karimireddy et al., 2020) and Mime (Karimireddy et al., 2021) employ control variates for local updates while FedDANE (Li et al., 2019a) and FedCM (Xu et al., 2021) add a gradient correction term based on the server gradient. FedPA (Al-Shedivat et al., 2021) de-bias client updates by estimating the global posterior on the client side. On the other hand, some approaches adopt a contrastive loss (Li et al., 2021) , knowledge distillation (Kim et al., 2022) , or a generative model (Zhu et al., 2021)  L i (θ) + β 2 θ -(θ t-1 + λm t-1 ) 2 Client sends θ t i back to the server. end In server: θ t = i∈St ω i θ t i m t = θ t -θ t-1 end Return θ t 2022) and ASAM (Caldarola et al., 2022) apply SAM (Foret et al., 2021) as a client-side optimizer for reducing the gap between global and local losses. However, most of these methods require full participation (Zhang et al., 2020; Khanduri et al., 2021) , additional communication cost (Xu et al., 2021; Karimireddy et al., 2020; Zhu et al., 2021; Karimireddy et al., 2021; Li et al., 2019a; Das et al., 2020; Gao et al., 2022) , or extra client storage (Acar et al., 2021; Karimireddy et al., 2020; Li et al., 2021; Gao et al., 2022) , which can be problematic in realistic federated learning tasks. Server-side optimization techniques also have been explored for the stability and speedup of convergence. These approaches adopt a momentum SGD (Hsu et al., 2019) , or an adaptive gradient-descent method (Reddi et al., 2021; Caldarola et al., 2022) , while FedDF (Lin et al., 2020) utilizes the averaged representations of local models on proxy data for aggregation. STEM (Khanduri et al., 2021) and FedGLOMO (Das et al., 2020) apply STORM algorithm (Cutkosky & Orabona, 2019) to both server-level and client-level SGD procedures for reducing high variance in server model update. Meanwhile, another set of works aims to decrease the communication cost per round by compressing the model transmitted. FedPAQ (Reisizadeh et al., 2020) , FedCAMS (Wang et al., 2022) and FedCOMGATE (Haddadpour et al., 2021) quantize the communicated message by using low bit precision, while FedPara (Nam et al., 2022) use low-rank Hadamard product to reparameterize the model's weights. These works are orthogonal to our approach, so they can be readily combined with our proposed method.

3. METHOD

In the federated learning setting, there are N clients that optimize their local models based on the corresponding private datasets as well as a central server that broadcasts the global model and then aggregates messages from the clients. Let L i (θ) := E (x,y)∼Di [ i ((x, y); θ)] be the loss function of the i th ∈ {1, . . . , N } client with a local dataset denoted by D i . Then, our goal is to train a model that minimizes the average loss of all clients as follows: min θ L(θ) := N i=1 ω i L i (θ) , ( ) where θ is the parameter of the global model and ω i is the normalized weight of the i th client proportional to the size of the local dataset. We focus on the non-i.i.d data setting, where local datasets have heterogeneous distributions. Note that the communication of training data between clients and the central server is strictly prohibited in principle due to privacy.

3.1. FEDACG

To reduce the inconsistency between the local models and the consequent divergence of the global one, we incorporate the global momentum into local models for guiding local updates. Overall framework Each round of FedAGC starts from the server. In the t th communication round, the server computes its momentum m t-1 := θ t-1 -θ t-2 , and broadcasts the accelerated global model θ t-1 + λm t-1 as a single message to the active client set S t ⊆ {1, . . . , N }, where λ ∈ (0, 1] controls the importance of the global momentum. Each participating client optimizes its local model from the momentum-integrated initialization. The objective of each client is to minimize the sum of its empirical loss on the local data and a penalty from the difference between the local online model and the accelerated global model, which is given by min θ t i F i (θ t i ) := L i (θ t i ) + β 2 θ t i -(θ t-1 + λm t-1 ) 2 , where β controls the balance between the two terms. Each client uploads their trained model θ t i to the server, and then the server constructs the next server model θ t via a simple aggregation, i.e., θ t = i∈St ω i θ t i . Algorithm 1 presents the procedure of FedACG.

Accelerated client gradient

The main idea of this work, accelerated client gradient, is to leverage the global momentum and allow clients to look ahead the landscape of the global loss. The momentum m t serves as an approximate gradient of the global loss since it maintains past global updates even in partial participating setting. Let the local update of each participating client be ∆ t i = θ t i -(θ t-1 + λm t-1 ). Then, m t is defined recursively with an exponential decay factor λ as m t = θ t -θ t-1 = ∆ t + (θ t-1 + λm t-1 ) -θ t-1 = ∆ t + λm t-1 , where ∆ t = i∈St ω i ∆ t i denotes the expected local updates of all participating clients in the current round t. As illustrated in Figure 1 , FedACG makes an anticipatory update by integrating λm t-1 to the previous global model θ t-1 . This strategy allows the updates of each client to be aligned with the trajectory of the global gradients, which improves consistency of local updates in FedACG. Our approach has a similar motivation with meta-learning (Finn et al., 2017) , where a meta-learner identifies the optimal point to facilitate the optimization of all target tasks. Regularization with momentum-integrated model In addition to the initial acceleration for local training, the second term of our local objective function in Eq. (2) takes the advantage of the global gradient information to reduce the variations of client-specific gradients, ∆ t i . This regularization term enforces the local model not to deviate from the accelerated point, preventing each client from falling into biased local minima.

3.2. DISCUSSION

While our formulation has something in common with the existing works that also address client heterogeneity by employing global gradient information for the local update, FedACG has the following major advantages. First, contrary to Karimireddy et al. (2020) ; Xu et al. (2021) ; Gao et al. (2022) , the server and clients only communicate model parameters without imposing additional network overhead for transmitting gradients and other information; the server broadcasts (θ t-foot_0 + λm t-1 ) as a single message and each client sends θ t i,K to the server. This is a critical benefit because the increase in communication cost challenges many realistic federated learning applications involving clients with limited network bandwidths. Second, FedACG is robust to the low participation rate of clients and allows new-arriving clients to join the training process immediately without a warmup phase because, unlike Karimireddy et al. (2020 ), Acar et al. (2021) , Li et al. (2021), and Gao et al. (2022) , the clients are supposed to neither store their local states nor use them for model updates.

3.3. CONVERGENCE ANALYSIS OF FEDACG

We now present the theoretical convergence rate of FedACG. We first state two assumptions for the local loss functions i (θ), which are commonly used in several previous works on federated optimization (Karimireddy et al., 2020; Reddi et al., 2021; Xu et al., 2021; Acar et al., 2021) . First, the local function i (•) is assumed to be L-smooth for all i ∈ {1, . . . , N }, i.e., ∇ i (x) -∇ i (y) ≤ L x -y ∀x, y. Second, if we additionally assume the convexity of the functions { i (•)} N i=1 , we have ∀x, 1 2LN N i=1 ∇ i (x) -∇ i (x * ) 2 ≤ (x) -(x * ) and (5) ∀x, y, z, ∇ i (x), z -y ≤ -i (z) + i (y) + L 2 z -x 2 , where (x) = 1 N N i=1 i (x) and ∇ (x * ) = 0. Based on the above assumptions, we derive the following asymptotic convergence bound of FedACG. Note that we make no further assumptions such as a form of bounded variance and gradients used in Karimireddy et al. (2020 ), Reddi et al. (2021 ), and Xu et al. (2021) . Theorem 1. Assuming the convexity and L-smoothness of { i (•)} N i=1 , for 1 2 < λ < 1, Algorithm 1 satisfies E 1 T T t=1 θ t-1 -(θ * ) ≤ λ(1 -λ) T L θ 0 -θ * 2 + 1 LN N i=1 ∇ i θ 0 i 2 , where θ * = argmin θ (θ) and θ t = 1 |St| i∈St θ t i . Theorem 1 implies that, for convex and smooth local functions, the global objective function is expected to converge at a rate of O √ λ(1-λ) T . This rate is the empirical loss averaged over all devices. It further implies a higher value of λ improves the convergence rate under the convex setting, which will be empirically verified in our experiments. Please, refer to the supplementary document for the full proof.

4. EXPERIMENTS

This section presents empirical evaluations of FedACG and competing federated learning methods, to highlight the robustness to data heterogeneity of the proposed method in terms of performance and communication-efficiency.

Datasets and baselines

We conduct a set of experiments on CIFAR-10 ( Krizhevsky et al., 2009) , CIFAR-100 (Krizhevsky et al., 2009) , and Tiny-ImageNet 1 (Le & Yang, 2015) with various data heterogeneity levels and participation rates. Note that Tiny-ImageNet (200 classes with 10, 000 samples) is more natural and realistic compared to the simple datasets, such as MNIST and CIFAR, (Li et al., 2021) , FedDC (Gao et al., 2022) . We adopt a standard ResNet-18 (He et al., 2016) as backbone network for all benchmarks, but we replace batch normalization by group normalization as suggested in Hsieh et al. (2020) . Evaluation metrics To evaluate the generalization performance of the methods on the global distribution, we use the entire test set in the CIFAR-10, CIFAR-100, and Tiny-ImageNet. Since both the speed of learning as well as the final performance are important quantities for federated learning, we measure: (i) the performance attained at a specified number of rounds, and (ii) the number of rounds needed for an algorithm to attain the desired level of target accuracy, following Al-Shedivat et al. (2021) . For the selection of target accuracies, we first choose the median of all methods at round 1000 and another representative value lower than the median. For methods that could not achieve aimed accuracy within the maximum communication round, we append the communication round with a + sign. 2021) for evaluation protocol. For local update, we use the SGD optimizer with a learning rate 0.1 for all approaches on the three benchmarks. We apply exponential decay on the local learning rate, and the decay parameter is selected from {1.0, 0.998, 0.995}. We apply no momentum for local SGD, but apply weight decay of 0.001 to prevent overfitting. We also use gradient clipping to increase the stability of the algorithms. The number of local training epochs over each client update is set to 5, and the batch size is set so that the total iteration for local updates is set to 50 for all experiments. We set the global learning rate as 1 for all methods except for FedADAM which is set to 0.01. We list the details of the hyperparameters specific to FedACG and the compared algorithms in Appendix B.

4.2. MAIN RESULTS

Evaluation with standard federated learning scenarios We first present the performance of the proposed approach, FedACG, on CIFAR-10, CIFAR-100, and Tiny-ImageNet in the scenarios by varying the number of clients, data heterogeneity, and participation rate. Our experiment has been performed on two different settings; one is with a moderate-scale, which involves 100 devices with 5% participation rate per round, and the other is with a large number of clients, 500 with 2% participation rate. Note that the number of clients in the large-scale setting is 5 times more than the moderate-scale experiment, which reduces the number of examples per client by 80%. For the large-scale setting, Table 1b illustrates the outstanding performance of FedACG on CIFAR-10, CIFAR-100, and Tiny-ImageNet, except for the accuracy at 1K rounds on CIFAR-10. One noticeable thing is that the overall performance is lower than the case with a moderate number of clients. This is because the number of training data for each client decreases and each client suffers more from the heterogeneous data distribution. Nevertheless, we observe that FedACG outperforms other methods consistently in most cases; the accuracy gap between FedACG and its strongest competitor becomes larger in these more challenging scenarios. The results from the large-scale experiments informs the robustness of FedACG to the heterogeneity and limited participation of clients. We present more comprehensive results for the convergence of FedACG in Appendix D.1. Effect of low participation rate Partial participation is a critical challenge to slow down the convergence of the global model in federated learning. To verify the robustness to the low participation rate of clients, we perform experiments when the total number of clients is 500 and the participation 2 again shows that FedACG has the best performance for most cases. Note that the performance gap between FedACG and the second-best method, FedDC, becomes larger than when the participation rate is 2%: from -0.69%p to 1.39%p on CIFAR-10 and from 2.47%p to 6.34%p on CIFAR-100 at round 1000. This is partly because the local states managed by FedDC are susceptible to get stale quickly in this scenario, making its convergence require extra iterations. In contrast, our method does not rely on the past information stored in local devices and is not affected by this issue. Evaluation on dynamic client set Since FedACG is free from the requirements of storing local model history for local updates, it is conceptually better-suited for the scenarios in the presence of newly participating clients. In order to validate the property, we conduct an experiment on CIFAR-100 with 500 clients for Dirichlet(0.3) splits. We sample 250 clients at every 100 rounds as a candidate client set, and then 10 randomly sampled clients (4% of clients) participate in the local training for each communication round. Table 3 shows that FedACG outperforms FedAvg and FedDyn. Note that FedDyn is worse than FedAvg since the client model has trouble with its heterogeneity and divergence because new clients have no or non-informative local states. 4 presents the contribution of individual components in the experiment on CIFAR-10 for the large-scale federated learning setting. We observe that the accelerated client gradient for local training has more critical impact on accuracy with 1000 rounds. Note that the proposed regularization term in local loss function shows larger performance gain when used with the accelerated client gradient, while employing the regularization term only do not necessarily achieve performance gains in CIFAR-10 and CIFAR-100.

Contribution of individual components Table

Ablation study for hyperparameters Table 5 presents the accuracy of FedACG for Dirichlet(0.3) and IID splits by varying the value of λ and β, which control the momentum integration of the server model and the weight of the proximal term, respectively. As shown in the table 5a, the low values of λ do not work well, supporting the benefit of the proposed accelerated client gradient strategy, while Table 5b shows that the accuracy is stable with respect to β. 

4.4. EXPERIMENTS ON REALISTIC DATASETS

We conducted experiments on additional realistic datasets, FEMNIST and CelebA in LEAF (Caldas et al., 2019) , which includes other non-iid scenarios such as feature skewness and data imbalance between clients. For the experiment, the number of clients is set to 2000 with data split following Caldas et al. ( 2019), and 10 randomly sampled clients participate the training for each communication round. We use simple CNN with group normalization with the number of layers two for FEMNIST and four for CelebA, respectively. Table 6 presents that FedACG also outperforms other baselines on both datasets for most cases, which supports our claim about the strength of FedACG on dataset heterogeneity. Table 6 presents that FedACG also outperforms other baselines on both datasets for most cases, which supports our claim about the strength of FedACG on dataset heterogeneity. Note that, while FedACG requires 20 more communication rounds than FedDC to reach a target accuracy on FEMNIST, it sends 1.76× less parameters than FedDC.

5. CONCLUSION

This paper tackles a realistic federated learning scenario, where a large number of clients with heterogeneous data and limited participation constraints hurt the convergence and performance of the model. To address this problem, we proposed a novel federated learning framework, which naturally aggregates previous global gradient information and incorporates it to guide client updates. The proposed algorithm transmits the global gradient information to clients without additional communication cost by simply adding the global information to the current model when broadcasting it to clients. We showed that the proposed method is desirable with the realistic federated learning scenarios since it does not require any constraints such as communication or memory overhead. We demonstrate the effectiveness of the proposed method in terms of robustness and communicationefficiency in the presence of client heterogeneity through extensive evaluation on multiple benchmarks.

Ethics statement

We propose a communication-efficient federated learning framework which handles non-i.i.d data distribution over remote clients. Without access to the raw data stored in remote devices, the proposed method gets the basic level of privacy. Also, unlike centralized training which suffers from dataset bias and unfairness problems since collected data reflects the perspec-tive of the person who collects the data, it opens the way to learn real data distribution instead of collected data distribution.

Reproducibility statement

We present the procedure of our proposed method in Algorithm 1, and the implementation details in Section 4.1. We also present algorithm-dependent hyperparameters in Appendix B. We have submitted the code and will make it publicly available.

A CONVERGENCE OF FEDACG

We now present the theoretical convergence rate of FedACG. We first state few assumptions for the local loss functions i (θ), which are commonly used in several previous works on federated optimization (Karimireddy et al., 2020; Reddi et al., 2021; Xu et al., 2021) . First, the local function L i (•) is assumed to be L-smooth for all i ∈ {1, . . . , N }, i.e., ∇L i (x) -∇L i (y) ≤ L x -y ∀x, y. (7) if local functions {L i (•)} N i=1 are convex, we additionally have ∀x, 1 2LN N i=1 ∇L i (x) -∇L i (x * ) 2 ≤ L(x) -L(x * ) and (8) ∀x, y, z, ∇L i (x), z -y ≤ -L i (z) + L i (y) + L 2 z -x 2 , ( ) where L(x) = 1 N N i=1 L i (x) and ∇L(x * ) = 0. Second, we assume the local loss functions L i (x) have bounded variance, i.e., E Di ∇ i (x) -∇L i (x) < σ 2 , and bounded gradients i.e., L i (x) 2 < G, for all x. Based on the above assumptions, we derive the following asymptotic convergence bound of FedACG.  E ∇F θ t + λm t 2 ≤ 2 (F (z 0 ) -F * ) (1 -λ) t + 1 max 2LK 1 -λ , √ t + 1 C + C √ t + 1 B , ( ) where B = 1 (1-λ)K (1 + L 2 K 4 3 )(1 -λ) + λ 4 LK 2 2(1-λ) 2 + (1 + 4N |St|(N -1) (1 -|St| N ))(LK 2 + Lλ 4 K 2 2(1-λ) 2 ) G 2 + LK 2 2 2 + λ 4 (1-λ 2 ) σ 2 . Proof. We start the proof from the result in Lemma 1, E[F(z t+1 ) -F(z t+1 )] ≤ -BE[ ∇F(θ t + λm t ) 2 ] + B , where B = ηK 1-λ 1-ηLK 1-λ , and B = η 2 2(1-λ) 2 (1+ L 2 K 4 3 )(1-λ)+ λ 4 LK 2 2(1-λ) 2 +(1+ 4N |St|(N -1) (1- |St| N ))(LK 2 + Lλ 4 K 2 2(1-λ) 2 ) G 2 + LK 2 2 2 + λ 4 (1-λ 2 ) σ 2 , respectively. By summing the above inequalities for t = 0, . . . , t and by noting that λ < 1-λ LK , B t k=0 E ∇F(θ t + λm t ) 2 ≤ E[F(z 0 ) -F(z t+1 )] + (t + 1)B ≤ E[F(z 0 ) -F * ] + (t + 1)B . Then min k=0,...,t E ∇F θ t + λm t 2 ≤ f (z 0 ) -f * (t + 1)B + B B (11) Assume η ≤ 1-λ 2LK , then B ≥ ηK 2(1-λ) . Then min k=0,...,t E ∇F θ t + λm t 2 ≤ 2 (F (z 0 ) -F * ) (1 -λ) ηK(t + 1) + 2(1 -λ) ηK B . ( ) Noting that η = min 1-λ 2LK , C √ t+1 , we can have min k=0,...,t E ∇F θ t + λm t 2 ≤ 2 (F (z 0 ) -F * ) (1 -λ) t + 1 max 2LK 1 -λ , √ t + 1 C + C √ t + 1 B , ( ) where B = 1 (1-λ)K (1 + L 2 K 4 3 )(1 -λ) + λ 4 LK 2 2(1-λ) 2 + (1 + 4N |St|(N -1) (1 -|St| N ))(LK 2 + Lλ 4 K 2 2(1-λ) 2 ) G 2 + LK 2 2 2+ λ 4 (1-λ 2 ) σ 2 . We then complete the proof by noting that z 0 = θ 0 +λm 0 . Lemma 1. For proving Theorem 1, we first prove the key Lemma below. Let z t = θ t + λ 1-λ m t , ∆ t i = θ t i -(θ t-1 + λm t-1 ) = K-1 k=0 -η∇f i (θ t i,k ), δ t = 1 N i∈[N ] K-1 k=0 -η∇F i (θ t i,k ) , and e t = ∆ t -δ t . FedACG satisfies for any t ≥ 0 and 0 ≤ λ < 1, E [f (z k+1 ) -f (z k )] ≤ -BE ∇f (x k ) 2 + B , where B = ηK 1-λ 1-ηLK 1-λ and B = η 2 2(1-λ) 2 (1+ L 2 K 4 3 )(1-λ)+ λ 4 LK 2 2(1-λ) 2 +(1+ 4N |St|(N -1) (1- |St| N ))(LK 2 + Lλ 4 K 2 2(1-λ) 2 ) G 2 + LK 2 2 2 + λ 4 (1-λ 2 ) σ 2 Proof. F(z t+1 ) ≤ F(z t ) + ∇F(z t ), z t+1 -z t + L 2 z t+1 -z t 2 = F(z t ) + 1 1 -λ ∇F(z t ), ∆ t+1 + L 2(1 -λ) 2 ∆ t+1 2 = F(z t ) + 1 1 -λ ∇F(z t ), (e t+1 + δ t+1 ) + ηK∇F(θ t + λm t ) -ηK∇F(θ t + λm t ) + L 2(1 -λ) 2 e t+1 + δ t+1 2 = F(z t ) + 1 1 -λ ∇F(z t ), e t+1 + 1 1 -λ ∇F(z t ), (δ t+1 + ηK∇F(θ t + λm t )) - ηK 1 -λ ∇F(z t ), ∇F(θ t + λm t + L 2(1 -λ) 2 e t+1 + δ t+1 2 = F(z t ) + 1 1 -λ ∇F(z t ), e t+1 + 1 1 -λ ∇F(z t ), δ t + ηK∇F(θ t + λm t ) - ηK 1 -λ ∇F(z t ) -∇F(θ t + λm t ), ∇F(θ t + λm t ) - ηK 1 -λ ∇F(θ t + λm t ) 2 + L 2(1 -λ) 2 e t+1 + δ t+1 2 First inequality comes from the L-smoothness of the loss function F. By taking expectation on both sides, we get following equation. E(F(z t+1 ) -F(z t )) ≤ 1 1 -λ E[ ∇F(z t ), δ t + ηK∇F(θ t + λm t ) ] - ηK 1 -λ E[ ∇F(z t ) -∇F(θ t + λm t ), ∇F(θ t + λm t ) ] - ηK 1 -λ E[ ∇F(θ t + λm t ) 2 ] + L 2(1 -λ) 2 E[ e t+1 + δ t+1 2 ] ≤ 1 2(1 -λ) {E[ η∇F(z t ) 2 ] I * + E[ 1 η (δ t + ηK∇F(θ t + λm t )) 2 II * } + 1 4L E[ ∇F(z t ) -∇F(θ t + λm t ) 2 ] III * + L 2(1 -λ) 2 (E[ e t+1 2 ] IV * + E[ δ t+1 2 ] V * ) + {L( Kη 1 -λ ) 2 - Kη 1 -λ }E[ ∇F(θ t + λm t ) 2 ] First line holds because E[e t+1 ] = 0. Second inequality comes from the Lemma 5. Now we have to find the upper bound of five terms depicted in the last inequality. I * 's upper bound is, I * = η 2 E[ 1 N i∈[N ] ∇F i (z t ) 2 ] ≤ η 2 G 2 The upper bound of II * , III * , and IV * are handled in Lemma 2, Lemma 3, and Lemma 4, respectively. V * 's upper bound is, V * = E[ 1 N i∈[N ] K-1 k=0 -η∇F i (θ t i,k ) 2 ] ≤ η 2 K N i∈[N ] K-1 k=0 E[ ∇F i (θ t i,k ) 2 ] ≤ η 2 K 2 G 2 Substituting upper bound for five items yields the desired result. Lemma 2. E[ 1 η (δ t + ηK∇F(θ t + λm t )) 2 ] in the proof of Lemma 6 has following bound.

E[

1 η (δ t + ηK∇F(θ t + λm t )) 2 ] ≤ η 2 L 2 K 4 G 2 Proof. 1 η (δ t + ηK∇F(θ t + λm t )) 2 } = E[ 1 η (δ t + ηK∇F(θ t + λm t )) 2 = E[ 1 N i∈[N ] K-1 k=0 {-∇F i (θ t i,k ) + ∇F i (θ t + λm t ))} 2 ≤ K N i∈[N ] K-1 k=0 E[ {-∇F i (θ t i,k ) + ∇F i (θ t + λm t ))} 2 ≤ L 2 K N i∈[N ] K-1 k=0 E[ θ t i,k -θ t i,0 2 = L 2 K N i∈[N ] K-1 k=0 E[ k-1 τ =0 -η∇F i (θ t i,k ) 2 ≤ η 2 L 2 K N i∈[N ] K-1 k=0 k k-1 τ =0 E[ ∇F i (θ t i,k ) 2 ≤ η 2 L 2 K N i∈[N ] K-1 k=0 k 2 G 2 ≤ η 2 L 2 K 4 G 2 3 Inequality in the third and sixth line comes from Jensen's inequality. Inequality in the fourth line is derived by the smoothness of the objective function. Inequality in the seventh line use bounded gradient assumption. Lemma 3. E ∇F(z t ) -∇F(θ t + λm t ) 2 in the proof of Lemma 6 has the following bound for any 0 ≤ λ < 1, E ∇F(z t ) -∇F(θ t + λm t ) 2 ≤ η 2 λ 4 L 2 K 2 G 2 1 + 4N |St|(N -1) 1 -|St| N + η 2 λ 4 L 2 K 2 σ 2 (1 -λ) 4 Proof. E ∇F(z t ) -∇F(θ t + λm t ) 2 ≤ L 2 E λ 2 1 -λ m t 2 = λ 4 L 2 (1 -λ) 2 E m t 2 = λ 4 L 2 (1 -λ) 2 E t k=0 λ t-k ∆ k 2 , where the first inequality comes from the L-smoothness of the global loss function F, while the last equation comes from the unrolling the recursion of the momentum m t , i.e., m t = t k=0 λ t-k ∆ k . Let Γ t = t k=0 λ k = 1-λ t 1-λ . For 0 ≤ λ < 1, Γ t ≤ 1 1-λ .Then λ 4 L 2 (1 -λ) 2 E t k=0 λ t-k ∆ k 2 = λ 4 L 2 (1 -λ) 2 Γ 2 t 1 Γ t t k=1 λ t-k ∆ k 2 ≤ λ 4 L 2 (1 -λ) 2 Γ t t k=1 λ t-k E ∆ k 2 = λ 4 L 2 (1 -λ) 2 Γ 2 t E ∆ t 2 = η 2 λ 4 L 2 K 2 G 2 1 + 4N |St|(N -1) 1 -|St| N + η 2 λ 4 L 2 K 2 σ 2 (1 -λ) 4 Lemma 4. E e 2 in the proof of Lemma 6 has the following bound, E e t 2 ≤ η 2 K 2 σ 2 + 4η 2 N K 2 (N -1)|S t | (1 - |S t | N )G 2 Proof. We have E e t 2 = E ∆ t -δ t . Note that: E ∆ t -δ t 2 = E ∆ t + η |S t | i∈St K-1 k=0 ∇F i (θ t i,k ) - η |S t | i∈St K-1 k=0 ∇F i (θ t i,k ) + η N i∈[N ] K-1 k=0 ∇F i (θ t i,k ) 2 = E η |S t | i∈St K-1 k=0 ∇f i (θ t i,k ) -∇F i (θ t i,k ) 2 + E η |S t | i∈St K-1 k=0 ∇F i (θ t i,k ) -δ t 2 ≤ η 2 |S t | i∈St K-1 k=0 E ∇f i (θ t i,k ) -∇F i (θ t i,k ) 2 + E η |S t | i∈St K-1 k=0 ∇F i (θ t i,k ) -δ t 2 = η 2 K 2 σ 2 + E η |S t | i∈St K-1 k=0 ∇F i (θ t i,k ) -δ t 2 (A) In (A), we take expectation with respect to S k and total clients N . For that, we use Lemma 4 of Reisizadeh et al. (2020) . Specifically, using Eq. ( 59) in Reisizadeh et al. (2020) , we get: (A) ≤ η 2 |S t | 2 |S t | N - |S t |(|S t | -1) N (N -1) i∈[N ] E K-1 k=0 ∇F i (θ t i,k ) - δ t η 2 ≤ η 2 |S t | 2 |S t | N - |S t |(|S t | -1) N (N -1) 2 i∈[N ] E K-1 k=0 ∇F i (θ) 2 + 2 N i∈[N ] E K-1 k=0 ∇F i (θ t i,k ) 2 ≤ 4η 2 |S t | 2 |S t | N - |S t |(|S t | -1) N (N -1) i∈[N ] E K-1 k=0 ∇F i (θ t i,k ) 2 ≤ 4η 2 |S t | 2 |S t | N - |S t |(|S t | -1) N (N -1) N K 2 G 2 This gives us the desired result. Lemma 5. (Relaxed triangle inequality). For any a > 0, v 1 + v 2 2 ≤ (1 + a) v 1 2 + 1 + 1 a v 2 2 Proof. This lemma holds because when we organize the formulas on the right, we get 0 ≤ av 1 -v2 a 2 . A.2 CONVEX ANALYSIS Theorem 2. Suppose that local functions { i } N i=1 are convex and L-smooth. Then, for 1 2 < λ < 1 and L 1-λ ≤ β, FedACG satisfies, E 1 T T t=1 θ t-1 -(θ * ) ≤ 1 T β(1 -λ) θ 0 -θ * 2 + λ β 1 N N i=1 ∇ i θ 0 i 2 = O 1 T where θ * = argmin θ (θ), θ t = 1 |St| i∈St θ t i . If β = L λ 1-λ , we get the statement in Theorem 1 in the main paper. We utilize similar approaches as in SCAFFOLD (Karimireddy et al., 2020) and FedDyn (Acar et al., 2021) analysis throughout the proof. We define momentum m t = θ t -θ t-1 , and a set of variables for the analysis. Following analysis in FedDyn (Acar et al., 2021) , we first define virtual variable { θt i } as, θt i = argmin θ i (θ) + β 2 θ -(θ t-1 + λm t-1 ) 2 . ( ) θ t consists of the locally trained models from participating devices. We express the server model as active device average and its relation with accelerated model as, θ t = 1 |S t | i∈St θ t i ; θ t = γ t -λm t . We also define t which calculate difference between local models and the average of device models from previous round as, t = 1 N i∈{1,...,N } E θt i -θ t-1 2 . ( ) If models converge to θ * , t will be 0. After these definitions, Theorem 2 can be seen as a direct consequence of the following Lemma, Lemma 6. For convex and L-smooth {f i } N i=1 functions, if 1 2 < λ < 1 and L 1-λ ≤ β, FedACG satisfies E θ t -θ * 2 + κ t ≤ E θ t-1 -θ * 2 + κ t-1 -κ 0 E (θ t-1 ) -(θ * ) where θ * = argmin θ f (θ), κ = 4βλ 2 (L-β-βλ) (β 2 -4L 2 -4β 2 λ 2 )(1-λ) , κ 0 = 2 β(1-λ) 4L(L-β-λβ) β 2 -4L 2 -4β 2 λ 2 -1 . Lemma 6 can be telescoped in the following way, κ 0 E (θ t-1 ) -(θ * ) ≤ E θ t-1 -θ * 2 + κ t-1 -E θ t -θ * 2 + κ t κ 0 T t=1 E (θ t-1 ) -(θ * ) ≤ E θ 0 -θ * 2 + κ 0 -E θ T -θ * 2 + κ T If 1 2 < λ < 1 and L 1-λ ≤ β, κ 0 and κ become positive. Eliminating negative terms on RHS gives, κ 0 T t=1 E (θ t-1 ) -(θ * ) ≤ E θ 0 -θ * 2 + κ 0 Applying Jensen's inequality on LHS gives, E 1 T T t=1 θ t-1 -(θ * ) ≤ 1 T 1 κ 0 θ 0 -θ * 2 + κ 0 = O 1 T , which proves the statement in Theorem 2. Similar to convergence analysis of gradient descent, θ t -θ * 2 is expressed as θ t -θ t-1 + θ t-1 -θ * 2 and expanded in the proof of Lemma 6. To tackle the extra terms, we state the following Lemmas and corresponding proofs. We first bound θ t -θ t-1 2 with the following, Lemma 7. Suppose that local functions {f i } N i=1 are convex and L-smooth. Then we can bound the global model update, E θ t -θ t-1 2 ≤ t (17) Proof. Note that E θ t -θ t-1 2 = E 1 |S t | i∈St θ t i -θ t-1 2 ≤ 1 |S t | E i∈St θ t i -θ t-1 2 = 1 |S t | E i∈St θt i -θ t-1 2 = 1 |S t | |S t | N N i=1 E θt i -θ t-1 2 = t where first equality comes from Eq. ( 15). The following inequality applies Jensen. Remaining relations are due to θt i = θ t i if i ∈ S t , taking expectation by conditioning on randomness before time t and definition of t . We introduce additional Lemma to further bound term in Lemma 7. Before the proof, we first introduce triangular inequality here. Lemma 8. ∀{v j } n j=1 ∈ R d , triangular inequality satisfies n j=1 v j 2 ≤ n n j=1 v j 2 . Proof. With Jensen's inequality, 1 n n j=1 v j 2 ≤ 1 n n j=1 v j 2 . Multiplying both sides with n 2 gives the inequality. Lemma 9. For convex and L-smooth {f i } N i=1 functions, then the updates of FedACG have bounded drift, 1 - 4L 2 β 2 t ≤ 4λ 2 t-1 + 8L β 2 E (θ t-1 ) -(θ * ) Proof. t = 1 N N i=1 E θt i -θ t-1 2 = 1 N N i=1 E - 1 β ∇ i ( θt i ) + λm t-1 2 = 1 N N i=1 E - 1 β ∇ i (θ * ) + 1 β ∇ i (θ * ) - 1 β ∇ i (θ t-1 ) + 1 β ∇ i (θ t-1 ) - 1 β ∇ i ( θt i ) + λm t-1 2 ≤ 4 β 2 1 N N i=1 E ∇ i (θ t-1 ) -∇ i (θ * ) 2 + 4 β 2 1 N N i=1 E ∇ i (θ * ) 2 + 4 β 2 1 N N i=1 E ∇ i ( θt i ) -∇ i (θ t-1 ) 2 + 4λ 2 t-1 ≤ 4L 2 β 2 t + 4λ 2 t-1 + 8L β 2 E (θ t-1 ) -(θ * ) where first and second equations come from Eq. ( 15) and first order condition of Eq. ( 14). Following inequalities come from Lemma 8, 7, and convexity. Now, let's express θ t -θ * 2 term as, E θ t -θ * 2 = E θ t-1 -θ * + θ t -θ t-1 2 = E θ t-1 -θ * 2 + 2E θ t-1 -θ * , θ t -θ t-1 + E θ t -θ t-1 2 ≈ E θ t-1 -θ * 2 + 2 (1 -λ)βN N i=1 E θ t-1 -θ * , -∇ i ( θt i ) + E θ t -θ t-1 2 ≤ E θ t-1 -θ * 2 + E θ t -θ t-1 2 + 2 (1 -λ)βN N i=1 E i (θ * ) -i (θ t-1 ) + L 2 θt i -θ t-1 2 = E θ t-1 -θ * 2 - 2 (1 -λ)β E (θ t-1 ) -(θ * ) + L (1 -λ)β t + E θ t -θ t-1 2 (19) where we approximately have E[m t ] ≈ - 1 (1-λ)βN N i=1 E[∇ i (θ t i ) ] since global gradient information is an exponentially updated with a coefficient λ. Following inequality is due to the quadratic bound by convexity and L-smoothness of local functions. Let's scale Eq. ( 19) with (1-λ)(β 2 -4L 2 -4λ 2 β 2 ) β(L-β-βλ) . Note that the coefficient is positive due to the condition on β and λ. Summing scaled version of Eq. ( 19) and Lemma 9 gives the statement in Lemma 6.

B HYPERPARAMETER SETTING

For the hyperparameter selection, we assume the scenario that the server can compute the validation accuracy through communication with clients at the early stages, which is common to all algorithms. For the experiments on CIFAR-10 and CIFAR-100, we choose 5 as the number of local training epochs (50 iterations) and 0.1 as the local learning rate. We set the batch size of the local update to 50 and 10 for the 100 and 500 client participation, respectively. The learning rate decay parameter of each algorithm is selected from {0.995, 0.998, 1} to achieve the best performance. The global learning rate is set to 1, except for FedAdam, which is tested with 0.01. For the experiments on Tiny-ImageNet, we match the total local iterations of local updates with other benchmarks by setting the batch size of local updates as 100 and 20 for the 100 and 500 client participation, respectively. As for algorithm-dependent hyperparameters, α in FedCM is selected from {0.1, 0.3, 0.5}, α in FedDyn is selected from {0.001, 0.01, 0.1}, and α in FedDC is set to 0.01. τ in FedAdam is set to 0.001 while µ in MOON is set to 1. β in FedAvgM is selected from {0.4, 0.6, 0.8}, β in FedProx and FedACG is selected from {0.1, 0.01, 0.001}, and λ in FedACG is selected from {0.8, 0.85, 0.9}. We submitted and will release the source code to facilitate the reproduction of our results.

C EVALUATION ON VARIOUS DATA HETEROGENEITY

Tables 7 and 8 show that FedACG matches or outperforms the performance of competitive methods when data heterogeneity is not severe (Dirichlet 0.6) or absent (IID) on CIFAR-10 in most cases. Note that, while the compared methods show performance degradation as the participation rate decreases, FedACG shows little degradation as the participation rate decreases for both data splits. This implies that FedACG is more robust for low participation rates than other baselines. This is partly because low client heterogeneity reduces noise in the momentum of global gradient, which attributes to the smooth trajectory of global update. Since FedACG effectively incorporates the momentum for local updates, FedACG is relatively unaffected by the partial participation of federated learning. 4 show the convergence of FedACG and the compared algorithms on CIFAR-10, CIFAR-100, and Tiny-ImageNet for various federated learning settings: varying the number of total clients, participation rates, data heterogeneity. FedACG continuously matches or exceeds the performance of the most powerful of our competitors in most learning sections. Figure 5 shows the convergence plots under massive clients, 1% participation rate setting. The result shows that FedACG takes the lead in most learning sections, which also demonstrates the effectiveness of FedACG.

D.2 EVALUATION ON DYNAMIC CLIENT SET

Figure 6 shows a convergence plot when the entire client's pool changes during training. The result shows that FedACG outperforms the baselines in most learning sections. Note that FedDyn shows worse performance than FedAvg in the overall section of learning, and only achieves FedAvg's performance at the end. This is partly because it needs to store local states for local training in each client, which requires a kind of warm-up period for newly participating clients to contain useful information. In contrast, FedACG, which is free from these restrictions, shows strength in a realistic federated learning scenario where the pool of the entire clients changes during training. ) while Fe-dAvgM (Hsu et al., 2020) broadcasts the current global model(θ t-1 ) to each client as the initial point of the local model. To clarify the novelty of FedACG, we provide two pseudo-codes of FedACG and FedAvgM with local regularization term in Algorithm 2 and Algorithm 3, respectively. Figure 7 and Figure 8 also illustrates the server broadcasting and aggregation process of FedACG and FedAvgM, respectively. FedProx FedACG is a totally different method from FedProx for three reasons. First, FedACG utilizes the global momentum for server update as in Algorithm 2. Second, since FedACG uses client accelerated gradient, the local model's initial point is different from FedProx. From this, third, objective function of FedACG regularize the distance not between the local model and the previous global model (FedProx) , but between the local model and the accelerated point.

F EFFECT OF LOCAL REGULARIZATION TERM

Table 9 shows the effect of local regularization term in FedAvg, FedAvgM, and FedACG. Note the role of the local regularization term is different in FedACG due to the acceleration term (+λm t-1 ) included in the message from the global model. We first observe that employing accelerated client gradient by adding global momentum to the current model plays a critical role for the performance gain. We also observe the effectiveness of the local regularization term in FedACG; Adding the local regularization term to the other baselines do not necessarily achieve performance gains in CIFAR-10 and CIFAR-100. Algorithm 2 FedACG Input: β, λ, initial server model θ 0 , number of clients N , number of communication rounds T , number of local iterations K, local learning rate η Initialize global momentum m 0 = 0 for each round t = 1, 2, . . . , T do Sample subset of clients S t ⊆ {1, . . . , N } Server sends θ t-1 +λm t-1 for all clients i ∈ S t for each client i ∈ S t , in parallel do Initialize local model θ t i,0 = θ t-1 + λm t-1 for each local iteration k = 1, 2, . . . , K do Compute mini-batch loss f i (θ t i,k-1 ) = L i (θ t i,k-1 ) + β 2 θ t i,k-1 -(θ t-1 + λm t-1 ) 2 θ t i,k = θ t i,k-1 -η∇f i (θ t i,k-1 ) end ∆ t i = θ t i,K -(θ t-1 + λm t-1 ) Client sends ∆ t i back to the server end In server: ∆ t = i∈St ω i ∆ t i m t = λm t-1 + ∆ t θ t = θ t-1 + m t end Return θ t



https://www.kaggle.com/c/tiny-imagenet



Figure 1: An illustration of the proposed accelerated client gradient method. We first partially update the global model in the direction of the global momentum (orange) and then aggregate local updates (gray), resulting in the server model in the next round (blue). Through this anticipatory update, we make the individual local updates aligned with the global gradient and achieve speed-up of convergence.

NON-CONVEX ANALYSIS Theorem 1. (Convergence of FedACG) Suppose that local functions {f i } N i=1 are non-convex and L-smooth. Let z t = θ t + λ 1-λ m t for any 0 ≤ λ < 1. Then, by setting η = min 1

Figure5: The convergence plots of FedACG and the baselines when participation rate is low (1%) for 500 clients on CIFAR-10 and CIFAR-100. The Dirichlet parameter is set to 0.3 for the experiments.

FedACGInput: β, λ, initial server model θ 0 , number of clients N , number of communication rounds T , local learning rate η Initialize the global momentum, m 0 = 0. for each round t = 1, 2, . . . , T do Sample a subset of clients S t ⊆ {1, . . . , N }.Server sends θ t-1 + λm t-1 to initialize local models for all clients i ∈ S t .for each client i ∈ S t , in parallel do

Comparisons of FedACG with baselines on CIFAR-10, CIFAR-100 and Tiny-ImageNet for two different federated learning settings. For a moderate-scale experiment (a), the number of clients and participation rate, are set to 100, and 5%, respectively, while a large-scale experiment (b) has 500 clients with 2% participation rate. The Dirichlet parameter is commonly set to 0.3. Accuracy at target round and the communication round to reach target test accuracy is based on running exponential moving average with parameter 0.9. The arrows indicate whether higher (↑) or lower (↓) is better. The best performance in each column is denoted in bold. FedCM † and FedDC ‡ require 1.5× and 2× communication cost for each communication round, respectively.Reddi et al., 2021)  72.33 81.73 908 1000+ 44.80 52.48 691 1000+ 33.22 38.91 658 945 FedDyn (Acar et al., 2021) 84.82 88.10 392 646 48.38 55.79 424 883 37.35 41.18 344 573 MOON (Li et al., 2021) 83.32 86.30 371 686 53.15 58.37 284 640 36.62 40.33 410 627 FedCM † (Xu et al., 2021) 78.92 83.71 624 1000+ 52.44 58.06 293 747 31.61 37.87 694 1000+ FedDC ‡ (Gao et al., 2022) 86.52 87.47 323 519 54.25 59.01 333 553 40.32 45.51 340 403 FedACG (ours) 85.13 89.10 319 450 55.79 62.51 260 409 42.26 46.31 226 331

Effect of low participation rate, 1% over 500 clients with Dirichlet (0.3) split, for FedACG and the baselines on CIFAR-10 and CIFAR-100. Accuracy at the target round and the communication round to reach target test accuracy are based on running exponential moving average with parameter 0.9. The arrows indicate whether higher (↑) or lower (↓) is better. FedCM † and FedDC ‡ require 1.5× and 2× communication cost for each communication round, respectively.

demonstrates that FedACG improves accuracy and convergence speed significantly and consistently compared with other federated learning methods in most cases. This is partly because FedACG enables each client to look ahead the global update and aligns the local model updates with the global gradient trajectory. Note that FedCM and FedDC require 1.5× and 2× communication costs for each communication round respectively since they communicate the current model and the associated gradient information per round, while others only require model parameters.

Results on CIFAR-100 when client set changes dynamically: we sample 250 clients out of 500 clients as a candidate clients set at every 100 rounds over 10 stages on Dirichlet (0.3) split. 10 clients out of the sampled client set participate for the local training for each communication round. FedDC † requires 2× communication cost for each communication round.

Contribution of individual components in FedACG at 1000 th rounds on CIFAR-10 and CIFAR-100 with 2% participation and 500 clients.

Ablation study of FedACG to the weights of the two hyperparameters, λ (a) and β (b), with respect to the accuracy at 1000 th rounds on CIFAR-10 in 2% participation and 500 clients.

Results on the realistic federated learning datasets which contain feature skewness and data imbalance between the clients. FedCM † and FedDC ‡ require 1.5× and 2× communication cost for each communication round, respectively.

Results with Dirichlet (0.6) data split on CIFAR-10 and CIFAR-100 for two different federated learning settings. Accuracy at the target round and the communication round to reach target test accuracy are based on running exponential moving average with parameter 0.9. The arrows indicate whether higher (↑) or lower (↓) is better. FedCM † and FedDC ‡ require 1.5× and 2× communication cost for each communication round, respectively.

annex

i back to the server end In server: 

