FEDGC: AN ACCURATE AND EFFICIENT FEDERATED LEARNING UNDER GRADIENT CONSTRAINT FOR HET-EROGENEOUS DATA

Abstract

Federated Learning (FL) is an important paradigm in large-scale distributed machine learning, which enables multiple clients to jointly learn a unified global model without transmitting their local data to a central server. FL has attracted growing attentions in many real-world applications, such as multi-center cardiovascular disease diagnosis and autonomous driving. Practically, the data across clients are always heterogeneous, i.e., not independently and identically distributed (Non-IID), making the local models suffer from catastrophic forgetting of the initial (or global) model. To mitigate this forgetting issue, existing FL methods may require additional regularization terms or generate pseudo data, resulting to 1) limited accuracy; 2) long training time and slow convergence rate for real-time applications; and 3) high communication cost. In this work, an accurate and efficient Federated Learning algorithm under Gradient Constraints (FedGC) is proposed, which provides three advantages: i) High accuracy is achieved by the proposed Client-Gradient-Constraint based projection method (CGC) to alleviate the forgetting issue occurred in clients, and the proposed Server-Gradient-Constraint based projection method (SGC) to effectively aggregate the gradients of clients; ii) Short training time and fast convergence rate are enabled by the proposed fast Pseudo-gradient-based mini-batch Gradient Descent (PGD) method and SGC; iii) Low communication cost is required due to the fast convergence rate and only gradients are necessary to be transmitted between server and clients. In the experiments, four real-world image datasets with three Non-IID types are evaluated, and five popular FL methods are used for comparison. The experimental results demonstrate that our FedGC not only significantly improves the accuracy and convergence rate on Non-IID data, but also drastically decreases the training time. Compared to the state-of-art FedReg, our FedGC improves the accuracy by up to 14.28% and speeds up the local training time by 15.5 times while decreasing 23% of the communication cost.

1. INTRODUCTION

Federated Learning (FL) enables multiple participations / clients to collaboratively train a global model while keeping the training data local due to various concerns such as data privacy and real-time processing. FL has attracted growing attention in many real-world applications, such as multi-center cardiovascular disease diagnosis Linardos et al. (2022) , Homomorphic Encryptionbased healthcare system Zhang et al. (2022) , FL-based real-time autonomous driving Zhang et al. (2021a) ; Nguyen et al. (2022) , FL-based privacy-preserving vehicular navigation Kong et al. (2021) , FL-based automatic trajectory prediction Majcherczyk et al. (2021) ; Wang et al. (2022) . However, in practice, the data across clients are always heterogeneous, i.e., not independently and identically distributed (Non-IID) (Sattler et al., 2020; Zhang et al., 2021b) , which hinders the optimization convergence and generalization performance of FL in real-word applications. At each communication round, a client firstly receives the aggregated knowledge of all clients from the server and then locally trains its model using its own data. If the data are Non-IID across clients, the local optimum of each client can be far from the others after local training and the initial model parameters received from server will be overridden. Hence, the clients will forget the initially received knowledge from the server, i.e., the clients suffer from the catastrophic forgetting of the learned knowledge from other clients Shoham et al. (2019) ; Xu et al. (2022) . In other words, there is a drastic performance drop (or loss increase) of model on global data after local training (as detailed in Appendix A.10). Recently, several approaches have been proposed to mitigate the catastrophic forgetting in FL, e.g., Federated Curvature (FedCurv) (Shoham et al., 2019) and FedReg Xu et al. (2022) . FedCurv utilizes the continual learning method Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) to penalize the clients for changing the most informative parameters. The Fisher information matrix is used in EWC to determine which parameters are informative. However, EWC is not effective for mitigating the catastrophic forgetting Xu et al. (2022) in FL, and FedCurv needs to transmit the Fisher matrix between the server and clients besides model parameters. That significantly increases the communication cost (2.5 times than the baseline FedAvg Xu et al. (2022) ). In addition, the calculation of Fisher matrix drastically increases the local training time. FedReg (Xu et al., 2022) is the most recently proposed FL method inspired by the continual learning method Gradient Episodic Memory (GEM) Lopez-Paz & Ranzato (2017) . GEM alleviates the catastrophic forgetting by avoiding the increase of loss at previous tasks. However, it requires an episodic memory to contain the representative samples from all previous tasks, which hinders it from being suitable for FL due to data privacy concerns Xu et al. (2022) . To resolve this, each client in FedReg firstly generates pseudo data by encoding the knowledge of previous training data learned by the global model, and then regularizes its model parameters by avoiding the increase of loss on pseudo data after local training. Although it uses generated pseudo data to protect data privacy and alleviate the forgetting issue in FL, the data generation will increase a lot of computational and storage costs for clients, especially when clients have large-scale data. In addition, the generation of pseudo data and parameter regularization also significantly increase the local training time. Therefore, these methods are not friendly enough to many real-time applications that concern communication & computational costs. In this work, we propose an accurate and efficient Federated Learning algorithm under Gradient Constraints (FedGC) to improve the performance of FL on Non-IID data and reduce the local training time. At client, a fast Pseudo-gradient-based mini-batch Gradient Descent (PGD) algorithm is proposed to reduce the local training time while accelerating the convergence rates of FL. The pseudo gradient of a local model is obtained by calculating its gradients over few mini-batches data using gradient descent algorithm. In addition, to mitigate catastrophic forgetting, we propose an effective Client-Gradient-Constraint based projection method (CGC). Different from GEM requiring memorized data from other clients and FedReg generating pseudo data at clients, our CGC only utilizes the server gradient (i.e., the aggregated gradient from all clients) to restrict the projected gradient to satisfy the constraint: the angle between these two gradients is less than 90 • , in order to enable the local model retains more knowledge received from server. Meanwhile, the projected gradient is also forced to be as close as possible to the pseudo gradient, that enables the local model to learn new knowledge from local data. At server, we propose a Server-Gradient-Constraint based projection method (SGC) to achieve an optimal server gradient which involves the information of clients participated in aggregation while accelerating the convergence rate by restricting the angles between the server gradient and gradients of participating clients to be less than 90 • . Moreover, our FedGC only transmits the gradients between the server and clients. In other words, our FedGC greatly saves communication costs. The contributions are summarized as follows, i) High accuracy of our FedGC on Non-IID data is achieved by the proposed CGC to mitigate the catastrophic forgetting occurred in clients and the proposed SGC to effectively aggregate the gradients of clients; ii) Short training time and fast convergence rate in our FedGC are enabled by the proposed fast PGD method and SGC; iii) Low communication cost is required in our FedGC due to the fast convergence rate and only gradients to be transmitted between server and clients; iv) Extensive experimental results illustrate that our FedGC not only improves the performance of FL on Non-IID data with a fast convergence rate but also significantly reduces local training time.

2. RELATED WORKS

Federated learning is an important paradigm in large-scale distributed machine learning. It enables multiple clients to jointly learn a unified global model without transmitting their local data to a central server (McMahan et al., 2017; Bhagoji et al., 2019; Yang et al., 2019) . FedAvg (McMahan et al., 2017) is the most popular FL algorithm. In FedAvg, clients first locally train models on local data, and then their model updates (e.g., parameters) are transmitted over the network to a central server, where the updates are aggregated. However, data in many real-world applications are always Non-IID, which degrades the performance of FedAvg and slows down the convergence rate Li et al. (2020; 2018) ; Xu et al. (2022) c i has the same dimension with the gradient and is transmitted between clients and server, which doubles the communication costs compared with FedAvg. Moreover, the average gradient in the previous communication round may not satisfy its assumptions that c j ≈ g j (y i ) and c ≈ 1 N j g j (y i ), especially when deep learning models on image datasets perform at clients Li et al. (2021) . Notably, all the above FL methods rely on stochastic gradient descent (SGD) to train models at clients by performing multiple epochs on full local data. That significantly increases the local training time of clients. In addition, the clients owning small training data need to wait a long time before the clients with large-scale data complete local training. That is not computationally efficient in practice and increases the latency between clients. Minibatch SGD (Woodworth et al., 2020) is recently proposed to perform local training of clients on several mini-batch data at the same model, and then it calculates the mean gradient by averaging the gradients.

3. METHOD

Given K clients, each client k has a local dataset D k . D k = {x k , y k } ⊆ D k represents a few minibatches data including n k samples. Let T be the number of communication rounds, B be the number of mini-batches for local training. At each communication round, a subset of clients K ⊆ [K] are sampled uniformly like FedAvg. In the t-th communication round, θ t represents the parameters of the global model at server and θ t k is the parameters of the local model at client k. g t k represents the gradient of client k on its local data after local training, g t is the server gradient that is obtained by aggregating local gradients of clients. Notably, the gradient mentioned in this work is the negative gradient for the convenience of calculation.

3.1. LOCAL TRAINING WITH PGD AND CGC

At the t-th communication round, clients would firstly receive g t-1 of the previous round, and then synchronize the parameters of local models with g t-1 to ensure clients have the same initially parameters. For client k ∈ K with its mini-batches data D k , the objective function is given by, min θ t k L(f (x k ; θ t k ), y k ) s.t. ∂L(f (x k ; θ t k ), y k ) ∂θ t k , g t-1 ≥ 0 (1) where ⟨.⟩ represents the inner product operation. f (θ t k , x k ) represents the prediction of local model at client k with parameters θ t k and input x k , L(f (θ t k , x k ), y k ) is the loss function of client k on its local data D k and L(f (x k ; θ t k ), y k ) = 1 |D k | (x k,i ,y k,i ∈D k ) L(f (x k,i ; θ t k ), y k,i )) (2) The constraint in problem (1) indicates that the angle between the current gradient ∂L(f (x k ;θ t k ),y k ) ∂θ t k and the server gradient g t-1 is less than 90 • . In this way, during local training, the update direction of local model will be not only learned from local data but also restricted by the server gradient. The server gradient obtained by our SGC involves the update direction of clients participated in aggregation (as detailed in Section 3.2). Hence, through solving problem (1), the local model will convergence to the global optimum. However, ( 1) is an optimization problem with inequality constraints, which cannot be directly solved by SGD. Hence, we divided the solution into two steps, as shown in Figure 1 . Step 1: Pseudo-gradient-based mini-batch Gradient Descent (PGD) 4). At step 2, clients obtain the projected gradients by performing CGC on the pseudo gradients. Since the angle between gt i and g t-1 is less than 90 • , the projected gradient of client i is gt i itself through CGC. In contrast, the angle between gt k and g t-1 is more than 90 • , the projected gradient g t k is obtained by performing CGC (red arrow) on gt k to satisfy constraints of (5). ḡt i and ḡt k (green arrows) are the mean gradients used in Minibatch SGD. The projected gradients usually have larger modulus than the mean gradients. At t-th communication round, for client k ∈ K with its local data D k , the loss function Eq. ( 3) is first minimized, min θ t k,B L(f (x k ; θ t k,B ), y k ) (3) where θ t k,B is the model parameters after B mini-batches. 

Pseudo-gradient:

During this local training, g t k,1 , g t k,2 , • • • , g t k,B in Figure 1 (b) are obtained. Dif- ferent from the mean gradient ḡt k (= 1 B B b=1 g t k,b ′ ) used in Minibatch SGD that averages multiple gradients for several mini-batches data at the same point (e.g., θ t i,0 ), we prefer to calculate a pseudo gradient gt k to represent the final gradient after this local training, as shown below. gt k = (θ t k,B -θ t k,0 ) η (4) where θ t k,0 is the initialized model parameters and η is the learning rate. Compared to the mean gradient ḡt k that has similar modulus with each gradient g t k,l , 1 ≤ l ≤ B, the pseudo gradient gt k would have larger modulus than g t k,l , as shown in Figure 1 (b). In this way, gt k will promote a large update for local model even with only few mini-batches in local training. This will accelerate the convergence rate of FL. Step 2: Client-Gradient-Constraint based projection method (CGC) Since the data cross clients are Non-IID, the angle between the pseudo gradient gt k and server gradient g t-1 may be more than 90 • , as shown in Figure 1(b) . That means the update direction of local model deviates from the global optimum, i.e., the catastrophic forgetting occurs at clients. To mitigate this forgetting, the Client-Gradient-Constraint based projection method (CGC) is proposed in this subsection. At t-th communication round, for client k ∈ K with its local data D k , the optimization problem of our CGC is given by ( 5). In problem (5), the projected gradient g t k should be as close as possible to the pseudo gradient gt k (in squared L2 norm) while being at an acute angle to the server gradient g t-foot_0 . min g t k 1 2 gt k -g t k 2 s.t. g t k , g t-1 -C ≥ 0 (5) where C = 1e -3 is a small positive constant to avoid g t k and g t-1 being orthogonal (i.e., to ensure g t k , g t-1 > 0) after projection. Through solving problem (5), the projected gradient g t k will enable the model retain more knowledge received from server while learning new knowledge from local data, i.e., the model can balance preserving the knowledge received from the server and being adaptive to local data. By simplifying problem (5), we obtain the primal of CGC Quadratic Program (QP) with inequality constraints: min u 1 2 u T u -h T u + 1 2 h T h s.t. C -z T u ≤ 0 (6) where 1 2 h T h is a constant term and can be discarded, Problem (6) s a QP on p variables and could be measured in millions. Hence, to solve problem (6) efficiently, we convert the primal problem into dual problem, and obtain the dual of the CGC QP: u = g t k ∈ R p , h = gt k ∈ R p and z = g t-1 ∈ R p , min v 1 2 v 2 z T z + v h T z -C s.t. v ≥ 0 (7) where v ∈ R is a Lagrange multiplier, and problem ( 7) is a QP on 1 ≪ p variable. To solve problem (7), the python library quadprog 1 is used and then the optimum v ⋆ is obtained. The details of calculating dual of CGC QP are provided in Appendix A.1. Finally, the optimal solution to problem ( 6) is calculated by u ⋆ = h + v ⋆ z, i.e., g t k = gt k + v ⋆ g t-1 after our CGC. In addition, since the whole concatenated gradient has very large dimension, we iteratively perform the gradient projection layer by layer (i.e., layer-wise manner) to reduce the memory overhead. Theoretical analysis of gradient projection is detailed in Appendix A.3

3.2. SERVER AGGREGATION WITH SGC

At t-th communication round, the local gradients (i.e., g t k ) of clients are then send to server for aggregation after local training. At server, the aggregated gradient g t (i.e., server gradient) is then send back to clients and the parameters θ t of global model is calculated by θ t = θ t-1 + ηg t . For aggregation, most FL methods simply use the weighted average of local gradients at server. When the data across clients are Non-IID, the weighted-average may be only effective for few clients. That is because the server gradient may point to the opposite directions of some local gradients (i.e., the angle between them are more than 90 • ), that slow down the convergence rate of FL. To effectively aggregate the local gradients at server and accelerate the convergence rate, we proposed a Server-Gradient-Constraint based projection method (SGC). Through our SGC, the projected gradient can point to the positive directions of most local gradients. The optimization problem of SGC is given by, min g t 1 2 g t -ḡt 2 s.t. g t , g t k -C ≥ 0, ∀k ∈ K where ḡt = k∈K n k N g t k is the weighted average of the local gradients, and N = k∈K n k . g t k is the local gradient of client k. The constraint in problem ( 8) is to restrict the angle between the projected gradient g t and local gradients at participating clients are less than 90 • . Similarly, we obtain the primal of SGC QP with inequality constrains by simplifying Eq. ( 8) min z 1 2 z T z -g T z + 1 2 g T g s.t. C -Gz ≤ 0 (9) where z = g t ∈ R p , g = ḡt ∈ R p , and G = (. . . , g t k , . . . ) T ∈ R |K|×p , k ∈ K. 1 2 g T g is a constant term and can be ignored. Problem ( 9) is a QP on p variables. To solve problem (8) efficiently, we also convert the primal problem (9) into dual problem and obtain the dual of the SGC QP: Finally, the optimal solution to problem (9) is calculated by z ⋆ = g + G T λ ⋆ , i.e., g t = ḡt + G T λ ⋆ after our SGC. When the problem ( 9) is unsolvable, we simply use ḡt to be g t . min λ (Gg -C) T λ + 1 2 λ T GG T λ s.t. λ ≥ 0 The pseudo codes of our method are provided in Algorithm 1, and the convergence analysis is detailed in Appendix A.4.

4. EXPERIMENTS

We conduct extensive experiments to compare our FedGC with several popular approaches, including FedAvg, FedProx, FedCurv, SCAFFOLD and FedReg, on real image datasets. The data preparation and experimental details are described below. The performances are evaluated in three aspects: 1) Overall test accuracy; 2) Convergence rate and training time; 3) Communication costs.

4.1. DATASETS

Algorithm 1: ; # Pseudo gradient 16 g t k ← CGC (g t k , g t-1 ) in (6); FedGC Input: K, T , B, datasets D = ∪ k∈[K] D k , η. Output: the parameters θ T of the global model. 1 Initialize: server θ 0 ; clients θ 0 k ← θ 0 , for k ∈ [K]; 2 for t = 1 to T do 3 for k ∈ [K] in parallel do 4 # Synchronize parameters 5 if t > 1 then 6 g t k ← g t-1 ; 7 θ t k ← θ t-1 k + ηg t k ; 17 θ t k ← θ t k,0 ; 18 end 19 ḡt ← k∈K n k N g t k ; G ← (. . . , g t k , . . . ) T , k ∈ K; 20 g t ← SGC (ḡ t , G) in (9); 21 θ t ← θ t-1 + ηg t ; 22 end a If t = 1, θ t k,0 = θ t-1 k , otherwise θ t k,0 = θ t k The experiments are conducted on three real image datasets, including Handwritten-Digits, CIFAR-10 (Krizhevsky, 2009) and CIFAR-100 Krizhevsky (2009) . (Hull, 1994) and SVHN Netzer et al. (2011) . Each dataset contains 10 classes. To input these images into deep models sharing the same network architecture, all images in these four datasets are pre-processed by reshaping the size to (32, 32, 3), including the training sets and test sets. The training data are split into 4 clients (named HWDigits-4) and 40 clients (named HWDigits-40) respectively. In HWDigits-4, each client owns one handwrittendigits dataset. In HWDigits-40, each client has images belonging to only one class in a dataset under the one-class setting. HWDigits-4 suffers from the attribute skew of Non-IID issue, in which the attributes (i.e., data features) across clients are different. In HWDigits-40, the data across clients may have different labels and attributes, and hence it suffers from both attribute and label skew. CIFAR-10 and CIFAR-100 Under the one-class setting, the training sets in CIFAR-10 and CIFAR-100 are split into 10 (named CIFAR10-10) and 100 (named CIFAR100-100) clients, respectively, i.e., each client owns only samples of one class. All clients share the test sets in original CIFAR-10 and CIFAR-100 respectively. CIFAR10-10 and CIFAR100-100 both suffer from the label skew under the one-class setting.

4.2. EXPERIMENTAL SETTING

We implement all methods in PyTorch-1.9.0 (Paszke et al., 2019) The learning rates η of all methods in experiments are tuned in the range of {0.01, 0.05, 0.1, 1.0}, and the optimal η of our FedGC is 0.1 and the ones for all other compared methods are 0.01. The learning rate η in FedGC is set to 0.1 and the learning rates in other methods are 0.01. For CIFAR10-10 and CIFAR100-100, a ResNet-9 network with the Fixup initialization (Xu et al., 2022) is trained from scratch. The communication rounds T is set to 500 for CIFAR10-10 and 1000 for CIFAR100-100. E = 1 and B = 1. The optimal η for our FedGC is 0.1 and the ones for all other compared methods are 0.05. Table 1 shows the overall test accuracies of all compared methods on HWDigits-4, HWDigits-40, CIFAR10-10 and CIFAR100-100. In Table 1 , our FedGC (denotes FedGC(w) if not specified) achieves the highest test accuracies than all other compared methods. Specifically, for HWDigits-40, our FedGC outperforms the SOTA FedReg by 14.28% (=77.33-63.05) of accuracy, which clearly shows the superiority of our FedGC in tackling the Non-IID data. FedCurv is designed based on the assumption that the deep neural networks are over-parameterized enough, so that it has a good probability of finding an optimal solution to task B in the neighborhood of previously learned task solution. However, for CIFAR100-100, there are 100 clients and each client contains only samples of one class. Hence, the assumption of over-parameterization cannot be satisfied and it only gets 9.48% of accuracy. FedProx has strong constraints on model parameters, so that it lacks of flexibility, which hinders the learning of new knowledge from the local data. Hence, on CIFAR100-100, it also gets worse performance than the baseline FedAvg. SCAFFOLD gets the worst performances on most of the compared datasets, because the average gradient in the previous communication round may not satisfy its assumptions, especially when deep learning models on image datasets perform at clients Li et al. (2021) . FedReg is the most recently proposed method, which alleviates the catastrophic forgetting in FL by generating pseudo data. The pseudo data is generated to guarantee that the loss of local model on them is less than that of initial (global) model on them. In this way, the catastrophic forgetting issues can be alleviated. However, when the number of samples between clients varies significantly (e.g., HWDigits-4 and HWDigits-40), the FedReg will be biased to the majority clients (i.e., the clients with many samples). That is because the global model is obtained by weighted averaging the model parameters in FedReg and the global model will be biased to majority clients. In local training, the client with few samples (i.e., minority clients) will forget the learned knowledge from local data after regularizing the model parameters with pseudo data in FedReg. In other words, the FedReg has further magnified the bias of global model to majority clients.

4.3. OVERALL TEST ACCURACY

Benefited from the proposed CGC, our FedGC mitigates the catastrophic forgetting of FL by constraining the local gradient at an acute angle to the server gradient and simultaneously minimizing the loss of local model on its local data. Moreover, benefited from the proposed SGC, the projected server gradient in our FedGC can effectively aggregate the knowledge from clients. The gradient projected by both CGC and SGC can reduce the bias toward majority clients. As shown in Table 1 , FedGC(w) further improves the performances of FedGC(w/o). Particularly, the SGC can further accelerate the convergence rate of FL (as detailed in Section 4.4). Hence, FedGC(w) achieves the highest accuracy on all compared datasets, and the improvements are up to 14.28% (HWDigits-40) compared to the state-of-the-art (SOTA) FedReg. In addition, our FedGC neither transmits extra data across the server and clients, nor needs extra storage costs to keep the generated data. The computational costs of CGC and SGC are also extremely small. Therefore, FedGC is very friendly for edge devices with limited computational resources. Figure 2 illustrates that our FedGC can quickly achieve higher accuracies than other methods on all compared datasets until the end of communication. It indicates that our method can improve both the performance and the convergence rate of FL on Non-IID data. The high convergence rate is mainly benefited from our proposed PGD and SGC. FedReg is also proposed for accelerating the convergence rate of FL by alleviating the forgetting issue, but it is more biased to majority clients, that may slow down its convergence rate on HWDigits-4 and HWDigits-40. For HWDigits-40, FedReg achieves its highest accuracy (63.05%) at round 476 as shown in Figure 2 (d) while our FedGC reaches this accuracy at round 365. Meanwhile, our FedGC improves the accuracy by 14.28% while decreasing 23% (≈ 1-365×|θ| 476×|θ| ) of communication costfoot_1 compared to FedReg. For CIFAR100-100, our FedGC also quickly reaches 53.02% of accuracy (i.e., the accuracy of FedReg), which empirically verifies that the projected gradients in both server and clients are helpful for FL. Among compared methods, SCAFFOLD is very unstable during training (as shown in Figures 2 (b) and (d) ). In addition, Figure 3 illustrates that our SGC can further improve the convergence rate of FL. Training time is another effective way to measure the practicality and efficiency of FL methods. The average training time of each client per communication round are summarized in Table 2 , where FedAvg is considered as the baseline. Due to the calculation of proximal term in objective function during local training, FedProx increases training time by about 13sfoot_2 at each dataset per round. Similarly, SCAFFOLD also requires more time to calculate and transmit the control variate between clients and server. FedCurv dramatically increases the training time due to the time consuming calculation of Fisher information matrix in every mini-batch. FedReg needs to generate pseudo data during local training, so that it also significantly increases the training time. On the contrary, in our FedGC, the gradient projection methods in both clients and server are very efficient due to few 

4.5. COMMUNICATION COSTS

The comparison of communication costs is shown in Table 2 . We calculate the communication cost by the number of communication round when the method achieves the target accuracy (e.g., the highest accuracy of baseline FedAvg) multiplied by the amount of transmitted data (e.g., the amount of parameters |θ|). 2.5|θ| represents that FedCurv needs to transmit the Fisher matrix between the server and clients besides model parameters. In Table 2 , our FedGC requires the lowest communication costs on four datasets when achieving the target accuracy. Specifically, on HWDigits-4, our FedGC can decrease 96.6%(≈1-19×|θ| 557×|θ| ) of the communication cost compared to FedAvg due to the fewest number of communication rounds. Compared to SOTA FedReg, the communication cost can be decreased by 45%(≈1-287×|θ| 531×|θ| ) on CIFAR100-100. 

5. CONCLUSIONS

L(z, λ) = 1 2 z T z -g T z + 1 2 g T g + λ T (C -Gz) = 1 2 z T z -(g T + λ T G)z + 1 2 g T g + λ T C Since L(z, λ) is a quadratic convex function, its minimal value can be obtained by ∇ z L(z, λ) = z -(g T + λ T G) = 0 =⇒ z = g + G T λ (22) So L(g + G T λ, λ) = 1 2 (g + G T λ) T (g + G T λ) -g T (g + G T λ) + 1 2 g T g + λ T C -λ T G(g + G T λ) = - 1 2 λ T GG T λ - 1 2 g T G T λ - 1 2 λ T Gg + λ T C = - 1 2 λ T GG T λ -g T G T λ + λ T C = -(Gg -C) T λ - 1 2 λ T GG T λ (23) Therefore, the dual of SGC QP is min λ (Gg -C) T λ + 1 2 λ T GG T λ s.t. λ ≥ 0 The optimum λ ⋆ can be calculated by library quadprog and the optimal solution of SGC QP is z ⋆ = g + G T λ ⋆ .

A.3 THEORETICAL ANALYSIS OF GRADIENT PROJECTION

Assumption 1 In each local update, small optimization steps happen and thus we can assume that the function F is locally linear (i.e., convex function). To enable the local model retains more knowledge received from server after local update, the loss of local model on global data D should reduce after local update, i.e., F (θ t k , D) < F (θ t-1 , D) where θ t k is the model parameters at client k after local update and θ t-1 is the initial model (i.e., the global model at previous round) parameters. Since the global data D cannot be achieved in FL, we use gradient projection to achieve this objective. For simplifying, D is omitted in the following equations. From convexity we know that ∇F θ t-1 T θ t k -θ t-1 ≥ 0 implies F (θ t k ) ≥ F (θ t-1 ), so the search direction in local training must satisfy ∇F θ t-1 T θ t k -θ t-1 < 0 Thus, it must make an acute angle with negative gradient, i.e., g t-1 T g t k > 0 ⇔ g t-1 , g t k > 0 where g t-1 = -∇F k θ t-1 and g t k = θ t k -θ t-1 .

A.4 CONVERGENCE ANALYSIS

Here we give a simple proof that our FedGC has a faster convergence rate than FedAvg. We analyze the convergence of our FedGC by finding an upper bound ξ of E F (θ T ) -F ⋆ , i.e., E F (θ T ) -F ⋆ ≤ ξ, where F (θ T ) represents the final global model with parameter θ T in FL and F ⋆ is the optimal model for all clients' data (i.e., the upper bound of global model in FL). From A.3, our FedGC can guarantee that the loss of local model on global data reduces after local update (i.e., F (θ t k ) < F (θ t-1 )), while FedAvg cannot. In other words, our FedGC can guarantee that the loss of global model gradually reduces (i.e., F (θ t ) < F (θ t-1 )), while FedAvg cannot. In detail, F (θ t ) = F K k=1 p k θ t k (28) where the aggregation is averaged for simplifying, and K k=1 p k = 1. Let Assumption 1 holds, we can obtain F (θ t ) = F K k=1 p k θ t k = K k=1 p k F θ t k < K k=1 p k F θ t-1 = F (θ t-1 ) Thus, after T communication rounds (i.e., one fixed T ), the difference between E F (θ T ) and F ⋆ of our FedGC should be smaller than that of FedAvg. Thus, we can conclude that ξ GC < ξ Avg . A.5 PRIVACY-PRESERVING ANALYSIS Recently, Deep Leakage from Gradients (DLG) Zhu et al. (2019) attack has attracted growing attentions, which can completely steal real data from gradients. However, under our FedGC, the real data cannot be reversely obtained. The details are shown below. In DLG, a pair of "dummy" input and label are first randomly generated to perform the usual forward and backward operations. In order to obtain the real data reversely, the dummy gradients from the dummy data are firstly derived, then DLG optimizes the dummy inputs and labels by minimizing the Euclidean distance between the dummy gradients and the real gradients. However, matching the gradients cannot make the dummy data close to the real data when our FedGC performs. In Figure 4 , at t-th communication round, an honest client k samples a minibatch (x t,k , y t,k ) from its own data and an evil client randomly initializes a dummy input x ′ t,k with dummy label y ′ t,k . The objective of the evil client is  x ′ t, Therefore, we can conclude that real data cannot be recovered with gradient inversion attacks (e.g., DLG) in our FedGC and our FedGC can ensure preserving of privacy. In addition, we conduct the gradient inversion attacks experiment to compare FedAvg and our FedGC with DLG on CIFAR10-10 dataset. The backbone is ResNet-9. We set the number of iterations in DLG to 300. As shown in Figure 5 , the quality of the images recovered from our FedGC is significantly worse than those recovered from FedAvg, exhibiting a better privacy protection capability of our FedGC. CIFAR-10 and CIFAR-100 CIFAR-10 contains 60,000 32×32 color images of 10 classes, including 5,000 images in training set and 1,000 images in test set per class. CIFAR-100 consists of 60,000



https://github.com/quadprog/quadprog The calculation is provided in Section 4.5 with FedReg as baseline. |θ| denotes the amount of parameters. ≈ (67.99+65.15+64.65+82.41)-(58.63+44+44.74+78.93) http://yaroslav.ganin.net/ http://ufldl.stanford.edu/housenumbers/



Figure 1: Local training with PGD and CGC performed at two clients of the t-th round. Clients receive the server gradient g t-1 before local training. At step 1, local models perform PGD on three mini-batches, and compute the pseudo gradients (i.e., gt i and gt k ) by (4). At step 2, clients obtain the projected gradients by performing CGC on the pseudo gradients. Since the angle between gt i and g t-1 is less than 90 • , the projected gradient of client i is gt i itself through CGC. In contrast, the angle between gt k and g t-1 is more than 90 • , the projected gradient g t k is obtained by performing CGC (red arrow) on gt k to satisfy constraints of (5). ḡt i and ḡt k (green arrows) are the mean gradients used in Minibatch SGD. The projected gradients usually have larger modulus than the mean gradients.

Mini-batch update: Instead of performing multiple epochs on full local data in each communication round like in FedAvg, we prefer to use few mini-batches to train local model to save local training time and avoid the latency between clients. Most importantly, when data are Non-IID, too many iterations in local training may cause local models to be biased towards their local data and enlarge the difference between the local model and the global model. This is not conducive to the convergence of global model. The experiments of Fe-dAvg using mini-batch update in Appendix A.7 can verify this statement.

and p indicates the number of parameters of local model.

10) where λ ∈ R |K| is the Lagrange multiplier, and the problem (10) is a QP on |K| ≪ p variables. The solution λ ⋆ of problem (10) is also obtained by the quadprog library. The details of calculating dual of SGC QP are provided in Appendix A.2.

Figure 2: Test accuracy vs. communication rounds on four datasets. The black dotted line denotes the highest accuracy of FedReg. The red and blue dotted lines represent the round numbers of FedGC and FedReg reaching this accuracy, respectively.

Figure 3: Test accuracy vs. communication rounds on HWDigits-4 and CIFAR10-10. The black dotted line denotes the highest accuracy of FedReg. The blue, green and red dotted lines represent the round numbers of FedReg, FedGC(w/o) and FedGC(w) reaching this accuracy, respectively.

Figure 4: DLG in our FedGC.

Figure 5: Images recovered from updated gradients of FedAvg and FedGC on CIFAR10-10.

FedProx takes the same communication costs as FedAvg because it does not transmit additional information besides model parameters. However, compared to FedCurv, it lacks of flexibility of model parameters and its stiffness comes at the expense of accuracy. SCAFFOLD (Karimireddy et al., 2020) introduces the control variates c and c i to correct the client-drift on Non-IID data in local training. The control variates c and c i aim to guide the local model to update based on the average gradient of participating clients in the previous communication round. However, the variate

training) in FedGC is 50. The client fraction is set to 1.0 and the batch size is 100.

Comparison results (%) on four datasets (↑). 'w/o' and 'w' denote the server aggregation in FedGC without and with SGC, respectively.

The training time per communication round and the communication costs (Comm. cost) (↓).

k , y ′ t,k = arg min x ′ t,k ,y ′ dent on the global gradient g t-1 by our CGC, and the CGC projection cannot be reversely derived. Thus, we can obtain(x ′ t,k , x ′ t,k ) ̸ = (x t,k , y t,k )

A APPENDIX

A.1 THE DUAL OF CGC QP Here we provide the procedure of how to convert the primal of CGC QP into the dual problem. The primal problem is:where u ∈ R p , h ∈ R p and z ∈ R p , and p indicates the number of parameters of local model.To obtain the dual of (11), we firstly conduct the Lagrange function L(u, v) with a Lagrange multiplier v ∈ R as followBecause ( 12) is a quadratic convex function, its minimal value can be obtained byHence, the minimal value of ( 12) isThe dual functionand the dual problem isTherefore, the dual of CGC QP isProblem ( 18) can be solved by quadprog library and the optimum v ⋆ is obtained. The standard form of QP problem in quadprog isAt last, the optimal solution of the primal CGC OP is calculated by

A.2 THE DUAL OF SGC QP

Here we provide the procedure of how to convert the primal of SGC QP into the dual problem. The primal problem is: In Table 3 , we can observe that FedAvg can achieve better performance when B = 50 while requireing less iterations in local training stage compared to E = 1. The experimental results are consistent with the statement in Section 3.1.

A.8 LEARNING CURVES

The learning curves are drawn in Figure 6 , where the x-axis is scaled by communication-budget (i.e., in multiples of |θ|). The figures show that our FedGC can reach the target accuracy (i.e., the highest accuracy of baseline FedAvg) with the smallest communication budget. The results in Table 4 illustrate that the performances of our FedGC will degrade without CGC, in particular for the label skew datasets. That is because the issue of catastrophic forgetting seriously where K is the number of clients, θ t-1 k is the initial model before local training, D is the global data.where θ t i is the local model after local training and D k is the local data. 

