FEDGSNR: ACCELERATING FEDERATED LEARNING ON NON-IID DATA VIA MAXIMUM GRADIENT SIGNAL TO NOISE RATIO Anonymous

Abstract

Federated learning (FL) allows participants jointly training a model without direct data sharing. In such a process, participants rather than the central server perform local updates of stochastic gradient descent (SGD) and the central server aggregates the gradients from the participants to update the global model. However, the noniid training data in participants significantly impact global model convergence. Most of existing studies addressed this issue by utilizing variance reduction or regularization. However, these studies focusing on specific datasets lack theoretical guarantee for efficient model training. In this paper, we provide a novel perspective on the non-iid issue by optimizing Gradient Signal to Noise Ratio (GSNR) during model training. In each participant, we decompose local gradients calculated on the non-iid training data into the signal and noise components and then speed up the model convergence by maximizing GSNR. We prove that GSNR can be maximized by using the optimal number of local updates. Subsequently, we develop FedGSNR to compute the optimal number of local updates for each participant, which can be applied to existing gradient calculation algorithms to accelerate the global model convergence. Moreover, according to the positive correlation between GSNR and the quality of shared information, FedGSNR allows the server to accurately evaluate contributions of different participants (i.e., the quality of local datasets) by utilizing GSNR. Extensive experimental evaluations demonstrate that FedGSNR achieves on average a 1.69× speedup with comparable accuracy. where we define d := W 2 (N (µ 1 , Σ 1 ), N (µ 2 , Σ 2 )). In this section, we investigate the relationship between GSNR and the local updates, then we propose a method to calculate the optimal number of local updates. As Section 4 will introduce, GSNR can be

1. INTRODUCTION

Federated learning (FL) McMahan et al. (2017) focuses on a practical scenario with multiple participants to collaboratively train a model without direct data sharing. Different from the typical centralized optimization problem, FL decomposes the optimization problem into several sub-optimization problems, and distributes them to different participants to be solved separately with the corresponding local datasets. Moreover, these local datasets often follow non-iid distributions in reality. During the training phase, each participant solves the sub-problem via stochastic gradient decent (SGD), and sends back the corresponding results for aggregation. One of the most popular FL algorithms is FedAvg McMahan et al. (2017) , and it typically accelerates global model convergence through multiple local updates. Although it has shown great performance in many practical applications, there're still mysteries in this area, especially in non-iid cases, and many previous literatures Haddadpour & Mahdavi (2019) ; Khaled et al. (2020) ; Li et al. (2020b) make efforts to analyze the convergency or even to accelerate it. One of the key challenges in FL is how a model can be well trained on non-iid data in different participants. On the one hand, such imbalance breaks the unbiased optimization procedure when we utilize multiple local updates. While on the other hand, due to the differences between different local datasets, non-iid information distribution makes it difficult to evaluate the contribution of different participants. The former slows down the FL-based model convergence, i.e., a key factor of the efficiency. The latter is associated with contribution evaluation involving malicious data tampering detection, contribution-based profit distribution, incentive mechanism design, etc. Especially in the current data-driven age Sim et al. (2020) , contribution evaluation is particularly important. In addition to separately pursuing the two goals of accelerating model convergence and improving contribution evaluation accuracy, these two goals even conflict with each other. To solve above challenges, we propose a novel approach that speedups model training by maximizing Gradient Signal to Noise Ratio (GSNR). The intuitions behind the design are two-fold. First, there is always a global optimal solution no matter how the data is distributed. For each local dataset, we can obtain an optimal optimization direction, i.e., the global gradient. Second, based on the information theory, GSNR determines the channel capacity, i.e., Shannon's formula: C = W • log(1 + SNR), and a larger GSNR means we can get more information with identical communication rounds, which can accelerate the model convergence. Thus, we can decompose the local optimization direction (i.e., the local gradient) into mutually orthogonal signal vector and noise vector. We find that if we can obtain the global gradient, the signal vector is parallel to the global gradient, while the noise vector is orthogonal to it. Fig. 1 shows a typical example of orthogonal decomposition of two participants. (GSNR) . A local step can be decomposed into two components: signal and noise, the former parallels to the update with global data, and the latter orthogonal to it. We prove that the number of local updates can control the GSNR value and we can maximize GSNR by computing the optimal number of local updates. To maximize GSNR, we utilize the gradients uploaded by the participants to estimate the global gradient, and propose a FedGSNR algorithm to compute the optimal number of local updates according to the estimated global gradient. Moreover, based on the GSNR perspective, we also develop a specific method to compute the GSNR for each dataset, which allow the server to evaluate each participant's contribution. In addition to personalizing the number of local updates to optimize model convergence efficiency, the newly proposed GSNR strategy FedGSNR is orthogonal to existing methods, which mostly depends on modifying gradients calculating. Hence, FedGSNR can be combined with these methods so as to further improve them. In summary, our contributions in this paper are as follows: • We prove that the optimal local updates decides the maximal GSNR, which leads to faster and more stable convergence. • We analyze existing FL algorithms with the perspective of GSNR. Moreover, based on the viewpoint of GSNR maximization, we propose an algorithm FedGSNR, which can be combined with most of current FL algorithms to calculate its optimal local updates. • We derive a function r(w) to calculate GSNR, which can be utilized to evaluate the local contributions of different participants. • We confirm our theoretical results on CIFAR-10 and CIFAR-100 datasets, and experiments indicate that FedGSNR can achieve on average a 1.69× speedup over its original when the information unevenly distributed among all participants, and r(w) is a reasonable metric for local contributions. 2020) proposes a specific gradient calculating method based on variance reduction. Li et al. (2020a) indicates that under non-iid FL conditions, a large number of local updates lead to divergence or instability. While Wang et al. (2020b) tries to stabilize the training procedure with a new average strategy. On the other hand, Wang et al. (2019) proposes a practical optimization problem with resources constraints, and it determines the number of local updates for each participant according to the corresponding constraints.

Related

A similar work Khaled et al. (2020) derives an upper bound of local updates by total iterations T and the number of participants M , which proposes a theoretical analysis of local updates. But, they treat each participant equally, and fail to propose a method to calculate the optimal number of local updates directly from the heterogeneous data. More discussion of related work can be found in Appendix A.

2. PRELIMINARY

2.1 FEDERATED AVERAGING (FEDAVG) In this work, we consider the following federated optimization problem: min w F (w) := E ξ [F (w, ξ)] = E C [E ξ [F (w, ξ)] | C] = K k=1 P (C = C k ) • E ξ [F (w, ξ|C k )], where F (w, • ) is a specified loss function with model w, K is the number of participants, and P (C) is a discrete probability distribution correlated to the importance of different datasets. Usually, P (C) is a uniform distribution or proportional to the local data quantity, and ξ|C k is a random sample drawn from the dataset of k-th participant, i.e., ξ|C k ∼ p(x|C k ). Regarding traditional machine learning, the global dataset is gathered from all participants, and the goal is to minimize F (w) = E ξ [F (w, ξ)], (2) where ξ is a random sample of global dataset, i.e., the gathered data. However, in most cases, due to the privacy requests, we cannot gather data from different participants. Thus, we separate the target function as Eq. ( 1), and send the initial model w 1 to each participant, then they do the optimization locally, and send back the corresponding results. We finally obtain the results by Eq. ( 1). If each participant does only one step optimization, according to the property of conditional expectation, minimizing Eq. ( 1) is equivalent to minimizing Eq. ( 2). But this procedure puts a lot of pressure on communication, so researchers propose to do more local updates for efficiency. Hence, for the k-th participant, the optimization procedure of a typical round can be formalized as w k i+1 ← w k i -ηE ξ|C k [∇ w F (w k i , ξ|C k )], i = 1, • • • , n. Then the central server aggregates local models w 1 n+1 , • • • , w K n+1 to update the global model by w = K k=1 p k w k n+1 , where we denote p k for P (C = C k ) for convenience.

2.2. WASSERSTEIN DISTANCE

Wasserstein distance (Villani (2009) ) is a metric in probabilistic space inspired by the problem of optimal transport. It is a distance between probability distributions that takes geometric information into account. The general wasserstein distance is defined as W p (µ, ν) = inf γ∈Γ(µ,ν) E (x,y)∼γ [∥x -y∥ p ], which is difficult to find a closed form solution. However, if we chose 2-norm as the geometric measure and simplify the problem to Gaussian distribution, the distance has an analytic solution d 2 = ∥µ 1 -µ 2 ∥ 2 2 + tr((Σ 1 2 1 -Σ 1 2 calculated by the ratio between the norm of global gradient, which is a constant for all participants, and the distance between global and local gradients. Therefore, to maximize GSNR is equivalent to minimize the distance between the distributions of global and local gradients. Furthermore, the minimal distance, i.e., the maximum GSNR, is decided by the optimal local updates. On the other hand, as mentioned in Section 2.1, our target is to gather data so as to optimize Eq. ( 2) centralizedly. But in practice, due to some real restrictions, we can just optimize Eq. ( 1) distributedly. Thus, we treat the former procedure as an ideal optimization process. Based upon this idea, for accelerating convergence, our practical optimization problem also requires to minimize the distance between practical and ideal optimization path, which is determined by the corresponding gradients. The distance between practical and ideal optimization path, which are denoted by the distributions of p w (w) and p wg (w) respectivelyfoot_0 , can be formalized as D = W 2 (p w (w), p wg (w)), (5) According to Eq. ( 3), due to the conditional independence of the data from different participants, the random vector w is a convex combination of a set of independent random vectors, hence p w (w) is identified by the convolution of corresponding local distribution functions, and such a distribution contains all details of each participant. Intuitively, if a specific participant attempts to minimize Eq. ( 5), it has to gather all information from others, which violates the privacy requests. Then we need to derive an upper bound of Eq. ( 5) as D = W 2 (p w (w), p wg (w)) = inf γ∈Γ(p w ,pw g ) E (x,y)∼γ [∥x -y∥ 2 ] ≤ K k=1 p k inf γ k ∈Γ(p w k n+1 ,pw g ) E (x,y)∼γ k [∥x -y∥ 2 ], where we split ∥x -y∥ 2 as ∥ K k=1 p k (x k -y)∥ 2 , and each pair (x k , y) is supported on Γ(p w k n+1 , p wg ). Then the inequality depends on triangle inequality and the fact that w k n+1 is mutual independent. The upper bound is obvious, since by optimal transport, comprehensively considering all mounds is better than the sum of separate consideration. By upper bound (6), the distance between practical and ideal optimization path is upper bounded. In other words, while minimizing upper bound (6), the target distance ( 5) is approximately minimized. Specifically, the target gap vanishes as the upper bound approaches to 0. Regarding Eq. ( 6), it is the sum of the distance between each independent local distribution p w k n+1 and the global distribution p wg , then due to the rotating symmetry, we can minimize upper bound (6) separately for each participant. Hence, without loss of generality, we just consider a specific participant in federated learning in the rest of this paper. Based on former analysis, for maximizing GSNR, each participant needs to greedily optimize its local gradient distribution to minimize the distance W 2 (p w k n+1 (w), p wg (w)). As the initial parameters, i.e., w k 1 , for all participants are identical, the main target is to estimate the gradient distribution of different participants. Then we have the following assumption. Assumption 3.1. (Bounded variance) The variance of stochastic gradients are uniformly bounded, i.e., E ξ|Ci ∥∇ w F (w, ξ|C i ) -µ i ∥ 2 ≤ σ 2 , ∀i, w, where µ i := E ξ|Ci [∇ w F (w, ξ|C i )]. Under such an assumption, the mini-batch stochastic gradient descent converges to joint normal distribution. The detailed proofs can be found in Appendix B. Lemma 3.2. With Assumption 3.1, let {ξ i,b | 1 ≤ i ≤ n; 1 ≤ b ≤ B} be a set of iid sam- ples of a specific dataset, g = (g 1 , • • • , g n ) be a finite dimensional gradient vector, where g i = 1 B B b=1 ∇ w F (w i , ξ i,b ), i ∈ {1, • • • , n}, then √ B(g -E[g]) converges to multivariate normal distribution. Remark 3.3. Let S = √ B(g -E[g] ) and Z ∼ N (0, Σ), where Σ is the covariance matrix of S, then based on Berry-Esseen theorem, for all convex sets U ⊆ R d , we have |P r(S ∈ U ) -P r(Z ∈ U )| ≤ C rank(Σ) 1/4 B 1/2 , where C is a constant, which provides an upper bound of estimation error for Lemma 3.2. Lemma 3.2 implies that, with mini-batch stochastic gradient descent, the sum of local updates converges to a Gaussian distribution, i.e., ḡ = w n+1 -w 1 = n i=1 g i = 1 T g, where ḡ is a linear transformation of a joint Gaussian vector. Thus, with a finite batch size B, it can be approximated by Gaussian distribution. Then we need to calculate the mean vector and the covariance matrix. For such a purpose, we have another assumption of smoothness. Assumption 3.4. (Smoothness) The target function F (w, • ): R m → R is twice differentiable, and the expected matrix norm of hessian matrix H(F (w, • )) is bounded, i.e., E ξ ∥H(F (w, ξ))∥ 2 ≤ L 2 , where ξ is randomly sampled from a specific dataset. Note that Assumption 3.4 is weaker than L-smooth Assumption, since if a function F (w, • ) is L-smooth, it conforms to Assumption 3.4, but not vice versa. Assumption 3.4 always holds for typical machine learning tasks, e.g., logistic regression, soft-max classification and so on. With these assumptions, we have the following lemma. Lemma 3.5. If Assumption 3.1 and 3.4 hold, let {η r } +∞ r=1 be a sequence of real number such that lim r→+∞ η r = 0, and {ε r } +∞ r=1 be a sequence of random vectors, where ε r = ĝḡ, and ĝ = n i=1 ∇ w F (w 1 , ξ i ), ḡ = n i=1 ∇ w F (w i , ξ i ) with w i = w i-1 -η r g i-1 , i ∈ {2, • • • , n} respectively, then we have ε r L → 0, which implies ĝ L → ḡ. Remark 3.6. Based on the proof of Lemma 3.5 in appendix, the estimation error is E∥ḡ -ĝ∥ ≤ n(n -1)η r LG, which implies that if we consider learning rate decayfoot_1 , then ∀ϵ, lim r→+∞ P r(∥ḡ -ĝ∥ > ϵ) = 0. In a typical communication round r, the optimization process implies that w n+1 -w 1 = η r ḡ. While based on Eq. ( 7), we can use η r ĝ = η r n i=1 ∇ w F (w 1 , ξ i ) to estimate η r ḡ. Moreover, if we multiply Eq. ( 7) by η r , the estimation error becomes O((nη r ) 2 LG). Specifically, since w 1 is a constant vector and ξ i is iid sampled from the dataset in different local steps, {∇ w F (w 1 , ξ i )|i ∈ {1, • • • , n}} are also iid random vectors. Therefore, based on the sum of independent random variables, we have µ = E[η r ĝ] = nη r E ξ [F (w 1 , ξ)], and Σ = Cov(η r ĝ, η r ĝ) = nη 2 r Σ 1 . Additionally, Σ is a second order term. To simplify the analysis, we need to convert coefficient n to n 2 . As mentioned before, Σ 1 is the covariance matrix of a mean vector distribution, hence it depends on the batch size B. Let B = B/n, then we obtain Σ = n 2 η 2 Σ 1 / B. For convenience, in the rest of our work, we use µ * and Σ * to denote the corresponding mean vector and covariance matrix of gradients estimated by a specific dataset. Similarly, as we can change the local batch size for simplifying computation, we ignore the differences between B and B. According to Lemma 3.2 and 3.5, we can estimate the parameter distributions of one global step and n local steps by N (η r µ g , η 2 r Σg B ) and N (nη r µ l , n 2 η 2 r Σ l B ) respectively. Then the optimal number of local updates to maximize GSNR is implied by following theorem. Theorem 3.7. The minimal Wasserstein distance between two multivariate Gaussian distribution denoted by N (η r µ g , η 2 r Σg B ) and N (nη r µ l , n 2 η 2 r Σ l B ) with variable n is achieved when n is n opt 1 = max(0, µ T l µ g + tr((Σ l Σg) 1 2 ) B ∥µ l ∥ 2 + tr(Σ l ) B ), and the minimum distance is (∆ opt 1 ) 2 = η 2 r ∆ 2 , ∆ 2 = ∥µ g ∥ 2 + tr(Σ g ) B - µ T l µ g + tr((Σ l Σg) 1 2 ) B 2 ∥µ l ∥ 2 + tr(Σ l ) B . Corollary 3.8. For N (mη r µ g , m 2 η 2 r Σ g ) and N (nη r µ l , n 2 η 2 r Σ l ) , where m is a constant, the optimal n to minimize the Wasserstein distance is n opt m = m • n opt 1 and the minimal distance is (∆ opt m ) 2 = m 2 • (∆ opt 1 ) 2 . Algorithm for k = 1 to M • n opt 1,i do w i ← w i -η r ḡi end for Server: w 1 ← | S| i=1 pi w i , where S = {i|i ∈ S, n opt 1,i > 0}, and pi is the corresponding probability ratio, i.e., p i / k∈ S p k end for Based on Corollary 3.8, if m is a constant, the minimum Wasserstein distance is achieved when n = m * n opt 1 , which is the optimal number of local updates for maximizing GSNR, leading to a maximal channel capacity for information communication. To estimate the optimal local updates, we need to compute the mean vectors and covariance matrix for both local distribution and global distribution by the samples from each participant. In practice, we use sample mean vector and sample covariance matrix to estimate the parameters of local distribution, i.e., for participant k with model w 1 , the corresponding statistics are gk = 1 B B b=1 g k,b and Σk = 1 B B b=1 (g k,b -gk )(g k,b -gk ) T , where g b,k = ∇ w F (w 1 , ξ b |C k ). While for the server, based on the theorem of conditional random variables, the corresponding global statistics are g = E C [g|C] = K k=1 p k gk , Σ = E C [ Σ|C] + Cov C (g|C) = K k=1 p k Σk + K k=1 p k [(g k -g)(g k -g) T ]. In practice, as the covariance matrix increases the communication traffic and the calculation of matrix introduces lots of computation, we need to simplify the procedure. Specifically, in Theorem 3.7, we mainly need the trace of covariance matrix. Meanwhile, according to Balduzzi et al. (2017) , we know that the covariance matrix of gradients is a sparse matrix and the estimate error can be scaled by the batch size B. Therefore, we can instead utilize the principal diagonal element of Σk for efficiency. Based on former analysis, we propose an algorithm FedGSNR to calculate the optimal number of local updates, and Algorithm 1 is a typical example of FedGSNR in conjunction with FedAvgfoot_2 . Partial participation. In federated scenarios, the active participants are usually not 100%, so we cannot obtain perfect information of global gradient. However, we claim that FedGSNR can also adapt to this situation, because such imperfect information encourages us to transform Eq. ( 5) with triangle inequality to W 2 (p w(w), p wg (w)) ≤ W 2 (p w(w), p ŵ(w)) + W 2 (p ŵ(w), p wg (w)), where ŵ is the average parameters of active clients (i.e., a subset of total clients). With a specific client set S, δ = W 2 (p ŵ(w), p wg (w)) is a constant, and W 2 (p w(w), p ŵ(w)) can be bounded by inequality (6). Thus, with some tolerance δ, we can similarly minimize Eq. ( 5) with the new upper bound, but the performance decreases as the ratio of active clients declines. Convergence analysis. FedGSNR is a convergent algorithm, since we just change the number of local updates. Moreover, we can transform the convergence analysis of FedGSNR to its original version by the inequality 1 ≤ E min ≤ E i,r ≤ E max . We provide an example of convergence analysis for FedGSNR with FedAvg in Appendix E.

4. CALCULATE GSNR BY LOCAL GRADIENTS

In this section, we first analyze the optimal local updates as well as the optimal distance between the local gradient distribution and the global gradient distribution, and then derive a method to calculate GSNR by the optimal distance. However, due to the limited space, the detailed analysis between GSNR and the optimization procedure can be found in Appendix C.

Global Step

Figure 2 : An overview for GSNR: the GSNR can be calculated by the statistics of global gradient distribution and local gradient distribution. Regarding the optimal number of local updates n opt 1 and the corresponding optimal distance (∆) 2 , let L = ∥µ g ∥ 2 + tr(Σg) B , M = µ T l µ g + tr((Σ l Σg) 1 2 ) B , and N = ∥µ l ∥ 2 + tr(Σ l ) B , we can rewrite n opt 1 = max(0, M N ) and (∆) 2 = L -M 2 N . Then we define a matrix as follows: R * =         u 1 * u 2 * . . . u t * 1 B Σ * 1 2         where µ i * is the component of µ * = (µ 1 * , • • • , µ t * ). On the one hand, L = ∥R g ∥ 2 F is correlated to the global distribution, which is a constant for all participants. 4 On the other hand, N = ∥R l ∥ 2 F depends on local distribution, thus it is a normalization coefficient. Hence, the two variables n opt 1 and (∆) 2 mainly depend on the value of M , the inner product of two matrixes, i.e., < R l , R g > F , which represents the similarity of them. Based on former analysis, we derive a method to calculate GSNR as Definition 4.1. Definition 4.1. Gradient Signal to Noise Ratio (GSNR). For a local dataset D l and a global dataset D g , with a loss function F (w, • ), the GSNR is a function of w defined as r(w) = max(0, < R l , R g > F ∥R l ∥ 2 F ∥R g ∥ 2 F -< R l , R g > 2 F ). Specifically, Fig. 2 illustrates an example for Definition 4.1. In Fig. 2 , we imagine a similar case in Euclid space. In this case, angle θ can be viewed as the similarity between global and local gradient vectors. In Euclid space, the minimum distance from a point to a line is the segment vertical to the line, thus the black dash line is orthogonal to local gradient vector. Hence, η 2 (∥µg∥ 2 +tr(Σg)/B) η 2 ∆ 2 = csc 2 θ. As for GSNR, defined as the magnitude ratio between the parallel component and the orthogonal component, it can be viewed as the cot θ. Then according to trigonometric transformation, i.e., cot 2 θ = csc 2 θ -1, we can obtain Definition 4.1.

5. EXPERIMENT

We run our experiments on the well known real world datasets CIFAR-10 and CIFAR-100 mentioned in Krizhevsky et al. (2009) to validate our design. Setup. For non-iid settings, we utilize 3 methods for data partition. First, we follow the settings in Hsu et al. (2019) to generate non-iid data across different participants by Dirichlet distribution, where α is a parameter represents the level of non-iid. Second, we propose NonBalance and Pareto for imbalanced partition, which simulates the imbalanced distributed information in practical scenario. Due to the limited space, the details of different methods can be found in Appendix D. For all experiments, we use LeNet for CIFAR-10 and VGG-16 for CIFAR-100. To ensure all methods are comparable, we need to set the total computation, i.e., local updates, to be equal. So in FedGSNR, we set the local updates for different participants to be E k = N E const n opt 1,k K i=1 n opt 1,i , where N and E const represent the active participants and the local updates of baseline algorithms respectively. Note that E k is a redistribution of local steps. The necessity of optimal local updates. To understand the necessity of optimal local updates, we  K k=1 p(C k ) log p(C k ), where p(C k ) = n opt 1,k K i=1 n opt 1,i , and the results are illustrated in Fig. 3 . Specifically, the dashed blue line on the top is the uniform distribution of local updates, which is the maximum entropy distribution of discrete variables, and it represents the equal local updates among all participants. Moreover, when the degree of non-iid increases, the computation is allocated more concentrated, i.e., a smaller entropy. On the contrary, the entropy converges to its maximum when the distributed data is closer to iid. Additionally, the corresponding convergence rate is illustrated in Table 1 , which indicates that the re-allocation of local updates based on FedGSNR accelerates the model convergence. The impatct on test accuracy. Versus its original, FedGSNR achieves comparable test accuracy and even outperform its original when the non-iid degree increases. For example, in Pareto scenario, the accuracy of FedGSNR with FedProx achieves an increase of 6.43%. More detailed results can be found in Appendix D.2. Figure 7 : The GSNR of different clients. We observe that GSNR is larger and almost the same among all participants when data is iid distributed, then it gets smaller and heterogeneous as the non-iid level grows. Finally, when data partition method is Label 2, GSNR is small but similar to each other again, which probably indicates that data is distributed with some symmetries in regard to the information. Evaluate local contributions with GSNR. Fig. 7 displays the variation of GSNR when we utilize Dirichlet method with different α. And the results demonstrate that when the level of non-iid grows, GSNR of different clients vary dramatically, which represents the contributions of different clients are different. Moreover, combined GSNR with the results in Table 1 , the model convergence is faster than its opponents when we considerate such differences. Further more, to investigate the performance of data evaluation, we change the labels l of client 0 to be (l + k) mod 10. So that the client provides a label flipping attack as Hitaj et al. (2017) . Fig. 6 illustrates the changes of GSNR when we change the labels, and the red dashed line box represents the original GSNR when the labels are unchanged. Specifically, we observe that the GSNR dramatically decreases when we make a malicious change to the labels. Additionally, we instead change the data points to be sampled from a uniform distribution, and observe a similar phenomenon. For both of malicious changes, we observe FedGSNR is more robust. Due to the limited space, we send these experiments to Appendix D.

6. CONCLUSION

In this paper, we have investigated the FL problem via a new perspective, i.e., GSNR. Our theoretical analysis indicates that under non-iid scenarios, the local updates can be decomposed into signal and noise components, and we can maximize GSNR with the optimal local updates. Based on theoretical analysis, we further propose an algorithm FedGSNR to calculate the optimal local updates for different FL algorithms, which achieves faster global model convergence. Additionally, we derive a method to calculate GSNR directly from the local datasets, which can be utilized to evaluate the local contributions of different participants. Finally, extensive experimental results demonstrate the beneficial effect of optimizing FL from the new perspective of GSNR, and also open up a promising new direction for follow-up research.

A RELATED WORK

Gradient Diversity. Gradient diversity is a key ingredient of federated learning, which captures the differences between the datasets possessed by different participants. Yin et al. (2018) employs gradient diversity to investigate the relationship between batch size and the convergence rate in parallel SGD. Yu et al. (2019) analyzes why periodical model averaging is suitable for deep learning, and provides a deep understanding of model averaging. Haddadpour et al. (2019) tries to mitigate gradient diversity through sharing a small batch of data among all participants, but it also introduces a higher privacy risk. Acar et al. ( 2021) introduces a dynamic regularization term to resolve the problem of gradient divergence. However, most of previous literatures try to solve gradient diversity through gradient calculating, such as gradient prediction, regularization, personalized target function and so on. However, the influence of local updates gains less attention. In this paper, we propose a new perspective to analyze the optimization procedure by Gradient Signal to Noise Ratio, it reduces the required communication rounds via an elaborate configuration of local updates and propose a method to evaluate the contributions of different participants. Personalization in federated learning. Another important problem in federated learning is personalization. Formally, personalization transform the optimization problem from global distribution p(ξ) to a specific local distribution p(ξ | C i ) on client i, it scarifies the global performance in order to gain more benefit in local scenario. Kulkarni et al. (2020) reviews the investigations of personalization, and the situations are divided into three categories: device heterogeneity, data heterogeneity and model heterogeneity, while the last one is the motivation of personalization. Mansour et al. (2020) proposes three methods to achieve personalization, these methods try to balance the model performance on global data distribution as well as local data distribution. Sim et al. ( 2019) proposes a method to optimize global model and local model separately in order to make local model more personalized. Jiang et al. (2019) proposes three objectives to make personalization easily. However, personalization is an important topic in federated learning since different participants confront different problems, but if we greedily utilize global information for personalization, there is likely to appear Prisoner's Dilemma, the collective benefit for all participants is not optimal and therefore, the profit for each participant can probably be futher improved. Hence, cooperation is also an important problem, and a better goal of personalization is to search optimum on conditional data distribution p(ξ | C i ) combined with cooperation.

B PROOFS OF LEMMA AND THEOREM

B.1 PROOF OF LEMMA 3.2 Proof. For any constant n, gradient vector g can be rewritten as g = 1 B B b=1 (∇ w F (w 1 , ξ 1,b ), • • • , ∇ w F (w n , ξ n,b )), let gb = (∇ w F (w 1 , ξ 1,b )), • • • , ∇ w F (w n , ξ n,b ))), we further have g = 1 B B b=1 gb , then with Assumption 3.1 and n is a constant, gb is subject to some complex distribution with bounded covariance matrix. As ξ i,b is iid sampled from a specific dataset, g is the mean vector of g1 , • • • , gB , which are iid random vectors. Therefore, based on the classical Central Limit Theory, with B growing large, √ B(g -E[g]) converges to N (0, Σ) in distribution, where Σ is the covariance matrix of g.

B.2 PROOF OF LEMMA 3.5

Proof. First, we prove lim r→+∞ E∥ε r ∥ = 0. Regarding the gradient g i , i ∈ {1, • • • , n}, due to the smoothness of ∇ w F (w i , ξ), we can expand g i based on Lagrange's mean value theorem as g i = ∇ w F (w i , ξ i ) = ∇ w F (w 1 , ξ i )+H(F ( wi , ξ i ))(w i -w 1 ) = g 1,i +H(F ( wi , ξ i ))(w i -w 1 ), where w := λw i + (1 -λ)w 1 , λ ∈ [0, 1], and g 1,i represents the gradient at w 1 with sample ξ i , which is an unbiased estimator of g 1 . Note that when i ̸ = j, g 1,i is independent of g 1,j . Then E∥g i -g 1,i ∥ = E∥H(F ( wi , ξ i ))(w i -w 1 )∥ (a) ≤ E∥H(F ( wi , ξ i ))∥∥(w i -w 1 )∥ (11) (b) ≤ E∥H(F ( wi , ξ i ))∥ 2 E∥w i -w 1 ∥ 2 (c) ≤ L • E∥η r i-1 j=1 g j ∥ 2 (d) ≤ L 2 • η r (i -1) i-1 j=1 E∥g j ∥ 2 (e) ≤ (i -1)η r LG, where  G 2 := σ 2 + µ 2 , µ = max({∥µ i ∥} i∈{1,••• ,n} ). Hence, E∥ε r ∥ = E∥ĝ -ḡ∥ = E∥ n i=1 (g i -g 1,i )∥ ≤ n 1 E∥g i -g 1,i ∥ ≤ ( n i=1 (i -1))η r LG = n(n -1) 2 η r LG where the first inequality follows from the triangle inequality, and the second inequality is based on Eq. ( 11). As n represents the number of local steps, which is a constant, E∥ε r ∥ is upper bounded by η r • M , where M is a bounded value. Therefore, 0 ≤ lim r→+∞ E∥ε r ∥ ≤ lim r→+∞ η r • M = 0. Formula ( 12) implies ε r converge to 0 in mean, i.e., ε r L → 0, which immediately completes the proof.

B.3 PROOF OF THEOREM 3.7

Proof. According to Eq. ( 4), to minimize the distance between N (η r µ g , η 2 r Σg B ) and N (nη r µ l , n 2 η 2 r Σ l B ), we can build an optimization problem as min n d 2 = ∥η r µ g -nη r µ l ∥ 2 + tr(M 2 ) (13) s.t. M = η 2 r Σ g B 1 2 - n 2 η 2 r Σ l B 1 2 n ≥ 0. Note that Eq. ( 13) is a quadratic function of n, which immediately completes the proof. B.4 PROOF OF COROLLARY 3.8 Proof. In this case, we change the distribution N (η r µ g , η 2 r Σg B ) to N (mη r µ g , mη 2 r Σg B ), and reformulate problem (13) as min n d 2 = m 2 (∥η r µ g - n m η r µ l ∥ 2 + tr(M 2 )) s.t. M = η 2 r Σ g B 1 2 - n m 2 η 2 r Σ l B 1 2 n ≥ 0, let d = d m and ñ = n m , then the new problem reduces to problem (13), which concludes the proof immediately.

Local Steps

𝐹 * (𝒘, 𝝃)

Optimal distance

Figure 8 : A similar case in Euclid space: as the ideal optimization path is discrete, there is an optimal distance between global updates and multiple local updates. Further more, when global updates converge to the optimum, the local updates converge to the nearest point to the optimum. According to Fig. 8 , the stochastic gradient descent algorithm converges to the ϵ-neighborhood of optimum after a constant steps (usually more than O( 1 ϵ ) Nesterov ( 1998)). For convenience, we refer such a constant as the optimal m opt . In practice, as Fig. 8 indicates, with determined differences between global distribution and local distribution, i.e., maximal GSNR is a constant during a specific round, set the target global steps as m opt is an optimal choice. However, a large number of m leads to a large error of gradient estimation, which is determined by O(η 2 n 2 ). Hence, in practice, in order to decide m, we need to trade off between the estimation error and the corresponding convergence rate. Then we focus on Definition 4.1, the function to calculate GSNR. Specifically, Cauchy-Schwarz inequality implies that ∆ 2 ≥ 0, and the equality holds when local distribution is the same as the global distribution, i.e., the data is iid distributed among all participants. With ∆ 2 decreases, which implies local data distribution approaches the global data distribution, n opt 1 gradually increases. Moreover, when ∆ 2 reaches its minimum 0, n opt 1 attains its maximum value 1. Based on the analysis, we can conclude that the more similarity between local dataset and the global dataset, i.e., the larger GSNR the local dataset achieves, the more local updates we need for optimization procedure, which is heuristically experimented in Li et al. (2020b) . To calculate GSNR, we derive r(w) as Definition 4.1, and r(w) is positive related to n opt 1 . On the one hand, when n opt 1 = 0, from its definition, we know that < R l , R g > F = 0, hence the optimal distance ∆ 2 achieves its maximum ∥R g ∥ 2 F , and r(w) attains its minimum 0. On the other hand, when local distribution is the same as the global distribution, i.e., ∆ 2 = 0, r(w) → +∞, we treat this scenario as a noiseless optimization procedure, and the data is iid distributed among each participant. Therefore, r(w) ∈ (0, +∞). With former analysis, we know r(w) ∈ (0, +∞). On the one hand, as the data is iid distributed among all participants, i.e., GSNR goes to +∞, the distributed optimization is a noiseless procedure, which means the local updates is unbiased. On the other hand, when GSNR is 0, which means < R l , R g > F ≤ 0, the angle between local gradient and the global gradient is greater than 90 • . In other words, for current optimization, local data distribution is independent of global data distribution, thus for global optimization, it is no better than a random guess, then its signal component will be set to 0, which leads GSNR to be 0. As for the relationship between GSNR and the parameters w, Fig. 9 displays a representative scenario. Due to the randomness of SGD, the new parameters after w 1 with another aggregation can be either w 2 or w ′ 2 . On the one hand, if the parameters is w 2 , which means we get closer to the optimum of client C 2 , there are different changes of the GSNR for different clients: for C 1 , the GSNR increases, while for C 2 , the vector is almost orthogonal to global optimization vector, which implies its GSNR is closer to 0. On the other hand, if the parameters is w ′ 2 , which is closer to optimum of C 1 , the phenomenon is slightly different: the GSNR of C 1 decreases and the GSNR of C 2 increases. Hence, during the training process, r(w) is a random variable correlated to the random process w, and if we tend to use GSNR to evaluate the contribution of different participants, we need to observe its statistics, i.e., mean or median.

D DETAILS OF EXPERIMENTS D.1 DIFFERENT METHODS OF DATA PARTITION

Dirichlet partition. we follow the settings in Hsu et al. (2019) to generate non-iid data across different participants by Dirichlet distribution. Specifically, the prior distribution is set to be Uniform, and then the parameter α represents the level of concentration. With α → +∞, the data distributions of all participants tend to be identical, hence the data is iid distributed among all clients. While α → 0, each participant only possesses data chosen from just one class, i.e., one label for each participant. As for Label 2., it is a specific partition method in Hsu et al. (2019) , and each client owns the data sampled from 2 classes. NonBalance partition. For NonBalance partition, we tend to simulate the practical scenario of imbalanced information distribution. Specifically, We divide all participants into three categories: abundant information, medium information and less information, which represent the clients possess data chosen from different number of labels. First, for clients with abundant information, we random chose data from all labels, and the number of them is 10% of total clients. Second, for the clients with medium information, we random chose 50% classes for each client, then distribute data randomly according to their chosen labels, and the ratio of them is 40%. Finally, for the clients with less information, the number of labels reduces to 20%, and the ratio of them raises to 50%. Pareto partition. In practice, Pareto distribution is a common scenario. It represents the long tailed distribution of practical scenario such as the degree of nodes in complex network, the distribution of social wealth, the distribution of followers in social network, etc. Hence, we design Pareto partition to simulate the so called Two-Eight distribution in practice. First, we sample N points from Pareto distribution, p(x) = k•x k min x k+1 , if x ≥ x min 0, otherwise. Where N represents the number of clients. Denote the corresponding samples as X = {x i } N i=1 , and we normalize x i with xi = x i / max(X) to guarantee all data distributed in [0, 1]. Finally, we use xi as the ratio of classes possessed by different clients for random sampling, and set the minimum number of labels among all clients to be 1. 10(d), we can discover that the decrease of GSNR for random input is larger than label changing. Specifically, it is consistent with our intuition, since malicious attack of label changing contains more information than random input. Simply put, after label changing, the model still gets the information that such data belong to a same class. For example, if we change all labels of 'cat' to 'dog', we still know the data of 'cat' belongs to a same class, though we call them 'dog'. On the contrary, the change of random input cannot provide this information. 1 , and it indicates that FedGSNR not only converges faster, but also achieves a comparable accuracy than its opponents. As for the accuracy drop in Table 3 , it's possibly because FedGSNR is a gradient-based algorithm, if the basic method introduces gradient estimation (i.e., Scaffold), the performance of FedGSNR will be correlated to the precision of such an estimation, and a relatively low precision leads to the corresponding accuracy drop. Meanwhile, as we have illustrated in Fig. 7 , the method of Label 2 distributes the data with some symmetries in regard to the information, i.e., the GSNR of each participant are similar to each other, then it's naturally compatible to identical local updates, hence the test accuracy between FedGSNR and its opponents are close to each other.

D.2 ADDITIONAL EXPERIMENTS

Lemma E.7. (Bounding the variance). Assume Assumption 3.1 holds, then we have E∥g t -ḡt ∥ ≤ K k=1 p 2 k σ 2 k ≤ σ 2 . We focus on Lemma 3 in Li et al. (2020b) , which is related to the number of local steps, and we prove that when we apply our method to decide the number of local steps, the lemma still holds. Lemma E.8. (Bounding the divergence of w k t when we use FedGSNR to decide the number of local updates). Assume the assumptions hold, then we have E[ K k=1 p k ∥ wt -w t k ∥ 2 ] ≤ 4η 2 t E 2 const BG 2 L µ . where wt = p k w t k . Proof. Since based on our strategy, the local steps is individually decided by E k,r = n opt 1,k,r * E const , where E const is a constant. Hence we define the local optimization process as w t+1 k = w t k -∇ w F (w t k ), 0 ≤ t -t 0 < E k,r w t k , E k,r ≤ t -t 0 < E max , where w k t0 = wt0 represents the aggregation step, and without loss of generality, t 0 is the initial time of communication round r, E max = max{E k,r | 1 ≤ k ≤ K, 1 ≤ r ≤ R}. Then the situation becomes an identical number of local steps, and we prove that the upper bound is independent of E max . We use the fact t -t 0 < E max , where t 0 represents the last aggregation step before t, η t is non-increasing and η t0 ≤ 2η t for all t -t 0 ≤ E max , we have (3) ≤ K k=1 p k E[ t0+E k,r -1 s=t0 E k,r η 2 s ∥∇ w F (w s k )∥ 2 ] (4) ≤ K k=1 p k η 2 t0 E[ t0+E k,r -1 s=t0 E const ∥R t0 g ∥ ∥R t0 k ∥ ∥∇ w F (w s k )∥ 2 ] (5) ≤ K k=1 p k η 2 t0 E const E[ t0+E k,r -1 s=t0 √ BG E[∥F (w t0 k )∥ 2 ] ∥∇ w F (w s k )∥ 2 ] A1 , where inequality (1) depends on wt = p k w t k and the fact ∥a + b∥ 2 ≤ 2(∥a∥ 2 + ∥b∥ 2 ). Inequality (2) is based on the local optimization process Eq. ( 14). Inequality (3) is a consequence of ∥ 



wg is the ideal optimization path based on the global data distribution, i.e., the ideal distribution of data gathered from all participants, while w is the corresponding practical path defined by Eq. (3). For example, ηr = η0α r refers to a widely used learning rate decay method with a decay rate α < 1. Note that our proposed FedGSNR is a compatible method, and the referred FedAvg can also be replaced by other methods (e.g., FedProx). ∥ • ∥F and < •, • >F are Frobenius inner product and Frobenius norm respectively. Speedup Karimireddy et al. (2020), i.e., S = T old Tnew , measures the relative performance of two methods.



Figure 1: An example of Gradient Signal to Noise Ratio (GSNR). A local step can be decomposed into two components: signal and noise, the former parallels to the update with global data, and the latter orthogonal to it.

Work. There has been a lot of literatures devoted to improving FL, including convergence Karimireddy et al. (2020); Li et al. (2020b); Wang et al. (2020a); Reddi et al. (2021) robustness Mohri et al. (2019); Fang et al. (2020); Li et al. (2021), and data privacy Melis et al. (2019); Zhu et al. (2019); Bagdasaryan et al. (2018). Regarding GSNR, Rainforth et al. (2018); Liu et al. (2020) try to analyze the generalization and variational bounds with such a concept. In this work, we focus on the relationship between GSNR and the optimal local updates in FL scenarios. To control the noise component (client drift), Karimireddy et al. (

Figure 3: The entropy of local steps for different partition methods.

Figure 4: Test accuracy of different algorithms with different local steps.

Figure 5: An imbalanced scenario with Pareto partition on CIFAR-100 dataset.

Figure 6: The variation of GSNR when we change the labels of a specific participant. Model convergence of FedGSNR. During experiments, we set µ = 0.01 for FedProx, and compare the performance of different algorithms to its combination with FedGSNR. Then according to Table

a) follows from sub-multiplicative property of matrix norm, (b) is based on Cauchy-Schwarz inequality, (c) is an immediate consequence of Assumption 3.4 and the local optimization process, (d) comes from the fact ∥ n i=1 a i ∥ 2 ≤ n n i=1 ∥a i ∥ 2 , and (e) is based on Assumption 3.1, where

Figure9: A representative scenario of GSNR: r(w) is a random variable with regard to w. If we get closer to the optimum of C 2 , GSNR of C 1 will increase and the GSNR of C 2 will decrease. While if we get closer to the optimum of C 1 , the situation changes in the opposite way.

Fig. 10 displays the change of GSNR and corresponding test accuracy when we apply different malicious change to a client. Interestingly, compare Fig. 10(a) and 10(b) with Fig. 10(c) and

Figure 10: We make different malicious changes to client 0 in different partition methods. In (a) and (b), we change the labels l of client 0 to l + 3. While in (c) and (d), the labels remain unchanged, and we change the input data to be a uniform distribution U (-1, 1). (e)-(h) are the corresponding test accuracy of different scenarios.

i ∥ 2 . Inequality (4) depends on the Cauchy-Schwarz inequality, i.e., n opt 1 ≤∥Rg∥∥R l ∥ ∥R l ∥ 2 = ∥Rg∥ ∥R l ∥, and inequality (5) is an immediate consequence of Lemma E.1 and E.2. Then based on the L-Smoothness, we have∥∇ w F (w s k )∥ 2 ≤ 2L(F (w s k ) -F (w * )), similarly, according to the µ-strong convexity, we have ∥∇ w F (w s k )∥ 2 ≥ 2µ(F (w s k ) -F (w * )).

Example of FedGSNR in conjunction with FedAvg Input: initial model w 1 , learning rate η 0 , sample size B, and chosen global steps M for r = 1 to R do Sample clients S ⊆ {1, • • • , K} Server: send w 1 and η r to each client i ∈ S On each active client i in parallel: initialize local model w i ← w 1 , compute gi and diag( Σi ), and send them to the server Server: compute n opt 1,i with Theorem 3.7 for each client i, and send it to each client i On each active client i in parallel:

Communication rounds to reach 0.5 accuracy and corresponding speedup 5 of FedGSNR on CIFAR10. We distributed the data among 30 clients, utilize batch size of 64, and set E const = 20.

Communication rounds to reach 0.5 test accuracy for classification on NonBalance CIFAR-10 of 100 participants as we vary the number of active clients. FedGSNR with different algorithms achieve faster convergence versus its original, and it reaches a 1.69× speedup on average with comparable accuracy. The corresponding accuracy can be found in Appendix D.2. Besides, Fig.4illustrates the accuracy when we set different number of E const , and FedGSNR with FedAvg converges faster and achieves better accuracy with different local steps (Scaffold fails to work when we set E const = 25). Moreover, in practice, Pareto's Law is a common principle, which means a small number of participants possess a large number of information. Fig.5(a) indicates that FedGSNR with different algorithms converge faster and reaches comparable accuracy. Meanwhile, the GSNR of different participants are resemble to their label distribution (the histogram on the bottom of Fig.5(b)), which demonstrates GSNR can distinguish the information quality between different local datasets. Furthermore, Table2indicates that the growth of active clients speeds up the convergence of different algorithms. Particularly, FedGSNR gains more benefit from global information as its speedup increases from 1.4× to 1.8× when active clients grows.

Best test accuracy on CIFAR-10. We distributed the data among 30 clients, utilize batch size of 64, and set local steps E const = 20 for different algorithms.

displays the corresponding test accuracy of different algorithms aforementioned in Table

E AN EXAMPLE OF CONVERGENCE ANALYSIS

As defined in Sec. 4, we havewhere µ * and Σ * are the corresponding mean vector and covariance matrix calculated by the samples sampled from D * respectively, andwhere B ≥ 1.Proof. First, we havewhich concludes the proof.Proof. Similarly,Then we analyze the convergence of FedGSNR with FedAvg based on the proofs of Li et al. (2020b) . For the analysis, we make additional assumptions. Assumption E.3. The functions F k are all L-smooth: for all w and v,Where F k (w) is short for F (w; ξ|C k ).The Lemma 1 and 2 in Li et al. (2020b) implies the bound for one step SGD and the bound for the variance of gradients, as they are independent of the number of local steps, we can use them directly. Lemma E.6. (Results of one step SGD). Assume Assumption E.3 and E.4 hold. If η t ≤ 1 4L , we havewhereHence, we can further bound A 1 aswhere the second inequality depends on F (w * ) = min w F (w), and the fact that, the process is a non-increasing sequence). The third inequality depends on the upper bound of E k,r , and the last inequality is a consequence of η t0 ≤ 2η t .Hence, we havewhich is related to a constant E const .Then we can prove the Thm. 1 in Li et al. (2020b) by the substitution for the upper bound of. Theorem E.9. Let the assumptions hold, and L, µ, G, σ defined therein, if we choose κ = L µ , γ ≥ max{8κ -1, E max } and the learning rate η t = 2 µ(γ+t) . Then FedGSNR with FedAvg satisfieswhere

