FEDGSNR: ACCELERATING FEDERATED LEARNING ON NON-IID DATA VIA MAXIMUM GRADIENT SIGNAL TO NOISE RATIO Anonymous

Abstract

Federated learning (FL) allows participants jointly training a model without direct data sharing. In such a process, participants rather than the central server perform local updates of stochastic gradient descent (SGD) and the central server aggregates the gradients from the participants to update the global model. However, the noniid training data in participants significantly impact global model convergence. Most of existing studies addressed this issue by utilizing variance reduction or regularization. However, these studies focusing on specific datasets lack theoretical guarantee for efficient model training. In this paper, we provide a novel perspective on the non-iid issue by optimizing Gradient Signal to Noise Ratio (GSNR) during model training. In each participant, we decompose local gradients calculated on the non-iid training data into the signal and noise components and then speed up the model convergence by maximizing GSNR. We prove that GSNR can be maximized by using the optimal number of local updates. Subsequently, we develop FedGSNR to compute the optimal number of local updates for each participant, which can be applied to existing gradient calculation algorithms to accelerate the global model convergence. Moreover, according to the positive correlation between GSNR and the quality of shared information, FedGSNR allows the server to accurately evaluate contributions of different participants (i.e., the quality of local datasets) by utilizing GSNR. Extensive experimental evaluations demonstrate that FedGSNR achieves on average a 1.69× speedup with comparable accuracy.

1. INTRODUCTION

Federated learning (FL) McMahan et al. (2017) focuses on a practical scenario with multiple participants to collaboratively train a model without direct data sharing. Different from the typical centralized optimization problem, FL decomposes the optimization problem into several sub-optimization problems, and distributes them to different participants to be solved separately with the corresponding local datasets. Moreover, these local datasets often follow non-iid distributions in reality. One of the key challenges in FL is how a model can be well trained on non-iid data in different participants. On the one hand, such imbalance breaks the unbiased optimization procedure when we utilize multiple local updates. While on the other hand, due to the differences between different local datasets, non-iid information distribution makes it difficult to evaluate the contribution of different participants. The former slows down the FL-based model convergence, i.e., a key factor of the efficiency. The latter is associated with contribution evaluation involving malicious data tampering detection, contribution-based profit distribution, incentive mechanism design, etc. Especially in the current data-driven age Sim et al. ( 2020), contribution evaluation is particularly important. In addition to separately pursuing the two goals of accelerating model convergence and improving contribution evaluation accuracy, these two goals even conflict with each other. To solve above challenges, we propose a novel approach that speedups model training by maximizing Gradient Signal to Noise Ratio (GSNR). The intuitions behind the design are two-fold. First, there is always a global optimal solution no matter how the data is distributed. For each local dataset, we can obtain an optimal optimization direction, i.e., the global gradient. Second, based on the information theory, GSNR determines the channel capacity, i.e., Shannon's formula: C = W • log(1 + SNR), and a larger GSNR means we can get more information with identical communication rounds, which can accelerate the model convergence. Thus, we can decompose the local optimization direction (i.e., the local gradient) into mutually orthogonal signal vector and noise vector. We find that if we can obtain the global gradient, the signal vector is parallel to the global gradient, while the noise vector is orthogonal to it. Fig. 1 shows a typical example of orthogonal decomposition of two participants. We prove that the number of local updates can control the GSNR value and we can maximize GSNR by computing the optimal number of local updates. To maximize GSNR, we utilize the gradients uploaded by the participants to estimate the global gradient, and propose a FedGSNR algorithm to compute the optimal number of local updates according to the estimated global gradient. Moreover, based on the GSNR perspective, we also develop a specific method to compute the GSNR for each dataset, which allow the server to evaluate each participant's contribution. In addition to personalizing the number of local updates to optimize model convergence efficiency, the newly proposed GSNR strategy FedGSNR is orthogonal to existing methods, which mostly depends on modifying gradients calculating. Hence, FedGSNR can be combined with these methods so as to further improve them. In summary, our contributions in this paper are as follows: • We prove that the optimal local updates decides the maximal GSNR, which leads to faster and more stable convergence. • We analyze existing FL algorithms with the perspective of GSNR. Moreover, based on the viewpoint of GSNR maximization, we propose an algorithm FedGSNR, which can be combined with most of current FL algorithms to calculate its optimal local updates. • We derive a function r(w) to calculate GSNR, which can be utilized to evaluate the local contributions of different participants. • We confirm our theoretical results on CIFAR-10 and CIFAR-100 datasets, and experiments indicate that FedGSNR can achieve on average a 1.69× speedup over its original when the information unevenly distributed among all participants, and r(w) is a reasonable metric for local contributions. 



During the training phase, each participant solves the sub-problem via stochastic gradient decent (SGD), and sends back the corresponding results for aggregation. One of the most popular FL algorithms is FedAvg McMahan et al. (2017), and it typically accelerates global model convergence through multiple local updates. Although it has shown great performance in many practical applications, there're still mysteries in this area, especially in non-iid cases, and many previous literatures Haddadpour & Mahdavi (2019); Khaled et al. (2020); Li et al. (2020b) make efforts to analyze the convergency or even to accelerate it.

Figure 1: An example of Gradient Signal to Noise Ratio (GSNR). A local step can be decomposed into two components: signal and noise, the former parallels to the update with global data, and the latter orthogonal to it.

There has been a lot of literatures devoted to improving FL, including convergence Karimireddy et al. (2020); Li et al. (2020b); Wang et al. (2020a); Reddi et al. (2021) robustness Mohri et al. (2019); Fang et al. (2020); Li et al. (2021), and data privacy Melis et al. (2019); Zhu et al. (2019); Bagdasaryan et al. (2018). Regarding GSNR, Rainforth et al. (2018); Liu et al. (2020) try to analyze the generalization and variational bounds with such a concept. In this work, we focus on the relationship between GSNR and the optimal local updates in FL scenarios. To control the noise component (client drift), Karimireddy et al. (2020) proposes a specific gradient calculating method based on variance reduction. Li et al. (2020a) indicates that under non-iid FL conditions, a large number of local updates lead to divergence or instability. While Wang et al. (2020b) tries to stabilize the training procedure with a new average strategy. On the other hand, Wang et al. (2019) proposes a practical optimization problem with resources constraints, and it determines the number of local updates for each participant according to the corresponding constraints.

