FEDSPEED: LARGER LOCAL INTERVAL, LESS COM-MUNICATION ROUND, AND HIGHER GENERALIZATION ACCURACY

Abstract

Federated learning is an emerging distributed machine learning framework which jointly trains a global model via a large number of local devices with data privacy protections. Its performance suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting. In this paper, we propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems. Concretely, FedSpeed applies the prox-correction term on the current local updates to efficiently reduce the biases introduced by the prox-term, a necessary regularizer to maintain the strong local consistency. Furthermore, FedSpeed merges the vanilla stochastic gradient with a perturbation computed from an extra gradient ascent step in the neighborhood, thereby alleviating the issue of local over-fitting. Our theoretical analysis indicates that the convergence rate is related to both the communication rounds T and local intervals K with a upper bound O(1/T ) if setting a proper local interval. Moreover, we conduct extensive experiments on the real-world dataset to demonstrate the efficiency of our proposed FedSpeed, which performs significantly faster and achieves the state-of-the-art (SOTA) performance on the general FL experimental settings than several baselines including FedAvg, FedProx, FedCM, FedAdam, SCAFFOLD, FedDyn, FedADMM, etc. 

1. INTRODUCTION

Since McMahan et al. (2017) proposed federated learning (FL) , it has gradually evolved into an efficient paradigm for large-scale distributed training. Different from the traditional deep learning methods, FL allows multi local clients to jointly train a single global model without data sharing. However, FL is far from its maturity, as it still suffers from the considerable performance degradation over the heterogeneously distributed data, a very common setting in the practical application of FL. We recognize the main culprit leading to the performance degradation of FL as local inconsistency and local heterogeneous over-fitting. Specifically, for canonical local-SGD-based FL method, e.g., FedAvg, the non-vanishing biases introduced by the local updates may eventually lead to inconsistent local solution. Then, the rugged client-drifts resulting from the local over-fitting into inconsistent local solutions may make the obtained global model degrading into the average of client's local parameters. The non-vanishing biases have been studied by several previous works Charles & Konečnỳ (2021) ; Malinovskiy et al. (2020) in different forms. The inconsistency due to the local heterogeneous data will compromise the global convergence during the training process. Eventually it leads to serious client-drifts which can be formulated as x * ̸ = i∈[m] x * i /m. Larger data heterogeneity may enlarge the drifts, thereby degrading the practical training convergence rate and generalization performance. In order to strengthen the local consistency during the local training process, and avoid the client-drifts resulting from the local over-fitting, we propose a novel and practical algorithm, dubbed as FedSpeed. Notably, FedSpeed incorporates two novel components to achieve SOTA performance. i) Firstly, FedSpeed inherits a penalized prox-term to force the local offset to be closer to the initial point at each communication round. However, recognized from Hanzely & Richtárik (2020); Khaled et al. ( 2019) that the prox-term between global and local solutions may introduce undesirable local training bias, we propose and utilize a prox-correction term to counteract the adverse impact. Indeed, in our theoretical analysis, the implication of the prox-correction term could be considered as a momentumbased term of the weighted local gradients. Via utilizing the historical gradient information, the bias brought by the prox-term can be effectively corrected. ii) Secondly, to avoid the rugged local over-fitting, FedSpeed incorporates a local gradient perturbation via merging the vanilla stochastic gradient with an extra gradient, which can be viewed as taking an extra gradient ascent step for each local update. Based on the analysis in Zhao et al. ( 2022); van der Hoeven (2020), we demonstrate that the gradient perturbation term could be approximated as adding a penalized squared L2-norm of the stochastic gradients to the original objective function, which can efficiently search for the flatten local minima Andriushchenko & Flammarion (2022) to prevent the local over-fitting problems. We also provide the theoretical analysis of our proposed FedSpeed and further demonstrate that its convergence rate could be accelerated by setting an appropriate large local interval K. Explicitly, under the non-convex and smooth cases, FedSpeed with an extra gradient perturbation could achieve the fast convergence rate of O(1/T ), which indicates that FedSpeed achieves a tighter upper bound with a proper local interval K to converge, without applying a specific global learning rate or assuming the precision for the local solutions (Durmus et al., 2021; Wang et al., 2022) . Extensive experiments are tested on CIFAR-10/100 and TinyImagenet dataset with a standard ResNet-18-GN network under the different heterogeneous settings, which shows that our proposed FedSpeed is significantly better than several baselines, e.g. for FedAvg, FedProx, FedCM, FedPD, SCAFFOLD, FedDyn, on both the stability to enlarge the local interval K and the test generalization performance in the actual training. To the end, we summarize the main contributions of this paper as follows: • We propose a novel and practical federated optimization algorithm, FedSpeed, which applies a prox-correction term to significantly reduce the bias due to the local updates of the prox-term, and an extra gradient perturbation to efficiently avoid the local over-fitting, which achieves a fast convergence speed with large local steps and simultaneously maintains the high generalization. • We provide the convergence rate upper bound under the non-convex and smooth cases and prove that FedSpeed could achieve a fast convergence rate of O(1/T ) via enlarging the local training interval K = O(T ) without any other harsh assumptions or the specific conditions required. • Extensive experiments are conducted on the CIFAR-10/100 and TinyImagenet dataset to verify the performance of our proposed FedSpeed. To the best of our interests, both convergence speed and generalization performance could achieve the SOTA results under the general federated settings. FedSpeed could outperform other baselines and be more robust to enlarging the local interval. 2021) introduce a detailed overview in this field. There are still many difficulties to be solved in the practical scenarios, while in this paper we focus to highlight the two main challenges of the local inconsistent solution and client-drifts due to heterogeneous over-fitting, which are two acute limitations in the federated



McMahan et al. (2017) propose the federated framework with the properties of jointly training with several unbalance and non-iid local dataset via communicating with lower costs during the total training stage. The general FL optimization involves a local client training stage and a global server update operation Asad et al. (2020) and it has been proved to achieve a linear speedup property in Yang et al. (2021). With the fast development of the FL, a series of efficient optimization method are applied in the federated framework. Li et al. (2020b) and Kairouz et al. (

