ENHANCE LOCAL CONSISTENCY FOR FREE: A MULTI-STEP INERTIAL MOMENTUM APPROACH

Abstract

Federated learning (FL), as a collaborative distributed training paradigm with several edge computing devices under the coordination of a centralized server, is plagued by inconsistent local stationary points due to the heterogeneity of the local partial participation clients, which precipitates the local client-drifts problems and sparks off the unstable and slow convergence, especially on the aggravated heterogeneous dataset. To address these issues, we propose a novel federated learning algorithm, named FedMIM, which adopts the multi-step inertial momentum on the edge devices and enhances the local consistency for free during the training to improve the robustness of the heterogeneity. Specifically, we incorporate the weighted global gradient estimations as the inertial correction terms to guide both the local iterates and stochastic gradient estimation, which can reckon the global objective optimization on the edges' heterogeneous dataset naturally and maintain the demanding consistent iteration locally. Theoretically, we show that FedMIM achieves the O( 1 √ SKT ) convergence rate with a linear speedup property with respect to the number of selected clients S and proper local interval K in each communication round under the nonconvex setting. Empirically, we conduct comprehensive experiments on various real-world datasets and demonstrate the efficacy of the proposed FedMIM against several state-of-the-art baselines.

1. INTRODUCTION

Federated Learning (FL) is an increasingly important distributed learning framework where the distributed data is utilized over a large number of clients, such as mobile phones, wearable devices or network sensors (Kairouz et al., 2021) . In the contrast to traditional machine learning paradigms, FL places a centralized server to coordinate the participating clients to train a model, without collecting the client data, thereby achieving a basic level of data privacy and security (Li et al., 2020a) . The common pipelines to achieve this goal includes three steps (Bonawitz et al., 2019) : i) The server broadcasts the current model to clients at the beginning of each communication iteration; ii) The clients synchronize the local models and update the local model based on their own data; iii) The server averages the latest local models and repeats these procedures until convergence. Despite the empirical success of the past work, there are still some key challenges for FL: expensive communication, privacy concern and statistical diversity. The first two problems are well fixed in past work (Konečnỳ et al., 2016; Sattler et al., 2019; Hamer et al., 2020; Truex et al., 2019; Xu et al., 2019) although the last one is still the main challenge that need to be deal with. Due to statistical diversity among clients within FL system, client drift (Karimireddy et al., 2020a) leads to slow and unstable convergence within model training. In the case of heterogeneous data, each client's optimum is not well aligned with the global optimum. The conventional FL algorithm does not consider this data heterogeneity problem and simply applies the stochastic gradient descent algorithm to the local update. As a consequence, the final converged solution of clients may differ from the stationary point of the global objective function since the average of client updates move towards the average of clients' optimums rather than the true optimum. As the distribution drift exists over the client's dataset, the model may overfit the local training data by applying empirical risk minimization and it has been reported that the generalization performance on clients' local data may exacerbate when clients have different distributions between training and testing dataset (Liang et al., 2020) . In order to overcome these problems, several solutions have been put forward in recent years. Generally, there are three types of methods: variance reduction based (Karimireddy et al., 2020b), regularization based (Li et al., 2020b; Acar et al., 2021) and momentum based (Xu et al., 2021; Reddi et al., 2020) . Although these past works present some effective methods to reduce the client drift and improve the generalization performance, the problem of local inconsistency is not fully considered. In the real experiment setting, the local interval K is finite and the local update could not reach the local optimum. With the iteration running, the final points for local iteration will remain relatively stable and become dynamic equilibrium. The stability of these points determines the effectiveness of algorithms and their position will alter when different algorithms are applied. The variance among these points brings the local inconsistency problem Wang et al. (2021a) . However, the analysis of these past works are not comprehensive and experimental verification of the reduced local inconsistency is lacking. In particular, when data heterogeneity among clients raises, the local update may repudiate mutually, that is, the direction of the local gradient could not remain compatible. Thus, the weighted average of local gradient at the aggregation stage is extraordinarily small and the moving global iteration point may stagnate, which leads to low generalization performance. To settle this problem, a federated learning algorithm is required to incorporate historical information of full gradient into client local updates for scaling down the variance between local dynamic equilibrium points. Furthermore, the usage of historical full gradient information to navigate the local update ought to be considered wisely instead of simply applied in the weight of models. In this paper, we develop a new FL algorithm to enhance local consistency for free, Federated Multi-step Inertial Momentum Algorithm (FedMIM), that mitigates client drift and reduces local inconsistency. From a high-level algorithmic perspective, we bring multistep inertial momentum to the local update, that is, multi-step momentum is placed in both weight (orange arrow shown in Figure 1 ) and gradient (yellow arrow shown in Figure 1 ) to modify the local update. Rather than calculating the momentum updates at the server's side and transmitting them through the down-link, all the clients compute the momentum term before the local iteration, while the historical momentum is kept in the client's storage. FedMIM has two major benefits to undertaking aforementioned deficiencies. Firstly, FedMIM does not acquire the server to broadcast the momentum between rounds, which curtails the communication burden. Secondly, in contrast to previous work that focuses on server side momentum (Karimireddy et al., 2020b) or client side momentum Xu et al. (2021) , FedMIM delivers inertial momentum term to introduce global information avoiding the gradient exclusion in local update when there exists large data heterogeneity among the participating clients. Theoretically, we provide a detailed convergence analysis for FedMIM. By setting proper local learning rate, FedMIM could achieve O( 1 √ SKT ) convergence rate with a linear speedup property for general non-convex setting with the number of selected clients S, local interval K and communication round T . As for non-convex function under PL condition, convergence rate achieves O( 1T ) with proper setting of local learning rate. We test FedMIM algorithm on three datasets (CIFAR-10, CIFAR-100 and TinyImagenet) with i.i.d, and different Dirichlet distributions in the empirical studies. The results display that our proposed FedMIM shows the best performance among the state-of-the-art baselines. When the heterogeneity increases extremely, the performance of the federated algorithms drops rapidly due to the negative impact of enlarging the local interval, while our proposed FedMIM can efficiently maintain stability under the same experimental setups. Contribution. We summarize the main contributions of this work as three-fold: • FedMIM algorithm delivers a multi-step inertial momentum to guide the gradient updates. We show that FedMIM successfully solves the problems on the heterogeneous datasets, which benefits the cross-device implantation in practical applications. • We display the convergence analysis of FedMIM for general non-convex function and nonconvex function under PL conditions. The theoretical analysis highlights the advantage of innovating multi-step inertial momentum and presents hyperparameter conditions.



Figure 1: Local steps of FedAvg and FedMIM with 2 clients.

