COMMUNICATION-EFFICIENT FEDERATED LEARNING WITH ACCELERATED CLIENT GRADIENT

Abstract

Federated learning often suffers from slow and unstable convergence due to heterogeneous characteristics of participating client datasets. Such a tendency is aggravated when the client participation ratio is low since the information collected from the clients is prone to have large variations. To tackle this challenge, we propose a novel federated learning framework, which improves the consistency across clients and facilitates the convergence of the server model. This is achieved by making the server broadcast a global model with a gradient acceleration. By adopting the strategy, the proposed algorithm conveys the projective global update information to participants effectively with no extra communication cost, and relieves the clients from storing the previous models. We also regularize local updates by aligning each of the client with the overshot global model to reduce bias and improve the stability of our algorithm. We perform comprehensive empirical studies on real data under various settings and demonstrate remarkable performance gains of the proposed method in terms of accuracy and communication efficiency compared to the state-of-the-art methods, especially with low client participation rates. We will release our code to facilitate and disseminate our work.

1. INTRODUCTION

Federated learning (McMahan et al., 2017 ) is a large-scale machine learning framework that learns a shared model in a central server through collaboration with a large number of remote clients with separate datasets. This decentralized learning concept allows federated learning to achieve the basic level of data privacy since the server does not observe training data directly. On the other hand, remote clients such as mobile or IoT devices have limited communication bandwidths, and federated learning algorithms are particularly sensitive to communication costs. A baseline algorithm of federated learning, FedAvg (McMahan et al., 2017) updates a subset of its client models based on a gradient descent method using their local data and then uploads the resulting models to the server for computing the global model parameters via model averaging. As discussed extensively on the convergence of FedAvg (Stich, 2019; Yu et al., 2019; Wang & Joshi, 2021; Stich & Karimireddy, 2019; Basu et al., 2020) , multiple local updates conducted before serverside aggregation provide theoretical support and practical benefit of federated learning by reducing communication cost greatly. Despite the initial success, federated learning faces two key challenges: high heterogeneity in training data distributed over clients and limited participation rates of clients. Several studies (Zhao et al., 2018; Karimireddy et al., 2020) have shown that multiple local updates in the clients with non-i.i.d (independent and identically distributed) data lead to client model drift, in other words, diverging updates in the individual clients. Such a phenomenon introduces the high variance issue in the FedAvg step for global model updates, which hampers the convergence to the optimal average loss over all clients (Li et al., 2020; Wang et al., 2019b; Khaled et al., 2019; Li et al., 2019b; Hsieh et al., 2020; Wang et al., 2020) . The challenge related to client model drift is exacerbated when the client participation rate per communication round is low, due to unstable client device operations and limited communication channels. To properly address the client heterogeneity issue, we propose a novel optimization algorithm for federated learning, Federated averaging with Accelerated Client Gradient (FedACG), which conveys the momentum of the global gradient to clients and enables the momentum to be incorporated into the local updates in the individual clients. Specifically, we introduce an extra-gradient step on the global model via the global momentum, which allows each client performs its local gradient step along the future gradient. This approach turns out to be effective for reducing the gap between global and local losses. Contrary to the existing methods that require to send additional bits to communicate the momentum, FedACG transmits the global model integrated with the momentum in the form of a single message and saves the cost for communication. In addition, FedACG adds a regularization term in the objective function of clients to make the local gradients more consistent across clients. Although there have been a growing number of works that handle the client heterogeneity in federated learning, FedACG has the following major advantages. Unlike existing approaches focusing on server-level optimization (Reddi et al., 2021; Wang et al., 2019a; Hsu et al., 2019) or client-level optimization (Xu et al., 2021; Acar et al., 2021; Karimireddy et al., 2020; Li et al., 2021; 2020; Zhang et al., 2020; Karimireddy et al., 2021; Li et al., 2019a; Liang et al., 2019) , FedACG incorporates the momentum based on the global gradient information for client-side updates. This strategy allows the proposed algorithm to achieve the same level of task-specific performance with fewer communication rounds. Moreover, while most of existing methods have additional requirements compared to FedAvg including full participation (Liang et al., 2019; Zhang et al., 2020; Khanduri et al., 2021) , additional communication bandwidth (Xu et al., 2021; Karimireddy et al., 2020; Zhu et al., 2021; Karimireddy et al., 2021; Li et al., 2019a; Das et al., 2020; Gao et al., 2022) , and memory budgets in clients to store local states or variables (Acar et al., 2021; Karimireddy et al., 2020; Li et al., 2021; Gao et al., 2022) , FedACG is completely free from any additional communication and memory overhead, which ensures the compatibility with large-scale and low-participation federated learning problems. The main contributions of this paper are summarized as follows. • We propose a communication-efficient federated optimization algorithm that deals with client heterogeneity effectively. The proposed approach employs the global momentum for the acceleration of client gradients to facilitate the optimization of local models. • We also revise the objective function of clients, which augments a regularization term to the local gradient direction, which further aligns the gradients of server and individual clients. • We show that the proposed approach does not require any additional communication cost and memory overhead, which is desirable for the real-world settings of federated learning. • We demonstrate outstanding performance of our optimization technique in terms of communication efficiency and robustness to client heterogeneity, especially when the participation ratio is low.

2. RELATED WORK

Federated learning was first introduced in McMahan et al. (2017) , which formulates the problem and provides the FedAvg algorithm as a solution for its key challenges such as non-iid client data, massively distributed clients, and partial participation of clients. Many works explore the negative influence of heterogeneity in federated learning empirically (Zhao et al., 2018) and derive convergence rates depending on the level of heterogeneity (Li et al., 2020; Wang et al., 2019b; Khaled et al., 2019; Li et al., 2019b; Hsieh et al., 2020; Wang et al., 2020) . There exists a long line of research for client-side optimization to prevent the divergence of clients from the global model. FedProx (Li et al., 2020) 



penalizes the difference between the server and client parameters, while FedDyn(Acar et al., 2021) and FedPD (Zhang et al., 2020)  use cumulative gradients of each client to dynamically regularize local update. FedDC (Gao et al., 2022) introduces the auxiliary drift variables for each client to reduce the impact of the local drift on the global objective. There is another line of works which adopt variance reduction techniques in client update to eliminate inconsistent update across clients. SCAFFOLD (Karimireddy et al., 2020) and Mime (Karimireddy et al., 2021) employ control variates for local updates while FedDANE (Li et al., 2019a) and FedCM (Xu et al., 2021) add a gradient correction term based on the server gradient. FedPA (Al-Shedivat et al., 2021) de-bias client updates by estimating the global posterior on the client side. On the other hand, some approaches adopt a contrastive loss (Li et al., 2021), knowledge distillation (Kim et al., 2022), or a generative model (Zhu et al., 2021) to ensure the similarity of the representations between the global model and local networks. FedSAM (Qu et al.,

