WHEN TO TRUST AGGREGATED GRADIENTS: AD-DRESSING NEGATIVE CLIENT SAMPLING IN FEDER-ATED LEARNING

Abstract

Federated Learning has become a widely-used framework which allows learning a global model on decentralized local datasets under the condition of protecting local data privacy. However, federated learning faces severe optimization difficulty when training samples are not independently and identically distributed (non-i.i.d.). In this paper, we point out that the client sampling practice plays a decisive role in the aforementioned optimization difficulty. We find that the negative client sampling will cause the merged data distribution of currently sampled clients heavily inconsistent with that of all available clients, and further make the aggregated gradient unreliable. To address this issue, we propose a novel learning rate adaptation mechanism to adaptively adjust the server learning rate for the aggregated gradient in each round, according to the consistency between the merged data distribution of currently sampled clients and that of all available clients. Specifically, we make theoretical deductions to find a meaningful and robust indicator that is positively related to the optimal server learning rate and can effectively reflect the merged data distribution of sampled clients, and we utilize it for the server learning rate adaptation. Extensive experiments on multiple image and text classification tasks validate the great effectiveness of our method.

1. INTRODUCTION

As tremendous data is produced in various edge devices (e.g., mobile phones) every day, it becomes important to study how to effectively utilize the data without revealing personal information and privacy. Federated Learning (Konečnỳ et al., 2016; McMahan et al., 2017) is then proposed to allow many clients to jointly train a well-behaved global model without exposing their private data. In each communication round, clients get the global model from a server and train the model locally on their own data for multiple steps. Then they upload the accumulated gradients only to the server, which is responsible to aggregate (average) the collected gradients and update the global model. By doing so, the training data never leaves the local devices. It has been shown that the federated learning algorithms perform poorly when training samples are not independently and identically distributed (non-i.i.d.) across clients (McMahan et al., 2017; Li et al., 2021) , which is the common case in reality. Previous studies (Zhao et al., 2018; Karimireddy et al., 2020) mainly attribute this problem to the fact that the non-i.i.d. data distribution leads to the divergence of the directions of the local gradients. Thus, they aim to solve this issue by making the local gradients have more consistent directions (Li et al., 2018; Sattler et al., 2021; Acar et al., 2020) . However, we point out that the above studies overlook the negative impact brought by the client sampling procedure (McMahan et al., 2017; Fraboni et al., 2021b) , whose existence we think should be the main cause of the optimization difficulty of the federated learning on non-i.i.d. data. Client sampling is widely applied in the server to solve the communication difficulty between the great number of total clients and the server with limited communication capability, by only sampling a small part of clients to participate in each round. We find that the client sampling induces the negative effect of the non-i.i.d. data distribution on federated learning. For example, assume each client performs one full-batch gradient descent and uploads the local full-batch gradient immediately (i.e., FedSGD in McMahan et al. (2017) ), (1) if all clients participate in the current round, it is equivalent

