DELTA: DIVERSE CLIENT SAMPLING FOR FASTING FEDERATED LEARNING

Abstract

Partial client participation has been widely adopted in Federated Learning (FL) to efficiently reduce the communication burden. However, an improper client sampling scheme will select unrepresentative subsets, which will cause a large variance in the model update and slows down the convergence. Existing sampling methods are either biased or can be further improved to accelerate the convergence. In this paper, we propose an unbiased sampling scheme, termed DELTA, to alleviate this problem. In particular, DELTA characterizes the impact of client diversity and local variance and samples the representative clients who carry valuable information for global model updates. Moreover, DELTA is a provably optimal unbiased sampling scheme that minimizes the variance caused by partial client participation and achieves better convergence than other unbiased sampling schemes. We corroborate our results with experiments on both synthetic and real data sets.

1. INTRODUCTION

Federated Learning (FL) has recently emerged as a critical distributed learning paradigm where a number of clients collaborate with a central server to train a model. Edge clients finish the update locally without any data sharing, thus preserving client privacy. Communication can become the primary bottleneck of FL since edge devices have limited bandwidth and connection availability (Wang et al., 2021) . In order to reduce the communication burden, only a portion of clients will be chosen for training in practice. However, an improper client sampling strategy, such as uniform client sampling adopted in FedAvg (McMahan et al., 2017) , might exacerbate the issues of data heterogeneity in FL, as the randomly-selected unrepresentative subsets can increase the variance introduced by client sampling and directly slow down the convergence. Existing sampling strategies can usually be categorized into two classes: biased and unbiased. Considering the crucial unbiased client sampling that may preserve the optimization objective, only a few strategies are proposed, e.g., in terms of multinomial distribution (MD) sampling and cluster sampling, including clustering based on sample size and clustering based on similarity methods. However, these sampling methods usually suffer from a slow convergence with large variance and computation overhead problems (Balakrishnan et al., 2021; Fraboni et al., 2021b) . To accelerate the convergence of FL with partial client participation, Importance Sampling (IS), another unbiased sampling strategy, is proposed in recent literature (Chen et al., 2020; Rizk et al., 2020) . IS will select clients with the large gradient norm, as shown in Fig 1(a) . As for another sampling method in Figure 1 (a), cluster-based IS will first cluster the clients according to the gradient norm and then use IS to select the clients with a large gradient norm within each cluster. Though IS, and cluster-based IS have their advantages, 1) IS suffers from learning inefficiency due to the transmission of excessive important yet similar updates from clients to the server. This problem has been pointed out in recent works (Fraboni et al., 2021a; Shen et al., 2022) , and some efforts are being conducted to solve this problem. One of them is cluster-based IS, which avoids redundant sampling of clients by first clustering similar clients into groups. Though clustering operation can somewhat alleviate this problem, 2) vanilla cluster-based IS does not work well because the high-dimensional gradient is too complicated to be a good clustering feature and can bring about poor clustering results, as pointed out by Shen et al. (2022) . In addition, clustering is known to be susceptible to biased performance if the samples are chosen from a group that is clustered based on a biased opinion, as shown in Sharma (2017); Thompson (1990) . From the above discussion, we know though IS and cluster-based IS have their own advantages in sampling, they both face their own limitations as well. Specifically, IS has utilized the large gradient norm to accelerate convergence while meeting redundant sampling problems due to excessive similar updates, and cluster-based IS can alleviate the similar update problem but face a slow convergence due to poor clustering effect and biased performance. Figure 2 illustrates both these two sampling methods have times when they perform poorly. To address the above challenges of IS and cluster-based IS, namely excessive similar updates and poor performance due to poor cluster effect and biased grouping, we propose a novel sampling method for Federated Learning, termed DivErse cLienT sAmpling (DELTA). To simplify the notion, in this paper, we term FL with IS as FedIS. Compared with FedIS and cluster-based IS methods, we show in Figure 1 (b) that DELTA tends to select clients with diverse gradient w.r.t global gradient. In this way, DELTA not only utilizes the advantages of a large gradient norm for convergence acceleration but also overcomes the gradient similarity issue.

1.1. CONTRIBUTIONS

In this paper, we propose an efficient unbiased sampling scheme based on gradient diversity and local variance, in the sense that (i) it can effectively solve the excessive similar gradient problem without additional clustering operation, while taking advantage of the accelerated convergence of gradientnorm-based IS and (ii) is provable better than uniform sampling or gradient norm based sampling. 



Figure 1: Difference between IS, cluster-based IS, and our sampling scheme DELTA.

Figure 2: We use a logistic regression model to show the performance of different methods on noniid MNIST. We sample 10 out of 200 clients and run 500 communication rounds. We report the average of the best 10 accuracies under 100, 300, and 500 rounds, which shows the accuracy performance from the initial training state to convergence.

The sampling scheme is completely generic and can be easily compatible with other advanced optimization methods, like Fedprox(Li et al., 2018)  and momentum(Karimireddy et al., 2020a). As our key contributions,• we present an unbiased sampling scheme for FL based on gradient diversity and local variance, a.k.a. DELTA. It can take advantage of the clients who select a large gradient norm and solve the problem of over-selection of clients with similar gradients at the beginning of training when that gradient of the global model is relatively large. Compared with the SOTA rate of FedAvg, its convergence rate removes the term O( 1 /T 2/3 ) as well as a σ 2 G -related term in the numerator of O( 1 /T 1/2 ). • We provide theoretical proof of convergence for nonconvex FedIS. Unlike existing work, our analysis is based on a more relaxed assumption and yields no worse results than the existing convergence rates. Its rate removes the term O( 1 /T 2/3 ) from that of FedAvg.2 RELATED WORKFedAvg is proposed by McMahan et al. (2017) as a de facto algorithm of FL, in which multiple local SGD steps are executed on the available clients to alleviate the communication bottleneck. While communication efficient, heterogeneity, such as system heterogeneity (Li et al., 2018; Wang et al.,

