DELTA: DIVERSE CLIENT SAMPLING FOR FASTING FEDERATED LEARNING

Abstract

Partial client participation has been widely adopted in Federated Learning (FL) to efficiently reduce the communication burden. However, an improper client sampling scheme will select unrepresentative subsets, which will cause a large variance in the model update and slows down the convergence. Existing sampling methods are either biased or can be further improved to accelerate the convergence. In this paper, we propose an unbiased sampling scheme, termed DELTA, to alleviate this problem. In particular, DELTA characterizes the impact of client diversity and local variance and samples the representative clients who carry valuable information for global model updates. Moreover, DELTA is a provably optimal unbiased sampling scheme that minimizes the variance caused by partial client participation and achieves better convergence than other unbiased sampling schemes. We corroborate our results with experiments on both synthetic and real data sets.

1. INTRODUCTION

Federated Learning (FL) has recently emerged as a critical distributed learning paradigm where a number of clients collaborate with a central server to train a model. Edge clients finish the update locally without any data sharing, thus preserving client privacy. Communication can become the primary bottleneck of FL since edge devices have limited bandwidth and connection availability (Wang et al., 2021) . In order to reduce the communication burden, only a portion of clients will be chosen for training in practice. However, an improper client sampling strategy, such as uniform client sampling adopted in FedAvg (McMahan et al., 2017) , might exacerbate the issues of data heterogeneity in FL, as the randomly-selected unrepresentative subsets can increase the variance introduced by client sampling and directly slow down the convergence. Existing sampling strategies can usually be categorized into two classes: biased and unbiased. Considering the crucial unbiased client sampling that may preserve the optimization objective, only a few strategies are proposed, e.g., in terms of multinomial distribution (MD) sampling and cluster sampling, including clustering based on sample size and clustering based on similarity methods. However, these sampling methods usually suffer from a slow convergence with large variance and computation overhead problems (Balakrishnan et al., 2021; Fraboni et al., 2021b) . To accelerate the convergence of FL with partial client participation, Importance Sampling (IS), another unbiased sampling strategy, is proposed in recent literature (Chen et al., 2020; Rizk et al., 2020) . IS will select clients with the large gradient norm, as shown in Fig 1(a) . As for another sampling method in Figure 1 (a), cluster-based IS will first cluster the clients according to the gradient norm and then use IS to select the clients with a large gradient norm within each cluster. Though IS, and cluster-based IS have their advantages, 1) IS suffers from learning inefficiency due to the transmission of excessive important yet similar updates from clients to the server. This problem has been pointed out in recent works (Fraboni et al., 2021a; Shen et al., 2022) , and some efforts are being conducted to solve this problem. One of them is cluster-based IS, which avoids redundant sampling of clients by first clustering similar clients into groups. Though clustering operation can somewhat alleviate this problem, 2) vanilla cluster-based IS does not work well because the high-dimensional gradient is too complicated to be a good clustering feature and can bring about poor clustering results, as pointed out by Shen et al. (2022) . In addition, clustering is known to be susceptible to biased performance if the samples are chosen from a group that is clustered based on a biased opinion, as shown in Sharma (2017) ; Thompson (1990) . From the above discussion, we know though IS and cluster-based IS have their own advantages in sampling, they both face their own limitations as well. Specifically, IS has utilized the large gradient norm to accelerate convergence while meeting redundant sampling problems due to excessive similar updates, and cluster-based IS can alleviate the similar update problem but face a slow convergence due to poor clustering effect and biased performance. Figure 2 illustrates both these two sampling methods have times when they perform poorly. To address the above challenges of IS and cluster-based IS, namely excessive similar updates and poor performance due to poor cluster effect and biased grouping, we propose a novel sampling method for Federated Learning, termed DivErse cLienT sAmpling (DELTA). To simplify the notion, in this paper, we term FL with IS as FedIS. Compared with FedIS and cluster-based IS methods, we show in Figure 1 (b) that DELTA tends to select clients with diverse gradient w.r.t global gradient. In this way, DELTA not only utilizes the advantages of a large gradient norm for convergence acceleration but also overcomes the gradient similarity issue.

1.1. CONTRIBUTIONS

In this paper, we propose an efficient unbiased sampling scheme based on gradient diversity and local variance, in the sense that (i) it can effectively solve the excessive similar gradient problem without additional clustering operation, while taking advantage of the accelerated convergence of gradientnorm-based IS and (ii) is provable better than uniform sampling or gradient norm based sampling. The sampling scheme is completely generic and can be easily compatible with other advanced optimization methods, like Fedprox (Li et al., 2018) and momentum (Karimireddy et al., 2020a) . As our key contributions, • we present an unbiased sampling scheme for FL based on gradient diversity and local variance, a.k.a. DELTA. It can take advantage of the clients who select a large gradient norm and solve the problem of over-selection of clients with similar gradients at the beginning of training when that gradient of the global model is relatively large. Compared with the SOTA rate of FedAvg, its convergence rate removes the term O( 1 /T 2/3 ) as well as a σ 2 G -related term in the numerator of O( 1 /T 1/2 ). • We provide theoretical proof of convergence for nonconvex FedIS. Unlike existing work, our analysis is based on a more relaxed assumption and yields no worse results than the existing convergence rates. Its rate removes the term O( 1 /T 2/3 ) from that of FedAvg.

2. RELATED WORK

FedAvg is proposed by McMahan et al. (2017) as a de facto algorithm of FL, in which multiple local SGD steps are executed on the available clients to alleviate the communication bottleneck. While communication efficient, heterogeneity, such as system heterogeneity (Li et al., 2018; Wang et al., 2020; Mitra et al., 2021; Diao et al., 2020) , and statistical/objective heterogeneity (Lin et al., 2020; Karimireddy et al., 2020b; Li et al., 2018; Wang et al., 2020; Guo et al., 2021) , results in inconsistent optimization objectives and drifted clients models, impeding federated optimization considerably. Objective inconsistency in FL. Objective inconsistency is not rare in FL due to the heterogeneity of clients' data and the difference in computing ability. For instance, Wang et al. (2020) first identify an objective inconsistency caused by heterogeneous local updates. There also exist several works that encounter the difficulty from the objective inconsistency caused by partial client participation (Li et al., 2019; Cho et al., 2020; Balakrishnan et al., 2021) . Li et al. (2019) ; Cho et al. (2020) use local-global gap f * -1 m m i=1 F * i to measure the distance between global optimum and average of all local personal optimum, where the local-global gap results from objective inconsistency at the final optimal point. In fact, objective inconsistency occurs in each training round, not only at the final optimal point. Balakrishnan et al. (2021) also encounter objective inconsistency caused by partial client participation. However, they use ∥ 1 n n i=1 ∇F i (x t ) -∇f (x t )∥ ≤ ϵ as an assumption to describe such update inconsistency caused by objective inconsistency without any analysis on it. So far, the objective inconsistency caused by partial client participation has not been analyzed though it is prevalent in FL, even in homogeneous local updates. Our work gives the fundamental convergence analysis on the influence of the objective inconsistency of partial client participation. Client selection in FL. In general, the sampling method can be divided into biased and unbiased sampling. Note that unbiased sampling guarantees the same expected value of the client aggregation as the global deterministic aggregation with all clients' participation. In contrast, biased sampling will lead to converging to sub-optimal. The most famous unbiased sampling strategy in FL is multinomial sampling (MD), that samples according to client data ratio (Wang et al., 2020; Fraboni et al., 2021a) . Besides, IS, an unbiased sampling method, is recently used in FL to reduce the convergence variance. Chen et al. (2020) uses update norm as importance to sampling clients, Rizk et al. (2020) samples clients based on data variability and Mohammed et al. (2021) uses test accuracy as an estimation of importance. Meanwhile, many biased sampling strategies have been proposed for accelerating training, such as sampling clients with higher loss (Cho et al., 2020) , sampling clients as many as possible under the limitation of threshold (Qu et al., 2021) , sampling clients with larger updates (Ribero & Vikalo, 2020) and greedy sampling according to gradient diversity (Balakrishnan et al., 2021) . However, all these biased sampling methods can exacerbate the negative effects of objective inconsistency and promise to converge to only a neighbor of optimum. Recently, cluster-based client selection has draw some attentions in FL (Fraboni et al., 2021a; Xu et al., 2021; Muhammad et al., 2020; Shen et al., 2022) . Though cluster operation needs additional clustering operation, and causes computation and memory overhead, Fraboni et al. (2021a) ; Shen et al. (2022) show clustering is helpful for sampling diverse clients and benefits for reducing variance. The proposed DELTA in Algorithm 1 can be viewed as a muted version of the diverse client clustering algorithm while promising to be unbiased.

3. THEORETICAL ANALYSIS AND AN IMPROVED FL SAMPLING STRATEGY

In FL, the objective of the global model is a sum-structured optimization problem: f * = min x∈R d f (x) := m i=1 wiFi(x) , where  F i (x) = E ξi∼Di [F i (x, ξ i )] fS t (xt) = 1 n i∈S t Fi(xt) . To ease the theoretical analysis of our work, we use the following widely used assumptions.

3.1. ASSUMPTIONS

Assumption 1 (L-Smooth). The client's local objective function is Lipschitz smooth, i.e., there is a constant L > 0, such that ∥∇F i (x) -∇F i (y)∥ ≤ L ∥x -y∥ , ∀x, y ∈ R d , and i = 1, 2, . . . , m.  2 L µmKϵ + ( 1 µ ) / σ 2 L mKϵ 2 + 1 ϵ σL bound DELTA Nonconvex ✓ ✓ σ 2 L nKϵ 2 + M 2 Kϵ Assumption 3 FedIS (ours) Nonconvex ✓ ✓ σ 2 L +Kσ 2 G nKϵ 2 + M 2

Kϵ

Assumption 3 FedIS (others) (Chen et al., 2020)  Nonconvex ✓ ✓ M 2 nKϵ 2 + A 2 +1 ϵ + σG ϵ 3/2 Assumption 3 and ρ bound Yang et al. (2021) Nonconvex ✓ ✓ σ 2 L nKϵ 2 + 4Kσ 2 G nKϵ 2 + M 2 Kϵ + K 1/3 M 2 n 1/3 ϵ 2/3 σG bound Karimireddy et al. (2020b) Nonconvex ✓ ✓ M 2 nKϵ 2 + A 2 +1 ϵ + σG ϵ 3/2 Assumption 3 Balakrishnan et al. (2021) Strongly convex ✓ × 1 ϵ + 1 φ Heterogeneity Gap Cho et al. (2020) Strongly convex ✓ × σ 2 L +G 2 ϵ+φ + Γ µ Heterogeneity Gap Yang et al. (2021) Nonconvex × ✓ σ 2 L mKϵ 2 + σ 2 L /(4K)+σ 2 G ϵ σG bound Karimireddy et al. (2020b) Strongly Convex × ✓ σ 2 L +σ 2 G µmKϵ + σL+σG µ √ ϵ + m(A 2 +1) µ Assumption 3 M = σ 2 L + 4Kσ 2 G , M 2 = σ 2 L + K(1 -n /m)σ 2 G , M 2 = σ 2 L + 6Kσ 2 G , M 2 = σ 2 L + 4Kζ 2 G . ρ assumption: A bound of the similarity among local gradients in Chen et al. (2020) Another FedIS(others) (Chen et al., 2020) has the same convergence rate as Karimireddy et al. (2020b) under the ρ assumption. While FedIS(ours) uses a looser Assumption 3 and achieves a faster rate than Chen et al. (2020) . for each worker i ∈ St,in parallel do 4:  x i t,0 = xt 5: for k = 0, • • •, K -1 do 6: compute g i t,k = ∇Fi(x i t,k , ξ i t,k ) 7: Local update:x i t,k+1 = x i t,k -ηLg i t,k 8: Let ∆ i t = x i t,K -x i t,0 = -ηL K-1 k=0 g i t, E ∇F i (x t , ξ i t ) = ∇F i (x t ), ∀i ∈ [m] , where the expectation is over the local datasets sample. The function F i (x t , ξ i t ) has σ L,i > 0 bounded local variance, i.e.,E ∇F i (x t , ξ i t ) -∇F i (x t ) 2 = σ 2 L,i ≤ σ 2 L . Assumption 3 (Bound Dissimilarity). There exists constant σ G ≥ 0 and A ≥ 0 s.t. E∥∇F i (x)∥ 2 ≤ (A 2 + 1)∥∇f (x)∥ 2 + σ 2 G . When all local loss functions are identical, A 2 = 0 and σ 2 G = 0. The above assumptions are commonly used in both non-convex optimization and FL literature, see e.g. Karimireddy et al. (2020b) ; Yang et al. (2021) ; Koloskova et al. (2020) ; Wang et al. (2020) ; Cho et al. (2020) ; Li et al. (2019) . For Assumption 3, if all local loss functions are identical, then we have A = 0 and σ G = 0.

3.2. CONVERGENCE RATE OF FEDIS

As discussed in the introduction, IS has an excessive gradient similarity problem, which may cause redundant sampling resulting in training inefficiency. As discussed in the introduction, IS has the issue of high gradient similarity, requiring us to design a new diversity sampling method. Before going to the details of our new sampling strategy, we first provide the convergence rate of FL under standard IS analysis in this section; the analysis itself is not well explored, especially for the nonconvex setting. Theorem 3.1 (Convergence rate of FedIS). Under Assumptions 1-3, and sampling strategy FedIS p t i = ∥ ĝt i ∥ m j=1 ∥ ĝt j ∥ , where ĝt i = K-1 k=0 g t i = K-1 k=0 ∇F i (x i k,t , ξ i k,t ) is the sum of the gradient updates of multiple local updates. Let constant local and global learning rates η L and η be chosen as such that η L < min (1/(8LK), C), where C is obtained from the condition that 1 2 -10L 2 K 2 (A 2 + 1)η 2 L - L 2 ηK(A 2 +1) 2n η L > 0 ,and η ≤ 1/(η L L), the expected gradient norm will be bounded as follows: min t∈[T ] E∥∇f (xt)∥ 2 ≤ O f 0 -f * √ nKT +O σ 2 L √ nKT +O M 2 T +O Kσ 2 G √ nKT order of Φ . ( ) where f 0 = f (x 0 ), f * = f (x * ), M = σ 2 L + 4Kσ 2 G and the expectation is over the local dataset samples among workers. The FedIS sampling probability p t i = ∥ ĝt i ∥ m j=1 ∥ ĝt j ∥ is derived from minimizing the variance of convergence w.r.t. p t i . The variance is Φ = 5η 2 L KL 2 2 M 2 + ηηLL 2m σ 2 L + LηηL 2nK Var( 1 mp t i ĝt i ), where Var( 1 /(mp t i ) ĝt i ) is called update variance. The proof details of Theorem 3.1 and derivation of sampling probability FedIS are detailed in Appendix C and Appendix E.1. Remark 3.2. It is worth mentioning that although a few works provide the convergence upper bound for FedIS, several limitations exist in these analyses and results. 1) Rizk et al. (2020) ; Luo et al. (2022) applied IS in FL to solve a convex/strongly convex problem, while we solved a nonconvex problem. 2) In Rizk et al. (2020) , their analysis result and sampling probability rely on the assumption of knowing the optimum x * , which is not feasible in practice. 3) Our analysis uses the common Assumption 1-3, while Chen et al. (2020) provides the convergence rate of nonconvex FL under a stronger assumption of gradient similarity bound. Compared with Chen et al. (2020) , we prove a tighter convergence upper bound for FedIS. Specifically, our convergence rate for FedIS improves from O( 1 √ nKT + 1 T + 1 T 2/3 ) to O 1 √ nKT + 1 T (c.f. Table 1). Despite the success of FedIS in reducing the variance term in the convergence rate, it is far from optimal, due to the issue of high gradient similarity and the improvement space of further minimizing the variance term (i.e., global variance σ G and local variance σ L in Φ). We will discuss how to address this challenging variance term in the next section.

3.3. AN IMPROVED CONVERGENCE ANALYSIS

To ease the understanding of the theoretical difference between FedIS and DELTA, as well as a better illustration of our design choice, we include an analysis flowchart in Figure 3 to help understand the difference between FedIS and DELTA while strengthening the motivation. Specifically, based on the convergence variance of FedIS, we find it is important to reduce the variance beyond Var( 1 /(mp t i ) ĝt i ). Furthermore, we connect the important variance with the convergence of surrogate objective f (x t ). Unlike FedIS, which analyzes the global objective, DELTA focuses on analyzing the surrogate objective and therefore obtains a different convergence variance and sampling probabilities than FedIS. The limitations of FedIS. As identified by the Theorem 3.1 discussed above, IS suffers from excessive similar gradient selection. The variance Φ in (4) shows that the standard IS strategy can only control the update variance Var( 1 /(mp t i ) ĝt i ), while leaving other terms in Φ untouched, i.e., σ L and σ G . Thus, the standard IS fails to handle the excessive similar gradient selection problem, and it motivates us to give a new sampling strategy below to address the issue of σ L and σ G . The decomposition of the global objective. As inspired by the proof of Theorem 3.1 as well as the corresponding Lemma B.1 (stated in Appendix) proposed for unbiased sampling, the global objective can be decomposed into surrogate objective and update gap, E∥∇f (x t )∥ 2 = E ∇ fSt (x t ) 2 + χ 2 t , where χ t = E ∇ fSt (x t ) -∇f (x t ) is the update gap. Intuitively, the surrogate objective is the practical objective of the participating clients in each round, while the update gap χ t means the update distance between partial client participation and full client participation. The convergence behavior of the update gap χ 2 t corresponds to the update variance in Φ, and the convergence of surrogate objective E ∇ fSt (x t ) 2 is dependent on the other variance terms in Φ, i.e., local variance and global variance. Minimizing the surrogate objective allows us further to reduce the variance term in the convergence rate, and we focus on the convergence analysis of the surrogate objective below. For the purpose of analysis, we use IS property to formulate the surrogate objective with an arbitrary unbiased sampling probability. Surrogate objective formulation. The expression of the surrogate objective relies on the property of IS. In detail, IS aims to substitute the original sampling distribution p(z) with another arbitrary sampling distribution q(z) while keeping the expectation unchanged: E q(z) [F i (z)] = E p(z) [ qi(z) /pi(z)F i (z)]. According to the Monte Carlo method, when q(z) follows the uniform distribution, we can estimate E q(z) [F i (z)] by 1 /m m i=1 F i (z) and E p(z) [ qi(z) /pi(z)F i (z)] by 1 /n i∈St 1 /mpiF i (z) , respectively, where m and |S t | = n are sample sizes. Based on the IS property, we formulate the surrogate objective as below: fS t (xt) = 1 n i∈S t 1 mp t i Fi(xt) , ( ) where m is the total number of clients, |S t | = n is the number of participating clients in each round, and p i t is the probability that client i is selected at round t. An improved rate for the global objective. Following the fact (c.f. Lemma B.2 in appendix) thatfoot_0 : min t∈[T ] E∥∇f (x t )∥ 2 = min t∈[T ] E∥∇ f (x t )∥ 2 + E∥χ 2 t ∥ ≤ min t∈[T ] 2E∥∇ f (x t )∥ 2 , the convergence rate of the global objective can be formulated as follows: Theorem 3.3 (Convergence rate). Under Assumption 1-3 and let local and global learning rates η and η L satisfy η L < 1 /( √ 20KL 1 n m l=1 1 mp t l ) and ηη L ≤ 1 /KL, the minimal gradient norm will be bounded as below: min t∈[T ] E ∥∇f (xt)∥ 2 ≤ f 0 -f * cηη L KT + Φ c , where f 0 = f (x 0 ), f * = f (x * ), c is a constant, and the expectation is over the local dataset samples among all workers. The combination of variance Φ represents combinations of local variance and client gradient diversity. We derive the convergence rates for both sampling with replacement and sampling without replacement. For sampling without replacement: Φ = 5L 2 Kη 2 L 2mn m i=1 1 p t i (σ 2 L,i + 4Kζ 2 G,i ) + Lη L η 2n m i=1 1 m 2 p t i σ 2 L,i , For sampling with replacement, Φ = 5L 2 Kη 2 L 2m 2 m i=1 1 p t i (σ 2 L,i + 4Kζ 2 G,i ) + Lη L η 2n m i=1 1 m 2 p t i σ 2 L,i where ζ G,i = ∥∇F i (x t ) -∇f (x t )∥ and let ζ G be a upper bound for all i, i.e., ζ G,i ≤ ζ G . The proof details of Theorem 3.3 can be found in Appendix D.

3.4. OUR PROPOSED SAMPLING STRATEGY: DELTA

The update difference between the surrogate objective and the global objective can be defined as objective inconsistency. As demonstrated in Figure 4 , different sampling methods lead to different degrees of objective inconsistency, and such inconsistency can be alleviated by choosing clients with a small updating gap. Figure 4 (a) uses a toy example of square functions to illustrate the objective inconsistency when two out of three clients are selected for training, where DELTA would sample To derive our sampling strategy DELTA, it is equivalent to solving an optimization problem that minimizes the variance Φ w.r.t the proposed sampling probability p t i : min p t i Φ s.t. m i=1 p t i = 1 , where Φ is a linear combination of local variance σ L,i and gradient diversity ζ G,i (cf. Theorem 3.3). Corollary 3.4 (Optimal sampling probability for DELTA). By solving the above optimization problem, the optimal sampling probability can be formulated as: p t i = α 1 ∥∇F i (x)-∇f (x)∥ 2 +α 2 σ 2 L,i m j=1 α 1 ∥∇F j (x)-∇f (x)∥ 2 +α 2 σ 2 L,j , where α 1 and α 2 are constants defined as α 1 = 20K 2 Lη L and α 2 = 5KLη L + η n . Let η L = O 1 √ T KL , η = O √ Kn and substitute the optimal sampling probability (11) back to Φ. Then for sufficiently large T, the iterates of Theorem 3.3 satisfy: min t∈[T ] E∥∇f (xt)∥ 2 ≤ O f 0 -f * √ nKT + O σ 2 L √ nKT + O σ 2 L + 4Kζ 2 G KT order of Φ . ( ) 3.5 DISCUSSIONS Difference between DELTA and FedIS. The difference between DELTA and FedIS comes mainly from the difference between Φ and Φ. FedIS aims to reduce the update variance term Var( 1 /(mp t i ) ĝt i ) in Φ, while DELTA aims to reduce the whole Φ which is composed of the gradient diversity and the local variance. Minimizing Φ corresponds to further minimizing the terms of Φ that can not be minimized by FedIS. Solving different optimization problems leads to different sampling probability expressions. As shown in Figure 4 , DELTA selects the more diverse Client 1 and Client 3 for participation, while FedIS tends to select Client 2 and Client 3 which have large gradient norms. It can be seen that the selection of DELTA leads to a smaller bias than FedIS. Moreover, as shown in Table 1 , based on our convergence rate results, DELTA achieves a better convergence rate with O( G 2 /ϵ 2 ) higher than other unbiased sampling algorithms. Compare DELTA with uniform sampling. According to the Cauchy-Schwarz inequality, DELTA is at least better than uniform sampling by reducing variance: Φuniform ΦDELTA = m m i=1 ( √ α1σ 2 L +α2ζ 2 G,i ) 2 ( m i=1 √ α1σ 2 L +α2ζ 2 G,i ) 2 ≥ 1 . This implies that DELTA does reduce the variance, especially when ( m i=1 √ α1σ 2 L +α2ζ 2 G,i ) 2 m i=1 ( √ α1σ 2 L +α2ζ 2 G,i ) 2 ≪ m. Remark 3.5. DELTA ensures the convergence of FL with partial client participation to a stationary point without any gap. Our results can be considered as a theoretical explanation for the heuristic of gradient diversity sampling algorithm in FL, and DELTA encourages the global model to acquire more knowledge in each round. Specifically, the server will give more weight to the clients with larger gradient diversity and local variance. These clients are representative, and sampling these clients can accelerate training given the more diverse and informative data to reflect the global data distribution. However, DELTA may fail to identify the attacked clients and even tends to select them when it comes to user attack scenarios. We will leave the solution for this scenario in our future work.

4. PRACTICAL IMPLEMENTATION FOR DELTA AND FEDIS

Gradient-norm-based sampling method requires the computation of the full gradient in each iteration (Elvira & Martino, 2021; Zhao & Zhang, 2015) . However, obtaining each client's gradient in advance is generally inadmissible in FL. For practical purposes, a series of IS algorithms estimate the current round's gradient by the historical gradient (Cho et al., 2020; Katharopoulos & Fleuret, 2017) . Similarly, we utilize the gradient from the previous training iteration to estimate the gradient of the current round to reduce the computing resources (Rizk et al., 2020) , where the previous iteration refers to the one in which the client participates. In particular, at iteration 0, all probabilities are set to 1 /m, then during the i th iteration, after the participating clients i ∈ S t send the server their updated gradients, the sampling probabilities are updated as: p * i,t+1 = ∥ ĝ i,t∥ i∈S t ∥ ĝ i,t∥ (1 -i∈S c t p * i,t ) , where the multiplicative factor follows from ensuring all the probabilities sum to 1. Specifically, we use the average of the latest participated clients' gradients to approximate the true gradient of the global model for DELTA. In this way, it is not necessary to obtain all clients' gradients in each round. The convergence analysis of our practical algorithm is provided in Appendix F.

5. EXPERIMENTS

In this section, we use both synthetic dataset and split FEMNIST to demonstrate our theoretical results. To show the validity of the practical algorithm, we run experiments on FEMNIST and CIFAR-10, and show that DELTA converges faster and achieve higher accuracy than other baselines. Synthetic datasets. We first examine our theoretical results through logistic regression on synthetic datasets. In details, we randomly generate (x, y) by y = log( (Ax-b) 2 /2) with given A i and b i as training data for clients, and each client's local dataset contain 1000 samples. In each round, 10 out of 20 clients are selected to participate in training (we also provide the results of 10 out of 200 clients in Appendix G). To simulate the gradient noise, in each training step, we calculate the gradient of client i by g i = ∇f i (A i , b i , D i ) + ν i , where A i and b i are model parameters, D i is the local dataset of client i, and ν i is a zero-mean random variable which control the heterogeneity of client i. The larger the E∥ν i ∥ 2 , the larger the heterogeneity of client i. Split FEMNIST In this section, we consider the split FEMNIST. We let 10% clients own 90% data and the detailed split algorithm is provided in Appendix G. Figure 6 shows that when the data distribution is highly heterogeneous, Our DELTA algorithm converges faster than other baselines. FEMNIST and CIFAR-10. We also verify our practical algorithm on FEMNIST and CIFAR-10. We summarize our numerical results in Table 2 : Compared with other baselines, DELTA achieves higher accuracy and has an improvement in convergence rate both in terms of the number of iterations and the wall-clock time. We also test different choices of the number of participated clients n and test on different heterogeneity α, and observe the consistent improvement of DELTA. The detailed setting and additional experiments are in Appendix G.

6. CONCLUSION AND FUTURE WORK

In this work, we studied the optimal client sampling strategy that addresses the data heterogeneity to accelerate the convergence speed of FL. We obtain a new tractable convergence rate for nonconvex FL algorithms with arbitrary client sampling probabilities. Based on the bound, we solve an optimization problem with respect to sampling probability and thus develop a novel unbiased sampling scheme that characterizes the impact of client diversity and local variance on the sampling design. Experimental results validated the superiority of our theoretical and practical algorithms compared to several baselines. As we point out, when user attacks occur, DELTA requires some changes to be able to identify and avoid selecting users from these attacks. Xinwei Zhang, Mingyi Hong, Sairaj Dhople, Wotao Yin, and Yang Liu. Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data. arXiv preprint arXiv:2005.11418, 2020. Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pp. 1-9. PMLR, 2015.

A TOY CASE

In Figure 7 , we give a detailed toy case to show that DELTA is more effective than FedIS. Experiments for illustrating our observation. For the experiments to illustrate our observation of the introduction, we apply a logistic regression model on the non-iid MNIST dataset. 10 clients are selected from 200 clients to participate in training in each round. We set 2 cluster centers for cluster-based IS. And we set the mini batch-size to 32, the learning rate to 0.01, and the local update time to 5 for all methods. We run 500 communication rounds for each algorithm. We report the average of each round's selected clients' gradient norm and the minimum of each round's selected clients' gradient norm. We report the gradient norm performance of cluster-based IS and IS to show that cluster-based IS selects clients with small gradients. As we mentioned in the introduction, the cluster-based IS always selects some clients from the cluster with small gradients, which will slow the convergence in some cases. We provide the average gradient norm comparison between IS and cluster-based IS in Figure 8(a) . Besides, we also provide the minimal gradient norm comparison between IS and cluster-based IS in Figure 8(b) . We report the comparison of accuracy and loss performance between vanilla cluster-based IS and the removal of cluster-based IS with small gradient clusters. Specifically, we consider the setting with two cluster centers. And after 250 rounds, we replace the clients in the cluster containing the smaller gradient with the clients in the cluster containing the larger gradient while keeping the total number of the participated clients the same. The experiment result is shown in Figure 9 . We can observe that the vanilla cluster-based IS performs worse than cluster-based IS without small gradients, which indicates that the small gradient is one reason for poor performance. Figure 9 : An illustration that cluster-based IS sampling from the cluster with small gradients will slow convergence. When the small gradient-norm cluster's clients are replaced by the clients from the large gradient-norm cluster, we see the performance improvement of cluster-based IS.

B TECHNIQUES

Here we show some technical lemmas which are helpful in the theoretical proof. We substitute 1 m for ni N to simplify writing in all following proofs. ni N is the data ratio of client i. All our proof can be easily extended from f (x t ) = 1 m m i=1 F i (x t ) to f (x t ) = m i=1 ni N F i (x t ). Lemma B.1. (Unbiased Sampling). Importance sampling is unbiased sampling. E( 1 n i∈St 1 mpi ∇F i (x t )) = 1 m m i=1 ∇F i (x t ) , no matter whether the sampling is with replacement or without replacement. Lemma B.1 proves that the importance sampling is an unbiased sampling strategy, either in sampling with replacement or sampling without replacement. Proof. with replacement: E 1 n i∈St 1 mp t i ∇F i (x t ) = 1 n i∈St E 1 mp t i ∇F i (x t ) = 1 n i∈St E E 1 mp t i ∇F i (x t ) | S = 1 n i∈St E m l=1 p t l 1 mp t l ∇F l (x t ) = 1 n i∈St ∇f (x t ) = ∇f (x t ) , without replacement: E 1 n i∈St 1 mp i ∇F i (x t ) = 1 n m l=1 E I m 1 mp t l ∇F l (x t ) = 1 n m l=1 E(I m ) × E( 1 mp t l ∇F l (x t )) = 1 n E( m l=1 I m ) × E( 1 mp t l ∇F l (x t )) = 1 n n × m l=1 p t l 1 mp t l ∇F l (x t ) = 1 n m l=1 np t l × 1 mp t l ∇F l (x t ) = 1 m m l=1 ∇F l (x t ) = ∇f (x t ) , where I m ≜ 1 if x l ∈ S t , 0 otherwise . In the expectation, there are three origins of stochasticity. They are client sampling, local data SGD, and filtration of x t . Therefore, the expectation is over all these randomnesses. Here, S represents the origins of stochasticity except for client sampling. Rigorously, S represents the filtration of the stochastic process {x j , j = 1, 2, 3. . . } at time t and the stochasticity of local SGD. Lemma B.2 (update gap bound). χ 2 = E∥ 1 n i∈St 1 mp t i ∇F i (x t ) -∇f (x t )∥ 2 = E∥∇ f (x t )∥ 2 -∥∇f (x t )∥ 2 ≤ E∥∇ f (x t )∥ 2 . ( ) where the first equation follows from E[x -E(x)] 2 = E[x 2 ] -[E(x) ] 2 and Lemma B.1. To increase readability, we give a detailed derivation of the Lemma B.2. E ∥∇ f (x t ) -∇f (x t )∥ 2 | S = E ∥∇ f (x t )∥ 2 | S -2E ∥∇ f (x t )∥∥∇f (x t )∥ | S + E ∥∇f (x t )∥ 2 | S , where E(x | S) means the expectation on x over the sampling space. And we use E ∥∇ f (x t ) | S = ∇f (x t ) and E ∥∇f (x t )∥ 2 | S = ∥∇f (x t )∥ 2 (∥∇f (x) ∥ is a constant for stochasticity S and the expectation over a constant is the constant itself.) Therefore, we conclude E ∥∇ f (x t ) -∇f (x t )∥ 2 | S = E ∥∇ f (x t )∥ 2 | S -∥∇f (x t )∥ 2 ≤ E ∥∇ f (x t )∥ 2 | S . We can further take the expectation on both sides of the inequality according to our needs, but without changing the relationship. The following lemma follows from Lemma 4 of Reddi et al. (2020) , but with a looser condition Assumption 3, instead of σ 2 G bound. With some effort, we can derive the following lemma: Lemma B.3 (Local updates bound.). For any step-size satisfying η L ≤ 1 8LK , we can have the following results: E∥x t i,k -x t ∥ 2 ≤ 5K(η 2 L σ 2 L + 4Kη 2 L σ 2 G ) + 20K 2 (A 2 + 1)η 2 L ∥∇f (x t )∥ 2 . ( ) Proof. E t ∥x i t,k -x t ∥ 2 = E t ∥x i t,k-1 -x t -η L g t t,k-1 ∥ 2 = E t ∥x i t,k-1 -x t -η L (g t t,k-1 -∇F i (x i t,k-1 ) + ∇F i (x i t,k-1 ) -∇F i (x t ) + ∇F i (x t ))∥ 2 ≤ (1 + 1 2K -1 )E t ∥x i t,k-1 -x t ∥ 2 + E t ∥η L (g t t,k-1 -∇F i (x i t,k ))∥ 2 + 4KE t [∥η L (∇F i (x i t,K-1 ) -∇F i (x t ))∥ 2 ] + 4Kη 2 L E t ∥∇F i (x t )∥ 2 ≤ (1 + 1 2K -1 )E t ∥x i t,k-1 -x t ∥ 2 + η 2 L σ 2 L + 4Kη 2 L L 2 E t ∥x i t,k-1 -x t ∥ 2 + 4Kη 2 L σ 2 G,i + 4Kη 2 L (A 2 + 1)∥∇f (x t )∥ 2 ≤ (1 + 1 K -1 )E∥x i t,k-1 -x t ∥ 2 + η 2 L σ 2 L + 4Kη 2 L σ 2 G + 4K(A 2 + 1)∥η L ∇f (x t )∥ 2 . ( ) ( ) Unrolling the recursion, we get: E t ∥x i t,k -x t ∥ 2 ≤ k-1 p=0 (1 + 1 K -1 ) p η 2 L σ 2 L + 4Kη 2 L σ 2 G + 4K(A 2 + 1)∥η L ∇f (x t )∥ 2 (21) ≤ (K -1) (1 + 1 K -1 ) K -1 η 2 L σ 2 L + 4Kη 2 L σ 2 G + 4K(A 2 + 1)∥η L ∇f (x t )∥ 2 (22) ≤ 5K(η 2 L σ 2 L + 4Kη 2 L σ 2 G ) + 20K 2 (A 2 + 1)η 2 L ∥∇f (x t )∥ 2 . C CONVERGENCE OF FEDIS, PROOF OF THEOREM 3.1 We first restate the convergence theorem (Theorem 3.1) more formally, then prove the result for nonconvex case. Theorem C.1. Under Assumptions 1-3 and sampling strategy FedIS, the expected gradient norm will converge to a stationary point of the global objective. More specifically, if communication rounds T is pre-determined and the learning rate η and η L is constant learning rates, then the expected gradient norm will be bounded as follows: min t∈[T ] E∥∇f (xt)∥ 2 ≤ F cηηLKT + Φ , where F = f (x 0 ) -f (x * ), M 2 = σ 2 L + 4Kσ 2 G , and the expectation is over the local datasets samples among workers. Let η L < min (1/(8LK), C), where C is obtained from the condition that 1 2 -10L 2 K 2 (A 2 + 1)η 2 L -L 2 ηK(A 2 +1) 2n η L > 0 ,and η ≤ 1/(η L L), it then holds that: Φ = 1 c [ 5η 2 L L 2 K 2m m i=1 (σ 2 L + 4Kσ 2 G ) + ηη L L 2m σ 2 L + Lηη L 2nK V ( 1 mp t i ĝt i )] . ( ) where c is a constant that satisfies 1 2 -10L 2 K 2 (A 2 + 1)η 2 L -L 2 ηK(A 2 +1) 2n η L > c > 0, and V ( 1 mp t i ĝt i ) = E∥ 1 mp t i ĝt i -1 m m i=1 ĝt i ∥ 2 . Corollary C.2. Suppose η L and η are such that the conditions mentioned above are satisfied, η L = O 1 √ T KL and η = O √ Kn , and let the sampling probability be FedIS (82). Then for sufficiently large T, the iterates of Theorem 3.1 satisfy: min t∈[T ] E∥∇f (x t )∥ 2 = O σ 2 L √ nKT + Kσ 2 G √ nKT + σ 2 L + 4Kσ 2 G KT . Proof. E t [f (x t+1 )] (a1) ≤ f (x t ) + ⟨∇f (x t ), E t [x t+1 -x t ]⟩ + L 2 E t [∥x t+1 -x t ∥ 2 ] = f (x t ) + ⟨∇f (x t ), E t [η∆ t + ηη L K∇f (x t ) -ηη L K∇f (x t )]⟩ + L 2 η 2 E t [∥∆ t ∥ 2 ] = f (x t ) -ηη L K ∥∇f (x t )∥ 2 + η ⟨∇f (x t ), E t [∆ t + η L K∇f (x t )]⟩ A1 + L 2 η 2 E t ∥∆ t ∥ 2 A2 , where (a1) follows from Lipschitz continuous condition. The expectation is conditioned over everything before current step k of round t. Specifically, it is over clients' sampling, local data sampling, and the current round's model x t . Firstly we consider A 1 : A 1 = ⟨∇f (x t ), E t [∆ t + η L K∇f (x t )]⟩ = ∇f (x t ), E t [- 1 |S t | i∈St 1 mp t i K-1 k=0 η L g i t,k + η L ∇f (x t )] (a2) = ∇f (x t ), E t [- 1 m m i=1 K-1 k=0 η L ∇F i (x i t,k ) + η L ∇f (x t )] = η L K∇f (x t ), - √ η L √ K E t [ 1 m m i=1 K-1 k=0 (∇F i (x i t,k ) -∇F i (x t ))] (a3) = η L K 2 ∥∇f (x t )∥ 2 + η L 2K E t 1 m m i=1 K-1 k=0 (∇F i (x i t,k ) -∇F i (x t )) 2 - η L 2K E t ∥ 1 m m i=1 K-1 k=0 ∇F i (x i t,k )∥ 2 (a4) ≤ η L K 2 ∥∇f (x t )∥ 2 + η L L 2 2m m i=1 K-1 k=0 E t x i t,k -x t 2 - η L 2K E t ∥ 1 m m i=1 K-1 k=0 ∇F i (x i t,k )∥ 2 ≤ η L K 2 + 10K 3 L 2 η 3 L (A 2 + 1) ∥∇f (x t )∥ 2 + 5L 2 η 3 L 2 K 2 σ 2 L + 10η 3 L L 2 K 3 σ 2 G - η L 2K E t ∥ 1 m m i=1 K-1 k=0 ∇F i (x i t,k )∥ 2 , where (a2) follows from Assumption 2 and LemmaB.1. (a3) is due to ⟨x, y⟩ = 1 2 ∥x∥ 2 + ∥y∥ 2 -∥x -y∥ 2 and (a4) comes from Assumption 1. Under review as a conference paper at ICLR 2023 Next consider A 2 . Let ĝt i = K-1 k=0 g t i,k = K-1 k=0 ∇F i (x i t,k , ξ i t,k ) A 2 = E t ∥∆ t ∥ 2 = E t η L 1 n i∈St 1 mp t i K-1 k=0 g i t,k 2 = η 2 L 1 n E t 1 mp t i K-1 k=0 g i t,k - 1 m m i=1 K-1 k=0 g i t,k 2 + η 2 L E t 1 m m i=1 K-1 k=0 g i (x i t,k ) 2 = η 2 L n V ( 1 mp t i ĝt i ) + η 2 L E∥ 1 m m i=1 K-1 k=0 [g i (x i t,k ) -∇F i (x i t,k ) + ∇F i (x i t,k )]∥ 2 ≤ η 2 L n V ( 1 mp i ĝt i ) + η 2 L 1 m 2 m i=1 K-1 k=0 E∥g i (x i t,k ) -∇F i (x i t,k )∥ 2 + η 2 L E∥ 1 m m i=1 K-1 k=0 ∇F i (x i t,k )∥ 2 ≤ η 2 L n V ( 1 mp t i ĝt i ) + η 2 L K m σ 2 L + η 2 L E∥ 1 m m i=1 K-1 k=0 ∇F i (x i t,k )∥ 2 . ( ) The third equality follows from independent sampling. Specifically, for sampling with replacement, due to every index being independent, we utilize E∥x 2 1 + ... + x n ∥ 2 = E[∥x 1 ∥ 2 + ... + ∥x n ∥ 2 ]. For sampling without replacement: E∥ 1 n i∈St ( 1 mp t i ĝt i - 1 m m i=1 ĝt i )∥ 2 (30) = 1 n 2 E∥ m i=1 I i ( 1 mp t i ĝt i - 1 m m i=1 ĝt i )∥ 2 (31) = 1 n 2 E ∥ m i=1 I i ( 1 mp t i ĝt i - 1 m m i=1 ĝt i )∥ 2 | I i = 1 × P(I i = 1) (32) + 1 n 2 E ∥ m i=1 I i ( 1 mp t i ĝt i - 1 m m i=1 ĝt i )∥ 2 | I i = 0 × P(I i = 0) (33) = 1 n m i=1 p t i ∥ 1 mp t i ĝt i - 1 m m i=1 ĝt i ∥ 2 (34) = 1 n E∥ 1 mp t i ĝt i - 1 m m i=1 ĝt i ∥ 2 . ( ) From above, we observe that it is possible to gain a speedup by sampling from the distribution that minimizes V ( 1 mp t i ĝt i ). Moreover, as we have discussed before, the optimal sampling probability is p * i = ∥ ĝt i ∥ m i=1 ∥ ĝt i ∥ . For MD sampling (Li et al., 2019) , which samples according to date ratio, the optimal sampling probability is p * i,t = qi∥ ĝt i ∥ m i=1 qi∥ ĝt i ∥ , where q i = ni N Now substitute the expression of A 1 and A 2 : E t [f (x t+1 )] ≤ f (x t ) -ηη L K ∥∇f (x t )∥ 2 + η ⟨∇f (x t ), E t [∆ t + η L K∇f (x t )]⟩ + L 2 η 2 E t ∥∆ t ∥ 2 ≤ f (x t ) -ηη L K 1 2 -10L 2 K 2 η 2 L (A 2 + 1) ∥∇f (x t )∥ 2 + 5ηη 3 L L 2 K 2 2 (σ 2 L + 4Kσ 2 G ) + η 2 η 2 L KL 2m σ 2 L + Lη 2 η 2 L 2n V ( 1 mp t i ĝt i ) - ηη L 2K - Lη 2 η 2 L 2 E t 1 m m i=1 K-1 k=0 ∇F i (x i t,k ) 2 ≤ f (x t ) -cηη L K∥∇f (x t )∥ 2 + 5ηη 3 L L 2 K 2 2 (σ 2 L + 4Kσ 2 G ) + η 2 η 2 L KL 2m σ 2 L + Lη 2 η 2 L 2n V ( 1 mp t i ĝt i ) , where the last inequality follows from ηη L 2K - Lη 2 η 2 L 2 ≥ 0 if ηη l ≤ 1 KL , and (a9) holds because there exists a constant c > 0 (with some η L ) satisfying 1 2 -10L 2 1 m m i-1 K 2 η 2 L (A 2 + 1) > c > 0 Rearranging and summing from t = 0, . . . , T -1,we have: T -1 t=1 cηη L KE∥∇f (x t )∥ 2 ≤ f (x 0 ) -f (x T ) + T (ηη L K)Φ . ( ) Which implies: min t∈[T ] E∥∇f (x t )∥ 2 ≤ f 0 -f * cηη L KT + Φ , where Φ = 1 c [ 5η 2 L KL 2 2 (σ 2 L + 4Kσ 2 G ) + ηη L L 2m σ 2 L + Lηη L 2nK V ( 1 mp t i ĝt i )] . C.1 PROOF FOR CONVERGENCE OF FEDIS (THEOREM 3.1) UNDER ASSUMPTION 1-3. For comparison, we first provide the convergence result under Assumption 4. The Assumption 4 is formally defined below: Assumption 4 (Gradient bound). The stochastic gradient's expected squared norm is uniformly bounded, i.e.,E∥∇F i (x t,k , ξ k,t )∥ 2 ≤ G 2 for all i and k. First we show Assumption 4 can be used to bound the update variance V 1 mp t i ĝt i , and under the sampling probability FedIS (80): V 1 mp t i ĝt i ≤ 1 m 2 E∥ m i=1 K k=1 ∇F i (x t,k , ξ k,t )∥ 2 ≤ 1 m m i=1 K K k=1 E∥∇F i (x t,k , ξ k,t )∥ 2 ≤ K 2 G 2 While for using Assumption 3 instead of additional Assumption 4, we can also bound the update variance: V 1 mp t i ĝt i ≤ 1 m 2 E∥ m i=1 K k=1 ∇F i (x t,k , ξ k,t )∥ 2 ≤ 1 m m i=1 K K k=1 E∥∇F i (x t,k , ξ k,t )∥ 2 ≤ K 2 σ 2 G + K 2 (A 2 + 1)∥∇f (x t )∥ 2 (41) We replace the variance back to equation ( 36): E t [f (x t+1 )] ≤ f (x t ) -ηη L K ∥∇f (x t )∥ 2 + η ⟨∇f (x t ), E t [∆ t + η L K∇f (x t )]⟩ + L 2 η 2 E t ∥∆ t ∥ 2 ≤ f (x t ) -ηη L K 1 2 -10L 2 K 2 η 2 L (A 2 + 1) ∥∇f (x t )∥ 2 + 5ηη 3 L L 2 K 2 2 (σ 2 L + 4Kσ 2 G ) + η 2 η 2 L KL 2m σ 2 L + Lη 2 η 2 L 2n V ( 1 mp t i ĝt i ) - ηη L 2K - Lη 2 η 2 L 2 E t 1 m m i=1 K-1 k=0 ∇F i (x i t,k ) 2 ≤ f (x t ) -ηη L K 1 2 -10L 2 K 2 η 2 L (A 2 + 1) - Lηη L K(A 2 + 1) 2n ∥∇f (x t )∥ 2 + 5ηη 3 L L 2 K 2 2 (σ 2 L + 4Kσ 2 G ) + η 2 η 2 L KL 2m σ 2 L + Lη 2 η 2 L 2n K 2 σ 2 G - ηη L 2K - Lη 2 η 2 L 2 E t 1 m m i=1 K-1 k=0 ∇F i (x i t,k ) 2 (42) This shows that the requirement for η L is different. It needs that there exists a constant c > 0(with some η L ) satisfying 1 2 -10L 2 K 2 η 2 L (A 2 + 1) -Lηη L K(A 2 +1) 2n > c > 0. One can still guarantee that there exists a constant for η L to satisfy this inequality according to the properties of quadratic functions. Specifically, for quadratic equations -10L 2 K 2 (A 2 + 1)η 2 L -LηK(A 2 +1) 2n η L + 1 2 , we know -10L 2 K 2 (A 2 + 1) < 0, -LηK(A 2 +1) 2n and 1 2 > 0. According to the solution of quadratic equations, we can make sure there exists a η L > 0 solution. Then we can substitute equation ( 36) by equation ( 42) and let η L = O 1 √ T KL and η = O √ Kn , we get the convergence rate of FedIS under Assumption 1-3: min t∈[T ] E∥∇f (x t )∥ 2 ≤ O f 0 -f * √ nKT +O σ 2 L √ nKT +O M 2 T +O Kσ 2 G √ nKT order of Φ . D CONVERGENCE OF DELTA. PROOF OF THEOREM 3.3

D.1 CONVERGENCE RATE WITH IMPROVED ANALYSIS METHOD FOR GETTING DELTA

As we see FedIS can only reduce the update variance term in Φ. Since we want to reduce the convergence variance as much as possible, the other term σ L and σ G still needs to be optimized. However, it is not straightforward to derive the optimization problem from Φ. In order to further reduce the variance in Φ (cf. 4), i.e., local variance (σ L ) and global variance (σ G ), we divide the convergence of the global objective into a surrogate objective and an update gap, and analyze each term separately. The analysis framework is shown in Figure 10 . While for the update gap, as inspired by the expression form of update variance, we formally define it as below. Definition D.1 (Update gap). In order to measure the update inconsistency, we define the update gap: χt = E ∇ f (xt) -∇f (xt) . ( ) Here the expectation is over all clients' distribution. When full clients participate, we have χ 2 t = 0. The update inconsistency exists as long as partial client participation. The update gap is a direct embodiment of the objective inconsistency in the update process. The existence of update gap makes the analysis of global objective different from the analysis of surrogate objective. However, once we promise the convergence of the update gap, we can re-derive the convergence result for the global objective. Formally, the update gap can help us to connect global objective convergence and surrogate objective convergence as follows: E∥∇f (xt)∥ 2 = E∥∇ f (xt)∥ 2 + χ 2 t . ( ) The equation follows from the property of unbiasedness, see Lemma B.1. In order to deduce the convergence rate of the global objective, we start from the convergence analysis of the surrogate objective. ) and ηη L ≤ 1 /KL, the minimal gradient norm of surrogate objective will be bounded as below: min t∈[T ] E ∇ f (xt) 2 ≤ f 0 -f * cηη L KT + Φ c , where f 0 = f (x 0 ), f * = f (x * ), the expectation is over the local dataset samples among workers. Φ is the new combination of variance, representing combinations of local variance and client gradient diversity. For sample without replacement: Φ = 5L 2 Kη 2 L 2mn m i=1 1 p t i (σ 2 L,i + 4Kζ 2 G,i ) + Lη L η 2n m i=1 1 m 2 p t i σ 2 L,i , For sampling with replacement: Φ = 5L 2 Kη 2 L 2m 2 m i=1 1 p t i (σ 2 L,i + 4Kζ 2 G,i ) + Lη L η 2n m i=1 1 m 2 p t i σ 2 L,i where ζ G,i represents client gradient diversity: ζ G,i = ∥∇F i (x t ) -∇f (x t )∥, c is a constant. The proof of Theorem D.2 is shown in Appendix D.2 and Appendix D.3. Specifically, the proof for sampling with replacement is shown in Appendix D.2 while the proof for sampling without replacement is shown in Appendix D.3. Remark D.3. We notice that there is no update variance in Φ, but the local variance and global variance remain in it. Furthermore, the new combination of variance Φ can be minimized by optimizing w.r.t sampling probability, as is shown later. Derive the convergence from surrogate objective to global objective. As shown in Lemma B.1, unbiased sampling promises partial client updates in expectation are equal to the participation of all clients. With enough training rounds, unbiased sampling can guarantee that the update gap χ 2 will converge to zero. However, we still need the convergence speed of χ 2 t to recover the convergence rate of the global objective. Fortunately, we can bound the convergence behavior of χ 2 t by the convergence rate of surrogate objective according to Definition D.1 and Lemma B.2. Therefore, the update gap can achieve at least the same convergence rate as the surrogate objective. Corollary D.4 (New convergence rate of global objective). Under Assumption 1-3 and based on the above analysis that update variance is bounded, the global objective will converge to a stationary point. Its gradient is bounded as: min t∈[T ] E∥∇f (xt)∥ 2 = min t∈[T ] E∥∇ f (xt)∥ 2 + E∥χ 2 t ∥ ≤ min t∈[T ] 2E∥∇ f (xt)∥ 2 ≤ f 0 -f * cηη L KT + Φ c . ( ) Theorem D.5 (Restate of Theorem 3.3). Under Assumption 1-3 and the same conditions as theo-rem3.1, the minimal gradient norm of surrogate objective will be bounded as follows by setting η L = 1 √ T KL and η

√

Kn Let local and global learning rates η and η L satisfy η L < 1 √ 20KL 1 n m l=1 1 mp t l and ηη L ≤ 1 KL . Under Assumption 1-3 and with partial worker participation, the sequence of outputs x k generated by Algorithm 1 satisfies: min t∈[T ] E∥∇f (x t )∥ 2 ≤ F cηη L KT + 1 c Φ , ( ) where F = f (x 0 ) -f (x * ) , and the expectation is over the local dataset samplings among workers. c is a constant. ζ G,i is defined as client gradient diversity: ζ G,i = ∥∇F i (x t ) -∇f (x t )∥. For sample with replacement: Φ = 5L 2 Kη 2 L 2m 2 m l=1 1 p t l (σ 2 L,l + 4Kζ 2 G,l ) + Lη L η 2n m l=1 1 m 2 p t l σ 2 L,i . For sampling without replacement: Φ = 5L 2 Kη 2 L 2mn m l=1 1 p t l (σ 2 L,l + 4Kζ 2 G,l ) + Lη L η 2n m l=1 1 m 2 p t l σ 2 L,l . Remark D.6 (Condition of η L ). Here, though the condition expression for η L relies on a dynamic sampling probability p t l , we can still guarantee that there a constant η L satisfies this condition. Specifically, one can substitute the optimal sampling probability 1 p t i = m j=1 √ α1ζ 2 G,j +α2σ 2 L,j √ α1ζ 2 G,i +α2σ 2 L,i back to the above inequality condition. As long as the gradient ∇F i (x t ) is bounded, we can ensure 1 m 2 m i=1 m j=1 √ α1ζ 2 G,j +α2σ 2 L,j √ α1ζ 2 G,i +α2σ 2 L,i ≤ maxj √ α1ζ 2 G,j +α2σ 2 L,j mini √ α1ζ 2 G,i +α1σ 2 L,i ≤ G, therefore 1 √ 20(A 2 +1)KL 1 m 2 m i=1 m j=1 √ α 1 ζ 2 G,j +α 2 σ 2 L,j √ α 1 ζ 2 G,i +α 2 σ 2 L,i ≥ 1 √ 20(A 2 +1)KL √ G ≥ C , where G and C are positive constants. Thus, we can always find a constant η L to satisfy this inequality under dynamic sampling probability p t i . Corollary D.7. Suppose η L and η are such that the conditions mentioned above are satisfied, η L = O 1 √ T KL and η = O √ Kn . Then for sufficiently large T, the iterates of Theorem 3.3 satisfy: min t∈[T ] E∥∇f (x t )∥ 2 ≤ O F √ nKT + O σ 2 L √ nKT + O σ 2 L + 4Kζ 2 G KT . ( ) Lemma D.8. For any step-size satisfying η L ≤ 1 8LK , we can have the following results: E∥x t i,k -x t ∥ 2 ≤ 5K(η 2 L σ 2 L + 4Kη 2 L ζ 2 G,i ) + 20K 2 (A 2 + 1)η 2 L ∥∇f (x t )∥ 2 . ( ) where ζ G,i = ∥∇F (x t ) -∇f (x t )∥, and the expectation is over local SGD and filtration of x t , without the stochasticity of client sampling. Proof. E t ∥x i t,k -x t ∥ 2 = E t ∥x i t,k-1 -x t -η L g t t,k-1 ∥ 2 = E t ∥x i t,k-1 -x t -η L (g t t,k-1 -∇F i (x i t,k-1 ) + ∇F i (x i t,k-1 ) -∇F i (x t ) + ∇F i (x t ))∥ 2 ≤ (1 + 1 2K -1 )E t ∥x i t,k-1 -x t ∥ 2 + E t ∥η L (g t t,k-1 -∇F i (x i t,k ))∥ 2 + 4KE t [∥η L (∇F i (x i t,K-1 ) -∇F i (x t ))∥ 2 ] + 4Kη 2 L E t ∥∇F i (x t )∥ 2 ≤ (1 + 1 2K -1 )E t ∥x i t,k-1 -x t ∥ 2 + η 2 L σ 2 L + 4Kη 2 L L 2 E t ∥x i t,k-1 -x t ∥ 2 + 4Kη 2 L ζ 2 G,i + 4Kη 2 L (A 2 + 1)∥∇f (x t )∥ 2 ≤ (1 + 1 K -1 )E∥x i t,k-1 -x t ∥ 2 + η 2 L σ 2 L + 4Kη 2 L σ 2 G + 4K(A 2 + 1)∥η L ∇f (x t )∥ 2 . ( ) (54) Unrolling the recursion, we get: E t ∥x i t,k -x t ∥ 2 ≤ k-1 p=0 (1 + 1 K -1 ) p η 2 L σ 2 L + 4Kη 2 L ζ 2 G,i + 4K(A 2 + 1)∥η L ∇f (x t )∥ 2 (55) ≤ (K -1) (1 + 1 K -1 ) K -1 η 2 L σ 2 L + 4Kη 2 L σ 2 G + 4K(A 2 + 1)∥η L ∇f (x t )∥ 2 (56) ≤ 5K(η 2 L σ 2 L + 4Kη 2 L σ 2 G ) + 20K 2 (A 2 + 1)η 2 L ∥∇f (x t )∥ 2 . ( ) In Section D.2 and Section D.3, we provide the proof for Theorem D.2. Specifically, the proof for sampling with replacement is shown in Appendix D.2 while the proof for sampling without replacement is shown in Appendix D.3. D.2 SAMPLE WITH REPLACEMENT min t∈[T ] E∥∇ f (x t )∥ 2 ≤ f 0 -f * cηη L KT + 1 c Φ , where Φ = 5L 2 Kη 2 L 2m 2 m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + Lη L η 2n m l=1 1 m 2 p t l σ 2 L . Proof. f (x t+1 ) (a1) ≤ f (x t ) + ∇ f (x t ), E t [x t+1 -x t ] + L 2 E t [∥x t+1 -x t ∥ 2 ] = f (x t ) + ∇ f (x t ), E t [η∆ t + ηη L K∇ f (x t ) -ηη L K∇ f (x t )] + L 2 η 2 E t [∥∆ t ∥ 2 ] = f (x t ) -ηη L K ∇ f (x t ) 2 + η ∇ f (x t ), E t [∆ t + η L K∇ f (x t )] A1 + L 2 η 2 E t ∥∆ t ∥ 2 A2 . Where (a1) follows from Lipschitz continuous condition. Here the expectation is over local data SGD and filtration of x t . However, in the next analysis, the expectation is over all randomness, i.e., client sampling is included. Firstly consider A 1 : A 1 = ∇ f (x t ), E t [∆ t + η L K∇ f (x t )] = ∇ f (x t ), E t [- 1 |S t | i∈St 1 mp t i K-1 k=0 η L g i t,k + η L K∇ f (x t )] (a2) = ∇ f (x t ), E t [- 1 |S t | i∈St 1 mp t i K-1 k=0 η L ∇F i (x i t,k ) + η L K∇ f (x t )] = Kη L ∇ f (x t ), √ η L √ K E t [- 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) + K∇ f (x t )] (a3) = Kη L 2 ∥∇ f (x t )∥ 2 + η L 2K E t ∥ - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) + K∇ f (x t )∥ 2 - η L 2K E t ∥ - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k )∥ 2 , where (a2) follows from Assumption 2, and (a3) is due to ⟨x, y⟩ = 1 2 ∥x∥ 2 + ∥y∥ 2 -∥x -y∥ 2 for x = √ Kη L ∇ f (x t ) and y = √ η L K [-1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) + K∇ f (x t )]. In order to bound A 1 , we need to bound the following part: E t ∥ 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) -K∇ f (x t )∥ 2 = E t ∥ 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x t )∥ 2 (a4) ≤ K n i∈St K-1 k=0 E t ∥ 1 mp t i (∇F i (x i t,k ) -∇F i (x t ))∥ 2 = K n i∈St K-1 k=0 E t {E t (∥ 1 mp t i (∇F i (x i t,k ) -∇F i (x t ))∥ 2 | S)} = K n i∈St K-1 k=0 E t ( m l=1 1 m 2 p t l ∥∇F l (x l t,k ) -∇F l (x t )∥ 2 ) = K K-1 k=0 m l=1 1 m 2 p t l E t ∥∇F l (x l t,k ) -∇F l (x t )∥ 2 (a5) ≤ K 2 m 2 m l=1 L 2 p t l E∥x l t,k -x t ∥ 2 (a6) ≤ L 2 K 2 m 2 m l=1 1 p t l 5K(η 2 L σ 2 L + 4Kη 2 L ζ 2 G,i ) + 20K 2 (A 2 + 1)η 2 L ∥∇f (x t )∥ 2 = 5L 2 K 3 η 2 L m 2 m l=1 1 p t l (σ 2 L + 4Kσ 2 G ) + 20L 2 K 4 η 2 L (A 2 + 1) m 2 m l=1 1 p t l ∥∇f (x t )∥ 2 , where (a4) follows from the fact that E∥x a5) is due to Assumption 1, and (a6) is due to Lemma D.8. 1 + • • • + x n ∥ 2 ≤ nE ∥x 1 ∥ 2 + • • • + ∥x n ∥ 2 , ( Combining the above formulations, we have: A 1 ≤ Kη L 2 ∥∇ f (x t )∥ 2 + η L 2K 5L 2 K 3 η 2 L m 2 m l=1 1 p t l (σ L + 4Kζ 2 G,i ) + 20L 2 K 4 η 2 L (A 2 + 1) m 2 m l=1 1 p t l ∥∇f (x t )∥ 2 - η L 2K E t ∥ - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k )∥ 2 . ( ) Next we consider to bound A 2 : A 2 = E t ∥∆ t ∥ 2 = E t -η L 1 n i∈St 1 mp t i K-1 k=0 g i t,k 2 = η 2 L E t 1 n i∈St K-1 k=0 ( 1 mp t i g i t,k - 1 mp t i ∇F i (x i t,k )) 2 + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 = η 2 L 1 n 2 i∈St K-1 k=0 E t 1 mp t i g i t,k - 1 mp t i ∇F i (x i t,k ) 2 + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 = η 2 L 1 n 2 K-1 k=0 E t E 1 mp t i (g i t,k -∇F i (x i t,k ) 2 | S + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 = η 2 L 1 n 2 K-1 k=0 E t m l=1 1 m 2 p t l g i t,k -∇F i (x i t,k ) 2 + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 (a7) ≤ η 2 L K n m l=1 1 m 2 p t l σ 2 L + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 , where S represents the whole sample space and (a7) is due to Assumption 2. Now substitute the expression of A 1 and A 2 and take the expectation over client sampling distribution on both sides. It should be noted that the derivation of A 1 and A 2 in above is based on considering the expectation over sampling distribution: f (x t+1 ) ≤ f (x t ) -ηη L KE t ∇ f (x t ) 2 + ηE t ∇ f (x t ), ∆ t + η L K∇ f (x t ) + L 2 η 2 E t ∥∆ t ∥ 2 ≤ f (x t ) -Kηη L 1 2 - 10K 2 η 2 L L 2 (A 2 + 1) m 2 m l=1 1 p t l E t ∇ f (x t ) 2 + 5L 2 K 2 η 3 L η 2m 2 m l=1 1 p t l σ L + 4Kζ 2 G,i + Lη 2 L η 2 K 2n m l=1 1 m 2 p t l σ 2 L - ηη L 2K - Lη 2 η 2 L 2 E t - 1 n i∈St 1 mp t i K-1 k=0 ∇f i (x i t,k ) 2 (a8) ≤ f (x t ) -Kηη L 1 2 - 10K 2 η 2 L L 2 (A 2 + 1) m 2 m l=1 1 p t l E t ∥∇ f (x t )∥ 2 + 5L 2 K 2 η 3 L η 2m 2 m l=1 1 p t l (σ L + 4Kζ 2 G,i ) + Lη 2 L η 2 K 2n m l=1 1 m 2 p t l σ 2 L (a9) ≤ f (x t ) -cKηη L E t ∥∇ f (x t )∥ 2 + 5L 2 K 2 η 3 L η 2m 2 m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + Lη 2 L η 2 K 2n m l=1 1 m 2 p t l σ 2 L , where (a8) follows from ηη L 2K - Lη 2 η 2 L 2 ≥ 0 if ηη l ≤ 1 KL , and (a9) holds because there exists a constant c > 0 satisfying ( 1 2 - 10K 2 η 2 L L 2 (A 2 +1) m 2 m l=1 1 p t l ) > c > 0 if η L < 1 √ 20(A 2 +1)KL 1 m m l=1 1 mp t l . Rearranging and summing from t = 0, . . . , T -1, we have: T -1 t=1 cηηLKE∥∇ f (xt)∥ 2 ≤ f (x0) -f (xT ) + T (ηηLK) 5L 2 Kη 2 L 2m 2 m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + LηLη 2n m l=1 1 m 2 p t l σ 2 L . Which implies: min t∈[T ] E∥∇ f (x t )∥ 2 ≤ f 0 -f * cηη L KT + 1 c Φ , where Φ = 5L 2 Kη 2 L 2m 2 m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + Lη L η 2n m l=1 1 m 2 p t l σ 2 L . D.3 SAMPLE WITHOUT REPLACEMENT min t∈[T ] E∥∇ f (x t )∥ 2 ≤ f 0 -f * cηη L KT + 1 c Φ , where Φ = 5L 2 Kη 2 L 2mn m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + Lη L η 2n m l=1 1 m 2 p t l σ 2 L . Proof. f (x t+1 ) ≤ f (x t ) + ∇ f (x t ), E[x t+1 -x t ] + L 2 E t [∥x t+1 -x t ∥] = f (x t ) + ∇ f (x t ), E t [η∆ t + ηη L K∇ f (x t ) -ηη L K∇ f (x t )] + L 2 η 2 E t [∥∆ t ∥ 2 ] = f (x t ) -ηη L K ∇ f (x t ) 2 + η ∇ f (x t ), E t [∆ t + η L K∇ f (x t )] A1 + L 2 η 2 E t ∥∆ t ∥ 2 A2 . Where the first inequality follows from Lipschitz continuous condition. Here the expectation is over local data SGD and filtration of x t . However, in the next analysis, the expectation is over all randomness, i.e., client sampling is included. Similarly, we consider A 1 first: A 1 = ∇ f (x t ), E t [∆ t + η L K∇ f (x t )] = ∇ f (x t ), E t - 1 |S t | i∈St 1 mp t i K-1 k=0 η L g i t,k + η L K∇ f (x t ) = ∇ f (x t ), E t - 1 |S t | i∈St 1 mp t i K-1 k=0 η L ∇F i (x i t,k ) + η L K∇ f (x t ) = Kη L ∇ f (x t ), √ η L √ K E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) + K∇ f (x t ) = Kη L 2 ∇ f (x t ) 2 + η L 2K E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) + K∇ f (x t ) 2 - η L 2K E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 . ( ) Since x i are sampled from S t without replacement, this causes pairs x i1 , x i2 to no longer be independent. We introduce the activation function by: I m ≜ 1 if x ∈ S t , 0 otherwise . ( ) Then we get the following bound: E t 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) -K∇ f (x t ) 2 = E t 1 n m l=1 I m 1 mp t l K-1 k=0 ∇F l (x l t,k ) - 1 n m l=1 I m 1 mp t l K-1 k=0 ∇F l (x t ) 2 (b1) ≤ m n 2 m l=1 E t I m 1 mp t l K-1 k=0 ∇F l (x l t,k ) -∇F l (x t ) 2 - 1 n 2 l1̸ =l2 E t I m 1 mp l1 K-1 k=0 ∇F l1 (x l1 t,k ) -∇F l1 (x t ) -I m 1 mp l2 K-1 k=0 ∇F l2 (x l2 t,k ) -∇F l2 (x t ) 2 ≤ m n 2 m l=1 E t I m 1 mp t l K-1 k=0 ∇F l (x l t,k ) - 1 mp t l ∇F l (x t ) 2 = m n 2 m l=1 E t    I m 1 mp t l K-1 k=0 ∇F l (x l t,k ) - 1 mp t l ∇F l (x t ) 2 | I m = 1    × P (I m = 1) + E t    I m ( 1 mp t l K-1 k=0 ∇F l (x l t,k ) - 1 mp t l ∇F l (x t ) 2 | I m = 0    × P (I m = 0)) = m n 2 m l=1 np t l E 1 mp t l K-1 k=0 ∇F l (x l t,k ) - 1 mp t l K-1 k=0 ∇F l (x t ) 2 (b2) ≤ L 2 K mn K-1 k=0 m l=1 1 p t l E∥x l t,k -x t ∥ 2 (b3) ≤ L 2 K 2 n 5K η 2 L m m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + 20K 2 (A 2 + 1)η 2 L ∥∇f (x t )∥ 2 1 m m l=1 1 p t l , where (b1) follows from ∥ m i=1 t i ∥ 2 = i∈[m] ∥t i ∥ 2 + i̸ =j ⟨t i , t j ⟩ c1 = i∈[m] m∥t i ∥ 2 - 1 2 i̸ =j ∥t i -t j ∥ 2 (where (c1) is due to ⟨x, y⟩ = 1 2 ∥x∥ 2 + ∥y∥ 2 -∥x -y∥ 2 ), and (b2) is due to E∥x 1 + • • • + x n ∥ 2 ≤ nE ∥x 1 ∥ 2 + • • • + ∥x n ∥ 2 , and (b3) is from Lemma D.8. Therefore, we have the bound of A 1 : A 1 ≤ Kη L 2 ∥∇ f (x t )∥ 2 + η L L 2 K 2n 5K η 2 L m m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + 20K 2 (A 2 + 1)η 2 L ∥∇f (x t )∥ 2 1 m m l=1 1 p t l - η L 2K E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 . ( ) And A 2 has the following expression: A 2 = E t ∥∆ t ∥ 2 = E t -η L 1 n i∈St 1 mp t i K-1 k=0 g i t,k 2 = η 2 L E t 1 n i∈St K-1 k=0 ( 1 mp t i g i t,k - 1 mp t i ∇F i (x i t,k )) 2 + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 = η 2 L 1 n 2 E t m l=1 I m K-1 k=0 1 mp t l (g l t,k -∇F i (x i t,k )) 2 + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 = η 2 L 1 n 2 m l=1 E t m l=1 I m K-1 k=0 1 mp t l (g l t,k -∇F i (x i t,k )) 2 + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 = η 2 L 1 n 2 m l=1 np t l E t K-1 k=0 1 mp t l (g l t,k -∇F i (x i t,k )) 2 + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 ≤ η 2 L K n m l=1 1 m 2 p t l σ 2 L + η 2 L E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 . ( ) Now substitute the expression of A 1 and A 2 and take the expectation over client sampling distribution on both sides. It should be noted that the derivation of A 1 and A 2 in above is based on considering the expectation over sampling distribution: f (x t+1 ) ≤ f (x t ) -ηη L KE t ∇ f (x t ) 2 + ηE t ∇ f (x t ), ∆ t + η L K∇ f (x t ) + L 2 η 2 E t ∥∆ t ∥ 2 (b4) ≤ f (x t ) -ηη L K 1 2 - 10L 2 K 2 (A 2 + 1)η 2 L nm m l=1 1 p t l E t ∥∇ f (x t )∥ 2 + 2K 2 ηη 3 L L 2 2nm m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + Lη 2 η 2 L K 2n m l=1 1 p t l σ 2 L - ηη L 2K - Lη 2 η 2 L 2 E t - 1 n i∈St 1 mp t i K-1 k=0 ∇F i (x i t,k ) 2 ≤ f (x t ) -cηη L KE t ∥∇ f (x t )∥ 2 + 2K 2 ηη 3 L L 2 2nm m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + Lη 2 η 2 L K 2n m l=1 1 p t l σ 2 L . Also, for (b4), step sizes need to satisfy ηη L 2K - Lη 2 η 2 L 2 ≥ 0 if ηη l ≤ 1 KL , and there exists a constant c > 0 satisfying ( 1 2 - 10K 2 η 2 L L 2 (A 2 +1) mn m l=1 1 p t l ) > c > 0 if η L < 1 √ 20(A 2 +1)KL 1 n m l=1 1 mp t l . Rearranging and summing from t = 0, . . . , T -1,we have: T -1 t=1 cηη L KE∥∇ f (x t )∥ 2 ≤ f (x 0 ) -f (x T ) + T (ηη L K) Φ . ( ) Which implies: min t∈[T ] E∥∇ f (x t )∥ 2 ≤ f 0 -f * cηη L KT + 1 c Φ , where Φ = 5L 2 Kη 2 L 2mn m l=1 1 p t l (σ 2 L + 4Kζ 2 G,i ) + Lη L η 2n m l=1 1 m 2 p t l σ 2 L . Solving the above optimization problem, we give the expression of optimal sampling probability: p t i = ∥ ĝt i ∥ m j=1 ∥ ĝt j ∥ , ( ) where ĝt i = K-1 k=0 g i k is the gradient updates sum of multiple updates. Recall theorem 3.1, only the last variance term in the convergence term Φ is affected by sampling. In other words, we need to minimize the variance term with respect to probability. We formalized it as below: min p t i ∈[0,1], m i=1 p t i =1 V ( 1 mp t i ĝt i ) ⇔ min p t i ∈[0,1], m i=1 p t i =1 1 m 2 m i=1 1 p t i ∥ ĝt i ∥ 2 . ( ) This problem can be solved in closed form by the KKT condition. It is easy to verify that the solution of the above optimization is : p * i,t = ∥ K-1 k=0 g i t,k ∥ m i=1 ∥ K-1 k=0 g i t,k ∥ , ∀i ∈ 1, 2, ..., m . Under optimal sampling probability, the variance will be: V 1 mp t i ĝt i ≤ E m i=1 ĝt i m 2 = 1 m 2 E∥ m i=1 K k=1 ∇F i (x t,k , ξ k,t )∥ 2 (83) Therefore, the variance term is bounded by: V 1 mp t i ĝt i ≤ 1 m m i=1 K K k=1 E∥∇F i (x t,k , ξ k,t )∥ 2 ≤ K 2 G 2 Remark: If the uniform distribution is adopted p t i = 1 m , it is easy to observe that the variance of the stochastic gradient is bounded by m i=1 ∥gi∥ 2 m . According to Cauchy-Schwarz inequality, m i=1 ∥ ĝt i ∥ 2 m m i=1 ∥ ĝi ∥ m 2 = m m i=1 ∥ ĝi ∥ 2 ( m i=1 ∥ ĝi ∥) 2 ≥ 1 , This implies that importance sampling does improve convergence rate, especially when ( m i=1 ∥gi∥) 2 m i=1 ∥gi∥ 2 << m. E.2 SAMPLING PROBABILITY OF DELTA Our result is of the following form: min t∈[T ] E∥∇f (x t )∥ 2 ≤ f 0 -f * cηη L KT + Φ , it's easy to see that the sampling strategy only affects Φ, for enhancing the convergence rate, we need to minimize Φ with respect to p t l . As is shown, the expression of Φ in with and without replacement are similar, only differ in number n and m. Here we just consider with replacement case. Specifically, we need to solve this optimization problem: min p t l Φ = 1 c ( 5L 2 Kη 2 L 2m 2 m l=1 1 p t l (σ 2 L,l + 4Kζ 2 G,i ) + Lη L η 2n m l=1 1 m 2 p t l σ 2 L,i ) s.t. m l=1 p t l = 1 . Solving this optimization problem, we can find the optimal sampling probability to be: p * i,t = 5KLη L (σ 2 L,i + 4Kζ 2 G,i ) + η n σ 2 L,l m l=1 5KLη L (σ 2 L,l + 4Kζ 2 G,l ) + η n σ 2 L,l . For simplicity's sake, we rewrote the optimal sampling probability as : p * i,t = α 1 ζ 2 G,i + α 2 σ 2 L,i m l=1 α 1 ζ 2 G,l + α 2 σ 2 L,l , where α 1 = 20K 2 Lη L , α 2 = 5KLη L + η n . Remark: Now we compare with the uniform sampling strategy: Φ DELT A = Lη L 2c   m l=1 α 1 ζ 2 G,l + α 2 σ 2 L,l m   2 . For uniform p l = 1 m : Φ unif orm = Lη L 2c m l=1 α 1 ζ 2 G,l + α 2 σ 2 L,l 2 m . According to Cauchy-Schwarz inequality: m l=1 α 1 ζ 2 G,l + α 2 σ 2 L,l 2 m /   m l=1 α 1 ζ 2 G,l + α 2 σ 2 L,l m   2 = m m l=1 α 1 ζ 2 G,l + α 2 σ 2 L,l 2 m l=1 α 1 ζ 2 G,l + α 2 σ 2 L,l 2 ≥ 1 , implies that importance sampling does improve convergence rate (importance samplingbased approach might be n-times faster in convergence than uniform), especially when ( m l=1 √ α1ζ 2 G,l +α2σ 2 L,l ) 2 m l=1 ( √ α1ζ 2 G,l +α2σ 2 L,l ) 2 << m.

F CONVERGENCE ANALYSIS OF THE PRACTICAL ALGORITHM

For providing the convergence rate of applying the practical algorithm, we need an additional Assumption: Assumption 5 (Local gradient norm bound). The gradients ∇F i (x) are uniformly upper bounded (by a constant G > 0) ∥∇F i (x)∥ 2 ≤ G 2 , ∀i. Assumption 5 is a general assumption in IS community to bound the gradient norm (Zhao & Zhang, 2015; Elvira & Martino, 2021; Katharopoulos & Fleuret, 2018) , and it is also used in the FL community to analyze convergence (Balakrishnan et al., 2021; Zhang et al., 2020) . This assumption tells us a useful fact that will be used later: |∇F i (x t,k , ξ t,k )/∇F i (x s,k , ξ s,k )| ≤ U for all i and k, where subscribe s refers to the lasted participated round of client i, and U is a constant upper bound. It tells us that the client's gradient norm change is bounded. In general, the gradient norm tends to be smaller as training progresses, thus leading the |∇F i (x t,k , ξ t,k )/∇F i (x s,k , ξ s,k )| goes to zero. Even if there are some oscillations in the gradient norm, the gradient will vary within a limited range and will not appear to be infinite. Based on Assumption 5 and Assumption 3, we can re-derive the convergence analysis for both convergence variance Φ (4) and Φ (47). As for Assumption 3 (E∥∇F i (x)∥ 2 ≤ (A 2 + 1)∥∇f (x)∥ 2 + σ 2 G ), we use σ G,s and σ G,t instead of a unified σ G for the sake of comparison. Specifically , Φ = 1 c [ 5η 2 L L 2 K 2m m i=1 (σ 2 L + 4Kσ 2 G ) + ηη L L 2m σ 2 L + Lηη L 2nK V ( 1 mp t i ĝt i )], where ĝt i = K k=1 ∇F i (x k,s , ξ k,s ). With the practical sampling probability p s i of FedIS, the term V 1 mp s i ĝt i = E∥ 1 mp s i ĝt i - 1 m m i=1 ĝt i ∥ 2 ≤ E∥ 1 mp t i ĝt i ∥ 2 = E∥ 1 m ĝt i ĝs i m j=1 ĝs j ∥ 2 . ( ) According to Assumption 5, we know ∥ ĝt i ĝs i ∥ 2 = ∥ K k=1 ∇Fi(x i t,k ,ξ i t,k ) K k=1 ∇Fi(x i s,k ,ξ i s,k ) ∥ ≤ U 2 . Then we get V 1 mp s i ĝt i ≤ E   ∥ 1 m ∥ 2 ∥∥ ĝt i ĝs i ∥ 2 ∥ m j=1 ĝs j ∥ 2   ≤ 1 m 2 U 2 E∥ m i=1 K k=1 ∇F i (x k,s , ξ k,s )∥ 2 ≤ 1 m 2 U 2 m m i=1 K K k=1 E∥∇F i (x k,s , ξ k,s )∥ 2 Similar to the previous proof, based on Assumption 3. we can get the new convergence rate: min t∈[T ] E∥∇f (xt)∥ 2 ≤ O f 0 -f * √ nKT +O σ 2 L √ nKT +O M 2 T +O KU 2 σ 2 G,s √ nKT order of Φ . where . However, it does not mean the practical algorithm is better than the theoretical algorithm because the σ G is different, as we stated at the beginning. Usually, σ G,s of the practical algorithm is larger than σ G,t , which also comes from the fact that the gradient tends to go to zero as training processing. Besides, due to the presence of the summation over both i and k, the gap between σ G,s and σ G,t is multiplied, and σ G,s /σ G,t ∼ m 2 K 2 1 U 2 . Thus, the practical algorithm leads to a slower convergence than the theoretical algorithm. M = σ 2 L + 4Kσ 2 G,s . Remark F.1. It is worth noting that |∇F i (x t,k , ξ t,k )/∇F i (x s,k , ξ s,k )| is Similarly, as long as the gradient is consistently bounded, we can assume |∇F i (x t ) - ∇f (x t )|/|∇F i (x s ) -∇f (x s )| ≤ Ũ1 ≤ Ũ and |σ L,t /σ L,s | ≤ Ũ2 ≤ Ũ where σ 2 L,s = E ∇F i (x s , ξ i s ) -∇F i (x s ) for all i. Then we can get a similar conclusion following the same analysis on Φ. Specifically, Φ = Lη L 2m 2 c m i=1 1 p s i α 1 ζ 2 G,i + α 2 σ 2 L,i , where α 1 and α 2 are constants defined in (11). For the sake of comparison of different participated rounds s and t, we rewrite the symbol as ζ i G,s and σ i L,s . Then use the practical sampling probability p s i of DELTA, and let R s Therefore, compared with the theoretical algorithm of DELTA, the practical algorithm of DELTA has a convergence rate as follows: i = α 1 ζ i G,s 2 + α 2 σ i L,s 2 , we have Φ = Lη L 2m 2 c m i=1 1 p s i (R t i ) 2 = Lη L 2m 2 c m i=1 (R t i ) 2 R s i m j=1 (R s j ) 2 = Lη L 2m 2 c m i=1 R t i R s i 2 R s i m j=1 R s j ≤ Lη L 2m 2 c Ũ 2 m i=1 R s i m j=1 R s j = Lη L 2m 2 c Ũ 2 m i=1 R s i 2 ≤ Lη L 2m 2 c Ũ 2 m m i=1 (R s i ) 2 ≤ Lη L 2c Ũ 2 (5KLη L (σ 2 L,s + 4Kζ 2 G,s ) + η n σ 2 L ) min t∈[T ] E∥∇f (x t )∥ 2 ≤ O f 0 -f * √ nKT + O Ũ 2 σ 2 L,s √ nKT + O Ũ 2 σ 2 L,s + 4K Ũ 2 ζ 2 G,s KT order of Φ . This discussion of the effect Ũ on convergence rate is the same as U in Remark F.1. G EXPERIMENT DETAILS.

Synthetic dataset

We demonstrate the experiment in different functions with different A and b. Each function is set with the noise of 20,30,40 to illustrate our theoretical results. As for constructing different functions, we assign A = 8, 10 and b =2, 1 respectively to see the convergence behavior of different functions. We choose 10 out of 20 clients in each round. All the algorithms run in the same environment with a fixed learning rate of 0.001. We train each experiment for 2000 rounds to make global loss have a stable convergence performance. We display the log of global loss in Fig 11 , where the Power-of-Choice is a biased sampling strategy that selects clients with higher loss (Cho et al., 2020) . We also show the convergence behavior of different sampling algorithms under small noise, as shown in Fig12. And to be consistent with the cross-device scenario, we further expanded the number of clients from 20 to 200, keeping 10 clients selected to participate in each round. The results in Fig 13 show the effectiveness of DELTA. The implementation detail of different sampling algorithms The power-of-choice sampling method is proposed by Cho et al. (2020) . The sampling strategy is that it first samples 20 clients randomly from all clients, and then chooses 10 of the 20 clients with the largest loss as selected clients. FedAvg samples clients according to their data ratio. Thus, FedAvg promises to be unbiased, which is given in Fraboni et al. (2021a) ; Li et al. (2019) to be an unbiased sampling method. As for FedIS, the sampling strategy follows (82). And for DELTA, the sampling probability follows (11). For practical implementation of FedIS and DELTA, the sampling probability follows the strategy we described in Section 4. Split FEMNIST In this section, we consider the split FEMNIST. We let 10% clients own 90% data and the detailed split data process is shown below. • Divide the dataset by labels, for example, divide FEMNIST into 10 groups, and assign each client one label • Random select one client • Reshuffle the data in the selected client • Equally divided into 100 clients FEMNIST and CIFAR-10 Specifically, we train a two-layer MLP on the split-FEMNIST and a resnet 18 on split-CIFAR-10, respectively. CIFAR10 is composed of 32x32 images with three RGB channels of 10 different classes with 60000 samples. The "split" follows the idea introduced in Yu et al. (2019) ; Hsu et al. (2019) , where we leverage the Latent Dirichlet Allocation (LDA) to control the distribution drift with the dirichlet parameter α. Larger α indicates smaller drifts. Unless otherwise stated ,we set dirichlet parameter α = 0.5. Unless specifically mentioned otherwise, our studies use the following protocol. All datasets are split with parameter α = 0.5, the server choose n = 20 clients according to our proposed probability from For CIFAR-10, we report the mean of the best 10 test accuracies on global test data here. In Table 2 we compare the performance of DELTA, FedIS, and FedAvg on non-IID FEMNIST and CIFAR-10. Specifically, we use α = 0.1 for FEMNIST and α = 0.5 for CIFAR-10 to split dataset. As for Multinomial Distribution (MD) sampling (Li et al., 2018) , it samples based on clients' data ratio and average aggregates. It is symmetric in sampling and aggregation with FedAvg, with similar performance. It can be seen that DELTA has better accuracy than FedIS, while DELTA and FedIS both outperform FedAvg with the same communication round. In Table 3 , we demonstrate that DELTA and FedIS is compatible with other FL optimization algorithms, e.g., Fedprox (Li et al., 2018) and FedMime (Karimireddy et al., 2020a) . Moreover, DELTA keeps its superiority in this setting. In Table 4 , we demonstrate that DELTA and FedIS is compatible with other variance reduction algorithms, like FedVARP (Jhunjhunwala et al., 2022) . It is worth noting that FedVARP utilizes the historic update to approximate the unparticipated clients' updates. However, in this setting, the improvement of the sampling strategy on the results is somewhat reduced. This is because the sampling strategy is slightly redundant when all users are involved. So when VARP and DELTA/FedIS are combined, instead of reassigning weights in the aggregation step, we use 82 or 11 to select the current round update clients and then average aggregate the updates of all clients. One can see that the combination of DELTA/FedIS and VARP can still show the advantages of sampling. We also experiment with different choices of heterogeneity α in CIFAR-10. The parameter of heterogeneity α changes from 0.1 to 0.5 to 1. We observe the consistent improvement of DELTA in Table 5 . Besides, we also experiment with various client numbers to examine the efficiency of DELTA in FEMNIST dataset. Here we set α = 1, and participated client number choose from n = 10, 30, 50. As shown in Table 6 , DELTA maintains its supremacy with different participating client numbers. 



With a slight abuse of notations, we use the f (xt) for fS t (xt) in this paper.



Figure 1: Difference between IS, cluster-based IS, and our sampling scheme DELTA.

Figure 2: We use a logistic regression model to show the performance of different methods on noniid MNIST. We sample 10 out of 200 clients and run 500 communication rounds. We report the average of the best 10 accuracies under 100, 300, and 500 rounds, which shows the accuracy performance from the initial training state to convergence.

Figure 3: Sketch of theoretical analysis flow (Compared with FedIS). The left side represents the analysis flow of FedIS, while the analysis of DELTA is shown on the right. The sampling probability difference comes from the difference in variance.

Figure 4: (a): Overview of objective inconsistency and update gap. Here is three square functions with expression y = 10x 2 and y = 3(x±8) 2 , and gradient is calculated at x = -2. The detail enlargement shows the objective inconsistency. (b): Illustration of the different sampling methods. The client's update is shown by the grey arrow and the ideal global update is the black arrow. It shows our DELTA is better than FedIS and FedAvg.

Figure 5: Performance of different algorithms on the regression model. The loss is calculated by f (x, y) = y -log( (A i x-b i ) 2 /2) 2 , A = 10, b = 1. We report the logarithm of global loss with different degrees of gradient noise ν. All methods are well-tuned, and we report the best result of each algorithm under each setting.

Figure 5 demonstrates that these empirical results align with our theoretical analysis. Additional experiments of different functions and different settings, and the detailed sampling strategies of these different sampling algorithms can be found in Appendix G. • DELTA and FedIS outperform other biased and unbiased methods in convergence speed. We can see both DELTA and FedIS converge faster than both FedAvg and Power-of-choice sampling. The larger the noise (variance), the more obvious the convergence speed advantage of DELTA and FedIS. For ν = 30, FedIS can achieve near twice faster than FedAvg, and for ν = 40, DELTA can achieve nearly 4× times faster than FedAvg. • DELTA outperforms FedIS. In experiments, DELTA converges about twice faster as FedIS in Figure 5(a). As all results show, DELTA can reduce more variance than FedIS and thus converge a smaller loss.

Figure 6: Performance of different sampling methods on Split FEMNIST dataset

Figure 7: Overview of objective inconsistency. The intuition of objective inconsistency in FL is caused by client sampling. When Client 1 & 2, are selected to participate the training, then the model x t+1 becomes x t+1 F edAvg instead of x t+1 global , resulting in objective inconsistency. Different sampling strategies can cause different surrogate objectives, thus causing different biases. From Fig 7(a) we can see DELTA achieves minimal bias among the three unbiased sampling methods.

Figure 8: The gradient norm comparison. Both results indicate that cluster-based IS selects clients with small gradients after about half of the training rounds compared to IS.

Figure 10: Sketch of theoretical analysis flow (Compared with FedIS). The left side represents the analysis flow of FedIS, while the analysis of DELTA is shown on the right. The sampling probability difference comes from the difference in variance.

usually relatively small because the gradient tends to go to zero as training processing. It means U can be relatively small, more specifically, U < 1 in the upper bound term O KU 2 σ 2 G,s √ nKT

Figure 11: Performance of different algorithms on the regression model. The loss is calculated by f (x, y) = y -log( (A i x-b i ) 2 2

Figure 12: Performance of different algorithms on the regression model with different (small) noise setting.

Figure 14: Loss performance of= DELTA, FedIS and FedAvg on FEMNIST. the total of m = 300 clients, and each is trained for T = 500 communication rounds with K = 5 local epoches. The default local dataset batch size is 32. The learning rates are set the same in all algorithms, specifically lr global = 1 and lr local = 0.01. All algorithms use FedAvg as the backbone. We compare DELTA and FedIS with FedAvg in different datasets with different settings. Loss performance of FEMNIST We compare the loss of DELTA, FedIS and uniform sampling on the non-iid FEMNIST dataset in Fig 14. It shows that DELTA and FedIS converges faster than FedAvg, while DELTA even achieves a lower loss than FedIS.For CIFAR-10, we report the mean of the best 10 test accuracies on global test data here. In Table2we compare the performance of DELTA, FedIS, and FedAvg on non-IID FEMNIST and CIFAR-10. Specifically, we use α = 0.1 for FEMNIST and α = 0.5 for CIFAR-10 to split dataset. As for Multinomial Distribution (MD) sampling(Li et al., 2018), it samples based on clients' data ratio and average aggregates. It is symmetric in sampling and aggregation with FedAvg, with similar performance. It can be seen that DELTA has better accuracy than FedIS, while DELTA and FedIS both outperform FedAvg with the same communication round. In Table3, we demonstrate that DELTA and FedIS is compatible with other FL optimization algorithms, e.g., Fedprox(Li et al., 2018) and FedMime (Karimireddy et al., 2020a). Moreover, DELTA keeps its superiority in this setting.

represents the local objective function of client i over data distribution D i , and ξ i means the sampled data of client i. m is the total number of clients and w i represents the weight of client i. With partial client participation, FedAvg (McMahan et al., 2017) randomly selects |S t | = n clients (n ≤ m) to communicate and update model. Then the loss function of actual participating users in each round can be expressed as:

Number of communication rounds required to reach ϵ or ϵ + φ (ϵ for unbiased sampling and ϵ + φ for biased sampling, where φ is a non-convergent constant term) accuracy for FL. σL is local variance bound, and σG bound is E∥∇Fi(x) -∇f (x)∥ 2 ≤ σ 2 G . Γ is the distance of global optimum and the average of local optimum(Heterogeneity bound), µ corresponds to µ strongly convex. G is the client's gradient bound, and ζG means the gradient diversity.

Performance of algorithms. We run 500 communication rounds on FEMNIST and CIFAR10 for each algorithm. We report the mean of maximum 5 accuracies for test datasets and the average number of communication rounds to reach the threshold accuracy.

Performance of algorithms with momentum and prox. We run 500 communication rounds on CIFAR10 for each algorithm. We report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the threshold accuracy.

Performance of DELTA/FedIS in combination with FedVARP. We run 500 communication rounds on FEMNIST with α = 0.1 for each algorithm. We report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the threshold accuracy.

Performance of algorithms under different α. We run 500 communication rounds on CIFAR10 for each algorithm (with momentum). We report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the threshold accuracy.

Performance of algorithms under different participated client number n. We run 500 communication rounds on FEMNIST for each algorithm. We report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the threshold accuracy.

