ADAPTIVE CLIENT SAMPLING IN FEDERATED LEARN-ING VIA ONLINE LEARNING WITH BANDIT FEEDBACK

Abstract

Due to the high cost of communication, federated learning (FL) systems need to sample a subset of clients that are involved in each round of training. As a result, client sampling plays an important role in FL systems as it affects the convergence rate of optimization algorithms used to train machine learning models. Despite its importance, there is limited work on how to sample clients effectively. In this paper, we cast client sampling as an online learning task with bandit feedback, which we solve with an online stochastic mirror descent (OSMD) algorithm designed to minimize the sampling variance. We then theoretically show how our sampling method can improve the convergence speed of optimization algorithms. To handle the tuning parameters in OSMD that depend on the unknown problem parameters, we use the online ensemble method and doubling trick. We prove a dynamic regret bound relative to any sampling sequence. The regret bound depends on the total variation of the comparator sequence, which naturally captures the intrinsic difficulty of the problem. To the best of our knowledge, these theoretical contributions are new and the proof technique is of independent interest. Through both synthetic and real data experiments, we illustrate advantages of the proposed client sampling algorithm over the widely used uniform sampling and existing online learning based sampling strategies. The proposed adaptive sampling procedure is applicable beyond the FL problem studied here and can be used to improve the performance of stochastic optimization procedures such as stochastic gradient descent and stochastic coordinate descent. 1 We use [M ] to denote the set {1, . . . , M }. 2 In this paper, we assume that all clients are available in each round and the purpose of client sampling is to reduce the communication cost, which is also the case considered by some previous research (Chen et al., 2020) . However, in practice, it is possible that only a subset of clients are available at the beginning of each round due to physical constraint. In Appendix H.2, we discuss how to extend our proposed methods to deal with such situations. Analyzing such an extension is highly non-trivial and we leave it for further study. See detailed discussion in Appendix H.2. 3 Throughout the paper we do not discuss how g t m is obtained. One possibility that the reader could keep in mind for concreteness is the LocalUpdate algorithm Charles & Konečný (2020), which covers well-known algorithms such as mini-batch SGD and FedAvg (McMahan et al., 2017).

1. INTRODUCTION

Modern edge devices, such as personal mobile phones, wearable devices, and sensor systems in vehicles, collect large amounts of data that are valuable for training of machine learning models. If each device only uses its local data to train a model, the resulting generalization performance will be limited due to the number of available samples on each device. Traditional approaches where data are transferred to a central server, which trains a model based on all available data, have fallen out of fashion due to privacy concerns and high communication costs. Federated Learning (FL) has emerged as a paradigm that allows for collaboration between different devices (clients) to train a global model while keeping data locally and only exchanging model updates (McMahan et al., 2017) . In a typical FL process, we have clients that contain data and a central server that orchestrates the training process (Kairouz et al., 2021) . The following process is repeated until the model is trained: (i) the server selects a subset of available clients; (ii) the server broadcasts the current model parameters and sometimes also a training program (e.g., a Tensorflow graph (Abadi et al., 2016) ); (iii) the selected clients make updates to the model parameters based on their local data; (iv) the local model updates are uploaded to the server; (v) the server aggregates the local updates and makes a global update of the shared model. In this paper, we focus on the first step and develop a practical strategy for selecting clients with provable guarantees. To train a machine learning model in a FL setting with M clients, we would like to minimize the following objective: At the beginning of the t-th communication round, the server uses the sampling distribution p t = (p t 1 , . . . , p t M ) ⊤ to choose K clients by sampling with replacement from [M ] 2 . Let S t ⊆ [M ] denote the set of chosen clients with |S t | = K. The server transmits the current model parameter vector w t to each client m ∈ S t . The client m computes the local update g t m and sends it back to the server. 3 After receiving local updates from clients in S t , the server constructs a stochastic estimate of the global gradient as g t = 1 K m∈S t λ m p t m g t m , and makes the global update of the parameter w t using g t . For example, w t+1 = w t -µ t g t , if the server is using stochastic gradient descent (SGD) with the stepsize sequence {µ t } t≥1 (Bottou et al., 2018) . However, the global update can be obtained using other procedures as well. The sampling distribution in FL is typically uniform over clients, p t = p unif = (1/M, . . . , 1/M ) ⊤ . However, nonuniform sampling (also called importance sampling) can lead to faster convergence, both in theory and practice, as has been illustrated in stochastic optimization (Zhao & Zhang, 2015; Needell et al., 2016) . While the sampling distribution can be designed based on prior knowledge (Zhao & Zhang, 2015; Johnson & Guestrin, 2018; Needell et al., 2016; Stich et al., 2017) , we cast the problem of choosing the sampling distribution as an online learning task and need no prior knowledge about equation 1. Existing approaches to designing a sampling distribution using online learning focus on estimation of the best sampling distribution under the assumption that it does not change during the training process. However, the best sampling distribution changes with iterations during the training process, and the target stationary distribution does not capture the best sampling distribution in each round. In the existing literature, the best fixed distribution in hindsight is used as the comparator to measure the performance of the algorithm used to design the sampling distribution. Here, we focus on measuring the performance of the proposed algorithm against the best dynamic sampling distribution. We use an online stochastic mirror descent (OSMD) algorithm to generate a sequence of sampling distributions and prove a regret bound relative to any dynamic comparators that involve a total variation term that characterizes the intrinsic difficulty of the problem. To the best of our knowledge, this is the first bound on the dynamic regret with intrinsic difficulty characterization in importance sampling. Moreover, we theoretically show how our sampling method improves the convergence guarantee of optimization method by reducing the dependency on the heterogeneity of the problem.

1.1. CONTRIBUTIONS

We develop an algorithm based on OSMD that generates a sequence of sampling distributions {p t } t≥1 based on the partial feedback available to the server from the sampled clients. We prove a bound on regret relative to the any dynamic comparators, which allows us to consider the best sequence of sampling distributions as they change over iterations. The bound includes a total variation term that characterizes the intrinsic difficulty of the problem by capturing the difficulty of following the best sequence of distributions. Such a characterization of problem difficulty is novel. Besides, our theoretical result can recover the results in previous research as special cases and is thus strictly more general. Moreover, we theoretically improve the convergence guarantee of optimization algorithm by using our sampling scheme over uniform sampling. We show that adaptive sampling can help reduce the dependency on the heterogeneity level of the problem. We also mkae contributions in experiments and practical parameter tuning strategy. See Appendix A.1 for more detailed discussion of our contributions.

1.2. RELATED WORK

Our paper is related to client sampling in FL, importance sampling in stochastic optimization, and online convex optimization. We summarize only the most relevant literature, without attempting to provide an extensive survey. Due to space limit, we only summarize the first direction here, and leave the discussion about importance sampling in stochastic optimization and online convex optimization in Appendix B. For client sampling, Chen et al. (2020) proposed to use the theoretically optimal sampling distribution to choose clients. However, their method requires all clients to compute local updates in each round, which is impractical due to stragglers. Ribero & Vikalo (2020) modelled the parameters of the model during training by an Ornstein-Uhlenbeck process, which was then used to derive an optimal sampling distribution. Cho et al. (2020b) developed a biased client selection strategy and analyzed its convergence property. As a result, the algorithm has a non-vanishing bias and is not guaranteed to converge to optimum. Moreover, it needs to involve more clients than our method and is thus communication and computational more expensive. Kim et al. (2020) ; Cho et al. (2020a) ; Yang et al. (2020) considered client sampling as a multi-armed bandit problem, but provided only limited theoretical results. Wang et al. (2020) used reinforcement learning for client sampling with the objective of maximizing accuracy, while minimizing the number of communication rounds. 1.3 NOTATION Let R M + = [0, ∞) M and R M ++ = (0, ∞) M . For M ∈ N + , let P M -1 := {x ∈ R M : M i=1 x i = 1} be the (M -1)-dimensional probability simplex. We use p = (p 1 , . . . , p M ) ⊤ to denote a sampling distribution with support on [M ] := {1, . . . , M }. We use p 1:T to denote a sequence of sampling distributions {p t } T t=1 . Let Φ : D ⊆ R M → R be a differentiable convex function defined on D, where D is a convex open set, and we use D to denote the closure of D. The Bregman divergence between any x, y ∈ D with respect to the function Φ is given as D Φ (x ∥ y) = Φ(x) -Φ(y) - ⟨∇Φ(y), x -y⟩. The unnormalized negative entropy is denoted as Φ e (x) = M m=1 x m log x m - M m=1 x m , x = (x 1 , . . . , x M ) ⊤ ∈ D = R M + , with 0 log 0 defined as 0. We use ∥ • ∥ p to denote the L p -norm for 1 ≤ p ≤ ∞. For x ∈ R n , we have ∥x∥ p = ( n i=1 x p i ) 1/p when 1 < p < ∞, ∥x∥ 1 = n i=1 |x i |, and ∥x∥ ∞ = max 1≤i≤n |x i |. Given any L p -norm ∥ • ∥, we define its dual norm as ∥z∥ ⋆ := sup{z ⊤ x : ∥x∥ ≤ 1}. For two sequences {a n } and {b n }, we use a n = O(b n ) or a n ≲ b n if there exists C > 0 such that |a n /b n | ≤ C for all n large enough; a n = Θ(b n ) if a n = O(b n ) and b n = O(a n ) simultaneously. Similarly, a n = Õ(b n ) if a n = O(b n log k b n ) for some k ≥ 0; a n = Θ(b n ) if a n = Õ(b n ) and b n = Õ(a n ) simultaneously.

1.4. ORGANIZATION OF THE PAPER

We motivate importance sampling in FL, introduce an adaptive client sampling algorithm, and establish a bound on the dynamic regret in Section 2. We derive optimization guarantee of mini-batch SGD using our sampling scheme in Section 3. Due to space limit, we leave additional contents in appendix. More specifically, we give additional related work summary in Appendix B. In Appendix C.3 and Appendix C.4, we design two extensions to the sampling algorithm that make it adaptive to the unknown problem parameters. We leave detailed algorithm descriptions and theoretical properties of adaptive methods in Appendix C. We provide the experimental results on synthetic data in Appendix F and real-world data in Appendix G. Sampling without replacement is discussed in Appendix C.5 and Appendix F.4. Finally, we give conclusions and discussions about future directions in Appendix H.

2. ADAPTIVE CLIENT SAMPLING

We show how to cast the client sampling problem as an online learning task. Subsequently, we solve the online learning problem using OSMD algorithm and provide a regret analysis for it.

2.1. CLIENT SAMPLING AS AN ONLINE LEARNING PROBLEM

Recall that at the beginning of the t-th communication round, the server uses a sampling distribution p t to choose a set of clients S t , by sampling with replacement K clients from [M ], to update the parameter vector w t . For a chosen client m ∈ S t , the local update is denoted as g t m . For example, the local update g t m = ∇ϕ(w t ; D m ) may be the full gradient; when mini-batch SGD/FedSGD is used, then g t m = (1/B) B b=1 ∇ϕ(w t ; ξ t m,b ), where ξ t m,b i.i.d. ∼ D m and B is the batch size; when FedAvg is used, then g t m = (w t m,B -w t )/η local , where w t m,b = w t m,b-1 -η local ∇f w t m,b-1 ; ξ t m,b-1 , b ∈ [B], w t m,0 = w t , ξ t m,b i.i.d. ∼ D m , and η local is the local stepsize. We define the aggregated oracle update at the t-th communication round as J t = M m=1 λ m g t m . The oracle update J t is constructed only for theoretical purposes and is not computed in practice. The stochastic estimate g t , defined in equation 2, is an unbiased estimate of J t , that is, E S t [g t ] = J t . Note that we only consider the randomness of S t and treat g t m as given. 4 The variance of g t is V S t [g t ] = 1 K M m=1 λ 2 m ∥g t m ∥ 2 p t m -J t 2 . ( ) Our goal is to design the sampling distribution p t , used to sample S t , so to minimize the variance in equation 3. In doing so, we can ignore the second term as it is independent of p t . Let a t m = λ 2 m ∥g t m ∥ 2 . For any sampling distribution q = (q 1 , . . . , q M ) ⊤ , the variance reduction lossfoot_1 is defined as l t (q) = 1 K M m=1 a t m q m . Given a sequence of sampling distributions q 1:T , the cumulative variance reduction loss is defined as L(q 1:T ) := T t=1 l t (q t ). When the choice of q 1:T is random, the expected cumulative variance reduction loss is defined as L(q 1:T ) := E L(q 1:T ) . The variance reduction loss appears in the bound on the sub-optimality of a stochastic optimization algorithm. As a motivating example, suppose F (•) in equation 1 is σ-strongly convex. Furthermore, suppose the local update g t m = ∇ϕ(w t ; D m ) is the full gradient of the local loss and the global update is made by SGD with stepsize µ t = 2/(σt). Theorem 3 of Salehi et al. (2017) then states that for any T ≥ 1: E F 2 T (T + 1) T t=1 t • w t -F (w ⋆ ) ≤ 2 σT (T + 1) L(p 1:T ), where w ⋆ is the minimizer of the objective in equation 1. Therefore, by choosing the sequence of sampling distributions p 1:T to make the L(p 1:T ) small, one can achieve faster convergence. This observation holds in other stochastic optimization problems as well. We develop an algorithm that creates a sequence of sampling distributions p 1:T to minimize L(p 1:T ) using only the norm of local updates, and without imposing assumptions on the loss functions or how the local and global updates are made. As a result, the algorithm can be applied to design sampling distributions for essentially any stochastic optimization procedure. In Section 3, we show how our sampling method improves the upper bound of mini-batch SGD of non-convex objectives. Suppose that at the beginning of the t-th communication round we know all {a t m } M m=1 . Then the optimal sampling distribution p t ⋆ = (p t ⋆,1 , . . . , p t ⋆,M ) ⊤ = arg min p∈P M -1 l t (p) is obtained as p t ⋆,m = a t m /( M m=1 a t m ). Computing the distribution p t ⋆ is impractical as it requires local updates of all clients, which eradicates the need for client sampling. From the form of p t ⋆ , we observe that clients with a large a t m are more "important" and should have a higher probability of being selected. Since we do not know {a t m } M m=1 , we will need to explore the environment to learn about the importance of clients before we can exploit the best strategy. Finally, we note that the relative importance of clients will change over time, which makes the environment dynamic and challenging. Based on the above discussion, we cast the problem of creating a sequence of sampling distributions as an online learning task with bandit feedback, where a game is played between the server and environment. Let p 1 be the initial sampling distribution. At the beginning of iteration t, the server samples with replacement K clients from [M ], denoted S t , using p t . The environment reveals {a t m } m∈S t to the server, where a t m = λ 2 m ∥g t m ∥ 2 . The environment also computes l t (p t ); however, this loss is not revealed to the server. The server then updates p t+1 based on the feedback {{a u m } m∈S u } t u=1 and sampling distributions {p u } t u=1 . Note that in this game, the server only gets information about the chosen clients and, based on this partial information, or bandit feedback, needs to update the sampling distribution. On the other hand, we would like to be competitive with an oracle that can calculate the cumulative variance reduction loss. We will design p t in a way that is agnostic to the generation mechanism of {a t } t≥1 , and will treat the environment as deterministic, with randomness coming only from {S t } t≥1 . We describe an OSMD-based approach to solve this online learning problem.

2.2. OSMD SAMPLER

The variance-reduction loss function l t is a convex function on P M -1 and ∇l t (q) = - 1 K a t 1 (q 1 ) 2 , . . . , a t M (q M ) 2 ⊤ ∈ R M for all q = (q 1 , . . . , q M ) ⊤ ∈ R M ++ . Since we do not observe a t , we cannot compute l t (•) or ∇l t (•). Instead, we can construct unbiased estimates of them. For any q ∈ P M -1 , let lt (q; p t ) be an estimate of l t (q) defined as lt (q; p t ) = 1 K 2 M m=1 a t m q m p t m N m ∈ S t , and ∇ lt (q; p t ) ∈ R M has the m-th entry defined as ∇ lt (q; p t ) m = - 1 K 2 • a t m q 2 m p t m N m ∈ S t . The set S t is sampled with replacement from [M ] using p t and N {m ∈ S t } denote the number of times that a client m is chosen in S t . Thus, 0 ≤ N {m ∈ S t } ≤ K. Given q and p t , lt (q; p t ) and ∇ lt (q; p t ) are random variables in R and R M that satisfy E S t lt (q; p t ) | p t = l t (q), E S t ∇ lt (q; p t ) | p t = ∇l t (q). Algorithm 1 OSMD Sampler 1: Input: A sequence of learning rates {η t } t≥1 ; parameter α ∈ (0, 1], ++ are given, lt (q; p t ) is a convex function with respect to q on R M ++ and satisfies lt (q; p t ) -lt (q ′ ; p t ) ≤ ⟨∇ lt (q; p t ), q -q ′ ⟩, for q, q ′ ∈ R M ++ . The constructed estimates lt (q; p t ) and ∇ lt (q; p t ) are crucial for designing updates to the sampling distribution. To the best of our knowledge, while similar estimators as lt (q; p t ) in equation 6 was used in previous literature (Borsos et al., 2018) , we are the first one to propose ∇ lt (q; p t ) in equation 7 and use it for updating sampling distribution. A = P M -1 ∩ [α/M, ∞ OSMD Sampler is an online stochastic mirror descent algorithm for updating the sampling distribution, detailed in Algorithm 1. The sampling distribution is restricted to lie in the space A = P M -1 ∩ [α/M, ∞) M , α ∈ (0, 1], to prevent the server from assigning small probabilities to devices. Let Φ : D ⊆ R M → R be a continuously differentiable convex function defined on D, with A ⊆ D. The learning rates {η t } t≥1 are positive and nonincreasing.foot_2 Line 7 of Algorithm 1 provides an update to the sampling distribution using the mirror descent update. The available feedback is used to construct an estimate of the loss, while the Bregman divergence between the current and next sampling distribution is used as a regularizer, ensuring that the updated sampling distribution does not change too much. The update only uses the most recent information, while forgetting the history, which results in nonstationarity of the sequence of sampling distributions. In Line 5 of Algorithm 1, we choose S t by sampling with replacement. In Section C.5, we discuss how to extend the results to sampling without replacement. The mirror descent update in Line 7 is not available in a closed form in general and an iterative solver may be needed. However, when Φ(•) is chosen as the negative entropy Φ e (•) (we use this as our default choice), a closed-form efficient solution can be obtained. An efficient implementation is shown in Algorithm 2 in Appendix C.1. The main cost comes from sorting the sequence {p t+1 m } M m=1 , which can be done with the computational complexity of O(M log M ). However, note that we only update a few entries of p t to get pt+1 and p t is sorted. Therefore, most entries of pt+1 are also sorted. Using this observation, we can usually achieve a much faster running time, for example, by using an adaptive sorting algorithm (Estivill-Castro & Wood, 1992) . Next, we provide a bound on the dynamic regret for OSMD Sampler.

2.3. DYNAMIC REGRET OF OSMD SAMPLER

We first describe the dynamic regret used to measure the performance of an online algorithm that generates a sequence of sampling distributions {p} t≥1 in a non-stationary environment. Given any comparator sequence q 1:T ∈ P T M -1 , the dynamic regret is defined as D-Regret T (q 1:T ) = L p1:T -L q 1:T . ( ) In contrast, the static regret measures the performance of an algorithm relative to the best fixed sampling distribution, that is, it restricts Namkoong et al., 2017; Salehi et al., 2017; Borsos et al., 2018; 2019) . When using a fixed comparator q 1 = • • • = q T = q, we write the regret as D-Regret T (q); besides, we write D-Regret T to denote D-Regret T (p 1:T ⋆ ). q 1 = • • • = q T ( The following quantity describes the dynamic complexity of a comparator sequence and will appear in the regret bound below. Definition 2.1 (Total Variation). The total variation of a comparator sequence q 1:T with respect to the norm ∥ • ∥ on R M is TV q 1:T = T -1 t=1 ∥q t+1 -q t ∥. The total variation measures how variable a sequence is. The larger the total variation TV(q 1:T ), the more variable q 1:T is, and such a comparator sequence is harder to match. To give an upper bound on the dynamic regret of OSMD Sampler, we need the following assumptions. Assumption 1. The function Φ(•) is ρ-strongly convex, ρ > 0 with respect to ∥ • ∥: , ϕ(q t , α) := ω(q t , α) 1 -ω(q t , α) 1 -α M . (9) We will use these quantities to characterize the projection error in the following theorem, which is the main result of this section. Theorem 1. Suppose Assumptions 1-2 hold and we use ∥ • ∥ = ∥ • ∥ 1 to define the total variation. Assume that {η t } t≥1 is a nonincreasing sequence. Let p1:T be a sequence generated by Algorithm 1. For any comparator sequence q 1:T , where q t is allowed to be deterministic or random, we have D-Regret T (q 1:T ) ≤ D max η 1 + 2H η T E TV q 1:T + 2 ρ T t=1 η t E Q 2 t Intrinsic Regret + 8H η T T t=1 E ψ(q t , α) + T t=1 E ϕ(q t , α)l t (q t ) Projection Error . Proof. The major challenge of the proof is to construct a projection of the comparator sequence q 1:T onto A T and bound the projection error. To the best of our knowledge, this bound on the projection error of a dynamic sequence is novel. Another challenge is to deal with the dynamic comparator, which requires us to connect the cumulative regret with the total variation of the comparator sequence. See Appendix D.2 for more details. From Theorem 1, we see that the bound on the dynamic regret consists of two parts. The first part is the intrinsic regret, quantifying the difficulty of tracking a comparator sequence in A T ; the second part is the projection error, arising from projecting the comparator sequence onto A T . Note that the intrinsic regret depends on α through D max , H, and {Q t } t≥1 . As shown in Appendix D.2, we have 0 ≤ ω(q t , α) ≤ 1 for all α ∈ [0, 1], which implies that ϕ(q t , α) ≤ M/α. Furthermore, ψ(q t , α) ≤ M m=1 (α/M )1 {q t m < (α/M )} ≤ α. Therefore, the projection error can be upper bounded by (8Hα)/η T +(M/α) T t=1 l t (q t ). More importantly, when q t m ∈ A, we have ψ(q t , α) = ω(q t , α) = ϕ(q t , α) = 0. Thus, when the comparator sequence belongs to A T , the projection error vanishes and we only have the intrinsic regret. As α decreases from one to zero, the intrinsic regret gets larger (it often tends to infinity as shown in Corollary 3), while we are allowing a larger class of comparator sequences; on the other hand, the projection error decreases to zero, since the gap between A and P M -1 vanishes with α. An optimal choice of α balances the two sources of regret. When ∥ • ∥ = ∥ • ∥ 1 and Φ is the unnormalized negative entropy, the Step 7 of Algorithm 1 has a closed-form solution (see Proposition 1). By Pinsker's inequality, the unnormalized negative entropy Φ e (x) is 1-strongly convex on P M -1 with respect to ∥ • ∥ 1 . When q 1:T is a deterministic sequence, we have E TV q 1:T = TV q 1:T . With Theorem 1, we have the following corollary. Corollary 1. Suppose conditions of Theorem 1 hold and let p1:T be the sequence generated by Algorithm 1. For any comparator sequence q 1:T , we choose α such that q t ∈ A for all t ∈ [T ]. Let η = K 2 α 3 M 3 log M + 2 log(M/α)E [TV (q 1:T )] 2 T t=1 E [(ā t ) 2 ] . ( ) Then D-Regret T (q 1:T ) ≤ 2 √ 2M 3 K 2 α 3 [log M + 2 log (M/α) E [TV (q 1:T )]] T t=1 E (ā t ) 2 , where āt : = max 1≤m≤M λ 2 m ∥g t m ∥ 2 = max 1≤m≤M a t m for all t ∈ [T ]. Proof. The proof follows directly from Corollary 3 in Appendix D.3. Note that as the training proceeds, the norms of local updates are decreasing, thus {ā t } T t=1 is typically a decreasing sequence. Thus, a naive upper bound is T t=1 E[(ā t ) 2 ] = O(T ). However, in Appendix F.6, we empirically show that āt decreases fast and the cumulative square sum of this sequence will converge to a constant, that is, T t=1 E[(ā t ) 2 ] = O(1) . This result further implies that the static regret with respect to the best fixed sampling distribution in hindsight (where T V (q 1:T ) = 0) is empirically O(1), which is much better than the rates in previous research (Salehi et al., 2017; Borsos et al., 2018) . Besides, our regret analysis also allows for sampling distributions that change over time (where T V (q 1:T ) > 0), and thus is more general than previous results. The choice of learning rate in equation 10 depends on unknown quantities prior to training, and is thus impractical. In Appendix C.3 and Appendix C.4, we introduce practical automatic parameter tuning strategies with the help of online ensemble method and doubling trick.

3. CONVERGENCE ANALYSIS OF MINI-BATCH SGD WITH OSMD SAMPLER

We illustrate how OSMD Sampler can be used to provably improve the convergence rate of the minibatch SGD. The detailed algorithm is given in Algorithm 3 in Appendix C.2. We use mini-batch SGD as a motivating example to show how adaptive sampling improves the convergence guarantee of an optimization algorithm. The analysis in this section can be extended to other optimization algorithms as well. To simplify the notation, we denote F m (w) := ϕ (w; D m ) and let λ m = 1/M for all m ∈ [M ]. We assume that w ∈ W ⊂ R d , where W is a compact set. Besides, we assume that client objectives are differentiable and L-smooth functions. Assumption 3. For all m ∈ [M ], F m (•) is differentiable and L-smooth, that is, ∥∇F m (x) -∇F m (y)∥ ≤ L∥x -y∥, for all x, y ∈ W. Note that we allow F m (•) to be non-convex. We also assume that the objective function F (•) is lower-bounded, that is, we assume that F ⋆ := inf w∈W F (w) > -∞. In addition, we make the following assumption about the local stochastic gradient.  E ξ∼Dm ∥∇ϕ(w; ξ) -∇F m (w)∥ 2 ≤ σ 2 . Next, we introduce quantities that characterize heterogeneity of the optimization problem. Specifically, heterogeneity characterizes how objective functions of different clients differ from each other. In a federated learning problem, heterogeneity can be large and it is important to understand its effect on the convergence of algorithms. The following three quantities characterize the heterogeneity: ζ 2 0 := sup w∈W 1 M M m=1 ∥∇F m (w) -∇F (w)∥ 2 = sup w∈W 1 M M m=1 ∥∇F m (w)∥ 2 -∥∇F (w)∥ 2 , ζ 2 1 := min p∈P M -1 sup w∈W 1 M 2 M m=1 1 p m ∥∇F m (w)∥ 2 -∥∇F (w)∥ 2 , ζ 2 2 := sup w∈W    1 M M m=1 ∥∇F m (w)∥ 2 -∥∇F (w)∥ 2    . By Jensen's inequality we have that to sample clients, we will have the heterogeneity level to be ζ 1 and ζ 2 respectively, where q ⋆ = arg min p∈P M -1 sup w∈W M m=1 (1/p m )∥∇F m (w)∥ 2 is the best fixed sampling distribution. Finally, the following quantities are useful in stating the convergence guarantee. Recall that B is the local batch size in Algorithm 3 and ζ 0 ≥ ζ 1 ≥ ζ 2 and K = |S t |. Let D F := F (w 0 ) -F ⋆ , R 0 := D F L T + σ √ D F L √ T KB + ζ 0 √ D F L √ T K , R 1 := D F L T + σ √ D F L √ T KBα + ζ 1 √ D F L √ T K + √ D F L D-Regret T (q ⋆ )) T , R 2 := D F L T + σ √ D F L √ T KBα + ζ 2 √ D F L √ T K + √ D F L D-Regret T (p 1:T ⋆ )) T . We are now ready to give the convergence guarantee of Algorithm 3. Theorem 2. Assume Assumption 3 and Assumption 4 hold. Let µ t = µ, t ∈ [T ], where µ is given in equation 33 in Appendix. Let {w 0 , . . . , w T -1 } be the sequence of iterates generated by Algorithm 3 and let w R denote an element of that sequence chosen uniformly at random. When q ⋆ ∈ A, we have E ∇F (w R ) 2 ≲ R 1 ; when p t ⋆ ∈ A for all t ∈ [T ], we have E ∇F (w R ) 2 ≲ R 2 ; when q ⋆ ∈ A and p t ⋆ ∈ A both hold, we have E ∇F (w R ) 2 ≲ min{R 1 , R 2 }. See proof in Appendix D.6. We derive different convergence rates R 1 and R 2 in Theorem 2 by choosing different comparators. More specifically, R 1 is derived by comparing against q ⋆ and R 2 is derived by comparing against p 1:T ⋆ . The different notions of heterogeneity reveal the fact that different sampling schemes can change the convergence speed of optimization algorithms through the change of heterogeneity level. Note that R 0 is the rate of mini-batch SGD under uniform sampling (Ghadimi & Lan, 2013) . OSMD Sampler can obtain tighter rates than uniform sampling when R 1 or R 2 are smaller than R 0 . To have R 1 ≲ R 0 , we need (σ/ √ KB)(1/ √ α -1) + D-Regret T (q ⋆ )/T ≲ (1/ √ K)(ζ 0 -ζ 1 ) . By Corollary 1, we have a worst-case upper bound of D-Regret T (q ⋆ ) as O( √ T ); empirical evidence in Appendix F.6 suggests tighter rates O(1). With either rates, we always have D-Regret T (q ⋆ )/T = o(1). Thus, to have R 1 ≲ R 0 , we only need (σ/ √ B)(1/ √ α -1) ≪ ζ 0 -ζ 1 asymptotically. That is, we want the gap between the heterogeneity under best fixed sampling distribution and uniform sampling to be large, compared to (σ/ √ B)(1/ √ α -1) , which is always true when we use full local gradient, i.e., when σ = 0. Similar arguments apply to when R 2 ≲ R 0 . See a more detailed discussion in Appendix A.2.

A MORE DISCUSSIONS

In this section, we include additional discussions that is omitted from the main text due to space limit.

A.1 DETAILED DISCUSSION ABOUT CONTRIBUTIONS

We develop an algorithm based on OSMD that generates a sequence of sampling distributions {p t } t≥1 based on the partial feedback available to the server from the sampled clients. We prove a bound on regret relative to the any dynamic comparators, which allows us to consider the best sequence of sampling distributions as they change over iterations. The bound includes a total variation term that characterizes the intrinsic difficulty of the problem by capturing the difficulty of following the best sequence of distributions. Such a characterization of problem difficulty is novel. Besides, our theoretical result can recover the results in previous research as special cases and is thus strictly more general. Moreover, we theoretically improve the convergence guarantee of optimization algorithm by using our sampling scheme over uniform sampling. We show that adaptive sampling can help reduce the dependency on the heterogeneity level of the problem. We demonstrate the empirical superiority of the proposed algorithm through synthetic and real data experiments. In addition to client sampling in FL, our proposed algorithm will have a broad impact on stochastic optimization. Adapting our algorithm to any stochastic optimization procedure that chooses samples, such as SGD, or coordinates, such as Stochastic Coordinate Descent, may improve their performance. For example, Zhao et al. (2022) illustrated the practical benefits of our algorithm to speed up L-SVRG and L-Katyusha. The learning rate of the proposed algorithm depends on the total variation of the comparator sequence, which is generally unknown. Furthermore, it also depends on the total number of iterations that can also be unknown. In appendix, we make the algorithm practical by addressing a few technical challenges. In particular, we adapt the follow-the-regularized-leader algorithm and doubling trick to automatically choose the learning rate that performs as well as the best learning rate asymptotically.

A.2 DETAILED DISCUSSION ABOUT THEOREM 2

We derive different convergence rates R 1 and R 2 in Theorem 2 by choosing different comparators, which rates can be applied shall depend on the tuning parameter α (or equivalently the space A). More specifically, R 1 is derived by comparing against q ⋆ and R 2 is derived by comparing against p 1:T ⋆ . The different notions of heterogeneity reveals the fact that different sampling schemes can change the convergence speed of optimization algorithms through the change of heterogeneity level. It is worth noting that one can always choose other comparator sequences besides these two and derived new upper bounds. When choosing a more flexible comparator, one can obtain a smaller heterogeneity, but risk requiring higher regret and larger space A (which will require smaller α). A good choice of comparator should keep a trade-off between these two concerns. We leave how to choose the optimal comparator sequence as a future research direction. Note that R 0 is the rate of mini-batch SGD under uniform sampling (Ghadimi & Lan, 2013) . To see when OSMD Sampler can obtain tighter rates than uniform sampling, we only need R 1 or R 2 to be smaller than R 0 . When will R 1 ≪ R 0 . To have R 1 ≲ R 0 , we need (σ/ √ KB)(1/ √ α -1) + D-Regret T (q ⋆ )/T ≲ (1/ √ K)(ζ 0 -ζ 1 ) . By Corollary 1, we have a worst-case upper bound of D-Regret T (q ⋆ ) as O( √ T ); empirical evidence in Appendix F.6 suggests tighter rates O(1). With either rates, we always have D-Regret T (q ⋆ )/T = o(1). Thus, to have R 1 ≲ R 0 , we only need (σ/ √ B)(1/ √ α -1) ≪ ζ 0 - ζ 1 asymptotically. That is, we want the gap between the heterogeneity under best fixed sampling distribution and uniform sampling be large, compared to (σ/ √ B)(1/ √ α -1). When σ = 0, that is, if we use full local gradient, then this is always true. When will R 2 ≪ R 0 . Similarly, to have R 2 ≪ R 0 , we need (σ/ √ KB)(1/ √ α -1) + D-Regret T (p ⋆ ⋆ )/T ≲ (1/ √ K)(ζ 0 -ζ 2 ).. A worst-case upper bound of D-Regret T (p 1:T ⋆ ) based on Corollary 1 is O(T ); however, as shown in Appendix F.6, the empirical evidence suggests that we may have D-Regret T (p 1:T ⋆ ) to be O( √ T ) or even O(1), which then implies that D-Regret T (p 1:T ⋆ )/T = o(1). Given this is true, to have R 1 ≲ R 0 , we need (σ/ √ B)(1/ √ α -1) ≪ ζ 0 -ζ 1 asymptotically. That is, we want the gap between the heterogeneity under best dynamic sampling distribution and uniform sampling be large, compared to (σ/ √ B)(1/ √ α -1). When σ = 0, that is, if we use full local gradient, then this is always true. New characterization of heterogeneity. Let ∆ζ i = ζ 0 -ζ i for i = 1, 2. Based on the previous analysis, we see that ∆ζ i , i = 1, 2, can also help measure the dissimilarity of the different clients. Compared with ζ 0 , the new characterization reflects how much gain we can have by using nonuniform sampling, where ∆ζ 1 measures the advantage gained by fixed sampling distribution, and ∆ζ 2 measures the advantage gained by dynamic sampling distribution.

B MORE RELATED WORK

Our paper is also closely related to importance sampling in stochastic optimization. Zhao & Zhang (2015) ; Needell et al. (2016) illustrated that by sampling observations from a nonuniform distribution when using a gradient-based stochastic optimization method, one can achieve faster convergence. They designed a fixed sampling distribution using prior knowledge on the upper bounds of gradient norms. Csiba & Richtárik (2018) 2017) designed the sampling distribution by solving a multi-armed bandit problem with the EXP3 algorithm (Lattimore & Szepesvári, 2020, Chapter 11) . Borsos et al. (2018) used the follow-the-regularized-leader algorithm (Lattimore & Szepesvári, 2020, Chapter 28) to solve an online convex optimization problem and make updates to the sampling distribution. Borsos et al. (2019) restricted the sampling distribution to be a linear combination of distributions in a predefined set and used an online Newton step to make updates to the mixture weights. The above approaches estimate a stationary distribution, while the best distribution is changing with iterations and, therefore, is intrinsically dynamic. In addition to having suboptimal empirical performance, these papers provide theoretical results that only establish a regret relative to a fixed sampling distribution in hindsight. To address this problem, Hanchi & Stephens (2020) took a non-stationary approach where the most recent information for each client was kept. A decreasing stepsize sequence is required to establish a regret bound. Furthermore, the regret bound does not capture the intrinsic difficulty of the problem. In comparison, we establish a regret bound relative to a dynamic comparator-a sequence of sampling distributions-without imposing assumptions on the stepsize sequence, and this bound includes the dependence on the total variation term characterizing the intrinsic difficulty of the problem. Our paper also contributes to the literature on online convex optimization. We cast the client sampling problem as an online learning problem (Hazan, 2016) and adapt algorithms from the dynamic online convex optimization literature to solve it. Hall & Willett (2015) ; Yang et al. (2016) ; Daniely et al. (2015) proposed methods that achieve sublinear dynamic regret relative to dynamic comparator sequences. In particular, Hall & Willett (2015) used a dynamic mirror descent algorithm to achieve sublinear dynamic regret with total variation characterizing the intrinsic difficulty of the environment. However, the optimal tuning parameters depend on the unknown total variation. On the other hand, van Erven & Koolen (2016); Zhang et al. (2018) proposed different online ensemble approaches to automatically choose the tuning parameters for online gradient descent. Compared with the problem settings in the above studies, there are two key new challenges that we need to address. First, we only have partial information-bandit feedback-instead of the full information about the loss functions. Second, the loss functions in our case are unbounded, which violates the common boundedness assumption in the online learning literature. To overcome the first difficulty, we construct an unbiased estimator of the loss function and its gradient, which are then used to make an update to the sampling distribution. We address the second challenge by first bounding the regret of our algorithm when the sampling distributions in the comparator sequence lie in a region of the simplex for which the loss is bounded, and subsequently analyze the additional regret introduced by projecting the elements of the comparator sequence to this region.

C ADDITIONAL ALGORITHMS

C.1 ALGORITHM TO SOLVE STEP 7 OF ALGORITHM 1 WHEN Φ IS UNNORMALIZED NEGATIVE ENTROPY When Φ(•) is chosen as the negative entropy Φ e (•), a closed-form efficient solution can be obtained as shown in Proposition 1. Proposition 1. Suppose Φ = Φ e is the unnormalized negative entropy in Algorithm 1. Let pt+1 m = p t m exp N m ∈ S t η t a t m /(K 2 (p t m ) 3 ) , m ∈ [M ]. Let π : [M ] → [M ] be a permutation such that pt+1 π(1) ≤ pt+1 π(2) ≤ • • • ≤ pt+1 π(M ) . Let m t ⋆ be the smallest integer m such that pt+1 π(m) 1 - m -1 M α > α M M j=m pt+1 π(j) . Then pt+1 m = α/M if π(m) < m t ⋆ (1 -((m t ⋆ -1)/M )α)p t+1 m / M j=m t ⋆ pt+1 π(j) otherwise. Proof. See Appendix D.1. An efficient implementation is shown in Algorithm 2. Algorithm 2 Solver of Step 7 of Algorithm 1 when Φ is unnormalized negative entropy 1: Input: pt , S t , {a t m } m∈S t , and A = P M -1 ∩ [α/M, ∞) M . 2: Output: pt+1 . 3: Let pt+1 m = p t m exp N {m ∈ S t } η t a t m /(K 2 (p t m ) 3 ) for m ∈ [M ]. 4: Sort {p t+1 m } M m=1 in a non-decreasing order: pt+1 π(1) ≤ pt+1 π(2) ≤ • • • ≤ pt+1 π(M ) . 5: Let v m = pt+1 π(m) 1 -m-1 M α for m ∈ [M ]. 6: Let u m = α M M j=m pt+1 π(j) for m ∈ [M ]. 7: Find the smallest m such that v m > u m , denoted as m t ⋆ . 8: Let pt+1 m = α/M if π(m) < m t ⋆ (1 -((m t ⋆ -1)/M )α)p t+1 m / M j=m t ⋆ pt+1 π(j) otherwise.

C.2 MIN-BATCH SGD WITH OSMD SAMPLER

In this section, we describe the Min-batch SGD with OSMD Sampler in Algorithm 3. Compared to classical min-batch SGD, the key ingredient of Algorithm 3 is Line 13, where the server updates the sampling distribution by OSMD Sampler, and Line 5, where the server samples the local mini-batch from a non-uniform sampling distribution.

C.3 ADAPTIVE-OSMD SAMPLER

The choice of the sequence of learning rates {η t } t≥1 has a large effect on the performance of OSMD Sampler. Similar to Corollary 1, we can have the following Corollary. Server computes a t m = λ 2 m ∥g t m ∥ 2 for m ∈ S t and g t = 1 K m∈S t λ m pt m g t m . 12: Server makes update of the model parameter w t+1 ← w t -µ t g t . 13: Server obtains updated sampling distribution pt+1 by Algorithm 1. 14: end for Corollary 2. Suppose conditions of Theorem 1 hold and let p1:T be the sequence generated by Algorithm 1. We choose α such that p t ⋆ ∈ A for all t ∈ [T ]. Let η = K 2 α 3 M 3 E [(ā 1 ) 2 ] log M + 2 log(M/α)E [TV (p 1:T ⋆ )] 2T . ( ) Then D-Regret T (p 1:T ⋆ ) ≤ 2 √ 2M 3 E [(ā 1 ) 2 ] K 2 α 3 [log M + 2 log (M/α) E [TV (p 1:T ⋆ )]] T , where āt := max 1≤m≤M a t m = max 1≤m≤M λ 2 m ∥g t m ∥ 2 . The choice in equation 11 still depends on unknown quantities such as E[ā 1 ], E[TV(p 1:T ⋆ )], and T . To get E[ā 1 ], we can add a pre-training phase where we broadcast the initial model parameter w 0 to all devices before the start of the training, and collect the returned ∥g 0 m ∥ 2 from all responsive devices, which we denote as S 0 . Then â1 := max m∈S 0 λ m ∥g 0 m ∥ 2 . On the other hand, E[TV(p 1:T ⋆ )] and T are hard to estimate in advance of the training. We discuss how to use an online ensemble method to choose the learning rate without the knowledge of E[TV(p 1:T ⋆ )], and how to get rid of the dependence on T using the doubling trick. The main idea is to run a set of expert algorithms, each with a different learning rate for Algorithm 1. We then use a prediction-with-expert-advice algorithm to track the best performing expert algorithm. 7 More specifically, we define the set of expert learning rates as E := 2 e-1 • K 2 α 3 M 3 E [(ā 1 ) 2 ] log M 2T e = 1, 2, . . . , E , where E = ⌊ 1 2 log 2 1 + 4 log(M/α) log M (T -1) ⌋ + 1. ( ) Then for each η e ∈ E, Adaptive-OSMD Sampler algorithm runs an expert algorithm to generate a sequence of sampling distributions p1:T e . Meanwhile, it also runs a meta-algorithm that uses Algorithm 4 Adaptive-OSMD Sampler 1: Input: Meta learning rate γ; the set of expert learning rates From the computational perspective, the major cost comes from solving step 9 of Algorithm 4, which needs to be run for a total number of T |E| = T ⌊log 2 T ⌋ times. Compared with Algorithm 1, the computational complexity only increases by a log(T ) factor. We have the following result on Algorithm 4.  E = {η 1 ≤ η 2 ≤ • • • ≤ η E } with E = |E|; parameter α ∈ (0, 1], A = P M -1 ∩ [α/M, ∞ -Regret T ≤ 3 √ 2M 3 E [(ā 1 ) 2 ] K 2 α 3 T [log M + 2 log (M/α) E [TV (p 1:T ⋆ )]] + M α T E [ā 1 ] 8K (1 + 2 log E) . Proof. See Appendix D.4. Since the additional regret term is Õ((M/α) T /K), which is no larger than the first term asymptotically, the bound on the regret is of the same order as in equation 12. However, we do not need to know the total variation E[TV(p 1:T ⋆ )] to set the learning rate. Based on Theorem 3, the choice of α relies on prior knowledge about {p t ⋆ } t≥1 . Specifically, we need α to be small enough so that p t ⋆ ∈ A for all t ∈ [T ]. While this prior knowledge is not generally available, in Section F.3 we experimentally show that the proposed algorithm is robust to the choice of α. As long as the chosen α is not too small or too large, we obtain a reasonable solution. We always set α = 0.4 in experiments.

C.4 ADAPTIVE-OSMD SAMPLER WITH DOUBLING TRICK (ADAPTIVE-DOUBLING-OSMD)

Algorithm 4 requires the total number of iterations T as input, which is not always available in practice. In those cases, we use doubling trick (Cesa-Bianchi & Lugosi, 2006, Section 2.3) to avoid this requirement. The basic idea is to restart Adaptive-OSMD Sampler at exponentially increasing time points T b = 2 b-1 , b ≥ 1. The learning rates of experts in Algorithm 4 are reset at the beginning of each time interval, and the meta-algorithm learning rate γ is chosen optimally for the interval length. We set āT b at the beginning of each time interval using the maximum environment feedback from the previous interval. More specifically, at the time point T b , we let E b := 2 e-b 2 -1 • K 2 α 3 √ log M M 3 âb e = 1, 2, . . . , E b , where E b = ⌊ 1 2 log 2 1 + 4 log(M/α) log M (2 b-1 -1) ⌋ + 1, and γ b = α M 8K 2 b-1 âb , b ≥ 1. We set âT b = max m∈S T b -1 a T b -1 m . In a practical implementation, at the time point t = T b , instead of initializing all expert algorithms using uniform distribution, we can initialize them with the output of the meta-algorithm for t = T b -1. To get a 0 , the server uses a pre-training phase where the initial model parameter w 0 is broadcast to all devices before the start of the training. Subsequently, the server collects the returned ∥g 0 m ∥ 2 from all responsive devices, which are denoted as S 0 . Then â1 := max m∈S 0 a 0 m where a 0 m = λ m ∥g 0 m ∥ 2 . Adaptive-Doubling-OSMD Sampler is detailed in Algorithm 5. From the computational perspective, by the proof of Theorem 4, Algorithm 5 needs to run Step 9 of Algorithm 4 for a total number of O(T |E| 2 ) = O(T ⌊log 2 T ⌋) times. Therefore, the computational complexity of Adaptive-Doubling-OSMD Sampler is asymptotically the same as that of Adaptive-OSMD Sampler, while it increases by only a log(T ) factor compared to OSMD Sampler. The following theorem provides a bound on the dynamic regret for Adaptive-Doubling-OSMD Sampler. Theorem 4. Let Φ = Φ e and we use ∥•∥ = ∥•∥ 1 to define the total variation. Let α be small enough constant so that p 1:T ⋆ ∈ A T and the training is stopped after T iterations. Suppose that there exists a constant C > 1 such that âT b ≤ Cā T b for all b = 1, 2, . . . , B, where B = ⌊log 2 (T + 1)⌋. Let p1:T be the output of Algorithm 5, where p unif is used in Step 8. Then D-Regret T ≤ 2(T + 1) √ 2 -1 (2C + 1) √ 2M 3 E [(ā 1 ) 2 ] K 2 α 3 log M + 2 log (M/α) E [TV (p 1:T ⋆ ])+ M α E [ā 1 ] 8K (C + 2 log E) . Proof. See Appendix D.5. From Theorem 4 we observe that the asymptotic regret bound has the same order as that of OSMD Sampler and Adaptive-OSMD Sampler. However, Adaptive-Doubling-OSMD Sampler does not need to know E TV(p 1:T ⋆ ) or T in advance.

C.5 ADAPTIVE SAMPLING WITHOUT REPLACEMENT

In the discussion so far, we have assumed that the set S t is obtained by sampling with replacement from p t . However, when K is relatively large compared to M and p t is far from uniform distribution, sampling without replacement can be more efficient than sampling with replacement. However, when sampling without replacement using p t , the variance reduction loss does not have a clean form as in equation 4. As a result, an online design of the sampling distribution is more challenging. In this section, we discuss how to use the sampling distribution obtained by Adaptive-OSMD Sampler to sample clients without replacement, following the approach taken in Hanchi & Stephens (2020) . Construct pt (k) by letting  pt (k),m = 1 - k-1 l=1 pt m t l -1 pt m if m ∈ [M ] Let S t = {m t 1 , • • • , m t K }.

11:

The server broadcasts the model parameter w t to clients in S t . 12: The clients in S t compute and upload the set of local gradients g The detailed sampling procedure is described in Algorithm 6. We still use Adaptive-OSMD Sampler to update the sampling distribution. However, we use the designed sampling distribution in a way that no client is chosen twice. Furthermore, Step 18 of Algorithm 6 constructs the gradient estimate with the following properties. for k = 2, • • • , K do 16: Let g t (k) = λ t m t k g t m t k /p t (k),m t k + k-1 l=1 λ t Proposition 2 (Proposition 3 of Hanchi & Stephens (2020) ). Let pt = p and let gt be as in Step 18 of Algorithm 6. Note that gt = gt (p) depends on p. Recall that J t = M m=1 λ m g t m . We have E S t gt = J t and arg min p∈P M -1 E S t ∥g t -J t ∥ 2 2 = arg min p∈P M -1 l t (p), where l t (•) is defined in equation 4 and the expectation is taken over S t . From Proposition 2, we see that gt is an unbiased stochastic gradient. Furthermore, the variance of gt is minimized by the same sampling distribution that minimizes the variance reduction loss in equation 4. Therefore, it is reasonable to use the sampling distribution generated by Adaptive-OSMD Sampler to design gt . The optimality condition for pt+1 implies that η t ∇ lt (p t ; pt ) + ∇Φ e (p t+1 ) -∇Φ e (p t ) = 0. ( ) By Lemma 1, the optimality condition for pt+1 implies that ⟨p -pt+1 , ∇Φ e (p t+1 ) -∇Φ e (p t+1 )⟩ ≥ 0, for all p ∈ A. Combining the last two displays, we have ⟨p -pt+1 , η t ∇ lt (p t ; pt ) + ∇Φ e (p t+1 ) -∇Φ e (p t )⟩ ≥ 0, for all p ∈ A. By Lemma 1, this is the optimality condition for pt+1 to be the solution in Step 7 of Algorithm 1. Note that equation 17 implies that - η t K 2 • a t m (p t m ) 3 N m ∈ S t + log(p t+1 m ) -log(p t m ) = 0, m ∈ [M ]. Therefore, pt+1 m = pt m exp η t a t m K 2 (p t m ) 3 N m ∈ S t , m ∈ [M ], and the final result follows from Lemma 4.

D.2 PROOF OF THEOREM 1

We first state a proposition that will be used to prove Theorem 1. The key difference between Theorem 1 and Proposition 3 is that in Proposition 3 the comparator sequence lies in A, and, as a result, there is no projection error. Proposition 3. Suppose the conditions of Theorem 1 hold. For any comparator sequence q 1:T with q t ∈ A, t ∈ [T ], we have D-Regret T (q 1:T ) ≤ D max η 1 + 2H η T E TV q 1:T + 2 ρ T t=1 η t E Q 2 t . Proof. By Lemma 1 and the definition of pt+1 in Step 7 of Algorithm 1, we have ⟨p t+1 -q t , ∇ lt (p t ; pt )⟩ ≤ 1 η t ⟨∇Φ(p t ) -∇Φ(p t+1 ), pt+1 -q t ⟩. ( ) By the convexity of lt (•; pt ), we have lt (p t ; pt ) -lt (q t ; pt ) ≤ ⟨∇ lt (p t ; pt ), pt -q t ⟩ = ⟨∇ lt (p t ; pt ), pt+1 -q t ⟩ + ⟨∇ lt (p t ; pt ), pt -pt+1 ⟩. Then, by equation 18, we further have lt (p t ; pt ) -lt (q t ; pt ) ≤ 1 η t ⟨∇Φ(p t ) -∇Φ(p t+1 ), pt+1 -q t ⟩ + ⟨∇ lt (p t ; pt ), pt -pt+1 ⟩. From the definition of D, we have D Φ (x 1 ∥ x 2 ) = D Φ (x 3 ∥ x 2 ) + D Φ (x 1 ∥ x 3 ) + ⟨∇Φ(x 2 ) -∇Φ(x 3 ), x 3 -x 1 ⟩, x 1 , x 2 , x 3 ∈ D. Then lt (p t ; pt ) -lt (q t ; pt ) ≤ 1 η t D Φ q t ∥p t -D Φ q t ∥p t+1 -D Φ pt+1 ∥p t + ⟨∇ lt (p t ; pt ), pt -pt+1 ⟩ = 1 η t D Φ q t ∥p t -D Φ q t+1 ∥p t+1 + 1 η t D Φ q t+1 ∥p t+1 -D Φ q t ∥p t+1 - 1 η t D Φ pt+1 ∥p t + ⟨∇ lt (p t ; pt ), pt -pt+1 ⟩. ( ) We bound the second term in equation 19 as D Φ q t+1 ∥p t+1 -D Φ q t ∥p t+1 = Φ(q t+1 ) -Φ(q t ) -⟨∇Φ(p t+1 ), q t+1 -q t ⟩ (a) ≤ ⟨∇Φ(q t+1 ) -∇Φ(p t+1 ), q t+1 -q t ⟩ (b) ≤ ∥∇Φ(q t+1 ) -∇Φ(p t+1 )∥ ⋆ ∥q t+1 -q t ∥ (c) ≤ 2H∥q t+1 -q t ∥, where (a) follows from the convexity of Φ(•), (b) follows from the definition of the dual norm, and (c) follows from the definition of H. Since Φ(•) is ρ-strongly convex, we can bound the third and fourth term in equation 19 as - 1 η t D Φ pt+1 ∥p t + ⟨∇ lt (p t ; pt ), pt -pt+1 ⟩ ≤ - ρ 2η t ∥p t+1 -pt ∥ 2 + ∥∇ lt (p t ; pt )∥ ⋆ ∥p t -pt+1 ∥. Since ab ≤ a 2 /(2ϵ) + b 2 ϵ/2, a, b, ϵ > 0, we further have - 1 η t D Φ pt+1 ∥p t + ⟨∇ lt (p t ; pt ), pt -pt+1 ⟩ ≤ 2η t ρ ∥∇ lt (p t ; pt )∥ 2 ⋆ ≤ 2η t ρ Q 2 t . Combining equation 19-equation 21, we have lt (p t ; pt ) -lt (q t ; pt ) ≤ D Φ (q t ∥p t ) η t - D Φ q t+1 ∥p t+1 η t + 2H ∥q t+1 -q t ∥ η t + 2 ρ η t Q 2 t . This implies that T t=1 lt (p t ; pt ) -T t=1 lt (q t ; pt ) ≤ D Φ q 1 ∥p 1 η 1 - D Φ q T +1 ∥p T +1 η T +1 + 2H T t=1 ∥q t+1 -q t ∥ η t + 2 ρ T t=1 η t Q 2 t ≤ D Φ q 1 ∥p 1 η 1 + 2H η T T t=1 ∥q t+1 -q t ∥ + 2 ρ T t=1 η t Q 2 t ≤ D max η 1 + 2H η T TV q 1:T + 2 ρ T t=1 η t Q 2 t , ( ) since p1 is the uniform distribution. Finally, note that E T t=1 lt (p t ; pt ) - T t=1 lt (q t ; pt ) = T t=1 E E S t lt (p t ; pt ) -E S t lt (q t ; pt ) = T t=1 E l t (p t ) -l t (q t ) = D-Regret T (q 1:T ). The conclusion follows by taking expectation on both sides of equation 22. We are now ready to prove Theorem 1. Proof of Theorem 1. For any comparator sequence q 1:T with q t ∈ P M -1 , t ∈ [T ], we have D-Regret T (q 1:T ) = E T t=1 l t (p t ) - T t=1 l t (q t ) + T t=1 l t (q t ) - T t=1 l t (q t ) . ( ) By Proposition 3, we further have that E T t=1 l t (p t ) - T t=1 l t (q t ) ≤ D max η 1 + 2H η T E TV q 1:T + 2 ρ T t=1 η t E Q 2 t + 2H η T E TV q1:T -TV q 1:T . (24) Therefore, to prove Theorem 1, we design a suitable sequence q1:T , where qt ∈ A, and bound the terms T t=1 l t (q t ) -T t=1 l t (q t ) and TV q1:T -TV q 1:T . We define qt as qt m = α/M if q t m < α/M, q t m -ω(q t , α) q t m -α M if q t m ≥ α/M, where ω(q t , α) is defined in equation 9. We now show that qt ∈ A, t ∈ [T ], by showing that qt m ≥ α/M , m ∈ [M ], and m∈[M ] qt m = 1. For m ∈ [M ] such that q t m < α/M , we have from equation 25 that qt m = α/M . For m ∈ [M ] such that q t m ≥ α/M , by equation 25, we have qt m -α/M = (1 -ω(q t , α)) (q t m -α/M ). Thus, we proceed to show that ω(q t , α) ≤ 1. Since 1 = M m=1 q t m 1 q t m < α M + M m=1 q t m 1 q t m ≥ α M ≥ M m=1 α M 1 q t m < α M + M m=1 α M 1 q t m ≥ α M = α, we have M m=1 q t m - α M 1 q t m ≥ α M ≥ M m=1 α M -q t m 1 q t m < α M . Therefore, 0 ≤ ω(q t , α) ≤ 1. Furthermore, ω(q t , 0) = 1 and ω(q t , 1) = 1. Finally, we show that M m=1 qt m = 1. By equation 25 and the definition of ω(q t , α) in equation 9, we have M m=1 qt m = M m=1 α M 1 q t m < α M + M m=1 q t m 1 q t m ≥ α M -ω(q t , α) M m=1 q t m - α M 1 q t m ≥ α M = M m=1 α M 1 q t m < α M + M m=1 q t m 1 q t m ≥ α M - M m=1 α M -q t m 1 q t m < α M = M m=1 q t m 1 q t m ≥ α M + M m=1 q t m 1 q t m < α M = 1. Therefore, qt ∈ A for any t ∈ [T ]. We now bound T t=1 l t (q t ) -T t=1 l t (q t ). When q t m < α/M , then 1/q t m -1/q t m < 0; and when q t m ≥ α/M , then 1 qt m - 1 q t m = 1 q t m •   1 1 -ω(q t , α) 1 -α M q t m -1   = 1 q t m • ω(q t , α) 1 -α M q t m 1 -ω(q t , α) 1 -α M q t m . Since ω(q t , α) 1 -α M q t m ≤ ω(q t , α) and 1 -ω(q t , α) 1 -α M q t m ≥ 1 -ω(q t , α) + ω(q t , α)α M as q t m ≤ 1, we have 1 qt m - 1 q t m ≤ 1 q t m • ω(q t , α) 1 -ω(q t , α) 1 -α M = ϕ(q t , α) q t m . Thus, T t=1 l t (q t ) - T t=1 l t (q t ) = T t=1 M m=1 a t m 1 qt m - 1 q t m ≤ T t=1 M m=1 a t m 1 qt m - 1 q t m 1 q t m ≥ α M ≤ T t=1 ϕ(q t , α) M m=1 a t m q t m 1 q t m ≥ α M ≤ T t=1 ϕ(q t , α)l t (q t ). Next, we bound TV q1:T -TV q 1:T . Note that TV q1:T = T t=2 qt -qt-1 1 = T t=2 qt -q t + q t -q t-1 + q t-1 -qt-1 1 ≤ T t=2 qt -q t 1 + T t=2 q t -q t-1 1 + T t=2 q t-1 -qt-1 1 ≤ TV q 1:T + 2 T t=1 qt -q t 1 . We now upper bound T t=1 ∥q t -q t ∥ 1 . If q t m < α/M , then |q t m -q t m | = α/M -q t m . If q t m ≥ α/M , by equation 25, we have |q t m -q t m | = ω(q t , α) (q t m -α/M ). Therefore, recalling the definition of ψ(q t , α) in equation 9, we have qt -q t 1 = M m=1 α M -q t m 1 q t m < α M + ω(q t , α) M m=1 q t m - α M 1 q t m ≥ α M = 2 M m=1 α M -q t m 1 q t m < α M = 2ψ(q t , α) and TV q1:T -TV q 1:T ≤ 4 T t=1 ψ(q t , α). Combining equation 23, equation 24, equation 26, and equation 27, and taking expectation on both sides, we obtain the result.

D.3 COROLLARY 3 AND ITS PROOF

Corollary 3. Suppose conditions of Theorem 1 hold and Φ = Φ e . Then for any comparator sequence {q 1:T }, we have D-Regret T (q 1:T ) ≤ log M η 1 + 2 log(M/α) η T TV q 1:T + 2M 6 K 4 α 6 T t=1 η t āt 2 + 8H η T T t=1 ψ(q t , α) + T t=1 ϕ(q t , α)l t (q t ), where āt := max 1≤m≤M a t m = max 1≤m≤M λ 2 m ∥g t m ∥ 2 . Proof. When Φ = Φ e , we have ρ = 1 from Pinsker's inequality. Furthermore, ∥ • ∥ ⋆ = ∥ • ∥ ∞ , ∇Φ e (p) = (log p 1 , . . . , log p M ) ⊤ , and pt ∈ A = P M -1 ∩ [α/M, ∞) M , t ∈ [T ]. Then Q t = M 3 āt /(K 2 α 3 ) and H = log(M/α) follows by checking the definition. Finally, to show that D Φe (q ∥ p unif ) ≤ log M for all q ∈ A, we note that D Φe (q ∥ p unif ) = log M + M m=1 q m log q m ≤ log M .

D.4 PROOF OF THEOREM 3

The proof proceeds in two steps. First, we show that there exists an expert learning rate η e ∈ E such that the regret bound for p1:T e is close to equation 12. That is, we show that there exists η e ∈ E such that E T t=1 lt (p t e ; pt ) - T t=1 l t (p t ⋆ ) ≤ 3 √ 2M 3 E [(ā 1 ) 2 ] K 2 α 3 T [log M + 2 log (M/α) E [TV (p 1:T ⋆ )]]. (28) Note that S t ∼ pt . Second, we show that the output of meta-algorithm can track the best expert with small regret. That is, we show that E T t=1 l t (p t ) -E T t=1 lt (p t e ; pt ) ≤ M α T E [ā 1 ] 8K (1 + 2 log E), e ∈ [E]. The theorem follows by combining equation 28 and equation 29. We first prove equation 28. Since 0 ≤ E[TV p 1:T ⋆ ] ≤ 2(T -1), we have min E = K 2 α 3 M 3 ā1 log M 2T ≤ η ⋆ ≤ K 2 α 3 M 3 ā1 log M + 4 log(M/α)(T -1) 2T ≤ max E, where η ⋆ is defined as in equation 11. Thus, there exists η e ∈ E, such that η e ≤ η ⋆ ≤ 2η e . Repeating the proof of equation 22 and proof of Corollary 3, we show that T t=1 lt (p t e ; pt ) - T t=1 lt (p t ⋆ ; pt ) ≤ log M η e + 2 log(M/α) η e TV p 1:T ⋆ + 2η e M 6 K 4 α 6 T t=1 āt 2 , which then implies that E T t=1 lt (p t e ; pt ) - T t=1 lt (p t ⋆ ; pt ) ≤ log M η e + 2 log(M/α) η e E TV p 1:T ⋆ + 2η e M 6 K 4 α 6 T t=1 E āt 2 , Since η ⋆ /2 ≤ η e ≤ η ⋆ , we further have E T t=1 lt (p t e ; pt ) - T t=1 lt (p t ⋆ ; pt ) ≤ 2 log M η ⋆ + 4 log(M/α) η ⋆ E TV p 1:T ⋆ + 2η ⋆ M 6 K 4 α 6 T t=1 E āt 2 ≤ 2 log M η ⋆ + 4 log(M/α) η ⋆ E TV p 1:T ⋆ + 2η ⋆ M 6 T K 4 α 6 E ā1 2 = 3 √ 2M 3 E [(ā 1 ) 2 ] K 2 α 3 T [log M + 2 log (M/α) E [TV (p 1:T ⋆ )]]. Now, equation 28 follows, since E T t=1 lt (p t ⋆ ; pt ) = T t=1 E E S t lt (p t ⋆ ; pt ) = T t=1 E l t (p t ⋆ ) . We prove equation 29 next. Let Le t = t s=1 ls (p s e ; ps ) e ∈ [E], t ∈ [T ]. Recall the update for θ t e in Step 11 of Alg 4. We have θ t e = θ 1 e exp -γ Le t-1 E b=1 θ 1 b exp -γ Lb t-1 , t = 2, . . . T. Let Θ t = E b=1 θ 1 b exp -γ Lb t . Then log Θ 1 = log E b=1 θ 1 b exp -γ Lb 1 and, for t ≥ 2, log Θ t Θ t-1 = log   E b=1 θ 1 b exp -γ Lb t-1 exp -γ lt (p t b ; pt ) E b=1 θ 1 b exp -γ Lb t-1   = log E b=1 θ t b exp -γ lt (p t b ; pt ) . We have log Θ T = log Θ 1 + T t=1 log Θ t Θ t-1 = T t=1 log E b=1 θ t b exp -γ lt (p t b ; pt ) ≤ T t=1 -γ E b=1 θ t b lt (p t b ; pt ) + γ 2 M 2 āt 8Kα 2 (Lemma 3) ≤ -γ T t=1 lt (p t ; pt ) + γ 2 M 2 T t=1 āt 8Kα 2 (Jensen's inequality) and log (Θ T ) = log E b=1 θ 1 b exp -γ Lb T ≥ log max 1≤b≤E θ 1 b exp -γ Lb T = -γ min 1≤b≤E Lb T + 1 γ log 1 θ 1 b . Combining the last two displays, we have -γ min 1≤b≤E Lb T + 1 γ log 1 θ 1 b ≤ -γ T t=1 lt (p t ; pt ) + γ 2 M 2 T t=1 āt 8Kα 2 , which implies that T t=1 lt (p t ; pt ) -Le T ≤ γM 2 T t=1 āt 8Kα 2 + 1 γ log 1 θ 1 e ≤ γM 2 T ā1 8Kα 2 + 1 γ log 1 θ 1 e , e ∈ [E]. Taking expectation on both sides, we then have E T t=1 lt (p t ; pt ) -Le T ≤ γM 2 T E ā1 8Kα 2 + 1 γ log 1 θ 1 e Since θ 1 e ≥ 1 E 2 , log 1/θ 1 e ≤ 2 log E. Let γ = 8Kα 2 /(T M 2 E[ā 1 ] ) to minimize the right hand side of the above inequality with log 1/θ 1 e substituted by 1. Then E T t=1 lt (p t ; pt ) -Le T = E T t=1 lt (p t ; pt ) - T t=1 lt (p t e ; pt ) ≤ M α T E [ā 1 ] 8K (1 + 2 log E) , e ∈ [E].

D.5 PROOF OF THEOREM 4

Recall that T b = 2 b-1 and pT b is reinitialized as the uniform distribution. Let D-Regret b = E   T b+1 -1 t=T b l t (p t ) - T b+1 -1 t=T b l t (p t ⋆ )   . Similar to the proof of equation 28 and equation 29, we have D-Regret b ≤ (2C + 1) √ 2M 3 E [(ā T b ) 2 ] K 2 α 3 log M + 2 log (M/α) E TV p T b :(T b+1 -1) ⋆ (T b+1 -T b ) + M α (T b+1 -T b )E [ā T b ] 8K (C + 2 log E b ) ≤ (2C + 1) √ 2M 3 E [(ā 1 ) 2 ] K 2 α 3 [log M + 2 log (M/α) TV (p 1:T ⋆ )]( √ 2) b-1 + M α E [ā 1 ] 8K (C + 2 log E)( √ 2) b-1 , where E is defined in equation 14. Since B = ⌊log 2 (T + 1)⌋, we have T B ≤ T ≤ T B+1 -1, which implies that 1 ≤ T -T B + 1 ≤ T B+1 -T B = 2 B . Thus, we can similarly obtain E T t=T B l t (p t ) - T t=T B l t (p t ⋆ ) ≤ (2C + 1) √ 2M 3 E [(ā 1 ) 2 ] K 2 α 3 [log M + 2 log (M/α) TV (p 1:T ⋆ )]( √ 2) B + M α E [ā 1 ] 8K (C + 2 log E)( √ 2) B . The result follows combing the last two displays.

D.6 PROOF OF THEOREM 2

Our proof follows the similar technique used in the proof of Theorem 2.1 of Ghadimi & Lan (2013) . Our key novel technique is the construction of a ghost subset that is drawn from [M ] from the comparator sampling distribution. The ghost subset is only constructed for theoretical purpose and does not need to be computed in practice. We only show the proof of R 1 , R 0 and R 2 can be then derived in a similar fashion. Let δ t = g t -∇F (w t ). Under Assumption 3, by (1.6) of Ghadimi & Lan (2013) , we have F w t+1 ≤ F w t + ∇F w t , w t+1 -w t + L 2 µ 2 g t 2 = F w t -µ ∇F w t , g t + L 2 µ 2 g t 2 = F w t -µ ∇F w t 2 -µ ∇F w t , δ t + L 2 µ 2 ∇F w t 2 + 2 ∇F w t , δ t + δ t 2 = F w t -µ - L 2 µ 2 ∇F w t 2 -µ -Lµ 2 ∇F w t , δ t + L 2 µ 2 δ t 2 . ( ) Note that E [g t | w t , pt ] = ∇F (w t ), thus we have E [δ t | w t , pt ] = 0, and E ∇F w t , δ t = E E ∇F w t , δ t | w t , pt = 0. On the other hand, we can assume that there is a ghost subset of clients St with | St | = K, which is drawn from [M ] with sampling distribution q ⋆ . Besides, we let gt := 1 M K m∈ St g t m q ⋆ m . Then we have E S t δ t 2 | w t , pt = E S t g t -J t + J t -∇F (w t ) 2 | w t , pt = E S t g t -J t 2 | w t , pt + J t -∇F (w t ) 2 = l t pt -J t 2 + J t -∇F (w t ) 2 = l t (q ⋆ ) -J t 2 + J t -∇F (w t ) 2 + l t pt -l t (q ⋆ ) = E St gt -∇F (w t ) 2 | w t , pt + l t pt -l t (q ⋆ ) . Since E gt -∇F (w t ) 2 | w t , pt = E    1 M K m∈ St g t m q ⋆ m -∇F (w t ) 2 | w t , pt    ≤ 2E    1 M K m∈ St g t m q ⋆ m - 1 M K m∈ St ∇F m (w t ) q ⋆ m 2 | w t , pt    + 2E    1 M K m∈ St ∇F m (w t ) q ⋆ m -∇F (w t ) 2 | w t , pt    = 2 M 2 K 2 E   m∈ St E ∥g t m -∇F m (w t )∥ 2 (q ⋆ m ) 2 | w t , pt   + 2 K 1 M 2 M m=1 ∥∇F m (w t )∥ 2 q ⋆ m -∇F m (w t ) 2 ≤ 2σ 2 M 2 KB M m=1 1 q ⋆ m + 2ζ 2 1 K ≤ 2σ 2 KBα + 2ζ 2 1 K , where the penultimate line follows that E ∥g t m -∇F m (w t )∥ 2 ≤ σ 2 /B and the definition of ζ 1 , and the last line follows that q ⋆ m ≥ α/M . Thus, we have E δ t 2 | w t , pt ≤ 2σ 2 KBα + 2ζ 2 1 K + l t pt -l t (q ⋆ ) , which implies that T -1 t=0 E δ t 2 ≤ 2T σ 2 KBα + 2T ζ 2 1 K + E T -1 t=0 l t pt - T -1 t=0 l t (q ⋆ ) = 2T σ 2 KBα + 2T ζ 2 1 K + D-Regret T (q ⋆ ). Combine equation 30, equation 31 and equation 32, we have µ - L 2 µ 2 T -1 t=0 E ∇F w t 2 ≤ F (w 1 ) -F (w T ) + L 2 µ 2 2T σ 2 KBα + 2T ζ 2 1 K + D-Regret T (q ⋆ ) ≤ D F + L 2 µ 2 2T σ 2 KBα + 2T ζ 2 1 K + D-Regret T (q ⋆ ) . Since µ ≤ 1/L, thus (µ -L 2 µ 2 ) = µ(1 -L 2 µ) ≥ µ/2, thus 1 T T -1 t=0 E ∇F w t 2 ≤ 2D F T µ + Lµ 2σ 2 KBα + 2ζ 2 1 K + D-Regret T (q ⋆ ) T . Finally, let µ = min 1/L, (1/σ) D F KBα/(LT ), (1/ζ 1 ) D F K/(T L), D F /(LD-Regret T (q ⋆ )) , (33) then we have 1 T T -1 t=0 E ∇F w t 2 ≲ D F T max L, σ LT D F KBα , ζ 1 T L D F K , LD-Regret T (q ⋆ )) D F + σ √ D F L √ T KBα + ζ 1 √ D F L √ T K + √ D F L D-Regret T (q ⋆ )) T ≲ D F T L + σ LT D F KBα + ζ 1 T L D F K + LD-Regret T (q ⋆ )) D F + σ √ D F L √ T KBα + ζ 1 √ D F L √ T K + √ D F L D-Regret T (q ⋆ )) T ≲ D F L T + σ √ D F L √ T KBα + ζ 1 √ D F L √ T K + √ D F L D-Regret T (q ⋆ )) T .

E USEFUL LEMMAS

Lemma 1. Suppose that f is a differentiable convex function defined on domf , and X ⊆ domf is a closed convex set. Then x is the minimizer of f on X if and only if ∇f (x) ⊤ (y -x) ≥ 0 for all y ∈ X . Proof. See Section 4.2.3 of Boyd et al. (2004) . Lemma 2. For q ∈ P M -1 we have D Φ (q ∥ p unif ) ≤ log M , where Φ is the unnormalized negative entropy. Proof. Since Φ(q) = M m=1 q m (log q m -1) ≤ 0, Φ(p unif ) = -log M , and ⟨∇Φ(p unif ), q -p unif ⟩ = M m=1 (q m - 1 M ) log 1 M = 0, we have D Φ (q ∥ p) ≤ log M . Lemma 3 (Hoeffding's Inequality). Let X be a random variable with a ≤ X ≤ b for a, b ∈ R. Then for all s ∈ R, we have log E e sX ≤ sE[X] + s 2 (b -a) 2 8 . Proof. See Section 2 of Wainwright (2019). Lemma 4 (Based on Exercise 26.12 of Lattimore & Szepesvári (2020) ). Let α ∈ [0, 1],  A = P M -1 ∩ [α/M, 1] M , D = [0, ∞) M , y m ⋆ 1 - m ⋆ -1 M α > α M M m=m ⋆ y m . Then x m = α M if m < m ⋆ (1-m ⋆ -1 M α)ym M n=m ⋆ yn otherwise. Proof. Consider the following constrained optimization problem: min u∈[0,∞) M M m=1 u m u m y m , s.t. M m=1 u m = 1, u m ≥ α M , m ∈ [M ]. Since x is the solution to this problem, by the optimality condition, there exists λ, ν 1 , . . . , ν M ∈ R such that log x m y m + 1 -λ -ν m = 0, m ∈ [M ], M m=1 x m = 1, x m - α M ≥ 0, m ∈ [M ], ν m ≥ 0, m ∈ [M ], ν m x m - α M = 0, m ∈ [M ]. By equation 34, we have x m = y m exp(-1 + λ + ν m ). By equation 37 and equation 38, when x m = α/M , we have x m = y m exp(-1 + λ + ν m ) ≥ y m exp(-1 + λ); when x m > α/M , we have x m = y m exp(-1 + λ). Assume that x 1 = • • • = x m ⋆ -1 = α/M < x m ⋆ ≤ • • • ≤ x M . Then 1 = M m=1 x m = (m ⋆ -1) α M + exp(-1 + λ) • M m=m ⋆ y m , which implies that exp(-1 + λ) = 1 -(m ⋆ -1) α M M m=m ⋆ y m . ( ) Thus, we have x m ⋆ = y m ⋆ exp(-1 + λ) = y m ⋆ 1 -(m ⋆ -1) α M M m=m ⋆ y m > α M , We use the mean squared error loss defined as L(w) = 1 M M m=1 L m (w), where L m (w) = 1 2n m (y m,i -⟨w ⋆ , x m,i ⟩) 2 . We use the stochastic gradient descent to make global updates. At each round t, we choose a subset of K = 5 clients, denoted as S t . For each client m ∈ S t , we choose a mini-batch of samples, B t m , of size B = 10, and compute the mini-batch stochastic gradient. The parameter w is updated as w t+1 = w t + µ SGD M K B m∈S t 1 p t m i∈B t m (y m,i -⟨w ⋆ , x m,i ⟩) • x m,i , where µ SGD is the learning rate, set as µ SGD = 0.1 in simulations. In all experiments, we set α in Adaptive-OSMD Sampler as α = 0.4. The tuning parameters for MABS, VRB and Avare are set as in their original papers. Computational resources and amount of compute. All the computation was done on a personal laptop. The synthetic data experiments are computed by CPU (Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2.59 GHz). Each run of all experiments in this section took less than 10 minutes. For the training loss, we see that when the heterogeneity level is low (σ = 1.0), the uniform sampling performs as well as Adaptive-OSMD Sampler and theoretically optimal sampling; however, as the heterogeneity level increases, the performance of uniform sampling gradually suffers; when σ = 10.0, uniform sampling performs poorly. On the other hand, Adaptive-OSMD Sampler performs well across all levels of heterogeneity and is very close to the theoretically optimal sampling. Similarly, for the cumulative regret, when the heterogeneity level is low, the cumulative regret of uniform sampling is close to Adaptive-OSMD Sampler; however, when the heterogeneity level increases, the cumulative regret of uniform sampling gets much larger than Adaptive-OSMD Sampler. Based on the above results, we can conclude that while the widely used choice of uniform sampling may be reasonable when heterogeneity is low, our proposed sampling strategy is robust across different levels of heterogeneity, and thus should be considered as the default option.

F.2 ADAPTIVE-OSMD SAMPLER VS MABS VS VRB VS AVARE

We compare Adaptive-OSMD Sampler to other bandit feedback online learning samplers: MABS (Salehi et al., 2017) , VRB (Borsos et al., 2018) and Avare (Hanchi & Stephens, 2020) . Training loss and cumulative regret are shown in Figure 2 . We see that while VRB and Avare perform better when the heterogeneity level is low and MABS performs better when the heterogeneity level is high, Adaptive-OSMD Sampler always achieves the best in both training loss and cumulative regret across all different levels of heterogeneity. Thus, we conclude that Adaptive-OSMD is a better choice than other online learning samplers.

F.3 ROBUSTNESS OF ADAPTIVE-OSMD SAMPLER TO THE CHOICE OF α

We examine the robustness of Adaptive-OSMD Sampler to the choice of α. We run Adaptive-OSMD Sampler separately for each α ∈ {0.01, 0.1, 0.4, 0.7, 0.9, 1.0}. Note that when α = 1.0, the Adaptive-OSMD Sampler outputs a uniform distribution. Training loss and cumulative regret are shown in Figure 3 . We observe that Adaptive-OSMD Sampler is robust to the choice of α, and performs well as long as α is not too close to zero or too close to one.

F.4 EXPERIMENTS ON SAMPLING WITH REPLACEMENT VS WITHOUT REPLACEMENT

We compare sampling with replacement and sampling without replacement when used together with Adaptive-OSMD sampler. Sampling without replacement is described in Section C.5. Training loss and cumulative regret are shown in Figure 4 . We observe that using sampling with replacement results in a slightly smaller cumulative regret and a slightly better training loss. However, these differences are not significant.

F.5 DYNAMIC SAMPLING DISTRIBUTION V.S. FIXED SAMPLING DISTRIBUTION

In this paper, we allow both our sampling distribution and competitor sampling distribution to change over time, while previous studies either use fixed sampling distribution (Zhao & Zhang, 2015; Needell et al., 2016) or they compare against fixed sampling distribution (Namkoong et al., 2017; Salehi et al., 2017; Borsos et al., 2018; 2019) . In this section, we show that under certain settings, dynamic sampling distribution can achieve significant advantage over fixed sampling dis- from N (1.0, 0.1 2 ). The entries of w ⋆ are generated i.i.d. from e N (0,ν 2 ) . Therefore, ν controls the variance of entries of w ⋆ . Besides, we choose the optimal stepsize from the set {1.0, 0.5, 0.1, 0.05, 0.01} for each method separately. The final result is shown in Figure 5 . We see that Adaptive-OSMD Sampler performs better than p IS across all levels of ν. Note that in practice, in order to implement p IS , we need prior information about Lipschitz constants of L m (•)'s, while Adaptive-OSMD Sampler does not need prior information. This way, our proposed method does not only have better practical performance, but also requires less prior information.

F.6 EMPIRICAL OBSERVATION OF REGRET

In Corollary 1, we see that the upper bound for the regret of OSMD sampler with respect to any comparator sequence q 1:T depends on two important quantities: T t=1 (ā t ) 2 and TV(q 1:T ). In this section, we first empirically show how āt and t l=1 (ā l ) 2 grow with t in practice. Then, we use p 1:T ⋆ as the comparator and show how TV(p 1:t ⋆ ) grow with t empirically. Finally, we also empirically show how the regret D-Regret T (p 1:T ⋆ ) grows. The experimental setting of this section is the same as Section F. In Figure 6 , we plot both āt and t l=1 (ā l ) 2 over t. We see that āt drops to zero fastly. As a result t l=1 (ā l ) 2 converges to a constant as t → ∞. This way, we empirically have t l=1 (ā l ) 2 = O(1). In Figure 7 , we show how TV(p 1:t ⋆ ) grows with t. We see that TV(p 1:t ⋆ ) = O(t) empirically, which is consistent with the worst-case upper bound. Based on this result, the best upper bound of D-Regret T (p 1:T ⋆ ) we can hope in practice is O( √ T ). However, as shown in Figure 8 , in practice, we have D-Regret T (p 1:T ⋆ ) converge to constant. This observation indicates that the upper bound we obtained might still be loose compared to practical result.

G REAL DATA EXPERIMENT

We compare Adaptive-OSMD Sampler with uniform sampling and other online learning samplers including MABS (Salehi et al., 2017) , VRB (Borsos et al., 2018) and Avare (Hanchi & Stephens, 2020) on real data. We use three commonly used computer vision data sets: MNIST (LeCun & Cortes, 2010)foot_4 , KMINIST (Clanuwat et al., 2018) foot_5 , and FMINST (Xiao et al., 2017) foot_6 . We set the number of devices to be M = 500. To better simulate the situation where our method brings significant convergence speed improvement, we create a highly skewed sample size distribution of the training set among clients: 65% of clients have only one training sample, 20% of clients have 5 training samples, 10% of clients have 30 training samples, and 5% of clients have 100 training samples. This setting tries to illustrate a real-life situation where most of the data come from a small fraction of users, while most of the users have only a small number of samples. The skewed sample size distribution is common in other FL data sets, such as LEAF (Caldas et al., 2018) . The sample size distribution in the training set is shown in Figure 9 . In addition, each client has 10 validation samples used to measure the prediction accuracy of the model over the training process. We use a multi-class logistic regression model. For a given gray scale picture with the label y ∈ {1, 2, . . . , C}, we unroll its pixel matrix into a vector x ∈ R p . Given a parameter matrix W ∈ R C×p , the training loss function defined in equation 1 is ϕ(W ; x, y) := l CE (ς(W x) ; y) , where ς(•) : R C → R C is the softmax function defined as [ς(x)] i = exp(x i ) K j=1 exp(x j ) , for all x ∈ R C , and l CE (x ; y) = C i=1 1(y = i) log x i , x ∈ R C , y ∈ {1, . . . , C}, is the cross-entropy function. We use the same algorithms and tuning parameters as in Section F. Learning rate in SGD is set to 0.075 for MNIST and KMNIST, and is set to 0.03 for FMNIST. The total number of communication rounds is to 1, 000. In each round of communication, we choose K = 10 clients to participate (2% of total number of clients). For a chosen client m, we compute its local mini-batch gradient with the batch size equal to min{5, n m }, where n m is the training sample size on the client m. Figure 10 shows both the training loss and validation accuracy. Each figure shows the average performance over 5 independent runs. We use the same random seed for both Adaptive-OSMD Sampler and competitors, and change random seeds across different runs. The main focus is on minimizing the training loss, and the validation accuracy is only included for completeness. We observe that Adaptive-OSMD Sampler performs better than uniform sampling and other online learning samplers across all data sets. Given the cheap computational cost and the significant practical advantage, we recommend using Adaptive-OSMD Sampler as the default option in practice. Besides, for the completeness of the paper, we also include the results under homogeneous setting where the sample sizes are balanced across different clients. The result is shown in Figure 11 , where we see that all methods perform similarly. Computational resources and amount of compute. All the computation was done on a personal laptop. The real data experiments are computed by GPU (NVIDIA GeForce RTX 2070 with Max-Q Design). Each run of all experiments in this section took less than 15 minutes.

H DISCUSSION

We studied the client sampling problem in FL. We proposed an online learning with bandit feedback approach to tackle client sampling. We used online stochastic mirror descent to solve the online learning problem and applied the online ensemble method with doubling trick to choose the tuning parameters. We established an upper bound on the dynamic regret relative to the theoretically optimal sequence of sampling distributions. The total variation of the comparator is explicitly included in the upper bound as a measure of the difficulty of the problem. Extensive numerical experiments demonstrated the benefits of our approach over both widely used uniform sampling and other competitors. In this paper, we have focused on sampling with replacement. However, sampling without replacement would ideally be a more efficient approach. In Section C.5, we discussed a natural extension of Adaptive-OSMD Sampler to a setting where sampling without replacement is used. However, this approach does not directly minimize the variance of the gradient g t . When sampling without replacement is used, the variance function becomes more complicated and the design of an algorithm to directly minimize the variance is an interesting future direction. Besides, in federated learning, privacy is a major concern. In this paper, the non-uniform sampling distribution may make the protection of clients' privacy more challenging than uniform sampling. One possible solution is to add noise to the gradient feedback and protect the clients' privacy under the Differential Privacy (DP) concept (Dwork, 2008) . However, the added noise may hurt the performance of our sampling design and increase the regret. Studying the trade-off between privacy protection and regret is an important direction for addressing societal concerns in real-world application. Other fruitful future directions include the design of sampling algorithms for minimizing personalized FL objectives and sampling with physical constraint in FL system, which we discuss next. , so that the decomposition allows us to better minimize the variance. We term this approach as doubly variance reduction for personalized Federated Learning. The first part minimizes the variance of updates to the shared global parameter, when the best local parameters are fixed; and the second part minimizes the variance of updates to local parameters, when the global part is fixed. While these two parts are related, any given machine will have different contributions to these two tasks. Adaptive-OSMD Sampler can be used to minimize the variance for both parts of the gradient. We note that this is a heuristic approach to solving the client sampling problem when minimizing a personalized FL objective. Personalized FL objectives have additional structures that should be used to design more efficient sampling strategies. Furthermore, designing sampling strategies that improve the statistical performance of trained models, rather than improving computational speed, is important in the heterogeneous setting. Addressing these questions is an important area for future research.

H.2 SAMPLING WITH PHYSICAL CONSTRAINT IN FL SYSTEM

In this paper, we assume that all clients are available in each round. However, in practical FL applications, a subset of the clients may be inactive due to physical constraints, thus we have to assign zero probabilities to them. In this section, we propose a simple extension of our proposed sampling method to such case. Specifically, denote the subset of clients that are active at the beginning of round t as I t ⊆ [M ]. If we have |I t | ≤ K, we can then use all clients in I t to make updates in round t; otherwise, we would like to choose a smaller subset S t ⊆ I t to participate. This can be achieved by rescaling the output sampling distribution of any of our proposed methods, which we denote as pt . We let pt m = pt m /( i∈I t pt i ) and pt m = 0 for all m / ∈ I t . We can then use pt m to choose S t from I t . However, analyzing such a method in terms of convergence and regret guarantee is highly nontrivial. Typically, for general active clients sequence {I t } T t=1 , the optimization algorithms are not guaranteed to converge even if we involve all clients in I t in each round. This can happen, for example, if a client is active for only once in the whole training process. Thus, to ensure convergence, we need additional assumptions about {I t } T t=1 . Moreover, deriving regret bound is also very challenging, as assigning zero probability to any client will make the variance-reduction loss unbounded and thus the regret can be arbitrarily large. To achieve such theoretical result, one may need to appropriately redefine the regret concept. Such an analysis is beyond the scope of this paper and we leave it for future research.



The randomness comes from two sampling processes. The first sampling happens on clients level, and the second sampling happens locally when choosing samples. To ease the understanding, one may treat g t m as full local gradient. The variance reduction loss lt(•) should be distinguished from the training loss ϕ(•). While the former is always convex, ϕ(•) can be non-convex. We use the term learning rate when discussing an online algorithm that learns a sampling distribution, while the term stepsize is used in the context of an optimization algorithm. We refer the reader to Chapter 2 ofCesa-Bianchi & Lugosi (2006) for an overview of prediction-withexpert-advice algorithms. Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license. KMNIST dataset is licensed under a permissive CC BY-SA 4.0 license, except where specified within some benchmark scripts. FMNIST dataset is under The MIT License (MIT) Copyright © [2017] Zalando SE, https://tech.zalando.com



w; D m ) is the loss function used to assess the quality of a machine learning model parameterized by the vector w based on the local data D m on the client m ∈ [M ]. The parameter λ m denotes the weight for client m. Typically, we have λ m = n m /n, where n m = |D m | is the number of samples on the client m, and the total number of samples is n = M m=1 n m .

For all w ∈ W and m ∈ [M ], we have E ξ∼Dm [∇ϕ(w; ξ)] = ∇F m (w) and

we assume that ζ 0 < ∞. The quantity ζ 0 has been commonly used to quantify first-order heterogeneity (Karimireddy et al., 2020a;b), while ζ 1 and ζ 2 are variants of ζ 0 corresponding to different sampling schemes. More specifically, when we use q ⋆ and p 1:T ⋆

extended the importance sampling to mini-batches. Stich et al. (2017); Johnson & Guestrin (2018); Gopal (2016) developed adaptive sampling strategies that allow the sampling distribution to change over time. Nesterov (2012); Perekrestenko et al. (2017); Zhu et al. (2016); Salehi et al. (2018) discussed importance sampling in stochastic coordinate descent methods. Namkoong et al. (2017); Salehi et al. (2017); Borsos et al. (2018; 2019); Hanchi & Stephens (2020) illustrated how to design the sampling distribution by solving an online learning task with bandit feedback. Namkoong et al. (2017); Salehi et al. (

Let Φ = Φ e and we use ∥ • ∥ = ∥ • ∥ 1 to define total variation. Let α be small enough such that p t ⋆ ∈ A for all t ∈ [T ]. Let p1:T be the output of Algorithm 4 with γ = α M 8K T E[ā 1 ] , p init = p unif and E as in equation 13. Then D

Adaptive-OSMD Sampler with Doubling Trick (Adaptive-Doubling-OSMD) 1: Input: Paramter α. 2: Output: pt for t = 1, . . . , T . 3: Use w 0 to get {a 0 m } m∈S 0 , where S 0 is the set of responsive clients in the pre-training phase. 4: Initialize â1 = max m∈S 0 a 0 m and set b = 1. 5: while True do {p t } 2 b -1 t=2 b-1 from Algorithm 4 with parameters: γ b , E b , α, the number of iterations 2 b-1 , and the initial distribution p unif or p2 b-1 -1 (when b > 1).

max m∈[M ] a T b m . 14: end while Algorithm 6 Adaptive sampling without replacement 1: Input: w 1 and p1 . 2: for t = 1, 2, . . . , Tthe sampling distribution for sampling the k-th client in the t-th round */ 6:

show that the solution pt+1 in Step 7 of Algorithm 1 can be found as pt+1 = arg min p∈D η t ⟨p, ∇ lt (p t ; pt )⟩ + D Φe p ∥ pt , pt+1 = arg min p∈A D Φe p ∥ pt+1 .

and Φ = Φ e is the unnormalised entropy on D. For y ∈ [0, ∞) M , let x = arg min v∈A D Φ (v∥y). Suppose y 1 ≤ y 2 ≤ • • • ≤ y M . Let m ⋆ be the smallest value such that

Figure 1: The training loss (top row) and the cumulative regret (bottom row) of Adaptive-OSMD Sampler vs Uniform vs Optimal with σ = 1.0, σ = 3.0, and σ = 10.0. The solid line denotes the mean and the shaded region covers mean ± standard deviation across independent runs. The results of the training process and the cumulative regret are shown in Figure1. For the training loss, we see that when the heterogeneity level is low (σ = 1.0), the uniform sampling performs as well as Adaptive-OSMD Sampler and theoretically optimal sampling; however, as the heterogeneity level increases, the performance of uniform sampling gradually suffers; when σ = 10.0, uniform sampling performs poorly. On the other hand, Adaptive-OSMD Sampler performs well across all levels of heterogeneity and is very close to the theoretically optimal sampling. Similarly, for the cumulative regret, when the heterogeneity level is low, the cumulative regret of uniform sampling is close to Adaptive-OSMD Sampler; however, when the heterogeneity level increases, the cumulative regret of uniform sampling gets much larger than Adaptive-OSMD Sampler. Based on the above results, we can conclude that while the widely used choice of uniform sampling may be reasonable when heterogeneity is low, our proposed sampling strategy is robust across different levels of heterogeneity, and thus should be considered as the default option.

Figure 2: The training loss (top row) and the cumulative regret (bottom row) of Adaptive-OSMD Sampler vs MABS vs VRB vs Avare with σ = 1.0, σ = 3.0 and σ = 10.0. The solid line denotes the mean and the shaded region covers mean ± standard deviation across independent runs.

Figure 3: The training loss (top row) and the cumulative regret (bottom row) of Adaptive-OSMD Sampler with different choices of α under σ = 1.0, σ = 3.0 and σ = 10.0. The solid line denotes the mean and the shaded region covers mean ± standard deviation across independent runs.

Figure 4: The training loss (top row) and the cumulative regret (bottom row) of Adaptive-OSMD Sampler with replacement vs without replacement across σ = 1.0, σ = 3.0 and σ = 10.0. The solid line denotes the mean and the shaded region covers mean ± standard deviation across independent runs.

Figure 5: The training loss (top row) and the cumulative regret (bottom row) of Adaptive-OSMD Sampler vs p IS across ν = 1.0, ν = 3.0 and ν = 10.0. The solid line denotes the mean and the shaded region covers mean ± standard deviation across independent runs.

Figure 6: Plots of how āt and t l=1 (ā l ) 2 grow with t.

Figure 7: Plots of how TV(p 1:t ⋆ ) grows with t.

Figure 8: Plots of how D-Regret T (p 1:T ⋆ ) grows with t.

Figure 9: The sample size distribution in the training set across clients.

Figure 10: Comparison between Adaptive-OSMD Sampler, uniform sampler and other online learning samplers on real data in terms of training loss (top row) and validation accuracy (bottom row).Different columns correspond to different data sets. The solid line represents the mean and the shadow area represents mean ± 0.5 × standard error. Adaptive-OSMD Sampler is both faster and more stable. The result is the average performance over 5 independent runs.

Figure 11: Comparison between Adaptive-OSMD Sampler, uniform sampler and other online learning samplers on real data in terms of training loss (top row) and validation accuracy (bottom row) under balanced samples size. Different columns correspond to different data sets. We see that all methods perform similarly.

Algorithm 3 Min-batch SGD with OSMD Sampler 1: Input: Number of communication rounds T , number of clients chosen in each round K, local batch size B, initial model parameter w 0 , stepsize {µ t } T -1 t=0 . 2: Output: The final model parameter w T . 3: Initialize: p0 = p unif . 4: for t = 0, 1, . . . , T -1 do 5: Sample S t with replacement from [M ] with probability pt , such that |S t | = K.

) M ; number of iterations T ; initial distribution p init . : for t = 1, 2, . . . , T -1 do , which achieves performance close to the best expert.Algorithm 4 details Adaptive-OSMD Sampler. Note that since we can compute lt (p t

tribution. More specifically, we compare Adaptive-OSMD Sampler with the Lipschitz constant based importance sampling distribution proposed byZhao & Zhang (2015);Needell et al. (2016), which we denote as p IS .

We design p t 1 and p t 2 to choose S t 1 and S t 2 by minimizing the variance of the gradients. Note thatE E S t g t 1 -∇ 1 h(w t ) + g t 2 -∇ 2 h(w t )

annex

which then implies equation 41.

F SYNTHETIC EXPERIMENTS

In this section, we use synthetic data to demonstrate the performance of Adaptive-OSMD Sampler. We compare our method against uniform sampling in Section F.1 and compare against other bandit feedback online learning samplers in Section F.2. In addition, we examine the robustness of Adaptive-OSMD Sampler to the choice of α in Section F.3, while in Section F.4, we compare it with the sampling without replacement variant discussed in Section C.5. Finally, in Section F.6, we show some empirical observation for regret analysis.We generate data as follows. We set the number of clients as M = 100, and each client hasSamples on each client are generated aswhere the coefficient vector w ⋆ ∈ R d has elements generated as i.i.d. N (10, 3), and the feature vector x m,i ∈ R d is generated as x m,i ∼ N (0, Σ m ), where Σ m = s m • Σ, Σ is a diagonal matrix with Σ jj = κ (j-1)/(d-1)-1 and κ > 0 is the condition number of Σ. We generate {s m } M m=1 i.i.d. from e N (0,σ 2 ) and rescale them asIn this setting, κ controls the difficulty of each problem when solved separately, while σ controls the level of heterogeneity across clients. In all experiments, we fix κ = 25, which corresponds to a hard problem, and change σ to simulate different heterogeneity levels. We expect that uniform sampling suffers when the heterogeneity level is high. The dimension d of the problem is set as d = 10. The results are averaged over 10 independent runs.

H.1 CLIENT SAMPLING WITH PERSONALIZED FL OBJECTIVE

Data distributions across clients are often heterogeneous. Personalized FL has emerged as one effective way to handle such heterogeneity (Kulkarni et al., 2020) . Hanzely et al. (2021) illustrated how many existing approaches to personalization can be studied through a unified framework, and, in this section, we discuss a natural extension of Adaptive-OSMD Sampler to this personalized objective. Specifically, we study the following optimization problemwhere w ∈ R d0 corresponds to the shared parameter and β = (β 1 , . . . , β M ) with β m ∈ R dm corresponds to the local parameters. The objective in equation 45 coves a wide range of personalized federated learning problems (Hanzely et al., 2021) . We further generalize the approach and study the following bilevel optimization problem: In the following, we use ∇ w to denote a partial derivative with respect to w with β m fixed, ∇ βm to denote a partial derivative with respect to β m with w fixed, and ∇ to denote a derivative with respect to w where β m (w) is treated as a function of w. Let ∇ 2 βmβm G m (w, β m ) ∈ R dm×dm be the Hessian matrix of G m with respect to β m where w is fixed, and ∇ 2 wβm G m (w, β m ) ∈ R d0×dm be the Hessian matrix of G m with respect to w and β m , that is,for all i = 1, 2, . . . , d 0 , j = 1, 2, . . . , d m .By the implicit function theorem, we havewhereThere are two parts to ∇h(w t ) and, therefore, instead of choosing a single subset of clients for computing both parts, we decouple S t into two subsets S t 1 and S t 2 , S t = S t 1 ∪ S t 2 . We use clients in S t 1 to compute local updates of the first part, and clients in S t 2 to compute the local updates of the second part. To get an estimate of ∇h(w), we can estimate ∇ 1 h(w) and ∇ 2 h(w) separably and then combine. Assume that g t 1,m is an estimate of ∇ 1 F m (w, βm (w)) and g t 2,m is an estimate of ∇ 2 F m (w, βm (w)), we can then construct estimates of ∇ 1 h(w) and ∇ 2 h(w) as 

