ANCHOR SAMPLING FOR FEDERATED LEARNING WITH PARTIAL CLIENT PARTICIPATION

Abstract

In federated learning, the support of partial client participation offers a flexible training strategy, but it deteriorates the model training efficiency. In this paper, we propose a framework FedAMD to improve the convergence property and maintain flexibility. The core idea is anchor sampling, which disjoints the partial participants into anchor and miner groups. Each client in the anchor group aims at the local bullseye with the gradient computation using a large batch. Guided by the bullseyes, clients in the miner group steer multiple near-optimal local updates using small batches and update the global model. With the joint efforts from both groups, FedAMD is able to accelerate the training process as well as improve the model performance. Measured by ϵ-approximation and compared to the state-of-the-art first-order methods, FedAMD achieves the convergence by up to O(1/ϵ) fewer communication rounds under non-convex objectives. In specific, we achieve a linear convergence rate under PL conditions. Empirical studies on real-world datasets validate the effectiveness of FedAMD and demonstrate the superiority of our proposed algorithm: Not only does it considerably save computation and communication costs, but also the test accuracy significantly improves.

1. INTRODUCTION

Federated learning (FL) (Konečnỳ et al., 2015; 2016; McMahan et al., 2017) has attained an increasing interest over the past few years. As a distributed training paradigm, it enables a group of clients to collaboratively train a global model from decentralized data under the orchestration of a central server. By this means, sensitive privacy is basically protected because the raw data are not shared across the clients. Due to the unreliable network connection and the rapid proliferation of FL clients, it is infeasible to require all clients to be simultaneously involved in the training. To address the issue, recent works (Li et al., 2019b; Philippenko & Dieuleveut, 2020; Gorbunov et al., 2021a; Karimireddy et al., 2020b; Yang et al., 2020; Li et al., 2020; Eichner et al., 2019; Yan et al., 2020; Ruan et al., 2021; Gu et al., 2021; Lai et al., 2021) introduce a practical setting where merely a portion of clients participates in the training. The partial-client scenario effectively avoids the network congestion at the FL server and significantly shortens the idle time as compared to traditional large-scale machine learning (Zinkevich et al., 2010; Bottou, 2010; Dean et al., 2012; Bottou et al., 2018) . However, a model trained with partial client participation is much worse than the one trained with full client participation (Yang et al., 2020) . This phenomenon is account for two reasons, namely, data heterogeneity (a.k.a. non-i.i.d. data) and the lack of inactive clients' updates. With data heterogeneity, the optimal model is subject to the local data distribution, and therefore, the local updates on the clients' models greatly deviate from the update towards optimal global parameters (Karimireddy et al., 2020b; Malinovskiy et al., 2020; Pathak & Wainwright, 2020; Wang et al., 2020; 2021; Mitra et al., 2021; Rothchild et al., 2020; Zhao et al., 2018; Wu et al., 2021) . FedAvg (McMahan et al., 2017; Li et al., 2019b; Yu et al., 2019a; b; Stich, 2018) , for example, is less likely to follow a correct update towards the global minimizer because the model aggregation on the active clients critically deviates from the aggregation on the full clients, an expected direction towards global minimizer (Yang et al., 2020) . As a family of practical solutions to data heterogeneity, variance reduced techniques (Karimireddy et al., 2020b; Gorbunov et al., 2021a; Wu et al., 2021; Gorbunov et al., 2021b; Liang et al., 2019; Shamir et al., 2014; Li et al., 2019a; 2021b; Karimireddy et al., 2020a; Murata & Suzuki, 2021) achieve an improved convergence rate when compared to FedAvg. With multiple local updates, each client corrects the SGD steps with reference to an estimated global target, which is synchronized at σ 2 µM ϵ + σ µ 3/2 M √ ϵ + 1 µ log 1 ϵ FedAMD (Constant) (Corollary 3) ✓ 1 µ + M µ 2 A + M A log 1 ϵ Table 1 : Number of communication rounds that achieve E ∥∇F ( xout )∥ 2 2 ≤ ϵ for non-convex objectives (or EF ( xout ) -F * ≤ ϵ for PL condition or strongly-convex with the parameter of µ). We optimize an online scenario and set the small batch size to 1. The symbol ✓ or ✗ for "Partial Clients" is determined by the following footnote 1. the beginning of every round. Although, in each transmission round, variance-reduced algorithms require the communication overhead twice as more as FedAvg, their improved performances are likely to eliminate the cost increments. Recent studies (Gorbunov et al., 2021a; Murata & Suzuki, 2021; Tyurin & Richtárik, 2022; Zhao et al., 2021) have demonstrated great potential using large batches under full client participationfoot_0 . Measured by ϵ-approximation, MARINA (Gorbunov et al., 2021a) , for instance, realizes O(1/M ϵ 1/2 ) faster while using large batches, where M indicates the number of clients. However, none of the prior studies address the drawbacks of using large batches. Typically, a large batch update involves several gradient computations compared to a small batch update. This increases the burden of FL clients, especially on IoT devices like smartphones, because their hardware hardly accommodates all samples in a large batch simultaneously. Instead, they must partition the large batch into several small batches to obtain the final gradient. Furthermore, regarding the critical convergent differences between various participation modes, the effect of using large batches under partial client participation cannot be affirmative. BVR-L-SGD (Murata & Suzuki, 2021) and FedPAGE (Zhao et al., 2021) claim that they can work under partial client participation, but they require all clients' participation when the algorithms come to the synchronization using a large batch. Motivated by the observation above, we propose a framework named FedAMD under federated learning with anchor sampling that disjoints the partial participants into two groups, i.e., anchor and miner group. In the anchor group, clients (a.k.a. anchors) compute the gradient using a large batch cached in the server to estimate the global orientation. In the miner group, clients (a.k.a. miners) perform multiple updates corrected according to the previous and the current local parameters and the last local update volume. The objective for the latter group is twofold. First, multiple local updates without serious deviation can effectively accelerate the training process. Second, we update the global model using the local models from the latter group only. Since anchor sampling disjoints the clients with time-varying probability, we separately consider constant and sequential probability settings. Contributions. We summarize our contributions as follows: • Algorithmically, we propose a unified federated learning framework FedAMD that identifies a participant as an anchor or a miner. Clients in the anchor group aim to obtain the bullseyes of their local data with a large batch, while the miners target to accelerate the training with multiple local updates using small batches. • Theoretically, we establish the convergence rate for FedAMD under non-convex objectives under both constant and sequential probability settings. To the best of our knowledge, this is the first work to analyze the effectiveness of large batches under partial client participation. Our theoretical results indicate that, with the proper setting for the probability, FedAMD can achieve a convergence rate of O( M AT ) under non-convex objective, and linear convergence under Polyak-Łojasiewicz (PL) condition (Polyak, 1963; Lojasiewicz, 1963) . Comprehensive comparisons with previous works are presented in Table 1 . • Empirically, we conduct extensive experiments to compare FedAMD with the most representative approaches. The numerical results provide evidence of the superiority of our proposed algorithm. Achieving the same test accuracy, FedAMD utilizes less computational power metered by the cumulative gradient complexity.

2. RELATED WORK

In this section, we discuss the state-of-the-art works that are strongly relevant to our research. A more comprehensive review is provided in Appendix A. Mini-batch SGD vs. Local SGD. Distributed optimization is required to train large-scale deep learning systems. Local SGD (also known as FedAvg) (Stich, 2018; Dieuleveut & Patel, 2019; Haddadpour et al., 2019; Haddadpour & Mahdavi, 2019) performs multiple (i.e., K ≥ 1) local updates with K small batches, while mini-batch SGD computes the gradients averaged by K small batches (Woodworth et al., 2020b; a) (or a large batches (Shallue et al., 2019; You et al., 2018; Goyal et al., 2017) ) on a given model. There has been a long discussion on which one is better (Lin et al., 2019; Woodworth et al., 2020a; b; Yun et al., 2021) , but no existing work considers how to disjoint the nodes such that both can be trained at the same time. Variance Reduction in FL. The variance-reduced techniques have critically driven the advent of FL algorithms (Karimireddy et al., 2020b; Wu et al., 2021; Liang et al., 2019; Karimireddy et al., 2020a; Murata & Suzuki, 2021; Mitra et al., 2021) by correcting each local computed gradient with respect to the estimated global orientation. However, a concern is addressed on how to attain an accurate global orientation to mitigate the update drift from the global model. Roughly, the estimation lies in two types, namely, precalculated and cached. The former methods (Murata & Suzuki, 2021; Mitra et al., 2021) required precalculation typically require full worker participation, which is infeasible for federated learning settings. As for the global orientation estimated by cached information, existing approaches (Karimireddy et al., 2020b; Wu et al., 2021; Liang et al., 2019; Karimireddy et al., 2020a) utilize small batches, which derives a biased estimation and misleads the training. This work explores the effectiveness of large-batch estimation for the global orientation under partial client participation.

3. FEDAMD

In this section, we comprehensively describe the technical details of FedAMD, a federated learning framework with anchor sampling. In specific, it disjoints the active participants into the anchor group and the miner group with time-varying probabilities. The pseudo-code is illustrated in Algorithm 1. Problem Formulation. In an FL system with a total of M clients, the objective function is formalized as min (Line 6). x∈R d F (x) = 1 M m∈[M ] F m (x) After the initialization steps above, the algorithm comes to the model training (Line 7-27). At the beginning of each round t, the server randomly picks an A-client subset A from M clients (Line 8). Since each client is independently selected without replacement, under the setting of Equation (1), clients have an equal chance to be selected with the probability of A M . Subsequently, the server 7: for t = 0, 1, 2, . . . do 8:

Sample clients A ⊆ [M ]

9: Communicate the model xt and the caching gradient gt = avg(v t ) with clients i ∈ A 10: Initialize subsequent caching gradient v t+1 = v t 11: Initialize for i ∈ A in parallel do 12: if Bernoulli(p t ) == 1 then 13: v (i) t+1 = ∇f i ( xt , B i,t ) using B i,t x (i) t,-1 = x (i) t,0 = xt , g t,0 = gt 17: for k = 0, . . . , K -1 do 18: Generate random realization B ′ i,k ∼ D i with the size of b ′ 19: g (i) t,k+1 = g (i) t,k -∇f i x (i) t,k-1 , B ′ i,k + ∇f i x (i) t,k , B ′ i,k 20: x (i) t,k+1 = x (i) t,k -η l • g (i) t,k+1 21: end for 22:  ∆x (i) t = xt -x (i) t,K ) gt = 1 M m∈[M ] v (m) 0 (Line 9). With the probability of p t , client i ∈ A is classified for the anchor group (Line 13-14) or the miner group (Line 16-23), and different groups have different objectives and focus on different tasks. Anchor group . Clients in this group target to discover the bullseyes based on their local data distribution. According to Line 12, client i ∈ A has the probability of p t to become a member of this group. Then, the client utilizes a large batch B i,t with b samples to obtain the gradient v (i) t+1 (Line 13). Therefore, following the gradient v (i) t+1 can find an optimal or near-optimal solution for client i. Next, the client pushes the gradient to the server and updates the caching gradient (Line 14). In view that some clients do not participate in the anchor group for obtaining the bullseyes at round t, the server spontaneously inherits their previous calculation from v t (Line 10). As a result, gt in Line 9 indicates an approximate orientation towards global optimal parameters, which directs the local update in the miner group and affects the final global update. Besides, v (i) t+1 influences the training from round t + 1 up to the next time when client i is a member of anchor group. Miner group . Guided by the global bullseye, clients in the miner group perform multiple local updates and finally drive the update of the global model. First, client i initializes the model with xt and the target direction with gt (Line 16). Ideally, in the subsequent K updates (Line 17), client i update the model with the gradient ∇F (x (i) t,k ) for k ∈ {0, . . . , K -1}. This is impractical because clients cannot access all others' training sets to compute the noise-free gradients. Instead, the client at k-th iteration generates a b ′ -sample realization B ′ i,k (Line 18) and calculates the update g (i) t,k+1 via a variance-reduced technique, i.e., g (i) t,k+1 = g (i) t,k -∇f i x (i) t,k-1 , B ′ i,k + ∇f i x (i) t,k , B ′ i,k (Line 19). The update g (i) t,k+1 is approximate to ∇F (x (i) t,k ) for two reasons: (i) the first term is used to estimate the global update because g (i) t,0 stores the global bullseye; and (ii) the rest terms remove the perturbation of data heterogeneity and reflect the true update at x (i) t,k . Therefore, the local model update follows x (i) t,k+1 = x (i) t,k -η l • g (i) t,k+1 (Line 20). After K local updates, the model changes on client i is ∆x (i) t = xt -x (i) t,K . Then, the client transmits ∆x (i) t to the server for the purpose of global model update. The proposed approach possesses threefold advantages when compared to SCAFFOLD (Karimireddy et al., 2020b) and FedLin (Mitra et al., 2021) using a consistent correction term, i.e., gt -v (i) t . Firstly, it is a memory-efficient approach that is unnecessary to maintain the obsolete gradient. Secondly, it dynamically calibrates the local updates subject to the local model. Although BVR-L-SGD (Murata & Suzuki, 2021 ) also achieves such a functionality, it requires all clients to jointly obtain the global direction at the beginning of each round, leading to considerable training time. As for FedAMD, here comes the third advantage that avoids the precalculation on a global bullseye under partial-client scenarios. To the best of our knowledge, this is the first work to achieve dynamic calibration under partial-client scenarios. Server (Line 14 and Line 26). Therefore, after the separate local training on the participants, the server merges the model changes from the miner group into ∆x t (Line 26) and updates the caching gradients from the anchor group (Line 14). It is noted that the size of ∆x t (a.k.a. |∆x t |) can be within the range between 0 and A. When the size is 0, xt+1 = xt , or otherwise, xt+1 = xt -η s ∆x t /|∆x t | (Line 26). The reason why we solely use the changes from the miner group is that clients perform multiple local updates regulated by the global target such that the model changes walk towards the global optimal solution. While directly incorporating the new gradients from the anchor group, the global model has a degraded performance because they perform a single update that aims to find out the local bullseye deviated from the global target. Implicitly, clients in the miner group take in the update of caching gradients at iteration k = 0 to update the local model, which will affect the next global parameters. Previous Algorithms as Special Cases. The probabilities can vary among the rounds that disjoint the participants into the anchor group and the miner group. By setting A = M , and the probability {p t } following the sequence of {1, 0, 1, 0, . . . }, FedAMD reduces to distributed minibatch SGD (K = 1) or BVR-L-SGD (K > 1). Therefore, FedAMD subsumes the existing algorithms and takes partial client participation into consideration. To obtain the best performance, we should tune the settings of {p t } and K. However, accounting for the generality of FedAMD, it faces substantial additional hurdles in its convergence analysis, which is one of our main contributions, as detailed in the following section. Discussion on Communication Overhead. As the anchors are not necessary to obtain the averaged caching gradient (i.e., gt ) at t-th round, the centralized server solely distributes gt to the miners. Compared to FedAvg, the proposed algorithm requires (1 -p t )/2 more communication costs, but it achieves convergence with at least O( 1 ϵ ) less communication rounds (see Table 1 ). Therefore, from the perspective of model training progress, FedAMD is more communication efficient than FedAvg. Discussion on Massive-Client Settings. A typical example of this scenario is cross-device FL (Kairouz et al., 2019) . In this setting, it is not a wise option for the server to preserve all the caching gradients for clients. Therefore, the clients retain their caching gradients while the server keeps their average. Firstly, at t-th round, client i ∈ [M ] copies their caching gradient to (t + 1)-th round, i.e., v (i) t+1 = v (i) t . For the client i in the anchor group, they will follow Line 13 in Algorithm 1 to update v (i) t+1 and push ∧ (i) t = v (i) t+1 -v (i) t to the server. After the server receives the updates of all local caching gradients, it performs v t+1 = v t + 1 M ∧ t , where ∧ t aggregates ∧ (i) t where client i is in anchor group.

4. THEORETICAL ANALYSIS

In this section, we analyze the convergence rate of FedAMD under non-convex objectives with respect to ϵ-approximation, i.e., min t∈[T ] ∥∇F ( xt )∥ the setting of {p t } to obtain the best performance. Before showing the convergence result, we make the following assumptions, where the first two assumptions have been widely used in machine learning studies (Karimireddy et al., 2020b; Li et al., 2020) , while the last one has been adopted in some recent works (Gorbunov et al., 2021a; Tyurin & Richtárik, 2022; Murata & Suzuki, 2021) . Assumption 1 (L-smooth). The local objective functions are Lipschitz smooth: For all v, v ∈ R d , ∥∇F i (v) -∇F i (v)∥ 2 ≤ L∥v -v∥ 2 , ∀i ∈ [M ]. Assumption 2 (Bounded Noise). For all v ∈ R d , there exists a scalar σ ≥ 0 such that E B∼Di ∥∇f i (v, B) -∇F i (v)∥ 2 2 ≤ σ 2 |B| , ∀i ∈ [M ]. Assumption 3 (Average L-smooth). For all v, v ∈ R d , there exists a scalar L σ ≥ 0 such that E B∼Di ∥(∇f i (v, B) -∇f i (v, B)) -(∇F i (v) -∇F i (v))∥ 2 2 ≤ L 2 σ |B| ∥v -v∥ 2 2 , ∀i ∈ [M ]. Remark. Assumption 3 definitely provides a tighter bound for the patterns of variance reduction. In fact, solely with Assumption 2, the term E B∼Di ∥(∇f i (v, B) -∇f i (v, B)) -(∇F i (v) -∇F i (v))∥ 2 2 can be bounded by a constant. Therefore, we can easily obtain the coefficient for ∥v -v∥ 2 2 , which could be with the same structure as the constant in RHS of Assumption 2. Furthermore, if the loss function is Lipschitz smooth, e.g., cross-entropy loss (Tewari & Chaudhuri, 2015) , we can derive a similar structure as presented in Assumption 3.

4.1. SEQUENTIAL PROBABILITY SETTINGS

As mentioned in Section 3, a recursive pattern appeared in the probability sequence {p t ∈ {0, 1}} t≥0 can reduce FedAMD to the existing works. We assume that the caching gradient updates every τ (≥ 2) rounds, such that p t = 1, t mod τ == 0 0, Otherwise We derive the following results under sequential probability settings. The corresponding proof is provided in Appendix D. Theorem 1. Suppose that Assumption 1, 2 and 3 hold. Let the local updates K ≥ 1, the minibatch size b = min σ 2 M ϵ , n and b ′ < b. Additionally, the settings for the local learning rate η l and the global learning rate η s satisfy the following two constraints: (1) η s η l = 1 KL 1 + 2M τ A -1 ; and (2) η l ≤ min 1 2 √ 6KL , √ b ′ /K 4 √ 3Lσ . Then, to find an ϵ-approximation of non-convex objectives, i.e., min t∈[T ] ∥∇F ( xt )∥ 2 2 ≤ ϵ, the number of communication rounds T performed by FedAMD is T = O 1 + 2M τ A • τ τ -1 • 1 ϵ where we treat ∇F ( x0 ) -F * and L as constants. Discussion on the selection of τ . According to Theorem 1, we notice that τ = 2 achieves min t∈[T ] ∥∇F ( xt )∥ Comparison with BVR-L-SGD. As discussed in Section 3, FedAMD reduces to BVR-L-SGD (Murata & Suzuki, 2021) when τ = 2 and all clients participate in the training. In this case, Theorem 1 shows a total of T = O(1/ϵ) communication rounds is needed. This result coincides with the complexity of BVR-L-SGD in Table 1 by the setting that (1) nM ≤ 1 ϵ , and (2) K ≥ n/M . In other words, we theoretically prove that BVR-L-SGD still achieves min t∈[T ] ∥∇F ( xt )∥ 2 2 ≤ ϵ with T = O(1/ϵ) in a looser constraint. As for computation overhead, our proposed method requires O σ 2 ϵ 2 + M K ϵ , which is less than BVR-L-SGD (i.e., O σ 2 ϵ 2 + σ 2 +M K ϵ + M K ). 4.2 CONSTANT PROBABILITY SETTINGS Apparently, when we set the constant probability as 1, all participants are in the anchor group such that the model cannot be updated. Likewise, when the constant probability is 0, all participants are in the miner group such that the global target cannot be updated, leading to degraded performance. Therefore, we manually define a constant p ∈ (0, 1) such that {p t = p} t≥0 . In this section, we derive the following results with partial client participation. Detailed proof is provided in Appendix E. Specifically, Appendix E.2 and Appendix E.3 proves the convergence rate for Theorem 2 and Theorem 3, respectively. Theorem 2. Suppose that Assumption 1, 2 and 3 hold. Let the local updates K ≥ max 1, 2L 2 σ b ′ L 2 , the minibatch size b = min σ 2 M ϵ , n and b ′ < b, the local learning rate η l = 1 2 √ 6KL , and the global learning rate η s = 2 √ 6 1+ 2M Ap √ 1-p A . Then, to find an ϵ-approximation of non-convex objectives, i.e., min t∈[T ] ∥∇F ( xt )∥ 2 2 ≤ ϵ, the number of communication rounds T performed by FedAMD is T = O 1 ϵ 1 1 -p A + M Ap 1 -p A where we treat ∇F ( x0 ) -F * and L as constants. With the constant probability p approaching 0 or 1, Theorem 2 shows that FedAMD requires a significant number of communication rounds. Hence, there is an optimal p such that FedAMD achieves convergence with the fewest communication rounds. In view that M ≥ A, the number of communication rounds is dominated by O M Ap √ 1-p A • 1 ϵ . Based on this observation, the following corollary provides the settings for the constant probability p that leads to the optimal convergence result. Based on the value of p, we further refine the settings for other parameters. The following corollary takes b ′ = 1 into consideration, i.e., the small batch size is 1. Corollary 2. Suppose that Assumption 1, 2 and 3 hold. Let the constant probability p = 1 c 2 A+2 1/A , where c is a constant greater than or equal to 1, the local updates K ≥ max 1, 2L 2 σ L 2 , the mini- batch size b = min σ 2 M ϵ , n and b ′ = 1, the local learning rate η l = 1 2 √ 6KL , and the global learning rate η s = 2 √ 6A A+3M c . Then, after the communication rounds of T = O M Aϵ , we have min t∈[T ] ∥∇F ( xt )∥ 2 2 ≤ ϵ. Therefore, the number of total samples called by all clients (i.e., cumulative gradient complexity) is O σ 2 ϵ 2 + M K ϵ when it optimizes an online scenario. Discussion on the effectiveness of c. When c = 1, we can obtain the minimum value for the term p 1 -p A -1 with the constant p devised in Corollary 2. As the number of participants (i.e., A) gets larger, the optimal p increases as well and tends to be 1, indicating that most participants are in the anchor group. When we optimize an online scenario, the anchors compute a gradient with massive samples. As a result, the computation overhead of a single round is not acceptable. By deducting the anchor sampling probability to its 1/c, FedAMD consumes up to (c -1)/c less computation overhead, while its convergence performance remains. Comparison with FedAvg. As a classical algorithm, FedAvg (Yang et al., 2020) requires O K Aϵ 2 + 1 ϵ communication rounds to achieve min t∈[T ] ∥∇F ( xt )∥ consumption of O K 2 ϵ 2 + AK ϵ . Apparently, FedAMD needs O( 1 ϵ ) fewer communication rounds. As mentioned in Section 3, FedAMD consumes (1 -p)/2 more communication overhead than FedAvg. As ϵ is close to 0, the total communication overhead for FedAMD is far less than the cost of FedAvg. As for computation overhead, (Yang et al., 2020) implicitly assumes that K ≥ σ 2 , FedAMD is more communication friendly than FedAvg. In addition to the generalized non-convex objectives, we investigate the convergence rate of the PL condition, a special case under non-convex objectives. The following assumption describes this case: Assumption 4 (PL Condition (Karimi et al., 2016) ). The objective function F satisfies the PL condition when there exists a scalar µ > 0 such that ∥∇F (v)∥ 2 2 ≥ 2µ (F (v) -F * ) , ∀v ∈ R d . Under PL condition, the rest of the section draws the convergence performance of FedAMD with partial client participation. Theorem 3. Suppose that Assumption 1, 2, 3 and 4 hold. Let the local updates K ≥ max 1, 2L 2 σ b ′ L 2 , the minibatch size b = min σ 2 M µϵ , n and b ′ < b, the local learning rate η l = 1 2 √ 6KL , and the global learning rate η s = min 2 √ 6LAp M µ(1-p A ) , 2 √ 6 1+ 16M L µAp . Then, to find an ϵ-approximation of PL condition, i.e., F ( xT ) -F (x * ) ≤ ϵ, the number of communication rounds T performed by FedAMD is T = O 1 µ(1 -p A ) 1 + M µAp + M µ(1 -p A ) Ap log 1 ϵ where we treat ∇F ( x0 ) -F * and L as constants. Similar to Theorem 2, the number of communication rounds of FedAMD is mainly occupied by O M µ 2 Ap(1-p A ) + M Ap . According to such an approximation, we provide a mathematical expression for the setting of p in Corollary 3. Subsequently, we adjust the value of the hyper-parameters such that we can obtain the best result for Theorem 3. Corollary 3. Suppose that Assumption 1, 2, 3 and 4 hold. Let the constant probability p = 1 c 1 + A+1 2µ 2 - A+1 2µ 2 2 + A µ 2 1/A , where c is a constant greater than or equal to 1, the local updates K ≥ max 1, 2L 2 σ L 2 , the minibatch size b = min σ 2 M µϵ , n and b ′ = 1, the local learning rate η l = 1 2 √ 6KL , and the global learning rate η s = min √ 6AL M µc , 2 √ 6 1 + 32M c µA -1 . Then, after the communication rounds of T = O 1 µ + M µ 2 A + M A log 1 ϵ , we have F ( xT ) -F (x * ) ≤ ϵ. Therefore, the number of total samples called by all clients (i.e., cumulative gradient complexity) is O A µ + M µ 2 + M σ 2 M µϵ + K log 1 ϵ when it optimizes an online scenario. Remark. When c = 1, the probability p for anchor sampling approaches 100% as the number of participants is increasing. Likewise, it is necessary to use c ≥ 1 to reduce the computation consumption of each round. Besides, FedAMD achieves a linear convergence under PL conditions. In view that strongly-convex objectives possess a looser setting than PL conditions, FedAMD can also achieve linear convergence under strongly-convex objectives.

5. EXPERIMENTS

This section presents the experiments of our proposed approach and other existing baselines that are most relative to this work. We also investigate the effectiveness of probability {p t } t≥0 . Account for the limited space, numerical analysis on other factors like the number of local updates are presented in the supplementary materials. classes with a total of 600 samples, and each label is held by 20 clients. Let the number of local updates K be 10, the mini-batch size b ′ be 64 and b be 600. For different experiments, unless some hyper-parameters have been defined, we leverage the best setting (e.g., learning rate) to obtain the best results. All the numerical results in this section represent the average performance of three experiments using different random seeds. More empirical results are put in Appendix F. Effectiveness of probability {p t } t≥0 . Figure 1 demonstrates the performance of various probability settings under the scenarios of different participants. In Figure 1a with 20 clients, both sequential probability setting and constant probability setting achieve the best performance. In Figure 1b and 1c , where 40 and 100 clients are selected in each round, constant probability setting outperforms sequential probability setting. In all three scenarios, with sequential probability settings, the pattern of {0, 0, 1} has a much worse performance than the pattern of {0, 1}. This empirically validates Theorem 1 for the best setting τ = 2 in terms of communication complexity. Similarly, with constant probability settings, the best performance is achieved when p approximates or equals optimal, which validates the statement in Corollary 2. Comparison with the state-of-the-art works. Haddadpour et al., 2019; Haddadpour & Mahdavi, 2019) performs multiple (i.e., K ≥ 1) local updates with K small batches, while mini-batch SGD computes the gradients averaged by K small batches (Woodworth et al., 2020b; a) (or a large batches (Shallue et al., 2019; You et al., 2018; Goyal et al., 2017) ) on a given model. There has been a long discussion on which one is better (Lin et al., 2019; Woodworth et al., 2020a; b; Yun et al., 2021) , but no existing work considers how to disjoint the nodes such that both can be trained at the same time. Federated Learning. FL was proposed to ensure data privacy and security (Kairouz et al., 2019) , and now it has become a hot field in the distributed system (Yuan & Ma, 2020; Shamsian et al., 2021; Zhang et al., 2021; Avdiukhin & Kasiviswanathan, 2021; Yuan et al., 2021; Diao et al., 2020; Blum et al., 2021) . The FL training methods in the past few years usually require all trainers to participate in each training session (Kairouz et al., 2019) , but this is obviously impractical when facing the increase in FL clients. To enhance the systems' feasibility, this work assumes that a fixed number of clients are sampled at each round, which is widely adopted in (Li et al., 2019b; Philippenko & Dieuleveut, 2020; Gorbunov et al., 2021a; Karimireddy et al., 2020b; Yang et al., 2020; Li et al., 2020; Eichner et al., 2019; Ruan et al., 2021) . Therefore, the server collects the data from this participation every synchronization to update the model parameters (Li et al., 2019b; Philippenko & Dieuleveut, 2020; Gorbunov et al., 2021a; Karimireddy et al., 2020b; Yang et al., 2020; Li et al., 2020; Eichner et al., 2019; Yan et al., 2020; Ruan et al., 2021; Lai et al., 2021; Gu et al., 2021) . Variance Reduction in Finite-sum Problems. Variance reduction techniques (Johnson & Zhang, 2013; Defazio et al., 2014; Nguyen et al., 2017; Li et al., 2021a; Lan & Zhou, 2018a; b; Allen-Zhu & Hazan, 2016; Reddi et al., 2016; Lei et al., 2017; Zhou et al., 2018; Horváth & Richtárik, 2019; Horváth et al., 2020; Fang et al., 2018; Wang et al., 2018; Li, 2019; Roux et al., 2012; Lian et al., 2017; Zhang et al., 2016) was once proposed for traditional centralized machine learning to optimize finite-sum problems (Bietti & Mairal, 2017; Bottou & Cun, 2003; Robbins & Monro, 1951 ) by mitigating the estimation gap between small-batch (Bottou, 2012; Ghadimi et al., 2016; Khaled & Richtárik, 2020) and large-batch (Nesterov, 2003; Ruder, 2016; Mason et al., 1999) . SGD randomly samples a small-batch and computes the gradient in order to approach the optimal solution. Since the data are generally noisy, an insufficiently large batch results in convergence rate degradation. By utilizing all data in every update, GD can remove the noise affecting the training process. However, it is time-consuming because the period for a single GD step can implement multiple SGD updates. Based on the trade-off, variance-reduced methods periodically perform GD steps while correcting SGD updates with reference to the most recent GD steps. Variance Reduction in FL. 

B USEFUL LEMMAS

Prior to giving detailed proofs of the theorems, we cover some technical lemmas in this section, and all of them are valid in general cases. Lemma 1. Let ε = {ε 1 , . . . , ε a } be the set of random variables in R a×d . Every element in ε is independent with others. For i ∈ {1, . . . , a}, the value for ε i follows the setting below: ε i = e i , probability = q 0, otherwise where q is a constant real number between 0 and 1, i.e., q ∈ [0, 1]. Let | • | indicate the length of a set, ε \ {0} represent a set in which an element is in ε but not 0. Then, there is a probability of (1 -q) a for |ε \ {0}| = 0, let avg(ε) be the averaged result with the exception of zero vectors, i.e., avg(ε) = 1 |ε\{0}| a i=1 ε i , |ε \ {0}| ̸ = 0 0, |ε \ {0}| = 0 (3) Then, the following formulas hold for E (avg(ε)) and its second norm E ∥avg(ε)∥ 2 2 : E (avg(ε)) = (1 -(1 -q) a ) • 1 a a i=1 e i ; E ∥avg(ε)∥ 2 2 ≤ (1 -(1 -q) a ) • 1 a a i=1 ∥e i ∥ 2 2 (4) Proof. When q = 0, the formulas in Equation 4 obviously hold because E (avg(ε)) = 0 and E ∥avg(ε)∥ 2 2 = 0. As for q = 1, since avg(ε) = 1 a a i=1 e i , we leverage Cauchy-Schwarz inequality and get E ∥avg(ε)∥ 2 2 = 1 a a i=1 e i 2 2 ≤ 1 a a i=1 ∥e i ∥ 2 2 , which is consistent with the formulas in Equation 4. In addition to the preceding cases, we consider some general cases for the probability q within 0 and 1, i.e., q ∈ (0, 1). Firstly, we show the proof details for E (avg(ε)). For all i in {1, . . . , a}, given that ε i is not a zero vector, the coefficient of e i is based on the binomial distribution on how many non-zero elements in the set {ε 1 , . . . , ε i-1 } ∪ {ε i+1 , . . . , ε a }. Therefore, with the probability q that ε i is equal to e i , the coefficient of e i in the expected form is q      1 a • a -1 a -1 q a-1 (a-1) non-zero elements + • • • + 1 1 • a -1 0 (1 -q) a-1 0 non-zero element      Then, the coefficient of 1 a e i can be expressed and simplified for q a a • a -1 a -1 q a-1 + • • • + a 1 • a -1 0 (1 -q) a-1 (5) = q a a q a-1 + • • • + a 1 (1 -q) a-1 (6) = a a q a + • • • + a 1 q(1 -q) a-1 (7) = 1 -(1 -q) a where Equation (7) follows α β = α β • (α -1) × • • • × (α -β + 1) 1 × • • • × (β -1) = α β α -1 β -1 , ∀α ≥ β > 0 and Equation ( 8) follows (q + (1 -q)) a = a a q a + • • • + a 0 (1 -q) a . Thus, the equation E (avg(ε)) = (1 -(1 -q) a ) • 1 a a i=1 e i holds. Secondly, we provide the analysis for E ∥avg(ε)∥ 2 2 . Based on the definition for avg(ε) in Equation (3), we discuss the case |ε \ {0}| ̸ = 0. By means of Cauchy-Schwarz inequality, we can obtain the following inequality: 1 |ε \ {0}| a i=1 ε i 2 2 = 1 |ε \ {0}| i,εi̸ =0 ε i 2 2 ≤ 1 |ε \ {0}| i,εi̸ =0 ∥ε i ∥ 2 2 = 1 |ε \ {0}| a i=1 ∥ε i ∥ 2 2 (9) Therefore, ∥avg(ε)∥ 2 2 ≤ 1 |ε\{0}| a i=1 ∥ε i ∥ 2 2 , |ε \ {0}| ̸ = 0 0, |ε \ {0}| = 0 Apparently, Equation ( 10) is very similar to Equation (3) in terms of the expression. As a result, we can adopt the same proof framework in the analysis of E (avg(ε)). Then, we can directly draw a conclusion E ∥avg(ε)∥ 2 2 ≤ (1 -(1 -q) a ) • 1 a a i=1 ∥e i ∥ 2 2 . Lemma 2. Let ε = {ε 1 , . . . , ε a } be the set of random variables in R d with the number of a. These random variables are not necessarily independent. We can suppose that E [ε i ] = e i , and the variance is bounded as E ∥ε i -e i ∥ 2 2 ≤ σ 2 . After that we can get E   a i=1 ε i 2 2   ≤ a i=1 e i 2 2 + a 2 σ 2 If we make another suppose that the conditional mean of these random variables is Proof. For any random variable X, E [ε i |ε i-1 , . . . , ε 1 ] = e i , E X 2 = (E [X -E [X]]) 2 + (E [X]) 2 implying E   a i=1 ε i 2 2   = a i=1 e i 2 2 + E   a i=1 ε i -e i 2 2   Expanding above expression using relaxed triangle inequality: E   a i=1 ε i -e i 2 2   ≤ a a i=1 E ∥ε i -e i ∥ 2 2 ≤ a 2 σ 2 (14) For the second statement, e i depends on [ε i-1 , . . . , ε 1 ]. Thus we choose to use a relaxed triangle inequality E   a i=1 ε i 2 2   ≤ 2 a i=1 e i 2 2 + 2E   a i=1 ε i -e i 2 2   then we use a much tighter expansion and we can get: E   a i=1 ε i -e i 2 2   = i,j E (ε i -e i ) ⊤ (ε j -e j ) = i E   a i=1 ε i -e i 2 2   ≤ aσ 2 When {ε i -e i } form a martingale difference sequence, the cross terms will have zero means. Lemma 3. Suppose there is a sequence {y t ∈ R d } t≥0 satisfying a recursive function y t+1 = y t -η∆y t , where η > 0 is a constant and ∆y t ∈ R d is a vector. Given a L-smooth function G, the following inequality holds for any η and ∆y t : G(y t+1 ) ≤ G(y t ) - ηη ′ 2 ∥∇G(y t )∥ 2 2 - 1 2ηη ′ - L 2 ∥y t+1 -y t ∥ 2 2 + η 2η ′ ∥∆y t -η ′ ∇G(y t )∥ 2 2 (17) where η ′ > 0 can be any constant. Proof. Since G is a L-smooth function, for any v, v ∈ R d , the following inequality holds: G(v) = G(v) + 1 0 ∂G(v + t(v -v)) ∂t dt (18) = G(v) + 1 0 ∇G(v + t(v -v)) • (v -v)dt (19) = G(v) + ∇G(v)(v -v) + 1 0 (∇G(v + t(v -v)) -G(v)) • (v -v)dt (20) ≤ G(v) + ∇G(v)(v -v) + 1 0 L∥t(v -v)∥ 2 ∥v -v∥ 2 dt (21) ≤ G(v) + ∇G(v)(v -v) + L 2 ∥v -v∥ 2 2 . ( ) Based on the conclusion on L-smooth drawn from Equation ( 22), we derive Equation ( 17) step by step: G (y t+1 ) ≤ G (y t ) + ⟨∇G (y t ) , y t+1 -y t ⟩ + L 2 ∥y t+1 -y t ∥ 2 2 (23) = G (y t ) + ⟨∇G (y t ) , -η∆y t ⟩ + L 2 ∥y t+1 -y t ∥ 2 2 (24) = G (y t ) - η η ′ ⟨η ′ ∇G (y t ) , ∆y t ⟩ + L 2 ∥y t+1 -y t ∥ 2 2 (25) = G (y t ) - η 2η ′ η ′2 ∥∇G (y t )∥ 2 2 + ∥∆y t ∥ 2 2 -∥∆y t -η ′ ∇G (y t )∥ 2 2 + L 2 ∥y t+1 -y t ∥ 2 2 (26) = G (y t ) - ηη ′ 2 ∥∇G (y t )∥ 2 2 - 1 2ηη ′ - L 2 ∥y t+1 -y t ∥ 2 2 + η 2η ′ ∥∆y t -η ′ ∇G (y t )∥ 2 2 where Equation ( 26) is in accordance with ⟨α, β⟩ = 1 2 α 2 + β 2 -(α -β) 2 , and Equation ( 27) follows ∥∆y t ∥ 2 2 = 1 η 2 ∥y t+1 -y t ∥ 2 2 . C PRELIMINARY FOR FEDAMD  (m) t = -x (m) t,K -xt = - K-1 k=0 x (m) t,k+1 -x (m) t,k+1 = K-1 k=0 η l g (m) t,k+1 where the last equal sign is according to Line 20 in Algorithm 1. Next, with the recursive formula in Line 19, we have g (m) t,k+1 = g (m) t,k -∇f m x (m) t,k-1 , B ′ m,k + ∇f m x (m) t,k , B ′ m,k (30) = gt - k κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ + k κ=0 ∇f m x (m) t,κ , B ′ m,κ . Then, Equation ( 29) can be rewritten as ∆x (m) t = η l K gt -η l K-1 k=0 k κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ + η l K-1 k=0 k κ=0 ∇f m x (m) t,κ , B ′ m,κ D PROOFS UNDER SEQUENTIAL PROBABILISTIC SETTINGS D.1 PRELIMINARY Lemma 4. Suppose that Assumption 1, 2 and 3 hold. Let the local learning rate satisfy η l ≤ min 1 2 √ 3KL , 1 2 √ 3L 2 σ b ′ K . With FedAMD, K-1 k=0 x (m) t,k -x (m) t,k-1 2 2 represents the sum of the second norm of every iteration's difference. Therefore, the bound for such a summation in the expected form should be K-1 k=0 E x (m) t,k -x (m) t,k-1 2 2 ≤ 6η 2 l K ∥g t -∇F ( xt )∥ 2 2 + 6η 2 l K ∥∇F ( xt )∥ 2 2 (33) Proof. According to Equation (31), the update at (k -1)-th iteration is x (m) t,k -x (m) t,k-1 = -η l g (m) t,k = -η l gt - k-1 κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ + k-1 κ=0 ∇f m x (m) t,κ , B ′ m,κ (34) To find the bound for the expected value of its second norm, the analysis is presented as follows: E x (m) t,k -x (m) t,k-1 2 2 (35) = η 2 l E gt - k-1 κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ + k-1 κ=0 ∇f m x (m) t,κ , B ′ m,κ 2 2 (36) ≤ 3η 2 l ∥g t -∇F ( xt )∥ 2 2 + 3η 2 l ∥∇F ( xt )∥ 2 2 + 3η 2 l E k-1 κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ - k-1 κ=0 ∇f m x (m) t,κ , B ′ m,κ 2 2 (37) = 3η 2 l ∥g t -∇F ( xt )∥ 2 2 + 3η 2 l ∥∇F ( xt )∥ 2 2 + 3η 2 l E k-1 κ=0 ∇F m x (m) t,κ-1 -∇F m x (m) t,κ 2 2 + 3η 2 l E k-1 κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ -∇f m x (m) t,κ , B ′ m,κ -∇F m x (m) t,κ-1 + ∇F m x (m) t,κ 2 2 (38) ≤ 3η 2 l ∥g t -∇F ( xt )∥ 2 2 + 3η 2 l ∥∇F ( xt )∥ 2 2 + 3η 2 l KL 2 k-1 κ=0 E x (m) t,κ-1 -x (m) t,κ 2 2 + 3η 2 l E k-1 κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ -∇f m x (m) t,κ , B ′ m,κ -∇F m x (m) t,κ-1 + ∇F m x (m) t,κ 2 2 (39) ≤ 3η 2 l ∥g t -∇F ( xt )∥ 2 2 + 3η 2 l ∥∇F ( xt )∥ 2 2 + 3η 2 l KL 2 k-1 κ=0 E x (m) t,κ-1 -x (m) t,κ 2 2 + 3η 2 l L 2 σ b ′ k-1 κ=0 E x (m) t,κ-1 -x (m) t,κ 2 2 (40) where Equation ( 37) is based on Cauchy-Schwarz inequality; Equation ( 38) is based on the variance expansion on the third term of Equation (37); Equation ( 39) is based on Cauchy-Schwarz inequality and Assumption 1 on the third term of Equation ( 38); Equation ( 40) is based on Lemma 2 and Assumption 3 on the fourth term of Equation ( 39). Therefore, by summing Equation ( 40) for k = 1, . . . , K, we have K-1 k=0 x (m) t,k -x (m) t,k-1 2 2 ≤ K k=0 x (m) t,k -x (m) t,k-1 2 2 (41) ≤ 3η 2 l K ∥g t -∇F ( xt )∥ 2 2 + 3η 2 l K ∥∇F ( xt )∥ 2 2 + 3η 2 l K KL 2 + L 2 σ b ′ K-1 k=0 E x (m) t,κ-1 -x (m) t,κ 2 2 (42) Obviously, according to the setting of the local learning rate in the description above, the inequality 3η 2 l K KL 2 + L 2 σ b ′ ≤ 1 2 holds. Therefore, we can easily obtain the bound for the sum of the second norm of every iteration's difference, which is consistent with Equation (33).

D.2 FULL CLIENT PARTICIPATION

Theorem 4. Suppose that Assumption 1, 2 and 3 hold, and all clients participate in the training, i.e., A = M . Let the local updates K ≥ 1, and the local learning rate η l and the global learning rate η s be η s η l = 1 KL(1+2τ ) , where η l ≤ min 1 2 √ 6KL , √ b ′ /K 4 √ 3Lσ . Therefore, the convergence rate of FedAMD for non-convex objectives should be min t∈[T ] ∥∇F ( xt )∥ 2 2 ≤ O 1 + 2τ T -⌊T /τ ⌋ + O 1 {b<n} σ 2 M b ( ) where we treat F ( x0 ) -F * and L as constants. Proof. When p t = 1, according to Algorithm 1, there is no model update between two consecutive rounds, i.e., xt+1 = xt . Next, we consider the case when p t = 0. Based on Lemma 3, we have EF ( xt+1 ) -F ( xt ) ≤ - η s η l K 2 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 E ∥ xt+1 -xt ∥ 2 2 + η s 2η l K E ∥avg(∆x t ) -η l K∇F ( xt )∥ 2 2 (44) Knowing that when p t = 0 and all clients involve in the training, avg(∆x t ) =η l K gt - η l M m∈[M ] K-1 k=0 k κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ + η l M m∈[M ] K-1 k=0 k κ=0 ∇f m x (m) t,κ , B ′ m,κ , we have the bound for E ∥avg(∆x t ) -η l K∇F ( xt )∥ 2 2 according to the following derivation: E ∥avg(∆x t ) -η l K∇F ( xt )∥ 2 2 (46) = E η l K (g t -∇F ( xt )) + η l M m∈[M ] K-1 k=0 k κ=0 ∇f m x (m) t,κ , B ′ m,κ -∇f m x (m) t,κ-1 , B ′ m,κ 2 2 (47) ≤ 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2E η l M m∈[M ] K-1 k=0 k κ=0 ∇f m x (m) t,κ , B ′ m,κ -∇f m x (m) t,κ-1 , B ′ m,κ 2 2 (48) = 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2E η l M m∈[M ] K-1 k=0 k κ=0 ∇F m x (m) t,κ -∇F m x (m) t,κ-1 2 2 + 2E η l M m∈[M ] k=0 k κ=0 ∇f m x (m) t,κ , B ′ m,κ -∇f m x (m) t,κ-1 , B ′ m,κ -∇F m x (m) t,κ + ∇F m x (m) t,κ-1 2 2 (49) = 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2η 2 l KL 2 M m∈[M ] K-1 k=0 k κ=0 k • E x (m) t,κ -x (m) t,κ-1 2 2 + 2E η l M m∈[M ] K-1 k=0 k κ=0 ∇f m x (m) t,κ , B ′ m,κ -∇f m x (m) t,κ-1 , B ′ m,κ -∇F m x (m) t,κ + ∇F m x (m) t,κ-1 2 2 (50) ≤ 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2η 2 l KL 2 M m∈[M ] K-1 k=0 k κ=0 k • E x (m) t,κ -x (m) t,κ-1 2 2 + 2η 2 l M 2 m∈[M ] K-1 k=0 k κ=0 L 2 σ b ′ E x (m) t,κ -x (m) t,κ-1 2 2 (51) ≤ 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2η 2 l K 3 L 2 M m∈[M ] K-1 k=0 E x (m) t,k -x (m) t,k-1 2 2 + 2η 2 l KL 2 σ M 2 b ′ m∈[M ] K-1 k=0 E x (m) t,k -x (m) t,k-1 2 2 (52) ≤ 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 12η 4 l K 2 L 2 σ M b ′ + K 2 L 2 ∥g t -∇F ( xt )∥ 2 2 + ∥∇F ( xt )∥ 2 2 (53) where Equation ( 48) follows (α + β) 2 ≤ 2α 2 + 2β 2 ; Equation ( 49) is based on variance expansion; Equation ( 50) is based on Cauchy-Schwarz inequality and Assumption 1; Equation ( 51) is based on Lemma 2 and Assumption 3; Equation ( 53) is based on Lemma 4. According to the constraints on the local learning rate, we can further simplify Equation ( 53) as E ∥avg(∆x t ) -η l K∇F ( xt )∥ 2 2 ≤ 4η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + η 2 l K 2 2 ∥∇F ( xt )∥ 2 2 . ( ) Plugging Equation (54) into Equation ( 44), we have EF ( xt+1 ) -F ( xt ) ≤ - η s η l K 4 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 E ∥ xt+1 -xt ∥ 2 2 + 2η s η l K ∥g t -∇F ( xt )∥ 2 2 . ( ) Let Λ(t) indicate the most recent round where p Λ(t) = 1 and Λ(t) ̸ = t. It is noted that recursively using Λ(•) can achieve the value of 0, i.e., Λ(Λ(...Λ multiple Λ (t))) = 0 By summing Equation (55) from Λ(t) to t -1, we have EF ( xt ) -F xΛ(t) = t-1 θ=Λ(t) (EF ( xθ+1 ) -F ( xθ )) (56) ≤ - η s η l K 4 t-1 θ=Λ(t)+1 ∥∇F ( xθ )∥ 2 2 - 1 2η s η l K - L 2 t-1 θ=Λ(t)+1 E ∥ xθ+1 -xθ ∥ 2 2 + 2η s η l K t-1 θ=Λ(t)+1 E ∥g θ -∇F ( xθ )∥ 2 2 . ( ) The bound for the last term of Equation ( 57) is E ∥g θ -∇F ( xθ )∥ 2 2 = E gΛ(θ) -∇F ( xθ ) 2 2 (58) = E gΛ(θ) -∇F xΛ(θ) 2 2 + E ∇F xΛ(θ) -∇F ( xθ ) 2 2 (59) ≤ E gΛ(θ) -∇F xΛ(θ) 2 2 + L 2 E xθ -xΛ(θ) 2 2 (60) ≤ E gΛ(θ) -∇F xΛ(θ) 2 2 + L 2 τ θ-1 Ξ=Λ(θ) E ∥ xΞ+1 -xΞ ∥ 2 2 (61) ≤ 1 {b<n} σ 2 M b + L 2 τ θ-1 Ξ=Λ(θ) E ∥ xΞ+1 -xΞ ∥ 2 2 (62) where Equation ( 59) is based on the variance expansion; Equation ( 60) is based on Assumption 1; Equation ( 61) is according to Cauchy-Schwarz inequality and θ -Λ(θ) ≤ τ ; Equation (62) follows Assumption 2. Based on the definition of Λ(•), for all θ ∈ {Λ(t) + 1, . . . , t -1}, Λ(θ) = Λ(t). Therefore, with Equation (62), Equation ( 57) can be further simplified as: EF ( xt ) -F xΛ(t) (63) ≤ - η s η l K 4 t-1 θ=Λ(t)+1 ∥∇F ( xθ )∥ 2 2 - 1 2η s η l K - L 2 t-1 θ=Λ(t)+1 E ∥ xθ+1 -xθ ∥ 2 2 + 2η s η l K (t -Λ(t) -1) • 1 {b<n} σ 2 M b + 2η s η l KL 2 τ t-1 θ=Λ(t)+1 θ-1 Ξ=Λ(θ) E ∥ xΞ+1 -xΞ ∥ 2 2 (64) ≤ - η s η l K 4 t-1 θ=Λ(t)+1 ∥∇F ( xθ )∥ 2 2 - 1 2η s η l K - L 2 -2η s η l KL 2 τ 2 t-1 θ=Λ(t)+1 E ∥ xθ+1 -xθ ∥ 2 2 + 2η s η l K (t -Λ(t) -1) • 1 {b<n} σ 2 M b (65) Since η s η l = 1 KL(1+2τ ) , Equation ( 65) can be further simplified as EF ( xt ) -F xΛ(t) ≤ - η s η l K 4 t-1 θ=Λ(t)+1 ∥∇F ( xθ )∥ 2 2 + 2η s η l K (t -Λ(t) -1) • 1 {b<n} σ 2 M b (66) Therefore, based on the equation above, by summing up all t ∈ {T + 1, Λ(T + 1), . . . , τ }, we can obtain the following inequality: F * -F ( x0 ) ≤ EF ( xT +1 ) -F ( x0 ) (67) ≤ - η s η l K 4 T t=0;t mod τ =0 ∥∇F ( xt )∥ 2 2 + 2η s η l K (T -⌊T /τ ⌋) • 1 {b<n} σ 2 M b (68) Thus, we have 1 T -⌊T /τ ⌋ T t=0;t mod τ =0 ∥∇F ( xt )∥ 2 2 ≤ 4 (F ( x0 ) -F * ) η s η l K (T -⌊T /τ ⌋) + 8 • 1 {b<n} σ 2 M b (69) By using the settings of the local learning rate and the global learning rate in the description, we can obtain the desired result.

D.3 PARTIAL CLIENT PARTICIPATION

Theorem 5. Suppose that Assumption 1, 2 and 3 hold. Let the local updates K ≥ 1, and the local learning rate η l and the global learning rate η s be η s η l = 1 KL 1 + 2M τ A -1 , where η l ≤ min 1 2 √ 6KL , √ b ′ /K 4 √ 3Lσ . Therefore, the convergence rate of FedAMD for non-convex objectives should be min t∈[T ] ∥∇F ( xt )∥ 2 2 ≤ O 1 T -⌊T /τ ⌋ 1 + 2M τ A + O 1 {b<n} σ 2 M b ( ) where we treat F ( x0 ) -F * and L as constants. Proof. When p t = 1, according to Algorithm 1, there is no model update between two consecutive rounds, i.e., xt+1 = xt . Next, we consider the case when p t = 0. Based on Lemma 3, we have EF ( xt+1 ) -F ( xt ) ≤ - η s η l K 2 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 E ∥ xt+1 -xt ∥ 2 2 + η s 2η l K E ∥avg(∆x t ) -η l K∇F ( xt )∥ 2 2 (71) Knowing that when p t = 0 and a set of clients A involve in the training, avg(∆x t ) =η l K gt - η l A i∈A K-1 k=0 k κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ + η l A i∈A K-1 k=0 k κ=0 ∇f m x (m) t,κ , B ′ m,κ , we have the bound for E ∥avg(∆x t ) -η l K∇F ( xt )∥ 2 2 according to the following derivation: E ∥avg(∆x t ) -η l K∇F ( xt )∥ 2 2 (73) Under review as a conference paper at 2023 = E η l K (g t -∇F ( xt )) + η l A i∈A K-1 k=0 k κ=0 ∇f i x (i) t,κ , B ′ i,κ -∇f i x (i) t,κ-1 , B ′ i,κ 2 2 (74) ≤ 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2E η l A i∈A K-1 k=0 k κ=0 ∇f i x (i) t,κ , B ′ i,κ -∇f i x (i) t,κ-1 , B ′ i,κ 2 2 (75) = 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2E η l A i∈A K-1 k=0 k κ=0 ∇F i x (i) t,κ -∇F i x (i) t,κ-1 2 2 + 2E η l A i∈A K-1 k=0 k κ=0 ∇f i x (i) t,κ , B ′ i,κ -∇f i x (i) t,κ-1 , B ′ i,κ -∇F i x (i) t,κ + ∇F i x (i) t,κ-1 2 2 (76) = 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2η 2 l KL 2 A i∈A K-1 k=0 k κ=0 k • E x (i) t,κ -x (i) t,κ-1 2 2 + 2E η l A i∈A K-1 k=0 k κ=0 ∇f i x (i) t,κ , B ′ i,κ -∇f i x (i) t,κ-1 , B ′ i,κ -∇F i x (i) t,κ + ∇F i x (i) t,κ-1 2 (77) ≤ 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2η 2 l KL 2 A i∈A K-1 k=0 k κ=0 k • E x (i) t,κ -x (i) t,κ-1 2 2 + 2η 2 l A 2 i∈A K-1 k=0 k κ=0 L 2 σ b ′ E x (i) t,κ -x (i) t,κ-1 2 2 (78) = 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2η 2 l KL 2 M m∈[M ] K-1 k=0 k κ=0 k • E x (i) t,κ -x (i) t,κ-1 2 2 + 2η 2 l AM m∈[M ] K-1 k=0 k κ=0 L 2 σ b ′ E x (i) t,κ -x (i) t,κ-1 2 2 (79) ≤ 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 2η 2 l K 3 L 2 M m∈[M ] K-1 k=0 E x (m) t,k -x (m) t,k-1 2 2 + 2η 2 l KL 2 σ AM b ′ m∈[M ] K-1 k=0 E x (m) t,k -x (m) t,k-1 2 2 (80) ≤ 2η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + 12η 4 l K 2 L 2 σ Ab ′ + K 2 L 2 ∥g t -∇F ( xt )∥ 2 2 + ∥∇F ( xt )∥ 2 2 (81) where Equation ( 75) follows (α + β) 2 ≤ 2α 2 + 2β 2 ; Equation ( 76) is based on variance expansion; Equation ( 77) is based on Cauchy-Schwarz inequality and Assumption 1; Equation ( 78) is based on Lemma 2 and Assumption 3; Equation ( 79) is based on the setting of client selection, where each client is selected with a probability of A/M ; Equation ( 81) is based on Lemma 4. According to the constraints on the local learning rate, we can further simplify Equation ( 81) as E ∥avg(∆x t ) -η l K∇F ( xt )∥ 2 2 ≤ 4η 2 l K 2 ∥g t -∇F ( xt )∥ 2 2 + η 2 l K 2 2 ∥∇F ( xt )∥ 2 2 . ( ) Plugging Equation ( 82) into Equation ( 71), we have EF ( xt+1 ) -F ( xt ) (83) ≤ - η s η K 4 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 E ∥ xt+1 -xt ∥ 2 2 + 2η s η l KE ∥g t -∇F ( xt )∥ 2 2 (84) = - η s η l K 4 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 E ∥ xt+1 -xt ∥ 2 2 + 2η s η l K E ∥g t -Eg t ∥ 2 2 + E ∥Eg t -∇F ( xt )∥ 2 2 (85) ≤ - η s η l K 4 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 E ∥ xt+1 -xt ∥ 2 2 + 2η s η l K • 1 {b<n} σ 2 M b + 2η s η l KE ∥Eg t -∇F ( xt )∥ 2 2 (86) where Equation ( 85) is based on variance expansion, and Equation ( 86) is based on Assumption 2. By summing Equation ( 86) for all t ∈ {0, . . . , T }, we have F * -F ( x0 ) ≤ EF ( xT +1 ) -F ( x0 ) = T t=0 (EF ( xt+1 ) -F ( xt )) (87) ≤ - η s η l K 4 T t=0;t mod τ =0 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 T t=0;t mod τ =0 E ∥ xt+1 -xt ∥ 2 2 + 2η s η l K T t=0;t mod τ =0 E ∥Eg t -∇F ( xt )∥ 2 2 + 2η s η l K (T -⌊T /τ ⌋) • 1 {b<n} σ 2 M b Let Λ(t) indicate the most recent round where p Λ(t) = 1 and Λ(t) ̸ = t. It is noted that recursively using Λ(•) can achieve the value of 0, i.e., Λ(Λ(...Λ multiple Λ (t))) = 0. To find the bound for T t=0;t mod τ =0 E ∥Eg t -∇F ( xt )∥ 2 2 , the first step is to provide the bound for E ∥Eg t -∇F ( xt )∥ 2 2 . When p t = 1, a client updates the caching gradient with a probability of A/M , and therefore, Eg t = 1 -A M Eg Λ(t) + A M ∇F ( xt ). Based on this fact, the bound for E ∥Eg t -∇F ( xt )∥ 2 2 can be derived as follows: E ∥Eg t -∇F ( xt )∥ 2 2 = E Eg Λ(t) -∇F ( xt ) 2 2 (89) = E 1 - A M E gΛ(Λ(t)) -∇F xΛ(t) + ∇F xΛ(t) -∇F ( xt ) 2 2 (90) ≤ 1 - A M E Eg Λ(Λ(t)) -∇F xΛ(t) 2 2 + M A E ∇F xΛ(t) -∇F ( xt ) 2 2 (91) ≤ ⌊t/τ ⌋-1 θ=0 1 - A M ⌊t/τ ⌋-θ • M A L 2 E xθτ -x(θ+1)τ 2 2 + M A L 2 E xΛ(t) -xt 2 2 (92) where Equation ( 91) follows (α + β) 2 ≤ 1 + 1 γ α 2 + (1 + γ) β 2 -1 √ γ α + √ γβ 2 ≤ 1 + 1 γ α 2 + (1 + γ) β 2 and γ = M -A A . With Equation ( 92), we sum up all t ∈ {1, . . . , T } and obtain the following result: T t=0 E ∥Eg t -∇F ( xt )∥ 2 2 (93) ≤ T t=0   ⌊t/τ ⌋-1 θ=0 1 - A M ⌊t/τ ⌋-θ • M A L 2 E xθτ -x(θ+1)τ 2 2 + M A L 2 E xΛ(t) -xt 2 2   (94) ≤ ⌊T /τ ⌋-1 θ=0 M (M -A) A 2 L 2 τ E xθτ -x(θ+1)τ 2 2 + M A L 2 T t=0 xt xΛ(t) 2 2 (95) ≤ ⌊T /τ ⌋-1 θ=0 M (M -A) A 2 L 2 τ 2 (θ+1)τ -1 Ξ=θτ +1 E ∥ xΞ+1 -xΞ ∥ 2 2 + M A L 2 T t=0 (t -Λ(t) -1) t-1 Ξ=Λ(t)+1 ∥ xΞ+1 -xΞ ∥ 2 2 (96) ≤ M (M -A) A 2 L 2 τ 2 T -1 t=0;t mod τ =0 E ∥ xt+1 -xt ∥ 2 2 + M A L 2 τ 2 T -1 t=0;t mod τ =0 E ∥ xt+1 -xt ∥ 2 2 (97) where Equation ( 95) follows that, for all θ ∈ {0, . . . , ⌊T /τ ⌋ -1}, the coefficient for M A L 2 includes 1 -A M , . . . , 1 -A M ⌊T /τ ⌋-θ , and each of them has a maximum of τ ts, meaning that the upper bound of the coefficient should be τ 1 - A M + • • • + 1 - A M ⌊T /τ ⌋-θ ≤ τ • M 2A 1 - A M ; Equation ( 96) follows Cauchy-Schwarz inequality. Plugging Equation ( 97) back to Equation ( 88), we have: F * -F ( x0 ) ≤ - η s η l K 4 T t=0;t mod τ =0 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 -η s η l KL 2 τ 2 M 2 A 2 T t=0;t mod τ =0 E ∥ xt+1 -xt ∥ 2 2 + 2η s η l K (T -⌊T /τ ⌋) • 1 {b<n} σ 2 M b (99) Since η s η l = 1 KL 1 + 2M τ A -1 , 1 2ηsη l K -L 2 -η s η l KL 2 τ 2 M 2 A 2 ≥ 0 such that the term T t=0;t mod τ =0 E ∥ xt+1 -xt ∥ 2 2 can be omitted in Equation ( 99). Hence, we can easily obtain the following inequality: 1 T -⌊T /τ ⌋ T t=0;t mod τ =0 ∥∇F ( xt )∥ 2 2 ≤ 4 (F ( x0 ) -F * ) η s η l K (T -⌊T /τ ⌋) + 8 • 1 {b<n} σ 2 M b By using the settings of the local learning rate and the global learning rate in the description, we can obtain the desired result.

E PROOFS UNDER CONSTANT PROBABILISTIC SETTINGS

E.1 PRELIMINARY Lemma 5. Suppose that Assumption 1 holds, and p t ∈ (0, 1). Let gt be the definition of Line 9 of Algorithm 1, i.e., the average the caching gradients. Therefore, the recursive expression for {g t } t≥0 in the expected form is Eg t = 1 -A M p t-1 • Eg t-1 + A M p t-1 • ∇F ( xt-1 ) , t > 0 ∇F ( x0 ) , t = 0 Furthermore, when t > 0 we can obtain the following inequality: E ∥Eg t -∇F ( xt )∥ 2 2 ≤ 1 - A M p t-1 •E ∥Eg t-1 -∇F ( xt-1 )∥ 2 2 + M Ap t-1 •L 2 •E ∥ xt -xt-1 ∥ 2 2 . (102) As for t = 0, we have E ∥Eg t -∇F ( xt )∥ 2 2 = 0. Proof. According to the definition of Line 9 of Algorithm 1, gt+1 = avg (v t+1 ) = 1 M m∈[M ] v (m) t+1 . Hence, for each element in v t+1 , i.e., v t+1 , where m ∈ [M ], they have a probability of 1 -A M p t to retain the previous value, or otherwise update as anchor clients using large batches. Thus, the expected value for Ev (m) t+1 is: Ev (m) t+1 = A M p t • E∇f m ( xt , B m,t )] + 1 - A M p t • Ev (m) t (103) = A M p t • ∇F m ( xt ) + 1 - A M p t • Ev (m) t Therefore, we have Eg t+1 = 1 M M m=1 Ev (m) t+1 = A M p t • ∇F ( xt ) + 1 - A M p t • Eg t It is worth noting that Eg 0 = ∇F ( x0 ) as it is initialized at the beginning of the training, i.e., Line 2 -4 in Algorithm 1. Therefore, E ∥Eg t -∇F ( xt )∥ 2 2 = 0. Next, we find the recursive bound for E ∥Eg t+1 -∇F ( xt+1 )∥ 2 2 : E ∥Eg t+1 -∇F ( xt+1 )∥ 2 2 (106) = E 1 - A M p t • (Eg t -∇F ( xt )) + ∇F ( xt ) -∇F ( xt+1 ) 2 2 (107) ≤ 1 + Ap t M -Ap t 1 - A M p t 2 E ∥Eg t -∇F ( xt )∥ 2 2 + 1 + M -Ap t Ap t E ∥∇F ( xt ) -∇F ( xt+1 )∥ 2 2 (108) ≤ 1 - A M p t E ∥Eg t -∇F ( xt )∥ 2 2 + M Ap t L 2 E ∥ xt+1 -xt ∥ 2 2 where Equation ( 108) follows (α + β) 2 ≤ 1 + 1 γ α 2 + (1 + γ) β 2 -1 √ γ α + √ γβ 2 ≤ 1 + 1 γ α 2 + (1 + γ) β 2 , and Equation ( 109) follows Assumption 1. Lemma 6. Suppose that Assumption 1, 2 and 3 hold. Let the local learning rate satisfy η l ≤ min 1 2 √ 3KL , 1 2 √ 3L 2 σ b ′ K . With FedAMD, K-1 k=0 x (m) t,k -x (m) t,k-1 2 2 represents the sum of the second norm of every iteration's difference. Therefore, the bound for such a summation in the expected form should be K-1 k=0 E x (m) t,k -x (m) t,k-1 2 2 ≤ 2η 2 l (K+1) σ 2 M b +6η 2 l (K+1) ∥Eg t -∇F ( xt )∥ 2 2 +6η 2 l (K+1) ∥∇F ( xt )∥ 2 2 Proof. According to Equation (31), the update at (k -1)-th iteration is x (m) t,k -x (m) t,k-1 = -η l g (m) t,k = -η l g (m) t,0 - k-1 κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ + k-1 κ=0 ∇f m x (m) t,κ , B ′ m,κ . (111) To find the bound for the expected value of its second norm, the analysis is presented as follows: E x (m) t,k -x (m) t,k-1 2 2 (112) = η 2 l E gt - k-1 κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ + k-1 κ=0 ∇f m x (m) t,κ , B ′ m,κ 2 2 (113) = η 2 l E gt -Eg t - k-1 κ=0 ∇f m x (m) t,κ-1 , B ′ m,κ -∇f m x (m) t,κ , B ′ m,κ -∇F m x (m) t,κ-1 + ∇F m x (m) t,κ 2 2 + η 2 l E Eg t - k-1 κ=0 ∇F m x (m) t,κ-1 + k-1 κ=0 ∇F m x (m) t,κ = η 2 l 1 {b<n} σ 2 M b + k-1 κ=0 L 2 σ b ′ E x (m) t,κ -x (m) t,κ-1 2 2 + η 2 l E Eg t - k-1 κ=0 ∇F m x (m) t,κ-1 -∇F m x (m) t,κ 2 2 (115) = η 2 l 1 {b<n} σ 2 M b + k-1 κ=0 L 2 σ b ′ x (m) t,κ -x (m) t,κ-1 2 2 + η 2 l E Eg t -∇F ( xt ) + ∇F ( xt ) + k-1 κ=0 ∇F m x (m) t,κ + ∇F m x (m) t,κ-1 2 2 (116) ≤ η 2 l 1 {b<n} σ 2 M b + k-1 κ=0 L 2 σ b ′ E x (m) t,κ -x (m) t,κ-1 2 2 + 3η 2 l • ∥Eg t -∇F ( xt )∥ 2 2 + 3η 2 l ∥∇F ( xt )∥ 2 2 + 3η 2 l K k-1 κ=0 L 2 x (m) t,κ -x (m) t,κ-1 2 2 (117) = η 2 l 1 {b<n} σ 2 M b + 3η 2 l • E ∥Eg t -∇F ( xt )∥ 2 2 + 3η 2 l ∥∇F ( xt )∥ 2 2 + 3η 2 l KL 2 + L 2 σ b ′ k-1 κ=0 E x (m) t,κ -x (m) t,κ-1 2 2 (118) where Equation ( 114) is based on the variance expansion on the first term of Equation (113); Equation ( 117) is based on Cauchy-Schwarz inequality. Therefore, by summing Equation ( 118) for k = 1, . . . , K, we have E K-1 k=0 x (m) t,k -x (m) t,k-1 2 2 ≤ K k=0 E x (m) t,k -x (m) t,k-1 2 2 (119) ≤ η 2 l (K + 1)1 {b<n} σ 2 M b + 3η 2 l (K + 1) ∥Eg t -∇F ( xt )∥ 2 2 + 3η 2 l (K + 1) ∥∇F ( xt )∥ 2 2 + 3η 2 l K KL 2 + L 2 σ b ′ K-1 k=0 E x (m) t,k -x (m) t,k-1 2 2 (120) Obviously, according to the setting of the local learning rate in the description above, the inequality 3η 2 l K KL 2 + L 2 σ b ′ ≤ 1 2 holds. Therefore, we can easily obtain the bound for the sum of the second norm of every iteration's difference, which is consistent with Equation ( 110).

E.2 PROOFS FOR NON-CONVEX OBJECTIVES

The following lemma provides a recursive expression on EF ( xt+1 ) -F ( xt ) for time-varying probability settings. Lemma 7. Suppose that Assumption 1, 2 and 3 hold, and the time-varying probability sequence {p t ∈ (0, 1)} t≥0 . Let the local updates K ≥ 1, and the local learning rate η l ≤ min 1 2 √ 6KL , √ b ′ /K 2 √ 3Lσ . With the model training using FedAMD, the recursive function between F ( xt+1 ) and F ( xt ) in expected form is EF ( xt+1 ) -F ( xt ) ≤ - η s η l K 4 1 -(p t ) A ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 ∥ xt+1 -xt ∥ 2 2 + 4η s η l K 1 -(p t ) A ∥Eg t -∇F ( xt )∥ 2 2 + 3η s η l K 1 -(p t ) A 1 {b<n} σ 2 M b Proof. According to Lemma 4, we have: EF ( xt+1 ) -F ( xt ) (122) ≤ - η s η l K 2 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 ∥ xt+1 -xt ∥ 2 2 + η s 2η l K E ∥∆x t -η l K∇F ( xt )∥ 2 2 When |∆x t | = 0, the probability will be (1 -q) a , and when |∆x t | ̸ = 0, the probability will be 1 - (1 -q) a . Next, we find the bound for the third term of Equation ( 123), i.e., E ∥∆x t -η l K∇F ( xt )∥ 2 2 . By Lemma 1, we have the following derivation: E ∥∆x t -η l K∇F ( xt )∥ 2 2 (124) ≤ 1 -(p t ) A 1 M M m=1 E ∆x (m) t -η l K∇F ( xt ) 2 2 + (p t ) A ∥η l K∇F ( xt )∥ 2 2 (125) = 1 -(p t ) A 1 M M m=1 E ∆x (m) t -E∆x (m) t 2 2 + E E∆x (m) t -η l K∇F ( xt ) 2 2 + (p t ) A η 2 l K 2 ∥∇F ( xt )∥ 2 2 (126) where Equation ( 126) follows variance equation. . According to Section C, we have: E ∆x (m) t -E∆x (m) t 2 2 (127) ≤ 2η 2 l K 2 E ∥g t -Eg t ∥ 2 2 + 2η 2 l K-1 k=0 (K -1) L 2 σ b ′ x (m) t,k -x (m) t,k-1 2 2 (128) ≤ 2η 2 l K 2 1 + 2η 2 l L 2 σ b ′ 1 {b<n} σ 2 M b + 12η 4 l K 2 L 2 σ b ′ ∥Eg t -∇F ( xt )∥ 2 2 + ∥∇F ( xt )∥ 2 2 (129) where Equation ( 129) follows Lemma 6. According to the local learning rate setting in the description, we have E ∆x (m) t -E∆x (m) t 2 2 ≤ 4η 2 l K 2 • 1 {b<n} σ 2 M b (130) + 12η 4 l K 2 L 2 σ b ′ ∥Eg t -∇F ( xt )∥ 2 2 + ∥∇F ( xt )∥ 2 2 (131) After finding the bound for the first term of Equation ( 126), we now give the bound for its second term, i.e., E E∆x (m) t -η l K∇F ( xt ) 2 2 . E E∆x (m) t -η l K∇F ( xt ) 2 2 (132) = E η l K (Eg t -∇F ( xt )) + η l K-1 k=0 k κ=0 ∇F m x (m) t,κ -∇F m x (m) t,κ-1 2 2 (133) ≤ 2η 2 l K 2 ∥Eg t -∇F ( xt )∥ 2 2 + 2η 2 l K-1 k=0 k κ=0 ∇F m x (m) t,κ -∇F m x (m) t,κ-1 2 2 (134) ≤ 2η 2 l K 2 ∥Eg t -∇F ( xt )∥ 2 2 + 2η 2 l KL 2 K-1 k=0 k κ=0 k x (m) t,κ -x (m) t,κ-1 2 2 (135) ≤ 2η 2 l K 2 ∥Eg t -∇F ( xt )∥ 2 2 + 2η 2 l K K(K -1) 2 L 2 K-1 k=0 x (m) t,k -x (m) t,k-1 2 2 (136) ≤ 2η 4 l K 4 L 2 • 1 {b<n} σ 2 M b + 2η 2 l K 2 1 + 3η 2 l K 2 L 2 ∥Eg t -∇F ( xt )∥ 2 2 + 6η 4 l K 4 L 2 ∥∇F ( xt )∥ 2 2 (137) where Equation ( 134) follows (α + β) 2 ≤ 2α 2 + 2β 2 ; Equation (135) follows Cauchy-Schwarz inequality and Assumption 1; Equation ( 137) is based on Lemma 6. Then, according to the setting for the local learning rate in the description above, we can further simplify Equation ( 137): E E∆x (m) t -η l K∇F ( xt ) 2 2 ≤ 2η 4 l K 4 L 2 • 1 {b<n} σ 2 M b + 4η 2 l K 2 ∥Eg t -∇F ( xt )∥ 2 2 + 6η 4 l K 4 L 2 ∥∇F ( xt )∥ 2 2 Plugging Equation ( 131) and Equation ( 138) back to Equation (126), we can primarily obtain the inequality below: E ∥∆x t -η l K∇F ( xt )∥ 2 2 ≤ 2η 2 l K 2 1 -(p t ) A 2 + η 2 l K 2 L 2 1 {b<n} σ 2 M b + 4η 2 l K 2 1 -(p t ) A 1 + 3η 2 l L 2 σ b ′ ∥Eg t -∇F ( xt )∥ 2 2 + 6η 4 l K 2 1 -(p t ) A 2L 2 σ b ′ + K 2 L 2 ∥∇F ( xt )∥ 2 2 (139) + (p t ) A η 2 l K 2 ∥∇F ( xt )∥ 2 2 (140) With the setting described in the Lemma, we have: E ∥∆x t -η l K∇F ( xt )∥ 2 2 ≤ 6η 2 l K 2 1 -(p t ) A 1 {b<n} σ 2 M b + 8η 2 l K 2 1 -(p t ) A ∥Eg t -∇F ( xt )∥ 2 2 + 6η 4 l K 2 1 -(p t ) A 2L 2 σ b ′ + K 2 L 2 ∥∇F ( xt )∥ 2 2 (141) + (p t ) A η 2 l K 2 ∥∇F ( xt )∥ 2 2 (142) Therefore, according to the upper bound analyzed in the previous inequalities, Equation ( 123) can be reformulated as EF ( xt+1 ) -F ( xt ) (143) ≤ - η s η l K 2 1 -(p t ) A 1 -6η 2 l 2L 2 σ b ′ + K 2 L 2 ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 ∥ xt+1 -xt ∥ 2 2 + 4η s η l K 1 -(p t ) A ∥Eg t -∇F ( xt )∥ 2 2 + 3η s η l K 1 -(p t ) A 1 {b<n} σ 2 M b By means of the setting in the description above, we can obtain the desired conclusion. Theorem 6. Suppose that Assumption 1, 2 and 3 hold. Let the local updates K ≥ 1, and the local learning rate η l and the global learning rate η s be η s η l = 1 KL 1 + 2M Ap 1 -p A -1 , where η l ≤ min 1 2 √ 6KL , √ b ′ /K 4 √ 3Lσ . Therefore, the convergence rate of FedAMD for non-convex objectives should be min t∈[T ] ∥∇F ( xt )∥ 2 2 ≤ O 1 T 1 1 -p A + M Ap 1 -p A + O 1 {b<n} σ 2 M b ( ) where we treat F ( x0 ) -F * and L as constants. Proof. With Lemma 5 and Lemma 7, we can find the following recursive function under the constant probability settings: EF ( xt+1 ) + 4η s η l K 1 -p A M Ap E ∥Eg t+1 -∇F ( xt+1 )∥ 2 2 (146) ≤ EF ( xt ) + 4η s η l K 1 -p A M Ap E ∥Eg t -∇F ( xt )∥ 2 2 - η s η l K 4 1 -p A ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 - 4η s η l K 1 -p A M Ap M Ap L 2 E ∥ xt+1 -xt ∥ 2 2 + 3η s η l K 1 -p A 1 {b<n} σ 2 M b (147) Since η s η l = 1 KL 1 + 2M Ap 1 -p A -1 , we have: F * ≤ EF ( xT ) ≤ EF ( xT ) + 4η s η l K 1 -p A M Ap E ∥Eg T -∇F ( xT )∥ 2 2 (148) ≤ EF ( xT -1 ) + 4η s η l K 1 -p A M Ap E ∥Eg T -1 -∇F ( xT -1 )∥ 2 2 - η s η l K 4 1 -p A ∥∇F ( xT -1 )∥ 2 2 + 3η s η l K 1 -p A 1 {b<n} σ 2 M b (149) ≤ F ( x0 ) + 4η s η l K 1 -p A M Ap ∥Eg 0 -∇F ( x0 )∥ 2 2 - η s η l K 4 1 -p A T -1 t=0 ∥∇F ( xt )∥ 2 2 + 3η s η l KT 1 -p A 1 {b<n} σ 2 M b According to Lemma 5, ∥Eg 0 -∇F ( x0 )∥ 2 2 = 0. Therefore, based on the derivation above, we can attain the following inequality:  1 T T -1 t=0 ∥∇F ( xt )∥ 2 2 ≤ 4 (F ( x0 ) -F * ) η s η l KT (1 -p A ) + 31 {b<n} σ 2 M b EF ( xT ) -F * ≤   1 - 1 2 µK 1 -p A min   Ap M Kµ(1 -p A ) , 1 KL 1 + 16M µAp L     T (F ( x0 ) -F * ) + O 1 µ • 1 {b<n} σ 2 M b Proof. With Lemma 7, we have the recursive function on the time-varying probability settings under PL condition: EF ( xt+1 ) -F ( xt ) (153) ≤ - η s η l K 4 1 -(p t ) A ∥∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 ∥ xt+1 -xt ∥ 2 2 + 4η s η l K 1 -(p t ) A ∥Eg t -∇F ( xt )∥ 2 2 + 3η s η l K 1 -(p t ) A 1 {b<n} σ 2 M b (154) ≤ - µη s η l K 2 1 -(p t ) A (F ( xt ) -F * ) - 1 2η s η l K - L 2 ∥ xt+1 -xt ∥ 2 2 + 4η s η l K 1 -(p t ) A ∥Eg t -∇F ( xt )∥ 2 2 + 3η s η l K 1 -(p t ) A 1 {b<n} σ 2 M b According to the description, we consider the probability p t = p and have: EF ( xt+1 ) -F * (156) ≤ 1 - µη s η l K 2 1 -p A (F ( xt ) -F * ) - 1 2η s η l K - L 2 ∥ xt+1 -xt ∥ 2 2 + 4η s η l K 1 -p A E ∥Eg t -∇F ( xt )∥ 2 2 + 3η s η l K 1 -p A 1 {b<n} σ 2 M b (157) Since η s η l ≤ Ap M Kµ(1-p A ) , we have: EF ( xt+1 ) -F * + 8 µ E ∥Eg t+1 -∇F ( xt+1 )∥ 2 2 (158) ≤ 1 - µη s η l K 2 1 -p A F ( xt ) -F * + 8 µ E ∥Eg t -∇F ( xt )∥ 2 2 - 1 2η s η l K - L 2 - 8 µ M Ap L 2 ∥ xt+1 -xt ∥ 2 2 + 3η s η l K 1 -p A 1 {b<n} σ 2 M b According to the description η s η l ≤ 1 KL(1+ 16M µAp L) , we have: EF ( xt+1 ) -F * ≤ EF ( xt+1 ) -F * + 8 µ E ∥Eg t+1 -∇F ( xt+1 )∥ 2 2 (160) ≤ 1 - µη s η l K 2 1 -p A F ( xt ) -F * + 8 µ E ∥Eg t -∇F ( xt )∥ 2 2 + 3η s η l K 1 -p A 1 {b<n} σ 2 M b F ADDITIONAL EXPERIMENTS In the main text, we have analyzed some experimental results in Section 5. In this part, we conduct more thorough experiments by setting different numbers of local updates and different secondary mini-batch sizes.

F.1 DETAILED EXPERIMENTAL SETUP

Training on Fashion MNIST. In Section 5, the experiment conducts on Fashion MNIST (Xiao et al., 2017) , an image classification task to categorize a 28×28 greyscale image into 10 labels (including T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot). In the training dataset, each class owns 6K samples. Then, we follow the setting of (Konečnỳ et al., 2016; Li et al., 2019b) and partition the dataset into 100 clients (M = 100) such that each client holds two classes with a total of 600 samples. By this means, we simulate the heterogeneous data setting. To obtain a recognizable model on the images in the test dataset, we utilize a convolutional neural network structure LeNet-5 (LeCun et al., 1989; 2015) . Below comprehensively presents the structure of LeNet-5 on Fashion MNIST: Number of local updates K. Figure 5 -7 present the performance of FedAMD under the setting of K = 10, K = 20, and K = 5, respectively. In Section 5, we present the results under K = 10 (Figure 5 ), which manifests that: (I) The setting {0, 1} is the most efficient performance under the sequential probability setting; (II) The setting near the optimal probability can attain the best result under the constant probability settings. In this part, we verify whether these two statements still hold in two more examples. As for K = 20 and K = 5, it can provide the best performance when the constant probability is set to be near the theoretical optimal one. However, statement (I) does not always hold in both settings. Specifically, when all clients participate in the training, {0, 0, 1} even outperforms {0, 1}. A possible reason is that {0, 0, 1} has more rounds to update the global model while the caching gradient does not significantly change compared to the situation running for one more round. 

F.2.2 COMPARISON AMONG VARIOUS BASELINES

In Section 5, we present the comparison in a tabular format. Then, in this part, we visualize the training progress as well as introduce more results under different Ks with the help of Figure 8 . In specific, Figure 8a -8f are summarized into Table 2 , while the rest explore the efficiency of FedAMD under more scenarios. As described in Table 2 and the first six figures, when K = 10, the conclusions we can draw include: (I) the final test accuracy of FedAMD exceeds that of the baselines; (II) FedAMD is able to attain an accurate model with less communication and computation consumption. Next, we evaluate the performance of FedAMD when K = 20 and K = 5. • K = 20 (Figure 8g -8l): In this case, the gradient computation of a miner is around twice as that of an anchor. Therefore, FedAMD may consume less computation overhead than FedAvg and SCAFFOLD. In terms of final test accuracy, these baselines achieve similar results in all cases, while FedAMD achieves the performance with less computation overhead. • K = 5 (Figure 8m -8r): FedAMD eventually achieves the best accuracy compared to the existing works. Additionally, we can obtain a well-performed model with less computational consumption. These two phenomena are in support of the statements mentioned above. 



In this paper, partial client participation refers to the case where only a portion of clients take part at every round during the entire training. ≤ ϵ. Specifically, when it comes to PL condition, ϵ-approximation refers to F ( xT ) -F (x * ) ≤ ϵ. In the following discussion, we particularly highlight ≤ ϵ with the total computation CONCLUSIONIn this work, we investigate a federated learning framework FedAMD that disjoints the partial participants into anchor and miner groups. We provide the convergence analysis of our proposed algorithm for constant and sequential probability settings. Under the partial-client scenario, FedAMD achieves sublinear speedup under non-convex objectives and linear speedup under the PL condition. To the best of our knowledge, this is the first work to analyze the effectiveness of large batches under partial client participation. Experimental results demonstrate that FedAMD is superior to state-of-the-art works. It is interesting to explore anchor sampling in the other scenarios of FL, e.g., arbitrary device unavailability.



) where we define [M ] for a set of M clients. F m (•) indicates the local expected loss function for client m, which is unbiased estimated by empirical loss f m (•) using a random realization B m from the local training data D m , i.e., E Bm∼Dm f m (x, B m ) = F m (x). We denote n by the size of a client's local dataset, i.e., |D m | = n for all m ∈ [M ], and n can be infinite large in the streaming/online cases. F * represents the minimum loss for Equation (1). Algorithm Description. In FedAMD, a global model is initialized with arbitrary parameters x0 ∈ R d . By distributing the model to all clients (Line 1), clients m ∈ [M ] are required to generate a b-sample batch B m,0 and compute the gradient v Line 4), and server caches these gradients and span them as a matrix v 0 = v (m) 0 m∈[M ]

FedAMD Input: local learning rate η l , global learning rate η s , minibatch size b, b ′ < b, local updates K, probability {p t ∈ [0, 1]} t≥0 , initial model x0 . 1: Communicate the initial model x0 with all clients m ∈ [M ] 2: for m ∈ [M ] in parallel do ∇f m ( x0 , B m,0 ) using B m,0 ∼ D m with the size of b

ϵ with the fewest communication rounds. The following corollary discloses the relation between computation overhead and the value of τ . Corollary 1. Under the setting of Theorem 1, FedAMD computes O M b ϵ + τ M Kb ′ ϵ gradients and consumes a communication overhead of O M τ Aϵ during the model training. Remark. FedAMD requires an increasing computation and communication cost as τ gets larger. Therefore, τ = 2 possesses the most outstanding performance in the sequential probability settings. In this case, FedAMD requires the communication rounds of O M Aϵ , the communication overhead of O M Aϵ , and the computation cost of O σ 2 ϵ 2 + M K ϵ while optimizing an online scenario where the size of local dataset is infinity large.

Figure 1: Comparison of different probability settings using test accuracy against the communication rounds for FedAMD.

and the variables {ε i -e i } form a martingale difference sequence, and the bound of the variance is E ∥ε i -e i ∥ 2 2 ≤ σ 2 . So we can make a much tighter bound

151)By using the settings of the local learning rate and the global learning rate in the description, we can obtain the desired result. E.3 PROOFS FOR PL CONDITION Theorem 7. Suppose that Assumption 1, 2, 3 and 4 hold. Let the local updates K ≥ 1, and the local learning rate η l and the global learning rate η s be η s η l = min , the convergence rate of FedAMD for PL condition should be

Figure 2: Comparison of test accuracy and training loss against the communication rounds for FedAMD with constant p = 0.9.

Figure 4: Comparison of test accuracy and training loss against the communication rounds for FedAMD with sequential {0, 1}.

Figure 5: Comparison of different probability settings using training loss and test accuracy against the communication rounds for FedAMD by setting K = 10.

Figure 6: Comparison of different probability settings using training loss and test accuracy against the communication rounds for FedAMD by setting K = 20.

Figure 8: Comparison of different algorithms using test accuracy against the communication rounds and training loss against gradient complexity.

∼ D i with the size of b

Comparison among baselines in terms of cumulative gradient complexity (×10 5 samples), communication costs (×32 Mbits), and rounds reaching the accuracy of 75%, and the final accuracy

Table2compares FedAMD with the existing works under partial/full client participation. At first glance, FedAMD outperforms other baselines because the texts in bold are all appeared in FedAMD. With 20-client participation, our proposed method surpasses four baselines all aroundness. As for 40-client participation, FedAMD with constant probability saves at least 30% computation and communication cost, and the final accuracy realizes up to 6% improvement. In terms of the training with full client participation, FedAMD requires 10%-20% fewer communication rounds, and its final accuracy has significant improvement, i.e., within the range of 0.7%-8.3%. Also, it is well noted that BVR-L-SGD has a similar performance as FedAMD using sequential probability in terms of communication rounds and test accuracy, but the former needs more computation and communication overhead. This is because BVR-L-SGD computes the bullseye using multiple b ′ -size batches rather than a large batch. Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. Advances in Neural Information Processing Systems, 31, 2018. Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010. Mini-batch SGD vs. Local SGD. Distributed optimization is required to train large-scale deep learning systems. Local SGD (also known as FedAvg) (Stich, 2018; Dieuleveut & Patel, 2019;

However, a concern is addressed on how to attain an accurate global orientation to mitigate the update drift from the global model. Roughly, the estimation lies in two types, namely, precalculated and cached. The former methods(Murata & Suzuki, 2021;Mitra et al., 2021) required precalculation typically require full worker participation, which is infeasible for federated learning settings. As for the global orientation estimated by cached information, existing approaches(Karimireddy et al., 2020b;Wu et al., 2021;Liang et al., 2019; Karimireddy et al., 2020a)   utilize small batches, which derives a biased estimation and misleads the training. This work explores the effectiveness of large-batch estimation for the global orientation under partial client participation.

Algorithm 1 describes FedAMD in details. The objective in this part is to find the recursive function for the sequence of models, i.e., { xt } t≥0 . As mentioned in Line 26 in Algorithm 1, let ∆x t aggregate ∆x (i) t where client i updates model, then the difference between xt+1 and xt follows the recursive function written as xt+1 = xt -η s • avg(∆x t ) (28) where avg() is same as defined in Lemma 1. As we know, the length of ∆x t changes over rounds but does not exceed the number of participants, i.e., |∆x t | ≤ A. Then, suppose that ∆x

Details for LeNet-5 on Fashion-MNIST.

Details for 2-layer MLP on EMNIST digits. The training loss is calculated by the clients who perform local SGD on the average loss of all iterations. As for the test accuracy, the server utilizes the entire test dataset after the global model updates. The gradient complexity is the sum of all samples used for gradient calculation by all clients throughout the training. The communication overhead is measured by the transmission between the server and the clients.

annex

By using the settings of the local learning rate and the global learning rate in the description, we can obtain the desired result.

F.3 NUMERICAL RESULTS ON EMNIST DIGITS

In this section, Figure 9 analyzes our algorithm with one more dataset, i.e., EMNIST. In the first three figures, we evaluate different probability settings. As for the rest of the figures, we compare FedAMD with other baselines.With regards to different probability settings (Figure 9a -9c), we are still able to draw two conclusions as stated in Appendix F.2.1. As for the comparison among different algorithms, our proposed algorithm is able to outperform the state-of-the-art works when we take the test accuracy and the computation overhead into joint consideration. 

