FEDX: FEDERATED LEARNING FOR COMPOSITIONAL PAIRWISE RISK OPTIMIZATION

Abstract

In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of compositional pairwise risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of where two sets of data S 1 , S 2 are distributed over multiple machines, ℓ(•; •, •) is a pairwise loss that only depends on the prediction outputs of the input data pairs (z, z ′ ), and f (•) is possibly a non-linear non-convex function. This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. We propose two provable FL algorithms (FedX) for handling linear and nonlinear f , respectively. To address the challenges, we decouple the gradient's components with two types, namely active parts and lazy parts, where the active parts depend on local data that are computed with the local model and the lazy parts depend on other machines that are communicated/computed based on historical models and samples. We develop a novel theoretical analysis to combat the latency of the lazy parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the lazy parts do not degrade the complexities. We conduct empirical studies of FedX for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines. Under review as a conference paper at ICLR 2023

1. INTRODUCTION

This work is motivated by solving the following optimization problem arising in many ML applications in a federated learning (FL) setting: min w∈R d 1 |S 1 | z∈S1 f 1 |S 2 | z ′ ∈S2 ℓ(w; z, z ′ ) g(w;z,S2) , where S 1 and S 2 denote two sets of data points that are distributed over many machines, w denotes the model parameter of a prediction function h w (•) ∈ R do , f (•) is a deterministic function that could be linear or non-linear (possibly non-convex), and ℓ(w; z, z ′ ) = ℓ(h w (z), h w (z ′ )) denotes a pairwise loss that only depends the prediction outputs of the input data z, z ′ . We refer to the above problem as compositional pairwise risk (CPR) minimization problem. When f is a linear function, the above problem is the classic pairwise loss minimization problem, which has applications in AUROC (AUC) maximization (Gao et al., 2013; Zhao et al., 2011; Gao & Zhou, 2015; Calders & Jaroszewicz, 2007; Charoenphakdee et al., 2019; Yang et al., 2021b) , bipartite ranking (Cohen et al., 1997; Clémenc ¸on et al., 2008; Kotlowski et al., 2011; Dembczynski et al., 2012) , distance metric learning (Radenović et al., 2016; Wu et al., 2017; Yang et al., 2021b) . When f is a non-linear function, the above problem is a special case of finite-sum coupled compositional optimization problem (Wang & Yang, 2022a) , which has found applications in various performance measure optimization such as partial AUC maximization (Zhu et al., 2022) , average precision maximization (Qi et al., 2021; Wang et al., 2022) , NDCG maximization (Qiu et al., 2022) , and p-norm

2. RELATED WORK

FL for ERM. The challenge of FL is how to utilize the distributed data to learn a ML model with light communication cost without harming the data privacy (Konevcnỳ et al., 2016; McMahan et al., 2017) . To reduce the communication cost, many algorithms have been proposed to skip communications (Stich, 2018; Yu et al., 2019a; b; Yang, 2013; Karimireddy et al., 2020b) or compress the communicated statistics (Stich et al., 2018; Basu et al., 2019; Jiang & Agrawal, 2018; Wangni et al., 2018; Bernstein et al., 2018) . Tight analysis has been performed in various studies (Kairouz et al., 2021; Yu et al., 2019a; b; Khaled et al., 2020; Woodworth et al., 2020b; a; Karimireddy et al., 2020b; Haddadpour et al., 2019) . However, most of these works target at ERM. FL for Non-ERM Problems. In (Guo et al., 2020; Yuan et al., 2021a; Deng & Mahdavi, 2021; Deng et al., 2020; Liu et al., 2020; Sharma et al., 2022) , federated minimax optimization algorithms are studied, which are not applicable to our problem when f is non-convex. Gao et al. (2022) have considered a much simpler federated compositional optimization in the form of k E ζ∼D k f f k (E ξ∼D k g g k (w; ξ); ζ), where k denotes the machine index. We can see that compared with our CPR risk, their objective does not involve interdependence between different machines. Li et al. (2022) ; Huang et al. (2022) have analyzed FL algorithms for bi-level problems where only the low-level objective involves distribution over many machines. Tarzanagh et al. (2022) considered another federated bilevel problem, where both upper and lower level objective are distributed many machines, but the lower level objective is not coupled with the data in the upper objective. Xing et al. (2022) studied a federated bilevel optimization in a server-clients setting, where the central server solves an objective that depends on optimal solutions of local clients. Our problem cannot be mapped into these federated bilevel optimization problems. Centralized Compositional Pairwise Risk Minimization. In the centralized setting CPR minimization has been considered in recent works (Qi et al., 2021; Wang et al., 2022; Wang & Yang, 2022a; Qiu et al., 2022; Jiang et al., 2022) . However, it is non-trivial to extend these algorithms to the FL setting due to the challenges mentioned earlier. We provide a summary of state-of-the-art sample complexities for solving ERM and CPR in both centralized and FL setting in Appendix B.

3. FEDX FOR OPTIMIZING CPR

We assume S 1 , S 2 are split into N non-overlapping subsets that are distributed over N clientsfoot_0 , i.e., S 1 = S 1 1 ∪ S 2 1 . . . ∪ S N 1 and S 2 = S 1 2 ∪ S 2 2 . . . ∪ S N 2 . We denote by E z∼S = 1 . . , N, j = 1, . . . , N . We assume that these quantities ω 1 = (ω 11 , . . . , ω 1N ) and ω 2 = (ω 21 , . . . , ω 2N ) are available on all clients. If not, they can be easily computed and communicated once between the N clients. Denote by ∇ 1 ℓ(•, •) and ∇ 2 ℓ(•, •) the partial gradients in terms of the first argument and the second argument, respectively. Without loss of generality, we assume the dimensionality of h(w; z) is 1 (i.e., d o = 1) in the following presentation. For our discussion of complexity, we will simply assume ω 1i , ω 2j ≈ O(1).

3.1. FEDX1 FOR OPTIMIZING CPR WITH LINEAR f

With linear f , we rewrite the CPR risk into an equivalent form that is tailored to the FL setting: min w∈R d F (w) = 1 N N i=1 E z∈S i 1 1 N N j=1 E z ′ ∈S j 2 ℓ ij (h(w, z), h(w, z ′ )), where ℓ ij (h w (z), h(w, z ′ )) = ω 1i ω 2j ℓ(h(w, z), h(w, z ′ )). To highlight the challenge and motivate FedX, we compute the gradient of the objective function and decompose it into two terms: ∇F (w) = 1 N N i=1 E z∈S i 1 1 N N j=1 E z ′ ∈S j 2 ∇ 1 ℓ ij (h(w, z), h(w, z ′ ))∇h w (z) ∆i1 + 1 N N i=1 E z ′ ∈S i 2 1 N N j=1 E z∈S j 1 ∇ 2 ℓ ji (h(w, z), h(w, z ′ ))∇h(w, z ′ ) ∆i2 . With the above decomposition, we can see that the main task at the local client i is to estimate the gradient terms ∆ i1 and ∆ i2 . Due to the symmetry between ∆ i1 and ∆ i2 , below, we only use ∆ i1 as an illustration for explaining the proposed algorithm. The difficulty in computing ∆ i1 lies at it relies on data in other machines due to the presence of E z ′ ∈S j 2 for all j. To overcome this difficulty, we decouple the data-dependent factors in ∆ i1 into two types marked by green and blue shown below: ∆ i1 = E z∈S i 1 local1 1 N N j=1 E z ′ ∈S j 2 global1 ∇ 1 ℓ ij ( h(w, z) local2 , h(w, z ′ ) global2 ) ∇h(w, z) local3 . (3) It is notable that the three green terms can be estimated or computed based the local data. In particular, local1 can be estimated by sampling data from S i 1 and local2 and local3 can be computed based on the sampled data z and the local model parameter. The difficulty springs from estimating and computing the two blue terms that depend on data on all machines. We would like to avoid communicating h(w; z ′ ) at every iteration for estimating the blue terms as each communication would incur additional communication overhead. To tackle this, we propose to leverage the historical information computed in the previous roundfoot_1 . To put this into context of optimization, we consider the update at the k-th iteration during the r-th round, where k = 0, . . . , K -1. Let w r i,k denote the local model in i-th client at the k-th iteration within r-th round. Let z r i,k,1 ∈ S i 1 , z r i,k,2 ∈ S i 2 denote the data sampled at the k-th iteration from S i 1 and S i 2 , respectively. Each local machine will compute h(w r i,k ; z r i,k,1 ) and h(w r i,k ; z r i,k,2 ), which will be used for computing the active parts. Across all iterations k = 0, . . . , K -1, we will accumulate the computed prediction outputs over sampled data and stored in two sets H r i,1 = {h(w r i,k ; z r i,k,1 ), k = 0, . . . , K -1} and H r i,2 = {h(w r i,k ; z r i,k,2 ), k = 0, . . . , K -1}. At the end of round r, we will communicate w r i,K and H r i,1 and H r i,2 to the central server, which will average the local models to get a global model w r and also aggregate H r 1 = H r 1,1 ∪ H r 2,1 . . . ∪ H r N,1 and H r 2 = H r 1,2 ∪ H r 2,2 . . . ∪ H r N,2 . These aggregated information will be broadcast to each individual client. Then, at the k-th iteration in the r-th round, we estimate the blue term by sampling h r-1 2,ξ ∈ H r-1 2 without replacement and compute an estimator of ∆ i1 by G r i,k,1 = ∇ 1 ℓ ij ( h(w r i,k ; z r i,k,1 ) active , h r-1 2,ξ lazy ) ∇h(w r i,k ; z r i,k,1 ) active , where ξ = (j, t, z r-1 j,t,2 ) represents a random variable that captures the randomness in the sampled client j ∈ {1, . . . , N }, iteration index k ∈ {0, . . . , K -1} and data sample z r-1 j,t,2 ∈ S j 2 , which is used for estimating the global1 in (3). We refer to the green factors in G i,k,1 as the active parts and the blue factor in G i,k,1 as the lazy part. Similarly, we can estimate ∆ i2 by G i,k,2 G r i,k,2 = ∇ 2 ℓ j ′ i ( h r-1 1,ζ lazy , h(w r i,k ; z r i,k,2 ) active ) ∇h(w r i,k ; z r i,k,2 ) active , where h r-1 1,ζ ∈ H r-1 1 is a randomly sampled prediction output in the previous round with ζ = (j ′ , t ′ , z r-1 j ′ ,t ′ ,1 ) representing a random variable including a client sample j ′ and iteration sample t ′ and the data sample z r-1 j ′ ,t ′ ,1 . Then we will update the local model parameter w r i,k by using a gradient estimator G r i,k,1 + G r i,k,2 . We present the detailed steps of the proposed algorithm FedX1 in Algorithm 1. Several remarks are following: (i) at every round, the algorithm needs to communicate both the model parameters w r i,K and the historical prediction outputs H r-1 i,1 and H r-1 i,2 , where H r-1 i, * is constructed by collecting all or sub-sampled computed predictions in the (r -1)-th round. The bottom line for constructing H r-1 i, * is to ensure that H r-1 Algorithm 1 FedX1: Federated Learning for CPR with linear f 1: On Client i: Require parameters η, K 2: Initialize model w 0 i,0 and initialize Buffer B i,1 = ∅ and B i,2 = ∅ 3: Sample K points from S i 1 , compute their predictions using model w 0 i,0 denoted by H 0 i,1 4: Sample K points from S i 2 , compute their predictions using model w 0 i,0 denoted by H 0 i,2 5: for r = 1, ..., R do 6: Send H r-1 i,1 , H r-1 i,2 to the server 7: Receive R r-1 i,1 , R r-1 i,2 from the server 8: Update buffer B i,1 , B i,2 using R r-1 i,1 , R r-1 i,2 with shuffling ⋄ see text for updating the buffer 9: Set H r i,1 = ∅, H r i,2 = ∅ 10: for k = 0, .., K -1 do 11: Sample z r i,k,1 from S i 1 , sample z r i,k,2 from S i 2 ⋄ or sample two mini-batches of data 12: Take next h r-1 ξ and h r-1 ζ from B i,1 and B i,2 , respectively 13: Compute h(w r i,k , z r i,k,1 ) and h(w r i,k , z r i,k,2 ) 14: Add h(w r i,k , z r i,k,1 ) into H r i,1 and add h(w r i,k , z r i,k,2 ) into H r i,2 15: Compute G r i,k,1 and G r i,k,2 according to (4) and ( 5) 16: w r i,k+1 = w r i,k -η(G r i,k,1 + G r i,k,2 ) 17: end for 18: Sends w r i,K to the server 19: Receives wr from the server and set w r+1 i,0 = wr 20: end for 21: On Server 22: for r = 0, ..., R -1 do 23: Collects H r 1 = H r 1,1 ∪ H r 2,1 . . . ∪ H r N,1 and H r 2 = H r 1,2 ∪ H r 1,2 . . . ∪ H r N,2 24: Set R r i,1 = H r 1 , R r i,2 = H r 2 25: Send R r i,1 , R r i,2 to client i for all i ∈ [N ] 26: Receive w r+1 i,K , from client i, compute wr+1 = 1 N N i=1 w r+1 i,K and broadcast it to all clients. 27: end for O(d + Kd o /N ). Nevertheless, for simplicity in Algorithm 1 we simply put all historical predictions into H r-1 i, * . Similar to all other FL algorithms, FedX1 does not require communicating the raw input data, hence protects the privacy of the data. However, compared with most FL algorithms for ERM, FedX1 for CPR has an additional communication overhead at least O(d o K/N ) which depends on the dimensionality of prediction output d o . For learning a high-dimensional model (e.g. deep neural network with d ≫ 1) with score-based pairwise losses (d o = 1), the additional communication cost O(K/N ) could be marginal. For updating the buffer B i,1 and B i,2 , we can simply flush the history and add the newly received R r-1 i,1 with random shuffling to B i,1 and add R r-1 i,2 with random shuffling to B i,2 . However, we can keep the history up to a certain limit as long as the latency error can be well controlled, which will be analyzed in Appendix E. Next, we present the theoretical results of FedX1 with more formal results given in appendix. Theorem 1. (Informal) Under appropriate conditions, by setting η = O( N R 2/3 ) and K = O( R 1/3 N ), Algorithm 1 ensures that E 1 R R r=1 ∥∇F ( wr )∥ 2 ≤ O 1 R 2/3 . Remark. To get E[ 1 R R r=1 ∥∇F ( wr )∥ 2 ] ≤ ϵ 2 , we just need to set R = O( 1 ϵ 3 ), η = N ϵ 2 and K = 1 N ϵ . The number of communications is much less than the total number of iterations i.e., O( 1 N ϵ 4 ) as long as N ≤ O( 1 ϵ ). And the sample complexity on each machine is 1 N ϵ 4 , which is linearly reduced by the number of machines N .

Novelty of Analysis.

As the lazy parts are computed in different machines in a previous round, the gradient estimators G r i,k,1 and G r i,k,2 will involve the dependency between the local model parameter w r i,k and the historical data contained in ξ, ζ used for computing G r i,k,1 and G r i,k,2 , which makes the analysis more involved. We need to make sure that using the gradient estimator based on them can still result in "good" results. To this end, we borrow an analysis technique in (Yang et al., 2021b) to decouple the dependence between the current model parameter and the data used for computing the current gradient estimator, in which they used data in previous iteration to couple the data in the current iteration in order to compute a gradient of the pairwise loss ℓ(h(w t ; z t ), h(w t ; z t-1 )). Nevertheless, in federated CPR controlling the error brought by the lazy parts is more challenging since the delay is much longer and they were computed on different machines. In our analysis, we replace w r i,k with wr-1 to decouple the dependence between the model parameter wr-1 and the historical data ξ, ζ, then we need to control the latency error ∥ wr-1 -wr ∥ 2 and the gap error between different machines i k E∥ wr -w r i,k ∥ 2 such that the complexities are not compromised. 3.2 FEDX2 FOR OPTIMIZING CPR WITH NONLINEAR f Similarly, we re-write the objective into an equivalent form that is tailored to the FL setting, i.e., F (w) = 1 N N i=1 E z∈S i 1 f i 1 N N j=1 E z ′ ∈S j 2 ℓ j (h(w; z), h(w; z ′ )) , where f i (•) = ω 1i f (•) and ℓ j (•, •) = ω 2j ℓ(•, •). We compute the gradient and decompose it into two terms: ∇F (w) = 1 N N i=1 E z∈S i 1 1 N N j=1 E z ′ ∈S j 2 ∇f i (g(w; z, S 2 )) ∇ 1 ℓ j (h(w; z), h(w; z ′ ))∇h(w; z) ∆i1 + 1 N N i=1 E z ′ ∈S i 2 1 N N j=1 E z∈S j 1 ∇f j (g(w; z, S 2 )) ∇ 2 ℓ i (h(w; z), h(w; z ′ ))∇h(w; z ′ ) ∆i2 . Compared to that in (3) for CPR with linear f , the ∆ i1 term above involves another factor ∇f i (g(w; z, S 2 )), which cannot be computed locally as it depends on S 2 distributed over all machines. Similarly, the ∆ i2 term above involves another non-locally computable factor ∇f j (g(w; z, S 2 )). To address the challenge of estimating g(w; z, S 2 ), we leverage the similar technique in the centralized setting (Wang & Yang, 2022b ) by tracking it using a moving average estimator based on random samples. In a centralized setting, one can maintain and update u(z) for estimating g(w, z, S 2 ) by u(z) ← (1 -γ)u(z) + γℓ(h(w, z), h(w, z ′ )), where z ′ is a random sample from S 2 . However, this is not possible in an FL setting as S 2 is distributed over many machines. To tackle this, we leverage the same delay communication technique used in the last subsection. In particular, at the k-th iteration in the r-th round, we can update u(z r i,k,1 ) for a sampled z r i,k,1 by u r i,k (z r i,k,1 ) = (1 -γ)u r i,k (z r i,k,1 ) + γℓ j (h(w r i,k , z r i,k,1 ), h r-1 ξ,2 ), where h r-1 ξ,2 is a random sample from H r-1 2 where ξ = (j ′ , t ′ , z r-1 j ′ ,t ′ ,2 ) captures the randomness in client, iteration index and data sample in the last round. Then, we can use ∇f i (u r i,k (z r i,k,1 )) in place of ∇f i (g(w r i,k ; z r i,k,1 )) for estimating ∆ i1 . However, it is more nuanced for estimating ∇f j (g(w; z, S 2 )) in ∆ 2i since z ∈ S 2 j is not local random data. To address this, we propose to communicate U r-1 = {u r-1 i,k (z r-1 i,k,1 ), i ∈ [N ], k ∈ [K] -1}. Then at the k-iteration in the rth round of the i-th client, we can estimate ∇f j (g(w; z, S 2 )) with a random sample from U r-1 denoted by u r-1 ζ , where ζ = (j ′ , t ′ , z r-1 j ′ ,t ′ ,1 ), i.e., by using ∇f j ′ (u r-1 ζ ). Then we estimate ∆ 1i and ∆ 2i by G r i,k,1 = ∇f i (u r i,k (z r i,k,1 )) active ∇ 1 ℓ j ( h(w r i,k ; z r i,k,1 ) active , h r-1 2,ξ lazy ) ∇h(w r i,k ; z r i,k,1 ) active G r i,k,2 = ∇f j ′ (u r-1 ζ ) lazy ∇ 2 ℓ i ( h r-1 1,ζ lazy , h(w r i,k ; z r i,k,2 ) active ) ∇h(w r i,k ; z r i,k,2 ) active (9) where j, ξ, j ′ , ζ are random variables. Another difference from CPR with linear f is that even in the centralized setting directly using G r i,k,1 + G r i,k,2 will lead to a worse complexity due to that non-linear f make the stochastic gradient estimator biased (Wang et al., 2017) . Hence, in order to improve the convergence, we follow existing state-of-the-art algorithms for stochastic compositional optimization (Ghadimi et al., 2020; Wang & Yang, 2022b) to compute a moving average estimator Algorithm 2 FedX2: Federated Learning for CPR with non-linear f 1: On Client i: Require parameters η, K 2: Initialize model w 0 i,0 , U 0 i = {u 0 (z) = 0, z ∈ S i 1 }, G 0 i,0 = 0, and buffer B i,1 , B i,2 , C i = ∅ 3: Sample K points from S i 1 , compute their predictions using model w 0 i,0 denoted by H 0 i,1 4: Sample K points from S i 2 , compute their predictions using model w 0 i,0 denoted by H 0 i,2 5: for r = 1, ..., R do 6: Send H r-1 i,1 , H r-1 i,2 , U r-1 i to the server 7: Receive R r-1 i,1 , R r-1 i,2 , P r-1 from the server 8: Update the buffer B i,1 , B i,2 , C i using R r-1 i,1 , R r-1 i,2 , P r-1 with shuffling, respectively 9: Set H r i,1 = ∅, H r i,2 = ∅, U r i = ∅ 10: for k = 0, .  G r i,k = (1 -β)G r i,k-1 + β(G r i,k,1 + G r i,k,2 ) 18: w r i,k+1 = w r i,k -ηG r i,k 19: end for 20: Sends w r i,K , G r i,k to the server 21: Receives wr , Ḡr from the server and set w r+1 i,0 = wr , G r+1 i,0 = Ḡr 22: end for 23: On Server 24: for r = 0, ..., R -1 do 25: Receive Collects H r * = H r 1, * ∪ H r 2, * . . . ∪ H r N, * and U r = U r 1 ∪ U r 1 . . . ∪ U r N , where * = 1, 2 26: Set R r i,1 = H r 1 , R r i,2 = H r 2 , P r i = U w r+1 i,K ,G r+1 i,K from client i, compute wr+1 = 1 N N i=1 w r+1 i,K , G r+1 = 1 N N i=1 G r+1 i,K and broadcast them to all clients. 28: end for for the gradient at local machines, i.e., Step 17 in Algorithm 3. With these changes, we present the detailed steps of FedX2 for solving CPR with non-linear f in Algorithm 3. The buffers B i, * and C i are updated similar to that for FedX1. Different from FedX1, there is an additional communication cost for communicating U r-1 i and an additional buffer C i at each local machine to store the received P r-1 i from aggregated U r-1 . Nevertheless, these additional costs are marginal compared with communicating H r-1 * and maintaining the buffer B i, * . We present the convergence result of FedX2 below with more formal results given in appendix. Theorem 2. (Informal) Under appropriate conditions, denoting M = max i |S 1 i | as the largest number of data on a single machine, by setting γ = O( M 1/3 R 2/3 ), β = O( 1 M 1/6 R 2/3 ), η = O( 1 M 2/3 R 2/3 ) and K = O(M 1/3 R 1/3 ), Algorithm 2 ensures that E 1 R R r=1 ∥∇F ( wr )∥ 2 ≤ O( 1 R 2/3 ). Remark. To get E[ 1 R R r=1 ∥∇F ( wr )∥ 2 ] ≤ ϵ 2 , we just set R = O( M 1/2 ϵ 3 ), η = O( ϵ 2 M ), γ = O(ϵ 2 ), β = ϵ 2 √ M and K = M 1/2 ϵ . The number of communications R = O( M 1/2 ϵ 3 ) is less than the total number of iterations i.e., O( M ϵ 4 ) by a factor of O(M 1/2 /ϵ). And the sample complexity on each machine is M ϵ 4 , which is less than that in Wang & Yang (2022b) which has a sample complexity of O( N i=1 |S 1 i |/ϵ 4 ). When the data are evenly distributed on different machines, we have achieved a linear speedup property. And in an extreme case where all data are on one machine, we see that the sample complexity of FedX2 matches that established in (Wang & Yang, 2022b) , which is expected. Compared with FedX1, the analysis of FedX2 has to deal with several extra difficulties. First, with non-linear f , the coupling between the inner function and outer function adds to the complexity of interdependence between different rounds and different machines. Second, we have to deal with the error for the lazy part related to u. It is notable that our analysis for FedX2 with moving average gradient estimator for solving CPR is different from previous studies for local momentum methods (Yu et al., 2019a; Karimireddy et al., 2020a) , which used a moving average with a fixed momentum parameter for computing a gradient estimator in local steps for the ERM problem. In contrast, in FedX2 the momentum parameter β is decreasing as R increases, which is similar to centralized algorithms for solving compositional problems (Ghadimi et al., 2020; Wang & Yang, 2022b) .

4. EXPERIMENTS

To verify our algorithms, we run experiments on two tasks: federated deep partial AUC maximization and federated deep AUC maximization with a pairwise surrogate loss, which corresponds to (1) with non-linear f and linear f , respectively. Datasets and Neural Networks. We use four datasets: Cifar10, Cifar100 (Krizhevsky, 2009) , CheXpert (Irvin et al., 2019) , and ChestMNIST (Yang et al., 2021a) , where the latter two datasets are large-scale medical image data. The statistics of these datasets are reported in Appendix. For Cifar10 and Cifar100, we sample 20% of the training data as validation set, and construct imbalanced binary versions with positive:negative = 1:5 in the training set similar to (Yuan et al., 2021b) . For CheXpert, we consider the task of predicting Consolidation and use the last 1000 images in the training set as the validation set and use the original validation set as the testing set. For ChestMNIST, we consider the task of Mass prediction and use the provided train/valid/test split. We distribute training data to N = 16 machines unless specified otherwise. To increase the heterogeneity of data on different machines, we add random Gaussian noise of N (µ, 0.04) to all training images, where µ ∈ {-0.08 : 0.01 : 0.08} that varies on different machines, i.e., for the i-th machine out of the N = 16 machines, its µ = -0.08 + i * 0.01. We train ResNet18 from scratch for CIFAR-10 and CIFAR-100 data, and initialize DenseNet121 by an ImageNet pretrained model for CheXpert and ChestMNIST data. All experiments use the PyTorch framework (Paszke et al., 2019) . Baselines. We compare our algorithms with three local baselines: 1) Local SGD which optimizes a Cross-Entropy loss using classical local SGD algorithm; 2) CODASCA -a state-of-the-art FL algorithm for optimizing a min-max formulated AUC loss (Yuan et al., 2021a) ; and 3) Local Pair which optimizes the CPR risk using only local pairs. As a reference, we also compare with the Centralized methods, i.e., mini-batch SGD for CPR with linear f and SOX for CPR with non-linear f . For each algorithm, we tune the initial step size in [1e -3 , 1] using grid search and decay it by a factor of 0.1 after every 5K iterations. All algorithms are run for 20k iterations. The minibatch sizes B 1 , B 2 (as in Step 11 of FedX1 and FedX2) are set to 32. The β parameter of FedX2 (and corresponding Local Pair and Centralized method) is set to 0.1. In the Centralized method, we tune the batch size B 1 and B 2 from {32, 64, 128, 256, 512} in an effort to benchmark the best performance of the centralized setting. For CODASCA and Local SGD which are not using pairwise losses, we set the batch size to 64 for the sake of fair comparison with FedX. For all the noncentralized algorithms, we set the communication interval K = 32 unless specified otherwise. In every run of any algorithm, we use the validation set to select the best performing model and finally use the selected model to evaluate on the testing set. For each algorithm, we repeat 3 times with different random seeds and report the averaged performance. FedX2 for Federated Deep Partial AUC Maximization. First, we consider the task of one way partial AUC maximization, which refers to the area under the ROC curve with false positive rate (FPR) restricted to be less than a threshold. We consider the KL-OPAUC loss function proposed in (Zhu et al., 2022) , which is the formulation of (1) where S i 1 denotes the set of positive data, S i 2 denotes the set of negative data and ℓ(a, b) = exp((b + 1 -a) 2 + /λ) and f (•) = λ log(•) where λ is a parameter tuned in [1 : 5]. The experimental results are reported in Table 1 . We have the following observations: (i) FedX2 is better than all local methods (i.e., Local SGD, Local Pair and CODASCA), and achieves competitive performance as the Centralized method, which indicates the our algorithm can effectively utilize data on all machines. The better performance of FedX2 on CIFAR100 and CheXpert than the Centralized method is probably due to that the Centralized method may overfit the training data; (ii) FedX2 is better than the Local Pair method, which implies that using data pairs from all machines are helpful for improving the performance in terms of partial AUC maximization; and (iii) FedX2 is better than CODASCA, which is not surprising since CODASCA is designed to optimize AUC loss, while FedX2 is used to optimize partial AUC loss. FedX1 for Federated Deep AUC maximization with Corrupted Labels. Second, we consider the task of federated deep AUC maximization. Since deep AUC maximization for solving a min-max loss (an equivalent form for the pairwise square loss) has been developed in previous works (Yuan et al., 2021a) , we aim to justify the benefit of using the general pairwise loss formulation. According to (Charoenphakdee et al., 2019) , a symmetric loss can be more robust to data with corrupted labels for AUC maximization, where a symmetric loss is one such that ℓ(z)+ℓ(-z) is a constant. Since the square loss is not symmetric, we conjecture that that min-max federated deep AUC maximization algorithm CODASCA is not robust to the noise in labels. In contrast, our algorithm FedX1 can optimize a symmetric pairwise loss; hence we expect FedX1 is better than CODASCA in the presence of corrupted labels. To verify this hypothesis, we generate corrupted data by flipping the labels of 20% of both the positive and negative training data. We use FedX1/Local Pair to optimize the symmetric pairwise sigmoid (PSM) loss (Calders & Jaroszewicz, 2007) , which corresponds to (1) with linear f (s) = s and ℓ(a, b) = (1 + exp((a -b))) -1 , where a is a positive data score and b is a negative data score. The results are reported in Table 2 . We observe that FedX1 is more robust to label noises compared to other local methods, including Local SGD, Local Pair, and CODASCA that optimizes a min-max AUC loss. As before, FedX1 has competitive performance with the Centralized method. Ablation Study. Third, we show an ablation study to further verify our theory. In particular, we show the benefit of using multiple machines and the lower communication complexity by using K > 1 local updates between two communications. To verify the first effect, we fix K and vary N , and for the latter we fix N and vary K. We conduct experiments on the CIFAR-10 data for optimizing the CPR risk corresponding to partial AUC loss and the results are plotted in Figure 1 . The left two figures demonstrate that our algorithm can tolerate a certain value of K for skipping communications without harming the performance; and the right two figures demonstrate the advantage of FL by using FedX2, i.e., using data from more sources can dramatically improve the performance.

5. CONCLUSION

In this paper, we have considered federated learning (FL) for compositional pairwise risk minimization problems. We have developed communication-efficient FL algorithms to alleviate the interdependence between different machines. Novel convergence analysis is performed to address the technical challenges and to improve both iteration and communication complexities of proposed algorithms. We have conducted empirical studies of the proposed FL algorithms for solving deep partial AUC maximization and deep AUC maximization and achieved promising results compared with several baseline algorithms. 

A APPLICATIONS OF CPR PROBLEMS

We now present some concrete applications of the CPR minimization problems, including AUROC maximization, partial AUROC maximization and AUPRC maximization. A more comprehensive list of CPR minimization problems is discussed in the Intrduction section and can also be found in a recent survey (Yang, 2022) .

AUROC Maximization

The area under ROC curve (AUROC) is defined (Hanley & McNeil, 1982) as AUROC(w) = E[I(h(w, z) ≥ h(w, z ′ ))|y = +1, y ′ = -1], ) where z, z ′ are a pair of data features and y, y ′ are the corresponding labels. To maximize the AUROC, there are a number of surrogate losses ℓ(•), e.g. ℓ(w; z, z ′ ) = (1 -h(w, z) + h(w, z ′ )) 2 , that have proposed in the literature (Gao et al., 2013; Zhao et al., 2011; Gao & Zhou, 2015; Calders & Jaroszewicz, 2007; Charoenphakdee et al., 2019; Yang et al., 2021b) , which formulates the problem into min w 1 |S 1 | zi∈S1 1 |S 2 | zj ∈S2 ℓ(w, z i , z j ), where S 1 is the set of data with positive labels and S 2 is the set of data with negative labels. This is a CPR problem of (1) with f (x) = x. Partial AUROC Maximization In medical diagnosis, high false positive rates (FPR) and low true positive rates (TPR) may cause a large cost. To alleviate this, we will also consider optimizing partial AUC (pAUC). This task considers to maximize the area under ROC curve with the restriction that the false positive rate to be less than a certain level. In Zhu et al. (2022) , it has been shown that the partial AUROC maximization problem can be solved by the min w 1 |S 1 | xi∈S1 λ log   1 |S 2 | zj ∈S2 exp( l(w, z i , z j ) λ )   , where S 1 is the set of positive data, S 2 is the set of negative data, l(•) is surrogate loss, and λ is associated with the tolerance level of false positive rate. This is a CPR problem of (1) with f (x) = λ log(x), and ℓ(w, z i , z j ) = exp( l(w,zi,zj ) λ ). AUPRC Maximization According to (Boyd et al., 2013) , the area under the precision-recall curve (AUPRC) can be approximated by 1 |S| (zi,yi)∈S I(y i = 1) (zj ,yj )∈S I(y j = 1)I(h(w, z i ) ≥ h(w, z j )) (zj ,yj )∈S I(h(w, z i ) ≥ h(w, z j )) . ( ) Then using a surrogate loss, the AUPRC maximization problem becomes min w - 1 |S| (zi,yi)∈S I(y i = 1) (zj ,yj )∈S I(y j = 1) l(w, z i , z j )) (zj ,yj )∈S l(w, z i z j ) , which is a CPR problem of (1) with ℓ(w, (Qi et al., 2021) . z i , z j ) = [(I yj =1 ) l(w, z i , z j ), l(w, z i , z j )] and f (x 1 , x 2 ) = x1 x2

B COMPLEXITY FOR SOLVING CPR AND ERM PROBLEMS

In Table 3 , we summarize state-of-the-art results for ERM problems and CPR problems, in both centralized setting and federated setting. We cover the cases when the data comes with/without a finite-sum structure. For the CPR problem, we consider the finite-sum form for both the inner function and the outer function. Here we focuses on non-convex problems. For federated learning in convex/strongly-convex cases, please refer to (Shamir et al., 2014; Li et al., 2019; 2020; Khaled et al., 2020; Karimireddy et al., 2020b; 2021; Mishchenko et al., 2022; Khaled & Jin, 2022 ) and reference therein. In Table 3 , the * notion indicates that an algorithm matches a known lower bound complexity. The Spider algorithm (Fang et al., 2018) matches the lower bound result in (Zhou & Gu, 2019) for the  ( √ n/ϵ 2 ) (Fang et al., 2018) Finite-sum ------ -------------------- ------------------ Federated PR-SGD * : O(1/N ϵ 4 ) ( (n/ϵ 4 ) (Wang & Yang, 2022b) Inner Expectation + Outer Finite-sum ------ -------------------- ------------------ Federated This Work: O(max i n i /ϵ 4 ) Inner Expectation + Outer Finite-sum finite-sum setting and the SGD algorithm (Ghadimi & Lan, 2013) matches the lower bound in (Arjevani et al., 2022) for the expectation setting. In finite-sum setting, the federated ERM algorithms, i.e., PR-SGD, VRL-SGD, SCAFFOLD, matches the lower bound in (Woodworth et al., 2020b; a; Glasgow et al., 2022) . BSpiderBoost matches the lower bound in (Hu et al., 2020) . For CPR problems with a finite-sum structure on the outer function, the tight lower bounds are still unclear. After submitting to ICLR 2023, we noticed a later work Jiang et al. (2022) has propose a MSVR algorithm (in Table 3 ) that further improves the sample complexities by utilizing variance reduce techniques SVRG and STORM. However, naively implementing MSVR in federated setting would have a much higher communication cost than our algorithm. Actually, even for those ERM algorithms which have used similar variance reduction techniques, it remains an open problem whether any communication-efficient algorithm could be feasible.

C ANALYSIS OF FEDX1 FOR OPTIMIZING CPR WITH LINEAR f

In this section, we present the analysis of the FedX1 algorithm. For z ∈ S i 1 and z ′ ∈ S j 2 , we define G 1 (w, z, w ′ , z ′ ) = ∇ 1 ℓ ij (h(w; z), h(w ′ ; z ′ )) ⊤ ∇h(w; z) G 2 (w, z, w ′ , z ′ ) = ∇ 2 ℓ ij (h(w, z), h(w ′ ; z ′ )) ⊤ ∇h(w ′ ; z ′ ). Therefore, the G r i,k,1 = ∇ 1 ℓ ij (h(w r i,k ; z r i,k,1 ), h r-1 2,ξ )∇h(w r i,k ; z r i,k,1 ), defined in (3) is equivalent to G 1 (w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) , where h r-1 2,ξ = h(w r-1 j,t ; z r-1 j,t,2 ) is a scored of a randomly sampled data that in computed in the round r -1 at machine j and iteration t. Technically, notations j and t are associated with i and k, but we omit this dependence when the context is clear to simplify notations. Similarly, the G r i,k,2 = ∇ 2 ℓ j ′ i (h r-1 1,ζ , h(w r i,k ; z r i,k,2 ), )∇h(w r i,k ; z r i,k,2 ), defined in (5) is equivalent to G 2 (w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , w r i,k , z r i,k,2 ). Denote ∇F i (w) := E z∈S i 1 1 N N j=1 E z ′ ∈S j 2 ∇ 1 ℓ ij (h(w, z), h(w, z ′ ))∇h w (z) ∆i1 ) + E z ′ ∈S i 2 1 N N j=1 E z∈S j 1 ∇ 2 ℓ ji (h(w, z), h(w, z ′ ))∇h(w, z ′ ) ∆i2 . ( ) We make the following assumptions regarding the CPR with linear f problem, i.e., problem 2. Assumption 1. • ℓ ij (•) is differentiable, L ℓ -smooth and C ℓ -Lipschitz. • h(•, z) is differentiable, L h -smooth and C h -Lipschitz on w for any z ∈ S 1 ∪ S 2 . • E z∈S i 1 ,z ′ ∈S2 ∥∇ℓ ij (h(w; z), h(w; z ′ ))∇h(w; z) + ∇ℓ ji (h(w; z), h(w; z ′ ))∇h(w; z ′ ) - ∇F i (w)∥ 2 ≤ σ 2 . • ∥∇F i (w) -∇F (w)∥ 2 ≤ D 2 . Under Assumption 1, it follows that F (•) is L F -smooth, with L F := 2(L ℓ C h + C ℓ L h ). Simi- arly, G 1 , G 2 also Lipschtz in w with some constant L 1 that depend on C h , C ℓ , L ℓ , L h . Let L := max{L F , L 1 }. Basically, we consider well-conditioned problems where ω i,1 = N |S i 1 | |S2| and ω i,2 = N |S i 2 | |S2| are of O(1), therefore the above constants are appropriate. Nevertheless, we can also directly consider the FL objective where ω 1,i = 1 and ω 2,i = 1 similar to existing FL studies for the ERM problem. We re-present Theorem 1 as below. Theorem 3. Under Assumption 1, by setting η = O( N R 2/3 ) and K = O( R 1/3 N ), Algorithm 2 ensures that E[ 1 R R r=1 ∥∇F ( wr )∥ 2 ] ≤ O( 1 R 2/3 ). ( ) Proof. Denote η = ηK. Using the L-smoothness of F (w), we have F ( wr+1 ) -F ( wr ) ≤ ∇F ( wr ) ⊤ ( wr+1 -wr ) + L 2 ∥ wr+1 -wr ∥ 2 = -η∇F ( wr ) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) + L 2 ∥ wr+1 -wr ∥ 2 = -η(∇F ( wr ) -∇F ( wr-1 ) + ∇F ( wr-1 )) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) + L 2 ∥ wr+1 -wr ∥ 2 ≤ 1 2 L ∥∇F ( wr ) -∇F ( wr-1 )∥ 2 + 2η 2 L∥ 1 N K i k (G r i,k,1 + G r i,k,2 )∥ 2 -η∇F ( wr-1 ) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) + L 2 ∥ wr+1 -wr ∥ 2 ≤ 2η 2 L∥ 1 N K i k (G r i,k,1 + G r i,k,2 )∥ 2 + L∥ wr+1 -wr ∥ 2 -η∇F ( wr-1 ) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) , where -E η∇F ( wr-1 ) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) = -E η∇F ( wr-1 ) ⊤ 1 N K i k (G1(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) + G2(w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , w r i,k , z r i,k,2 ) -G1( wr-1 , z r i,k,1 , w r-1 , z r-1 j,t,2 ) -G2( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k,2 ) + G1( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k,2 ))) ≤ η 4 E∥∇F ( wr-1 )∥ 2 + 8η L2 E∥ wr -wr-1 ∥ 2 + 8η L2 1 N K i k E∥ wr -w r i,k ∥ 2 -E[η∇F ( wr-1 ) ⊤ 1 N K i k (G1( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , w r-1 , z r i,k,2 ) -∇Fi(w r-1 )) + ∇F ( wr-1 ) ] = η 4 E∥∇F ( wr-1 )∥ 2 + 8η L2 E∥ wr -wr-1 ∥ 2 + 8η L2 1 N K i k E∥ wr -w r i,k ∥ 2 -ηE∥∇F ( wr-1 )∥ 2 ≤ -E η 2 ∥∇F ( wr-1 )∥ 2 + 8η L2 E∥ wr -wr-1 ∥ 2 + 8η L2 1 N K i k E∥ wr -w r i,k ∥ 2 , ( ) where the second equality holds because that data samples z r i,k,1 , z r-1 j,t , z r-1 j ′ ,t ′ ,1 , z r i,k,2 are independent samples after wr-1 , therefore E[(G 1 ( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k,2 ) -∇F i ( wr-1 )] = 0. (21) To bound the updates of wr after one round, we have E∥ wr+1 -wr ∥ 2 = η2 E∥ 1 N K i k (G r i,k,1 + G r i,k,2 )∥ 2 = η2 E∥ 1 N K i k (G1(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) + G2(w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , w r i,k , z r i,k,2 ))∥ 2 ≤ 5η 2 E 1 N K i k [G1(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) + G2(w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , w r i,k , z r i,k,2 )] - 1 N K i k [G1( wr , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) + G2(w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , wr , z r i,k,2 )] 2 + 5η 2 E 1 N K i k [G1( wr , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) + G2(w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , wr , z r i,k,2 )] - 1 N K i k [G1( wr , z r i,k,1 , wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , wr , z r i,k,2 )] + 5η 2 E 1 N K i k [G1( wr , z r i,k,1 , wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , wr , z r i,k,2 )] - 1 N K i k [G1( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k,2 )] + 5η 2 E 1 N K i k [G1( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k,2 ) -∇Fi( wr-1 )] 2 + 5η 2 E ∇F ( wr-1 ) 2 ≤ 10η 2 L2 N K i k E∥w r i,k -wr ∥ 2 + 10η 2 L2 N K i k E∥w r-1 i,k -wr-1 ∥ 2 + 10η 2 L2 E∥ wr -wr-1 ∥ 2 + 10η 2 σ 2 N K + 10η 2 E∥F ( wr-1 )∥ 2 . (22) Thus, 1 R r E∥ wr+1 -wr ∥ 2 ≤ 1 R r 40η 2 L2 1 N K i k E∥w r i,k -wr ∥ 2 + 20η 2 σ 2 N K + 20η 2 E∥F ( wr-1 )∥ 2 . ( ) Then we bound the updates in one round and one machine as ∥ wr -w r i,k ∥ 2 = ∥w r i,k-1 -η(G 1 (w r i,k-1 , z r i,k-1,1 , w r-1 j,t , z r-1 j,t,2 ) + G 2 (w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , w r i,k-1 , z r i,k-1,2 )) -wr ∥ 2 ≤ ∥w r i,k-1 -wr -η(G 1 ( wr-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k-1,2 )) + η([G 1 ( wr-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k-1,2 )] -[G 1 ( wr , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr , z r i,k-1,2 )]) + η([G 1 ( wr , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr , z r i,k-1,2 )] -[G 1 (w r i,k-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , w r i,k-1 , z r i,k-1,2 )]) + η([G 1 (w r i,k-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , w r i,k-1 , z r i,k-1,2 )] -[G 1 (w r i,k-1 , z r i,k-1,1 , w r-1 j,t , z r-1 j,t,2 ) + G 2 (w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , w r i,k-1 , z r i,k-1,2 )])∥ 2 Using Young's inequality, we continue this inequality as E∥ wrw r i,k ∥ 2 ≤ (1 + 1 4K )E∥w r i,k-1 -wr -η(G 1 ( wr-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k-1,2 ))∥ 2 + (4K + 1)η 2 E∥([G 1 ( wr-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k-1,2 )] -[G 1 ( wr , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr , z r i,k-1,2 )]) + ([G 1 ( wr , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr , z r i,k-1,2 )] -[G 1 (w r i,k-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , w r i,k-1 , z r i,k-1,2 )]) + ([G 1 (w r i,k-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , w r i,k-1 , z r i,k-1,2 )] -[G 1 (w r i,k-1 , z r i,k-1,1 , w r-1 j,t , z r-1 j,t,2 ) + G 2 (w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , w r i,k-1 , z r i,k-1,2 )])∥ 2 ≤ (1 + 1 4K )E∥w r i,k-1 -wr -η(G 1 ( wr-1 , z r i,k-1,1 , wr-1 , z r-1 j,t,2 ) + G 2 ( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , z r i,k-1,2 ))∥ 2 + 18Kη 2 L2 E(∥ wr-1 -wr ∥ 2 + ∥ wr -w r i,k-1 ∥ 2 + ∥ wr-1 -wr-1 j,t ∥ 2 ) ≤ (1 + 1 K )E∥w r i,k-1 -wr -η∇F i ( wr-1 )∥ 2 + 5η 2 Kσ 2 + 18Kη 2 L2 E(∥ wr-1 -wr ∥ 2 + ∥ wr -w r i,k-1 ∥ 2 + ∥ wr-1 -wr-1 j,t ∥ 2 ) ≤ (1 + 2 K )E∥w r i,k-1 -wr ∥ 2 + 4Kη 2 E∥∇F i ( wr-1 )∥ 2 + 5Kη 2 σ 2 + 18Kη 2 L2 E(∥ wr-1 -wr ∥ 2 + ∥ wr -w r i,k-1 ∥ 2 + ∥ wr-1 -wr-1 j,t ∥ 2 ) ≤ (1 + 2 K )E∥w r i,k-1 -wr ∥ 2 + 8Kη 2 E∥∇F ( wr-1 )∥ 2 + 8Kη 2 (D 2 + σ 2 ) + 18Kη 2 L2 E(∥ wr-1 -wr ∥ 2 + ∥ wr -w r i,k-1 ∥ 2 + ∥ wr-1 -wr-1 j,t ∥ 2 ) = (1 + 2 K + 18Kη 2 L2 )E∥w r i,k-1 -wr ∥ 2 + 8Kη 2 E∥∇F ( wr-1 )∥ 2 + 8Kη 2 (D 2 + σ 2 ) + 18Kη 2 L2 E(∥ wr-1 -wr ∥ 2 + ∥ wr-1 -wr-1 j,t ∥ 2 ). ( ) Thus, E∥ wr -w r i,k ∥ 2 ≤ 8Kη 2 E∥∇F ( wr-1 )∥ 2 + 8Kη 2 (D 2 + σ 2 ) + 18Kη 2 L2 E[∥ wr-1 -wr ∥ 2 + ∥ wr-1 -wr-1 j,t ∥ 2 ] k-1 m=0 (1 + 2 K + 18Kη 2 ) m ≤ (8Kη 2 E∥∇F ( wr-1 )∥ 2 + 8Kη 2 (D 2 + σ 2 ) + 18Kη 2 L2 E(∥ wr-1 -wr ∥ 2 + ∥ wr-1 -wr-1 j,t ∥ 2 ))5K ≤ 40K 2 η 2 E∥∇F ( wr-1 )∥ 2 + 40K 2 η 2 (D 2 + σ 2 ) + 100K 2 η 2 L2 E(∥ wr-1 -wr ∥ 2 + ∥ wr-1 -wr-1 j,t ∥ 2 ), where the second inequality is due to 18Kη 2 ≤ 1 K . Then, 1 RN K R r=1 N i=1 K k=1 E∥ wr -w r i,k ∥ 2 ≤ 80K 2 η 2 E∥∇F ( wr-1 )∥ 2 + 200K 2 η 2 (D 2 + σ 2 ), (27) and 1 R r E∥ wr+1 -wr ∥ 2 ≤ 80K 2 η 2 E∥∇F ( wr-1 )∥ 2 + 80η 2 K 2 η 2 (D 2 + σ 2 ) + 20η 2 σ 2 N K . ( ) Recalling ( 67) and ( 20), we obtain 1 R R r=1 E∥F ( wr )∥ 2 ≤ O 2(F ( w0 ) -F * ) ηR + η2 (D 2 + σ 2 ) + 40η σ 2 N K . ( ) If we set η = O(N ϵ 2 ), K = O(1/N ϵ), thus η = O(ϵ), to ensure 1 R R r=1 E∥F ( wr )∥ 2 ≤ ϵ 2 , it takes communication rounds of R = O( 1 ϵ 3 ), and sample complexity on each machine O( 1 N ϵ 4 ).

D FEDX2 FOR OPTIMIZING CPR WITH NON-LINEAR f

In this section, we define the following notations: G i,1 (w 1 , z 1 , u, w 2 , z 2 ) = ∇f i (u)∇ℓ(h(w 1 , z 1 ), h(w 2 , z 2 ))∇h(w 1 , z 1 ) G i,2 (w 1 , z 1 , u, w 2 , z 2 ) = -∇f i (u)∇ℓ(h(w 1 , z 1 ), h(w 2 , z 2 ))∇h(w 2 , z 2 ). ( ) Denote ∇F i (w) := E z∈S i 1 1 N N j=1 E z ′ ∈S j 2 ∇f i (g(w; z, S 2 ))∇ 1 ℓ j (h(w; z), h(w; z ′ ))∇h(w; z) ∆i1 (31) + E z ′ ∈S i 2 1 N N j=1 E z∈S j 1 ∇f j (g(w; z, S 2 ))∇ 2 ℓ i (h(w; z), h(w; z ′ ))∇h(w; z ′ ) ∆i2 . ( ) We make the following assumptions regarding the CPR with non-linear f , i.e., problem (6). Assumption 2. • ℓ j (•) is differentiable, L ℓ -smooth and C ℓ -Lipschitz. |ℓ(•)| ≤ C 0 . • f i (•) is differentiable, L f -smooth and C f -Lipschitz. • h(•, z) is differentiable, L h -smooth and C h -Lipschitz on w for any z ∈ S 1 ∪ S 2 . • E z∈S i 1 ,z ′ ∈S2 ∥∇f i (g(w; z, S 2 ))∇ℓ j (h(w; z), h(w; z ′ ))∇h(w; z) + ∇f i (g(w; z, S 2 ))∇ℓ j (h(w; z), h(w; z ′ ))∇h(w; z) -∇F i (w)∥ 2 ≤ σ 2 . • ∥∇F i (w) -∇F (w)∥ 2 ≤ D 2 . Based on this assumption, it follows that G i,1 , G i,2 are Lipschitz with some constant modulus C 1 and are bounded by C 2 , F is L F -smooth, where C 1 , C 2 , L F are some proper constants depend on Assumption 2. We denote L = max{C 1 , C 2 , L F } to simplify notations.

D.1 ANALYSIS OF THE MOVING AVERAGE ESTIMATOR u

For z 1 ∈ S i 1 , z 2 ∈ S j 2 , define g(w 1 , z 1 , w 2 , z 2 ) = ℓ j (h(w 1 ; z 1 ), h(w 2 , z 2 )) and for z 1 ∈ S i 1 , we define g(w 1 , z 1 , w 2 , S 2 ) = 1 N N j=1 E z ′ ∈S j 2 ℓ j (h(w 1 ; z 1 ), h(w 2 , z ′ )) Lemma 1. Under Assumption 2, the moving average estimator u satisfies 1 N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 ≤ (1 - γ 4|S i 1 | ) 1 N N i=1 1 |S i 1 | z∈|S i 1 | [E∥u r i,k-1 (z) -g( wr k-1 , z, wr k-1 , S 2 )∥ 2 + γβ 2 K 2 C 0 |S i 1 | + 2 γ 2 |S i 1 | (σ 2 + C 2 0 ) + (1 + 4|S i 1 | γ ) L2 ∥ wr k-1 -wr k ∥ 2 ] + 2γ 2 ∥ wr -wr-1 ∥ 2 + 2γ 2 ∥ wr -w r i,k ∥ 2 + 2∥ wr -wr k ∥ 2 . ( ) Proof. By update rules, we have u r i,k (z) = u r i,k-1 (z) -γ(u r i,k-1 (z) -ℓ(h(w r i,k ; z r i,k,1 ), h(w r-1 j,t ; z r-1 j,t,2 ))) z = z r i,k,1 u r i,k-1 (z) z ̸ = z r i,k,1 . Or equivalently, u r i,k (z) = u r i,k-1 (z) -γ(u r i,k-1 (z) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 )) z = z r i,k,1 u r i,k-1 (z) z ̸ = z r i,k,1 Define ūr k = (u r 1,k , u r 2,k , ..., u r N,k ), wr k = 1 N N i=1 w r i,k , and ϕ r k (ū r k ) = 1 2N N i=1 1 |S i | z∈S i 1 ∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 . ( ) Then it follows that 1 2 ϕ r k (ū r k ) = 1 2N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k (z) -g( wr k , z, wr k , S2)∥ 2 = 1 N i 1 |Si| z∈|S i 1 | E 1 2 ∥u r i,k-1 (z) -g( wr k , z, wr k , S2)∥ 2 + ⟨u r i,k-1 (z) -g( wr k , z, wr k , S2), u r i,k (z) -u r i,k-1 (z)⟩ + 1 2 ∥u r i,k (z) -u r i,k-1 (z)∥ 2 = 1 N i 1 |Si| z∈S i 1 E 1 2 ∥u r i,k-1 (z) -g( wr k , z, wr k , S2)∥ 2 + 1 |S i 1 | ⟨u r i,k-1 (z r i,k,1 ) -g( wr k , z, wr k , S2), u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 2|S i 1 | ∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 = 1 N i 1 |Si| z∈S i 1 E 1 2 ∥u r i,k-1 (z) -g( wr k , z, wr k , S2)∥ 2 + 1 |S i 1 | E⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 |S i 1 | E⟨g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) -g( wr k , z r i,k,1 , wr k , S2), u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 2|Si| E∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 , ( ) where ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ = ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ + ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), u r i,k (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )⟩ = ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 γ ⟨u r i,k-1 (z r i,k,1 ) -u r i,k (z r i,k,1 ), u r i,k (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )⟩ ≤ ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), g( wr k , z r i,k,1 , wr i,k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 2γ (∥u r i,k-1 (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )∥ 2 -∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 -∥u r i,k (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )∥ 2 ) (39) If γ ≤ 1 9 , we have - 1 2 1 γ -1 - γ + 1 4γ ∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 + ⟨g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) -g( wr k , z r i,k,1 , wr k , S 2 ), u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ ≤ - 1 4γ ∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 + γ∥g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) -g( wr k , z r i,k,1 , wr k , S 2 )∥ 2 + 1 4γ ∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 ≤ γ∥g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ) -g( wr k , z r i,k,1 , wr k , S 2 )∥ 2 ≤ 4γ∥g( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ) -g( wr-1 , z r i,k,1 , wr-1 , S 2 )∥ 2 + 4γ L∥ wr -wr-1 ∥ 2 + 4γ L∥w r i,k -wr ∥ 2 + 4γ L∥w r-1 i,k -wr-1 ∥ 2 ≤ 4γσ 2 + 4γ L∥ wr -wr-1 ∥ 2 + 4γ L∥w r i,k -wr ∥ 2 + 4γ L∥w r-1 i,k -wr-1 ∥ 2 Then, we have 1 2N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 ≤ 1 2N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k-1 (z) -g( wr k , z, wr k , S 2 )∥ 2 + 1 N i 1 |S i 1 | 1 2γ ∥u r i,k-1 (z r i,k,1 ) -g(w r k , z r i,k,1 , w r k , S 2 )∥ 2 - 1 2γ ∥u r i,k (z r i,k,1 ) -g(w r k , z r i,k,1 , w r k , S 2 )∥ 2 - γ + 1 8γ ∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 + γ∥g( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ) -g( wr-1 , z r i,k,1 , wr-1 , S 2 )∥ 2 + 4γ L∥ wr -wr-1 ∥ 2 + 4γ L∥w r i,k -wr ∥ 2 + 4γ L2 ∥w r-1 i,k -wr-1 ∥ 2 + ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ . Note that z̸ =z r i,k,1 ∥u r i,k (z) -g( wr k+1 , z, wr k+1 , S 2 )∥ 2 = z̸ =z r i,k,1 ∥u r i,k+1 (z) - g(w r k+1 , z, wr k+1 , S 2 )∥ 2 , which implies 1 2γ ∥u r i,k-1 (z r i,k,1 ) -g( wr k , z, wr k , S 2 )∥ 2 -∥u r i,k (z r i,k,1 ) -g( wr k , z, wr k , S 2 )∥ 2 = 1 2γ z∈S i 1 ∥u r i,k-1 (z) -g( wr k , z, wr k , S 2 )∥ 2 -∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 . ( ) 1 4 E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r-1 i,0 (z r i,k,1 )∥ 2 -E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r-1 i,0 (z r i,k,1 )∥ 2 + 1 4 E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r-1 i,0 (z r i,k,1 )∥ 2 + 4E∥u r i,k-1 (z r i,k,1 ) -u r-1 i,0 (z r i,k,1 )∥ 2 . (44) Noting -E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r-1 i,0 (z r i,k,1 )∥ 2 = -E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r i,k (z r i,k,1 ) + u r i,k (z r i,k,1 ) -u r-1 i,0 (z r i,k,1 )∥ 2 = -E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r i,k (z r i,k,1 )∥ 2 -E∥u r i,k (z r i,k,1 ) -u r-1 i,0 (z r i,k,1 )∥ 2 + 2E⟨g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r i,k (z r i,k,1 ), u r i,k (z r i,k,1 ) -u r-1 i,0 (z r i,k,1 )⟩ ≤ - 1 2 E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r i,k (z r i,k,1 )∥ 2 + 8∥u r i,k (z r i,k,1 ) -u r-1 i,0 (z r i,k,1 )∥ 2 ≤ - 1 2 E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r i,k (z r i,k,1 )∥ 2 + 8β 2 K 2 C 2 0 . Then, we can obtain γ + 1 2 1 N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 ≤ γ(1 -1 |S i 1 | ) + 1 2 1 N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k-1 (z) -g( wr k , z, wr k , S 2 )∥ 2 + γ 2 |S i 1 | (σ 2 + C 2 0 ) + γβ 2 K 2 C 2 0 |S i 1 | + γ 2 ∥ wr -wr-1 ∥ 2 + γ 2 1 N i ∥ wr -w r i,k ∥ 2 + ∥ wr -wr k ∥ 2 . ( ) Dividing γ+1 2 on both sides gives 1 N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 ≤ γ(1 -1 |S i 1 | ) + 1 γ + 1 1 N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k-1 (z) -g( wr k , z, wr k , S 2 )∥ 2 + γβ 2 K 2 C 2 0 |S i 1 | + 2 γ 2 |S i 1 | (σ 2 + C 2 0 ) + 2γ 2 ∥ wr -wr-1 ∥ 2 + 2γ 2 ∥ wr -w r i,k ∥ 2 + 2∥ wr -wr k ∥ 2 (47) Using Young's inequality, 1 N N i=1 1 |S i 1 | z∈|S i 1 | E∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 ≤ (1 - γ 2|S i 1 | ) 1 N N i=1 1 |S i 1 | z∈|S i 1 | [(1 + γ 4|S i 1 | )E∥u r i,k-1 (z) -g( wr k-1 , z, wr k-1 , S 2 )∥ 2 + (1 + 4|S i 1 | γ ) L2 ∥ wr k-1 -wr k ∥ 2 ] + 2γ 2 ∥ wr -wr-1 ∥ 2 + 2γ 2 ∥ wr -w r i,k ∥ 2 + 2∥ wr -wr k ∥ 2 ≤ (1 - γ 4|S i 1 | ) 1 N N i=1 1 |S i 1 | z∈|S i 1 | [E∥u r i,k-1 (z) -g( wr k-1 , z, wr k-1 , S 2 )∥ 2 + γβ 2 K 2 C 0 |S i 1 | + 2 γ 2 |S i 1 | (σ 2 + C 2 0 ) + (1 + 4|S i 1 | γ ) L2 ∥ wr k-1 -wr k ∥ 2 ] + 2γ 2 ∥ wr -wr-1 ∥ 2 + 2γ 2 ∥ wr -w r i,k ∥ 2 + 2∥ wr -wr k ∥ 2 . ( ) D.2 ANALYSIS OF THE ESTIMATOR OF GRADIENT With update G r i,k = (1 -β)G r i,k-1 + β(G r i,k,1 + G r i,k,2 ). we define Ḡr k := 1 N N i=1 G r i,k , and ∆ r k := ∥ Ḡr k -∇F ( wr k )∥ 2 . Then it follows that Ḡr k = (1 -β) Ḡr k-1 + β 1 N i (G r i,k,1 + G r i,k,2 ). Lemma 2. Under Assumption 2, Algorithm 3 ensures that ∆ r k ≤(1 -β)∥ Ḡr k-1 -∇F ( wr k-1 )∥ 2 + 2 β 2 σ 2 N + 5β∥ wr -w r i,k ∥ 2 + 5β∥ wr-1 -wr ∥ 2 + 5β∥u r i,k (z r i,k,1 ) -g z r i,k,1 ( wr ))∥ 2 .

D.3 THEOREM 2

We re-present Theorem 2 as below. Theorem 4. Suppose Assumption 2 holds, denoting M = max i |S 1 i | as the largest number of data on a single machine, by setting γ = O( M 1/3 R 2/3 ), β = O( 1 M 1/6 R 2/3 ), η = O( 1 M 2/3 R 2/3 ) and K = O(M 1/3 R 1/3 ), Algorithm 2 ensures that E 1 R R r=1 ∥∇F ( wr )∥ 2 ≤ O( 1 R 2/3 ). Proof. By updating rules, ∥ wr -w r i,k ∥ 2 ≤ η 2 K 2 C 2 ℓ C 2 g , ∥ wr k -wr ∥ 2 = η2 ∥ 1 N K i k m=1 Ḡr k ∥ 2 ≤ η2 1 K k ∥ Ḡr k -∇F ( wr k ) + ∇F ( wr k )∥ 2 . ( ) By updating rule, we also have ∥ wr-1 -wr ∥ 2 = η2 ∥ 1 N K i k Ḡr-1 k ∥ 2 ≤ η2 1 K k ∥ Ḡr-1 k -∇F ( wr-1 k ) + ∇F ( wr-1 k )∥ 2 Lemma 2 gives that 1 RK r,k E∥ Ḡr k -∇F ( wr k )∥ 2 ≤ ∆ 0 0 βRK + 2βσ 2 N + 5β 1 RK r,k ∥ wr -w r i,k ∥ 2 + 5 1 R r ∥ wr-1 -wr ∥ 2 + 5 1 R r 1 N K i,k 1 |S i 1 | z∈S i 1 E∥u r i,k (z) -g( wr ; z, S2)∥ 2 + 5 1 R r 1 N K j ′ ,t ′ 1 |S i 1 | z∈S i 1 ∥u r-1 j ′ ,t ′ (z r-1 j ′ ,t ′ ,1 ) -g( wr-1 ; z r-1 j ′ ,t ′ ,1 , S2))∥ 2 (58) which by setting of η and β leads to 1 RK r,k E∥ Ḡr k -∇F ( wr k )∥ 2 ≤ 2∆ 0 0 βRK + 4βσ 2 N + 10β η2 C 2 ℓ C 2 g + 2η 2 1 R r ∥∇F ( wr-1 )∥ 2 + 5 1 R r 1 N K i,k 1 |S i 1 | z∈S i 1 E∥u r i,k (z) -g( wr ; z, S 2 )∥ 2 + 5 1 R r 1 N K j ′ ,t ′ 1 |S i 1 | z∈S i 1 ∥u r-1 j ′ ,t ′ (z r-1 j ′ ,t ′ ,1 ) -g( wr-1 ; z r-1 j ′ ,t ′ ,1 , S 2 ))∥ 2 , (59) Using Lemma 1 yields 1 R r 1 N K N i=1 K k=1 1 |S i 1 | z∈S i 1 E∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 ≤ 4M γ 1 R r 1 N K N i=1 1 |S i 1 | z∈S i 1 E∥u 0 i,0 (z) -g( w0 0 , z, w0 0 , S 2 )∥ 2 + 18M 2 γ 2 1 RK r,k L2 ∥ wr k-1 -wr k ∥ 2 + 4γβ 2 K 2 C 2 0 + 8γ(σ 2 + C 2 0 ) + 8γM 1 R r ∥ wr -wr-1 ∥ 2 + 8γ|S i 1 | 1 RN K r,i,k ∥ wr -w r i,k ∥ 2 + 8 |S i 1 | γ 1 RK r,k .∥ wr -wr k ∥ 2 . ( ) Combining this with previous five inequalities, we obtain 1 R r 1 N K N i=1 K k=1 1 M z∈S i 1 E∥u r i,k (z) -g( wr k , z, wr k , S 2 )∥ 2 ≤ O M γRK + γβ 2 K 2 + γ + η 2 M 2 γ 2 + 8γM η2 + M γ η2 ( 1 βRK + β N ) + 1 R r η2 ∥∇F ( wr-1 )∥ 2 (61) and 1 RK r,k E∥ Ḡr k -∇F ( wr k )∥ 2 ≤ O M γRK + γβ 2 K 2 + γ + η 2 M 2 γ 2 + 8γ|S i 1 |η 2 + M γ η2 ( 1 βRK + β N ) + 1 R r η2 ∥∇F ( wr-1 )∥ 2 . (62) Then using the standard analysis of smooth function, we derive F ( wr+1 ) -F ( wr ) ≤ ∇F ( wr ) ⊤ ( wr+1 -wr ) + L 2 ∥ wr+1 -wr ∥ 2 = -η∇F ( wr ) ⊤ 1 N K i k G r i,k -∇F ( wr ) + ∇F ( wr ) + L 2 ∥ wr+1 -wr ∥ 2 = -η∥∇F ( wr )∥ 2 + 2 ∥∇F ( wr )∥ 2 + η 2 ∥ 1 N K i k G r i,k -∇F ( wr )∥ 2 + L 2 ∥ wr+1 -wr ∥ 2 ≤ - η 2 ∥∇F ( wr )∥ 2 + η∥ 1 N K i k (G r i,k -∇F ( wr k ))∥ 2 + η∥ 1 K k (∇F ( wr k ) -∇F ( wr ))∥ 2 + L 2 ∥ wr+1 -wr ∥ 2 ≤ - η 2 ∥∇F ( wr )∥ 2 + η 1 K k ∥ 1 N i (G r i,k -∇F ( wr k ))∥ 2 + η L2 K k ∥ wr k -wr ∥ 2 + L 2 ∥ wr+1 -wr ∥ 2 . ( ) Combining with Lemma 1 and Lemma 2, we derive 1 R r E∥∇F ( wr )∥ 2 ≤ O M γRK + γβ 2 K 2 + γ + η 2 M 2 γ 2 + 8γM η2 + M γ η2 ( 1 βRK + β N ) . (64) By setting parameters as in the theorem, we can conclude the proof. Further, to get 1 R r E∥∇F ( wr )∥ 2 ≤ ϵ 2 , we just need to set γ = ϵ 2 , β = ϵ 2 √ M , K = √ M ϵ , η = ϵ 2 M , R = √ M ϵ 3 . E ANALYSIS OF FEDX1 FOR OPTIMIZING CPR WITH LINEAR f WITH A LARGER BUFFER In this section, we present the analysis of FedX1 with a larger buffer, i.e,. B i,1 , B i,2 keeps the history of previous τ > 1 rounds instead of only keep the history of the one previous round. For z ∈ S i and z ′ ∈ S j , we define G 1 (w, z, w ′ , z ′ ) = ∇ℓ ij (h(w; z) -h(w ′ ; z ′ )) ⊤ ∇h(w; z) G 2 (w, z, w ′ , z ′ ) = -∇ℓ ij (h(w, z) -h(w ′ ; z ′ )) ⊤ ∇h(w ′ ; z ′ ), We use superscript {r -τ, r -1} to denote that a historical statistics sampled from the buffer is computed at some round in (r -τ, r -1) randomly. Therefore, the G r i,k,1 = ∇ 1 ℓ ij (h(w r i,k ; z r i,k,1 ), h r-τ,r-1 2,ξ )∇h(w r i,k ; z r i,k,1 ), defined similarly as ( 3) is equivalent to G 1 (w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r j,t,2 ), and the G r i,k,2 = ∇ 2 ℓ j ′ i (h r-τ,r-1 1,ζ , h(w r i,k ; z r i,k,2 ), )∇h(w r i,k ; z r i,k,2 ), defined in ( 5) is equivalent to G 2 (w r-τ,r-1 j ′ ,t ′ , z r-τ,r-1 j ′ ,t ′ ,1 , w r i,k , z r i,k,2 ). We use same assumption and notations as in Appendix C Under Assumption 1, it follows that F (•) is L F -smooth, with L F := 2(L ℓ C h + C ℓ L h ). Similarly, G 1 , G 2 are also Lipschtz in w with some constant modulus L that depend on C h , C ℓ , L ℓ , L h . We present the analysis in the theorem below. Theorem 5. Under Assumption 1, by setting η = O( N R 2/3 ) and K = O( 1 N R 1/3 ) and τ = O(1), Algorithm 2 with a larger buffer that keeps the history of last τ rounds ensures that E[ 1 R R r=1 ∥∇F ( wr-τ )∥ 2 ] ≤ O( 1 R 2/3 ). ( ) Proof. Denote η = ηK. Using the L-smoothness of F (w), we have F ( wr+1 ) -F ( wr ) ≤ ∇F ( wr ) ⊤ ( wr+1 -wr ) + L 2 ∥ wr+1 -wr ∥ 2 = -η(∇F ( wr ) -∇F ( wr-τ ) + ∇F ( wr-τ )) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) + L 2 ∥ wr+1 -wr ∥ 2 ≤ 1 2 L ∥∇F ( wr ) -∇F ( wr-τ )∥ 2 + 2η 2 L∥ 1 N K i k (G r i,k,1 + G r i,k,2 )∥ 2 -η∇F ( wr-τ ) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) + L 2 ∥ wr+1 -wr ∥ 2 ≤ 2η 2 L∥ 1 N K i k (G r i,k,1 + G r i,k,2 )∥ 2 + L∥ wr+1 -wr ∥ 2 -η∇F ( wr-τ ) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) , -E η∇F ( wr-τ ) ⊤ 1 N K i k (G r i,k,1 + G r i,k,2 ) = -E η∇F ( wr-τ ) ⊤ 1 N K i k (G1(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-1 j,t,2 ) + G2(w r-τ,r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , w r i,k , z r i,k,2 ) -(G1( wr-τ , z r i,k,1 , wr-τ , z r-τ,r-1 j,t,2 ) + G2( wr-τ , z r-τ,r-1 j ′ ,t ′ ,1 , wr-τ , z r i,k,2 )) + G1( wr-τ , z r i,k,1 , wr-τ , z r-1 j,t,2 ) + G2( wr-τ , z r-1 j ′ ,t ′ ,1 , wr-τ , z r i,k,2 ))) = η 4 E∥∇F ( wr-τ )∥ 2 + 8η L2 E∥ wr -wr-τ ∥ 2 + 8η L2 1 N K i k E∥ wr -w r i,k ∥ 2 -E η∇F ( wr-τ ) ⊤ 1 N K i k (G1( wr-τ , z r i,k,1 , wr-τ , z r-τ,r-1 j,t,2 ) + G2( wr-τ , z r-τ,r-1 j ′ ,t ′ ,1 , wr-τ , z r i,k,2 ) -∇F ( wr-τ ) + ∇F ( wr-τ )) ≤ η 4 E∥∇F ( wr-τ )∥ 2 + 8η L2 E∥ wr -wr-τ ∥ 2 + 8η L2 1 N K i k E∥ wr-τ -w r i,k ∥ 2 -ηE∥∇F ( wr-τ )∥ 2 . ( ) By the updating rule, E∥ wr+1 -wr∥ 2 = η2 E∥ 1 N K i k (G r i,k,1 + G r i,k,2 )∥ 2 = η2 E∥ 1 N K i k (G1(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ) + G2(w r-τ,r-1 j ′ ,t ′ , z r-τ,r-1 j ′ ,t ′ ,1 , w r i,k , z r i,k,2 ))∥ 2 ≤ 5η 2 E 1 N K i k [G1(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ) + G2(w r-τ,r-1 j ′ ,t ′ , z r-τ,r-1 j ′ ,t ′ ,1 , w r i,k , z r i,k,2 )] - 1 N K i k [G1( wr-τ , z r i,k,1 , wr-τ , z r-1 j,t,2 ) + G2( wr-τ , z r-1 j ′ ,t ′ ,1 , wr-τ , z r i,k,2 )] 2 + 5η 2 E 1 N K i k [G1( wr-τ , z r i,k,1 , wr-τ , z r-1 j,t,2 ) + G2( wr-τ , z r-1 j ′ ,t ′ ,1 , wr-τ , z r i,k,2 ) -∇Fi( wr-τ )] 2 + 5η 2 E ∇F ( wr-τ ) 2 ≤ 10η 2 L2 N K i k E∥w r i,k -wr ∥ 2 + 10η 2 L2 N K i k E∥w r-1 i,k -wr-1 ∥ 2 + 10η 2 E∥ wr -wr-1 ∥ 2 + 10η 2 σ 2 N K + 10η 2 E∥F ( wr-τ )∥ 2 . (69) Thus, 1 R r E∥ wr+1 -wr ∥ 2 ≤ 1 R r 40η 2 1 N K i k E∥w r i,k -wr ∥ 2 + 20η 2 σ 2 N K + 20η 2 E∥F ( wr-1 )∥ 2 . ( ) Since ℓ ij (•) is C ℓ Lipschitz, we have E∥ wr -w r i,k ∥ 2 ≤ η 2 K 2 C 2 ℓ . (71) and E∥ wr -w r-τ ∥ 2 ≤ η 2 K 2 τ 2 C 2 ℓ . (72) Thus, 1 R R r=1 E∥F ( wr-τ )∥ 2 ≤ O 2(F ( w0 ) -F * ) ηR + η2 (D 2 + σ 2 ) + η2 τ 2 C 2 ℓ + 40η σ 2 N K . Set η and K as in theorem, we conclude the proof. Further, to ensure 1 R R r=1 E∥F ( wr-τ )∥ 2 ≤ ϵ 2 , we just need to set η = N ϵ 2 , K = 1/N ϵ, η = ηK = ϵ and τ = O(1), then number of communication rounds is R = O( 1 ϵ 3 ), sample complexity on each machine is O( 1 N ϵ 4 ).

F FEDX2 FOR OPTIMIZING CPR WITH NON-LINEAR f WITH MEMORY BANK

In this section, we present the analysis of FedX2 with a larger buffer, i.e,. B i,1 , B i,2 keeps the history of previous τ > 1 rounds instead of only keep the history of the one previous round. We use the same notations and assumptions as in Appendix D. The framework of the proof is similar as in Appendix D except that we need to handle the extra error caused by the large buffer. Theorem 6. Suppose Assumption 2 holds, denoting M = max i |S 1 i | as the largest number of data on a single machine, by setting γ = O( M 1/3 R 2/3 ), β = O( 1 M 1/6 R 2/3 ), η = O( 1 M 2/3 R 2/3 ), τ = O(M 1/4 ) and K = O(M 1/3 R 1/3 ), Algorithm 2 ensures that E 1 R R r=1 ∥∇F ( wr )∥ 2 ≤ O( 1 R 2/3 ). Proof. First, we need to handle the u estimator. Denote g(w 1 , p, w 2 , q) = ℓ(h(w 1 ; p), h(w 2 , q)). u r i,k (z) = u r i,k-1 (z) -γ(u r i,k-1 (z) -ℓ(h(w r i,k ; z r i,k,1 ), h(w r-τ,r-1 j,t ; z r-τ,r-1 j,t,2 ))) z = z r i,k,1 u r i,k-1 (z) z ̸ = z r i,k,1 Or equivalently, u r i,k (z) = u r i,k-1 (z) -γ(u r i,k-1 (z) -g(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 )) z = z r i,k,1 u r i,k-1 (z) z ̸ = z r i,k,1 Define ūr k = (u r 1,k , u r 2,k , ..., u r N,k ) and wr k = 1 N N i=1 w r i,k . We have 1 2N N i=1 1 |S i 1 | z∈S i 1 E∥u r i,k (z) -g( wr k , z, wr k , S2)∥ 2 = 1 N i 1 |S i 1 | z∈S i 1 E 1 2 ∥u r i,k-1 (z) -g( wr k , z, wr k , S2)∥ 2 + ⟨u r i,k-1 (z) -g( wr k , z, wr k , S2), u r i,k (z) -u r i,k-1 (z)⟩ + 1 2 ∥u r i,k (z) -u r i,k-1 (z)∥ 2 = 1 N i 1 |S i 1 | z∈S i 1 E 1 2 ∥u r i,k-1 (z) -g( wr k , z, wr k , S2)∥ 2 + 1 |S i 1 | ⟨u r i,k-1 (z r i,k,1 ) -g( wr k , z, wr k , S2), r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 2|S i 1 | ∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 = 1 N i 1 |S i 1 | z∈S i 1 E 1 2 ∥u r i,k-1 (z) -g( wr k , z, wr k , S2)∥ 2 + 1 |S i 1 | ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ), u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 |S i 1 | ⟨g(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ) -g( wr k , z r i,k,1 , wr k , S2), u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 2|S i 1 | ∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 , where ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ), u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ = ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ + ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ), u r i,k (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )⟩ = ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 γ ⟨u r i,k-1 (z r i,k,1 ) -u r i,k (z r i,k,1 ), u r i,k (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )⟩ ≤ ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-τ,r-1 j,t , z r-τ,r-1 j,t,2 ), g( wr k , z r i,k,1 , wr i,k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ + 1 2γ (∥u r i,k-1 (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )∥ 2 -∥u r i,k (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )∥ 2 -∥u r i,k (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )∥ 2 ). where E⟨u r i,k-1 (z r i,k,1 ) -g( wr-τ,r-1 , z r , wr-τ,r-1 , z j,t,2 ), wr-τ,r-1 , z r i,k,1 , wr-τ , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ = E⟨u r i,k-1 (z r i,k,1 ) -u r-tau i,0 (z r i,k,1 ) + u r-τ i,0 (z r i,k,1 ) -g( wr-τ , z r i,k,1 , wr-τ , z r-τ,r-1 j,t,2 ), g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r-τ i,0 (z r i,k,1 ) + u r-τ i,0 (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ ≤ E⟨u r i,k-1 (z r i,k,1 ) -u r-τ i,0 (z r i,k,1 ), g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r-τ i,0 (z r i,k,1 )⟩ + E⟨u r i,k-1 (z r i,k,1 ) -u r-τ i,0 (z r i,k,1 ), u r-τ i,0 (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ + E⟨u r-τ i,0 (z r i,k,1 ) -g( wr-τ , z r i,k,1 , wr-τ , z r-τ,r-1 j,t,2 ), g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r-τ i,0 (z r i,k,1 )⟩ + E⟨u r-τ i,0 (z r i,k,1 ) -g( wr-τ , z r i,k,1 , wr-τ , z r-τ,r-1 j,t,2 ), u r-τ i,0 (z r i,k,1 ) -u r i,k-1 (z r i,k,1 )⟩ ≤ 4E∥u r i,k-1 (z r i,k,1 ) -u r-τ i,0 (z r i,k,1 )∥ 2 + 1 4 E∥g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r-τ i,0 (z r i,k,1 )∥ 2 -E∥g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r-τ i,0 (z r i,k,1 )∥ 2 + 1 4 E∥g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r-τ i,0 (z r i,k,1 )∥ 2 + 4E∥u r i,k-1 (z r i,k,1 ) -u r-τ i,0 (z r i,k,1 )∥ 2 . (82) Noting -E∥g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r-τ i,0 (z r i,k,1 )∥ 2 = -E∥g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r i,k (z r i,k,1 ) + u r i,k (z r i,k,1 ) -u r-τ i,0 (z r i,k,1 )∥ 2 = -E∥g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r i,k (z r i,k,1 )∥ 2 -E∥u r i,k (z r i,k,1 ) -u r-τ i,0 (z r i,k,1 )∥ 2 + 2E⟨g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r i,k (z r i,k,1 ), u r i,k (z r i,k,1 ) -u r-τ i,0 (z r i,k,1 )⟩ ≤ - 1 2 E∥g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r i,k (z r i,k,1 )∥ 2 + 8∥u r i,k (z r i,k,1 ) -u r-τ i,0 (z r i,k,1 )∥ 2 ≤ - 1 2 E∥g( wr-τ , z r i,k,1 , wr-τ , S 2 ) -u r i,k (z r i,k,1 )∥ 2 + 8β 2 K 2 τ 2 C 2 0 . Then, we can obtain γ + 1 2 E∥u r k -g( wr k )∥ 2 ≤ γ(1 -1 |S i 1 | ) + 1 2 E∥u r k-1 -g( wr k )∥ 2 + γ 2 σ 2 |S i 1 | + 8β 2 K 2 τ 2 C 2 0 |S i 1 | + γ 2 ∥ wr -wr-1 ∥ 2 + γ 2 ∥ wr -w r i,k ∥ 2 . ( ) Dividing γ+1 2 on both sides gives E∥u r k -g( wr i,k )∥ 2 =(1 -  γ 4P i )E∥u r k-1 -g( wr k-1 )∥ 2 + γ 2 ∥ wr -wr-1 ∥ 2 + γ 2 ∥ wr-1 -w r k ∥ 2 + γη 2 K 2 τ 2 C 2 0 + γ 2 σ 2 |S i 1 | . ( ≤ O M γRK + γβ 2 K 2 + γ + η 2 M 2 γ 2 + 8γM η2 + γη 2 K 2 τ 2 C 2 0 + M γ η2 τ 2 ( 1 βRK + β N ) . ( ) By setting parameters as in the theorem, we can conclude the proof. Further, to get + L2 E∥u r-1 j ′ ,t ′ (ẑ r-1 j ′ ,t ′ ,1 ) -g( wr-1 t ′ , ẑr-1 j ′ ,t ′ ,1 , wr-1 t ′ , S 2 )∥ 2 .

G.3 CONVERGENCE RESULT

Theorem 7. Suppose Assumption 2 holds, and assume there are at least |P | machines take participation in each round. Denoting M = max i |S 1 i | as the largest number of data on a single machine, by setting γ = O( M 1/3 N 2/3 (R|P |) 2/3 ), β = O( N 2/3 M 1/6 (R|P |) 2/3 ), η = O( N 2/3 M 2/3 (R|P |) 2/3 ) and K = O( M 1/3 (R|P |) 1/3 N 1/3 ), Algorithm 2 ensures that E 1 R R r=1 ∥∇F ( wr )∥ 2 ≤ O( 1 R 2/3 ). Proof. By updating rules, we have that for i ∈ P r , ∥ wr - (114) w r i,k ∥ 2 ≤ η 2 K 2 C 2 f C 2 ℓ C 2 g , Similarly, we also have ∥ wr-1 -wr ∥ 2 = η2 ∥ 1 |P r |K i∈P r K k=1 Ḡr-1 k ∥ 2 ≤ η2 1 K K k=1 ∥ Ḡr-1 k -∇F ( wr-1 k ) + ∇F ( wr-1 k )∥ 2 Lemma 4 yields that 1 RK r,k E∥ Ḡr k -∇F ( wr k )∥ 2 ≤ ∆ 0 0 βRK + βσ 2 |P r | + 2   1 |P r | i∈P i 4 L2 E∥w r i,k -wr ∥ 2 + 4 L2 E∥ wr -wr-1 ∥ 2 + 1 |P r | i∈P r 4 L2 E∥w r-1 j ′ ,t ′ -wr-1 ∥ 2   + 2E   1 R r 1 |P r |K i∈P r ,k 1 |S i 1 | z∈S i 1 ∥u r i,k (z) -g( wr , z, wr , S2)∥ 2   + 2E   1 R r 1 |P r |K j ′ ,t ′ 1 |S i 1 | z∈S i 1 ∥u r-1 j ′ ,t ′ (z) -g( wr-1 t ′ , z, wr-1 t ′ , S2))∥ 2   , Then using the standard analysis of smooth function, we derive F ( wr+1 ) -F ( wr ) ≤ ∇F ( wr ) (118) Combining with ( 117), ( 113), (114), and (115), we derive 1 R r E∥∇F ( wr )∥ 2 ≤ O M N γRK|P | + η 2 M 2 γ 2 + γ + β 2 K 2 + M γ η2 ( 1 βRK + β |P | ) + γM η 2 K 2 . By setting parameters as in the theorem, we can conclude the proof. Further, to get 

H STATISTICS OF DATASETS AND MORE EXPERIMENTS

The statistics of the datasets we use are listed in Table 4 . Here we show some experiment results to verify the effectiveness of the larger buffers, the analysis of which is given in Appendix E and F. We focus on the task of one way partial AUC maximization optimized by FedX2 algorithm as in the experiment section. Recall that with the larger buffers, we just need to keep the last τ rounds of communicated history in B i,1 , B i,2 instead of the just keeping the previous one round's history. With large buffers, it would provide each machines with a larger pool to sample when computing local gradients. It would possibly help enhance the performance in local steps. In Figure 2 , τ = 1 denotes the Algorithm 3 while large τ refers to the algorithms with larger buffers. We can see that by keeping some larger τ can improve the performance. And we have further verified that FedX2 can tolerate to skip a big number of communications.



We use clients and machines interchangeably. A round is defined as a sequence of local updates between two consecutive communications. (50)



z∈S . Denote by ω1i = N |S i 1 |/|S 1 | and ω 2j = N |S j 2 |/|S 2 |, i = 1, .

r and send them to Client i for all i ∈ [N ] 27:

Figure 1: Ablation study: Left two: Fix N and Vary K; Right two: Fix K and Vary N

E∥∇F ( wr )∥ 2 ≤ ϵ 2 , we just need to set γ = O(ϵ 2 ), β = O( ϵ 2 √ M ), τ = O(M 1/4 ), K = O( √ M ϵ ), η = O( ϵ 2 M ), R = O( r i,k -wr ∥ 2 + 4 L2 E∥ wrwr-1 ∥ 2 ,1 ) -g( wr k , z r i,k,1 , wr k , S 2 )∥ 2

E∥∇F ( wr )∥ 2 ≤ ϵ 2 , we just need to set γ = O(ϵ 2 ), β = O(

Comparison for Federated Deep Partial AUC Maximization. All reported results are partial AUC scores on testing data.

Comparison for Federated Deep AUC maximization under corrupted labels. All reported results are AUC scores on testing data.

Peilin Zhao, Steven C. H. Hoi, Rong Jin, and Tianbao Yang. Online AUC maximization. In Lise Getoor and Tobias Scheffer (eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 -July 2, 2011, pp. 233-240. Omnipress, 2011. URL https://icml.cc/2011/papers/198_icmlpaper.pdf.

Comparison for sample complexity on each machine for solving ERM problem and CPR problem to a ϵ-stationary point, i.e., E[∥F (w)∥ 2 ] ≤ ϵ 2 . N is the number of machines in federated setting. n is the number of finite-sum components in outer finite-sum setting, which in ERM is the number of all data and in CPR is the number of data on the outer function. n in denotes the number of finite-sum components for the inner function g when it is of finite-sum structure. In federated learning setting with a finite-sum structure, each machine i has n i data in the outer function. * indicates the complexity matches a known lower bound.

Next, we deal with moving average of gradients, i.e., G ri,k . With update G r i,k = (1 -β)G r i,k-1 + ) -g( wr , z r i,k,1 , wr , S 2 )∥ 2 .Finally, we can analyze the convergence of the ∇F (w), F ( wr+1 ) -F ( wr ) ≤ ∇F ( wr ) ⊤ ( wr+1 -wr ) +

⊤ ( wr+1 -wr ) +

Statistics of the Datasets

availability

i,k-1 (z r i,k,1 ) -g( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ), g( wr-1 ,) -u r i,k-1 (z r i,k,1 )⟩ ≤ 4E∥u r i,k-1 (z r i,k,1 ) -u r-1 i,0 (z r i,k,1 )∥ 2 +

annex

Besides, we have E⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ = E⟨u r i,k-1 (z r i,k,1 ) -g( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ + E⟨g( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , z r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ ≤ E⟨u r i,k-1 (z r i,k,1 ) -g( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ), g( wr k , z r i,k,1 , wr k , S 2 ) -g( wr-1 , z r i,k,1 , wr-1 , S 2 )⟩ + E⟨u r i,k-1 (z r i,k,1 ) -g( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ), g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r i,k-1 (z r i,k,1 )⟩) -g( wr-1 , z r i,k,1 , wr-1 , z r-1 j,t,2 ), g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r i,k-1 (z r i,k,1 )⟩ + ∥ wr-1 -w r i,k ∥ 2 + 1 4 ∥g( wr , z r i,k,1 , wr k , S 2 ) -u r i,k-1 (z r i,k,1 )∥ 2 , (43) where E⟨u rProof.= (1 -β)( Ḡr k-1 -∇F ( wr k-1 )) + (1 -β)(∇F ( wr k-1 ) -∇F ( wr k ))+ β 1 N i (G1(w r i,k , z r i,k,1 , u r i,k (z r i,k,1 ), w r-1 j,t , z r-1 j,t,2 ) + G2(w r-1 j ′ ,t ′ , z r-1 j ′ ,t ′ ,1 , u r-1 j ′ ,t ′ (z r-1 j ′ ,k ′ ,1 ), w r i,k , z r i,k,2 ))-1 N i (G1( wr-1 , z r i,k,1 , u r i,k (z r i,k,1 ), w r-1 j,t , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , u r-1 j ′ ,t ′ (z r-1 j ′ ,t ′ ,1 ), w r i,k , z r i,k,2 ))+ β 1 N i (G1( wr-1 , z r i,k,1 , u r i,k (z r i,k,1 ), wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , u r-1 j ′ ,t ′ (z r-1 j ′ ,t ′ ,1 ), wr-1 , z r i,k,2 ))-1 N i (G1( wr-1 , z r i,k,1 , g( wr , z r i,k,1 , wr , S2), wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , g( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , S2), wr-1 , z r i,k,2 ))+ β 1 N i (G1( wr-1 , z r i,k,1 , g( wr , z r i,k,1 , wr , S2), wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , g( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , S2), wr-1 , z r i,k,2 ))-1 N i (G1( wr-1 , z r i,k,1 , g( wr-1 , z r i,k,1 , wr-1 , S2), wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , g( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , S2), wr-1 , z r i,k,2 ))+ β 1 N i (G1( wr-1 , z r i,k,1 , g( wr-1 , z r i,k,1 , wr-1 , S2), wr-1 , z r-1 j,t,2 ) + G2( wr-1 , z r-1 j ′ ,t ′ ,1 , g( wr-1 , z r-1 j ′ ,t ′ ,1 , wr-1 , S2), wr-1 , z r i,k,2 )) -∇F ( wr k )Denoting g z (w) = g(w, z, w, S 2 ). Using Young's inequality, we can then deriveBy the fact thatandThen, we haveNote thatBesides, we haveWe getUsing Young's inequality, we can then deriveBy the fact thatandwe obtainG FEDX WITH PARTIAL CLIENT PARTICIPATIONConsidering that not all client machines are available to work at each round, in this section, we provide an algorithm that allows partial client participation in every round. The algorithm is given in Algorithm 3. We use the same assumption as in Appendix D. The results will presented in Theorem 7.Algorithm 3 FedX2: Federated Learning for CPR with non-linear f1 , compute their predictions using model w 0 i,0 denoted by H 0 i,15: Sample K points from S i 2 , compute their predictions using model w 0 i,0 denoted by H 0 i,26: for r = 1, ..., R do 7:if i ̸ ∈ P r then skip this round, otherwise continue 8:Receive R r-1 i,1 , R r-1 i,2 , P r-1 from the server 9:Update the buffer⋄ or sample two mini-batches of data Compute G r i,k,1 and G r i,k,2 according to (9)18:end for 21:Sends w r i,K , G r i,k to the server 22:Send H r i,1 , H r i,2 , U r i to the server 23:Receives wr , Ḡr from the server and set w r+1 i,0 = wr , G r+1 i,0 = Ḡr 24: end for Broadcast wr and G r to clients in P r 30:and send them to Client i for all i ∈ P r 31:ReceiveCollects H r+1 * = ∪H r i, * , ∀i ∈ P r and U r+1 = ∪U r i , ∀i ∈ P i , where * = 1, 2 33: end for

G.1 ANALYSIS OF THE MOVING AVERAGE ESTIMATOR u

Lemma 3. Under Assumption 2, the moving average estimator u satisfiesProof. Denote P r as the clients that are sampled to take participation in the r-th round. By update rules of u, we haveOr equivalently,Definewhere for i ∈ P r it has ⟨u r i,k-1 (z r i,k,1 ) -g(w r i,k , z r i,k,1 , w r-1 j,t , ẑr-1Then, we haveNote that for i ∈ P r ,whereNoting for i ∈ P r , -E∥g( wr-1 , z r i,k,1 , wr-1 , S 2 ) -u r-1 i,0 (z r i,k,1 )∥With the client sampling and data sampling, we observe thatThen by multiplying γ to every term and rearranging terms using the setting of γ ≤ O(1), we can obtain γ + 1 2(107) Dividing γ+12 on both sides givesUsing Young's inequality,Using Young's inequality and L-Lipschtzness of G 1 , G 2 , we can then derive(G1( wr-1 , z r i,k,1 , g( wr-1 , z r i,k,1 , wr-1 , S2), wr-1 , ẑr-1 j,t,2 ) + G2( wr-1 , ẑr-1 j ′ ,t ′ ,1 , g( wr-1 , ẑr-1 j ′ ,t ′ ,1 , wr-1 , S2), wr-1 , z r i,k,2 )) -∇F ( wr-1 )

4. L2

E∥w r i,k -wr ∥ 2 + 4 L2 E∥ wrwr-1 ∥ 2 + 1 N i 4 L2 E∥w r-1 j ′ ,t ′ -wr-1 ∥ 2

L2

E∥u r i,k (z r i,k,1 ) -g( wr k , z r i,k,1 , wr k , S2)∥ 2

+ L2

E∥u r-1 j ′ ,t ′ (ẑ r-1 j ′ ,t ′ ,1 ) -g( wr-1 t ′ , ẑr-1 j ′ ,t ′ ,1 , wr-1By the fact that(G 1 ( wr-1 , z r i,k,1 , g( wr-1 , z r i,k,1 , wr-1 , S 2 ), wr-1 , ẑr-1 j,t,2 ) + G 2 ( wr-1 , ẑr-1 j ′ ,t ′ ,1 , g( wr-1 , ẑr-1 j ′ ,t ′ ,1 , wr-1 , S 2 ), wr-1 , z r i,k,2 )) -∇F ( wr-1 )] = 0, (111) and(G 1 ( wr-1 , z r i,k,1 , g( wr-1 , z r i,k,1 , wr-1 , S 2 ), wr-1 , ẑr-1 j,t,2 ) + G 2 ( wr-1 , ẑr-1 j ′ ,t ′ ,1 , g( wr-1 , ẑr-1 j ′ ,t ′ ,1 , wr-1 , S 2 ), wr-1 , z r i,k,2 )) -∇F ( wr-1 )∥ E∥u r-1 j ′ ,t ′ (ẑ r-1 j ′ ,t ′ ,1 ) -g( wr-1 ; ẑr-1 j ′ ,t ′ ,1 , S 2 ))∥ 2E∥ wr-1 -wr-1 t ′ ∥ 2 .Using Lemma 3 yieldsCombining this with previous five inequalities and noting the parameters settings, we obtainFigure 2 : Fix N , Vary K, τ

