FEDDA: FASTER FRAMEWORK OF LOCAL ADAPTIVE GRADIENT METHODS VIA RESTARTED DUAL AVERAG-ING

Abstract

Federated learning (FL) is an emerging learning paradigm to tackle massively distributed data. In Federated Learning, a set of clients jointly perform a machine learning task under the coordination of a server. The FedAvg algorithm is one of the most widely used methods to solve Federated Learning problems. In FedAvg, the learning rate is a constant rather than changing adaptively. The adaptive gradient methods show superior performance over the constant learning rate schedule; however, there is still no general framework to incorporate adaptive gradient methods into the federated setting. In this paper, we propose FedDA, a novel framework for local adaptive gradient methods. The framework adopts a restarted dual averaging technique and is flexible with various gradient estimation methods and adaptive learning rate formulations. In particular, we analyze FedDA-MVR, an instantiation of our framework, and show that it achieves gradient complexity Õ(ϵ -1.5 ) and communication complexity Õ(ϵ -1 ) for finding a stationary point ϵ. This matches the best known rate for first-order FL algorithms and FedDA-MVR is the first adaptive FL algorithm that achieves this rate. We also perform extensive numerical experiments to verify the efficacy of our method.

1. INTRODUCTION

Federated Learning denotes the process in which a set of distributed located clients jointly perform a machine learning task under the coordination of a central server over their privately-held data. A widely used method in FL is the FedAvg(Local-SGD) McMahan et al. (2017) algorithm. As indicated by its name, FedAvg performs (stochastic) gradient descent steps on each client and averages local states periodically. This method can be shown to converge Stich (2018) ; Haddadpour & Mahdavi (2019) ; Woodworth et al. (2020) when the distributions of the clients are homogeneous or with bounded heterogeneity. Recently, a large amount of literature has focused on accelerating FedAvg. In particular, many research works use momentum-based methods to accelerate FL, and significant progress has been made in this direction with improved gradient complexity and communication complexity Das et al. (2020) ; Karimireddy et al. (2019a) ; Khanduri et al. (2021a) . However, another important category of methods: adaptive gradient methods have received much less attention, and there is still no general framework to incorporate adaptive gradient methods into the federated setting. Adaptive gradient methods such as Adagrad Duchi et al. (2011 ), Adam Kingma & Ba (2014) and AMSGrad Reddi et al. (2018) are widely used in the non-distributed setting. The gradient descent method uses either a fixed learning rate or a fixed learning rate schedule. In contrast, adaptive gradient methods set the learning rate to be inversely proportional to the magnitude of the gradient; this can incorporate the local curvature structure of the problem. Adaptive gradient methods perform well in practice; meanwhile, they also enjoy useful theoretical implications that make them outperform the vanilla gradient descent method Duchi et al. (2011) ; Guo et al. (2021) . For example, a recent study Staib et al. (2019) showed that adaptive gradients help escape saddle points. Furthermore, some studies Loshchilov & Hutter (2018) ; Chen et al. (2018a) showed that adaptive gradients improve the generalization performance of the model. Adaptive gradient methods can be viewed as a type of generalized mirror descent Huang et al. (2021) methods, where the associated mirror map is defined according to adaptive learning rates. However, Table 1 : Comparisons of representative Federated Learning algorithms for finding an ϵ-stationary point of Objective equation 1 i.e., ∥∇f (x)∥ 2 ≤ ϵ or its equivalent variants. Gc(f, ϵ) denotes the number of gradient queries w.r.t. f (k) (x) for k ∈ [K]; Cc(f, ϵ) denotes the number of communication rounds; State means what state the algorithm maintains locally (Primal/Dual); Local-Adaptive means whether the algorithm performs adaptive gradient descent locally or not; Constrained means whether the algorithm can solve both constrained and unconstrained problems or not. The first three algorithms are not adaptive gradient methods, and the last four methods support some form of adaptive gradients.

Algorithm

Gc(f, ϵ) Cc(f, ϵ) State Local-Adaptive Constrained FedAvg McMahan et al. (2017) O(ϵ -2 ) O(ϵ -1.5 ) Primal/Dual ✗ ✗ FedCM Khanduri et al. (2021a) Õ(ϵ -1.5 ) Õ(ϵ -1 ) Primal/Dual ✗ ✗ STEM Khanduri et al. (2021a) Õ(ϵ -1.5 ) Õ(ϵ -1 ) Primal/Dual ✗ ✗ FedAdam Reddi et al. ( 2020) O(ϵ -2 ) O(ϵ -1 ) Primal ✗ ✗ Local-AMSGrad Chen et al. (2020b) O(ϵ -2 ) O(ϵ -1.5 ) Primal ✓ ✗ MIME-MVR Karimireddy et al. (2020a) Õ(ϵ -1.5 ) O(ϵ -1.5 ) Primal ✓ ✗ FedDA-MVR(Ours) Õ(ϵ -1.5 ) Õ(ϵ -1 ) Dual ✓ ✓ the mirror map is dynamic and changes at every training step. As a special case, the gradient descent method can be viewed as a mirror descent method with the mirror map being the L 2 distance function. Following the convention in the mirror descent literature, we denote the parameter space as the primal space and the gradient space as the dual space. The primal and dual space differ in adaptive gradient methods, but they coincide in the gradient descent method. We can exploit this primal-dual view to understand existing FL algorithms and design new algorithms. FedAvg actually exploits the usefulness of average dual states. In FedAvg, the gradient average approximates the true gradient evaluated at an average point of the client states, and the approximation error is upper-bounded by the client states difference; therefore, clients can perform multiple local steps without communication. Although in FedAvg, we do not differentiate the primal and dual space as they are the same, but the dual state average and primal state average are not equivalent for adaptive gradient methods. Current federated adaptive gradient methods in the literature either only perform adaptive gradient steps on the server side, or ignore this primal-dual nuance when supporting local adaptive gradient steps. An early work is Reddi et al. (2020) , the authors proposed applying adaptive gradients in the server-average step, while performing normal gradient descent updates locally. This method is simple to implement and gets better performance than FedAvg, but the adaptive information is not exploited during local updates; this weakens the usage of adaptive gradients. Recently, some work Karimireddy et al. (2020a) ; Chen et al. (2020b) exploited adaptive information during local update steps; however, a common characteristic of these methods is that they average the primal states (parameters) of the problem during the synchronization step. This will cause some problems. Firstly, since adaptive learning rates define the mirror map, updating adaptive learning rates locally makes the dual space not aligned, thus we can not average the primal states directly. Then even if the adaptive learning rates are fixed locally, the primal space might be nonlinear w.r.t. the dual space, e.g., when we solve a constrained optimization problem. In summary, we propose two principles to apply adaptive gradients in FL. First, the local dual spaces should be aligned with each other; Second, we should average dual states. More specifically, we propose the FL adaptive gradients framework FedDA, which is short for Federated Dual-averaging Adaptive-gradient. In each global round of FedDA, the clients aggregate gradients (dual states) locally, and the server averages the dual states of the clients in the synchronization step. Local weights (primal states) are used as gradient query points in local updates and are recovered through the inverse mirror map (defined by the adaptive gradients). The global primal state is updated on the basis of the averaged dual states and the inverse mirror map. In addition, we utilize a restarting technique to make sure that all clients share the same dual space during local updates; more precisely, we refresh the adaptive gradients at every global epoch and use a fixed one in the local update. Our FedDA framework is general and can incorporate a large family of adaptive gradient methods to the FL setting. In particular, FedDA-MVR, an instantiation of our framework, achieves the best-known gradient complexity and communication complexity in the FL setting. Finally, we highlight our contribution as follows: (i) We propose FedDA, a framework for federated adaptive gradient methods. The framework uses a restarted dual averaging technique and adapts a large family of adaptive gradient methods to the FL setting; (ii) FedDA-MVR, an instantiation of our framework, obtains the gradient complexity of Õ(ϵ -1.5 ) and communication complexity of Õ(ϵ -1 ). This matches the optimal rate of non-adaptive federated algorithms and outperforms existing adaptive federated algorithms. FedDA-MVR uses the momentum-based variance-reduction gradient estimation, and exponential moving average of the gradient square as adaptive learning rates; (iii) We empirically verify the efficacy of the framework FedDA by performing a colorrectal cancer prediction task and a classification task over the CIFAR10 and FEMNIST datasets. Notations. ∇f (x) (∇f (k) (x)) denotes the first-order derivatives of the function f (x) (f (k) (x)) w.r.t. variable x. ξ denotes a random sample and ∇f (x; ξ)(∇f (k) (x; ξ)) is the stochastic estimate ∇f (x) (∇f (k) (x)). O(•) is the big O notation, and Õ(•) hides logarithmic terms. I d denotes a d-dimensional identity matrix. Diag(x) denotes the matrix whose diagonal is the vector x. ∥ • ∥ denotes the ℓ 2 norm for vectors and the spectral norm for matrices, respectively. ⟨•, •⟩ denotes the Euclidean inner product. [K] denotes the set of {1, 2, ..., K}. For a random variable X, E[X] denotes its expectation.

2. RELATED WORKS

Optimization Algorithms in Federated Learning. The term Federated Learning was first coined in McMahan et al. (2017) , where the task is learned from a set of distributed located clients under the coordination of a server. In the paper McMahan et al. (2017) , the authors proposed the FedAvg algorithm, in which each client performs multiple steps of gradient descent with its local data and then sends the updated model to the server for averaging. The idea of FedAvg algorithm resembles the Local-SGD algorithm, which is studied in a more general distributed setting for a longer time Mangasarian & Solodov (1993) . et al. (2020) ; Khanduri et al. (2021b) , momentum-based variance reduction is applied to the FL setting to control the noise of the stochastic gradients. In Das et al. (2020) , the authors maintained a server momentum state and a client momentum state, while in Khanduri et al. (2021b) , the authors maintained a momentum state and the momentum was averaged periodically similar to the primal state. Adaptive gradient methods are also studied in the FL setting. The 'Adaptive Federated Optimization ' Reddi et al. (2020) method proposed to use adaptive gradients on the server side while the local gradients are used to update the states of the adaptive gradient methods. In Chen et al. (2020b) , the authors first showed the divergence of a naive local AMSGrad method that directly averages the primal states periodically. The authors then proposed Local-AMSGrad, a method in which clients update adaptive learning rates locally and average at the synchronization step. Finally, another line of research Tang et al. (2020; 2021) ; Lu et al. (2022) ; Chen et al. (2020a) considers federated adaptive learning rates through the compression approach, these methods communicate local gradients at every step, but the compression techniques are used to reduce the communication cost. Adaptive Gradients in the Non-distributed Learning. Adaptive gradient methods are widely used in the non-distributed machine learning setting. The first adaptive gradient method i.e. Adagrad was proposed in (Duchi et al., 2011) , where the method was shown to outperform SGD in the sparse gradient setting. Since Adagrad does not perform well under dense gradient setting and non-convex setting, some of its variants are proposed, such as SC-Adagra Mukkamala & Hein (2017) and SAdagrad Chen et al. (2018b) . Furthermore, Adam Kingma & Ba (2014) and YOGI Zaheer et al. (2018) proposed to use the exponential moving average instead of the arithmetic average used in Adagrad. Adam/YOGI is widely used and very successful in deep learning applications; however, Adam diverges in some settings and the gradient information quickly disappears, so AMSGrad Reddi et al. (2018) is proposed, and it applies an extra 'long term memory' variable to preserve the past gradient information to handle the convergence issue of Adam. The convergence of Adam-type methods is also studied in the literature Chen et al. (2019) ; Zhou et al. (2018) ; Liu et al. (2019) ; Guo et al. (2021) ; Huang et al. (2021) . Adaptive gradient methods with good generalization performance are also proposed, such as AdamW (Loshchilov & Hutter, 2018) , Padam (Chen et al., 2018a) , Adabound Luo et al. (2019) , Adabelief Zhuang et al. (2020) and AaGrad-Norm Ward et al. (2019) .

3. PRELIMINARIES

In this section, we introduce some preliminaries before introducing our framework. First, we consider the following formulation of Federated Learning: min x∈X ⊂R d f (x) := 1 K K k=1 f (k) (x) := E ξ (k) ∼D (k) [f (k) (x; ξ (k) )] . which considers K clients. For the k th client, we optimize the loss objective f (k) (x) : X → R which is smooth and possibly non-convex, and x denotes the variable of interest. X ⊂ R d is a compact and convex set. ξ (k) ∼ D (k) is a random example that follows an unknown data distribution D (k) . The formulation in equation 1 includes both the homogeneous case i.e. f (k) (x) = f (j) (x) for any k, j ∈ [K], and the heterogeneous case i.e. f (k) (x) ̸ = f (j) (x) for some k, j ∈ [K]. Next, we introduce some basics of adaptive gradient methods from a mirror-descent perspective. Generally, mirror descent is associated with a mirror map Φ(x). Given the objective f (x) and the primal state x t ∈ X at t th step, we first map the primal state to the mirror space as y t = ∇Φ(x t ), then we perform the gradient descent step in the mirror space: y t+1 = y t -η∇f (x), where η is the learning rate, finally, we map y t+1 back to the primal space as x t+1 = arg min x∈X D Φ (x, y t+1 ), where D Φ (x, y) denotes the Bregman Divergence associated to Φ, i.e. D Φ (x, y) = f (x) -f (y) -⟨∇f (y), x -y⟩, In summary, the mirror descent step can be written as a Bregman proximal gradient step as follows: x t+1 = arg min x∈X η⟨∇f (x t ), x⟩ + D Φ (x, x t ) For the adaptive gradient methods, we uses the following mirror map: Φ(x) = 1 2 x T Hx, where H is the adaptive matrix and is positive definite. Many adaptive gradient methods can be written in the following proximal gradient descent form: x t+1 = arg min x∈X η⟨ν t , x⟩ + 1 2 (x -x t ) T H t (x -x t ), we replace the gradient ∇f (x) with the generalized gradient estimation ν t , besides, we replace H with H t based on the fact that the adaptive matrix is updated at every step. Next, we show some examples of adaptive gradients methods that can be phrased as the above formulation. For the Adagrad Duchi et al. (2011) method, we set ν t = ∇f (x t , ξ t ), H t = Diag( √ µ t ), µ t = 1 t t i=1 ν 2 i (3) For Adam Kingma & Ba (2014), we have: νt = (1 -β 1 )∇f (x t , ξ t ) + β 1 νt-1 , μt = (1 -β 2 )∇f (x t , ξ t ) 2 + β 2 μt-1 ν t = νt /(1 -γ t 1 ), µ t = μt /(1 -γ t 2 ), H t = Diag( √ µ t + ϵ) where β 1 , β 2 , γ 1 , γ 2 are some constants. For other adaptive gradient methods, please refer to Huang et al. (2021) . Algorithm 1 FedDA-Server 1: Input: Number of global epochs E, tuning parameters {β τ } E i=1 ; 2: Initialize: Choose x 0 ∈ X and compute ν 0 = 1 K K j=1 ∇f (j) (x 0 , B (k) 0 ) where {B (k) 0 } K k=1 are a mini-batch of random points selected from each of K clients; 3: for τ = 0 to E -1 do (z (k) τ +1,I , ν (k) τ +1,I ) = FedDA-client(x τ , ν τ , H τ ) 7: end for 8: Compute z τ +1 = 1 r k∈Sτ z (k) τ +1,I ; 9: Compute x τ +1 = arg min x∈X {-⟨x, z τ +1 ⟩ + 1 2λ (x -x τ ) T H τ (x -x τ ) }; 10: Compute ν τ +1 = 1 r k∈Sτ ν (k) τ +1,I ; 11: Compute H τ +1 = V(H τ , z τ +1 ); 12: end for Algorithm 2 FedDA-Client (x τ , ν τ , H τ ) 1: Input: Number of local steps I, tuning parameters {η τ +1,i } I-1 i=0 , {α τ +1,i } I i=1 ; 2: Initialize: x (k) τ +1,0 = x τ ; ν (k) τ +1,0 = ν τ ; z (k) τ +1,0 = 0; 3: for i = 0 to I -1 do 4: Compute z (k) τ +1,i+1 = z (k) τ +1,i -η τ +1,i ν (k) τ +1,i ; 5: Compute x (k) τ +1,i+1 = arg min x∈X {-⟨x, z (k) τ +1,i+1 ⟩ + 1 2λ (x -x (k) τ +1,0 ) T H τ (x -x (k) τ +1,0 ) }; 6: Compute ν (k) τ +1,i+1 = U(ν (k) τ +1,i , x (k) τ +1,i+1 , x (k) τ +1,i ; α τ +1,i+1 , B τ +1,i+1 ), where B τ +1,i+1 is a minibatch of random samples from the client k; 7: end for 8: Output: Send z 

4. LOCAL ADAPTIVE GRADIENTS VIA DUAL AVERAGING

In this section, we introduce FedDA, a framework of federated adaptive gradient methods. The procedure of FedDA is summarized in Algorithm 1. In Algorithm 1, we perform E global steps and at each global step, we select a subset of clients for training. All selected clients at each step will run Algorithm 2. In Algorithm 2, clients receive the current model weight x τ , gradient estimation ν τ and adaptive gradient matrix H τ . The clients then perform I local training steps: line 3-line 7 in Algorithm 2. For each step, we first accumulate the dual state in the variable z (k) τ,i (line 4), then we calculate the local primal state x (k) τ,i (line 5), which is a proximal gradient step similar to equation 2. The function of this step is to map the aggregated dual state z (k) τ,i back to the primal space, and we use the primal state to query the gradient to update the estimation of the gradient ν (k) τ,i (line 6). Note, we use a fixed adaptive matrix H τ during local steps, this makes the clients share the same dual space. In line 6 of Algorithm 2, we update the gradient estimation ν (k) τ,i . The update rule U(•) is general, e.g.,the momentum-based variance reduction update equation 5 and the momentum update equation 6 as follows (α τ,i is some constant): ν (k) τ +1,i+1 = ∇f (k) (x (k) τ +1,i+1 , B (k) τ +1,i+1 ) + (1 -α τ +1,i+1 )(ν (k) τ +1,i -∇f (k) (x (k) τ +1,i , B (k) τ +1,i+1 )) and ν (k) τ +1,i+1 = α τ +1,i+1 ∇f (k) (x (k) τ +1,i+1 , B (k) τ +1,i+1 ) + (1 -α τ +1,i+1 )ν (k) τ +1,i After the client runs Algorithm 2, it returns the aggregated local dual states z τ +1,I to the server. The server first averages the local dual states (line 8 of Algorithm 1) to get z τ +1 . We can average local dual states as all clients have a common dual space. The server then calculates the new primal states x τ +1 as in line 9 of Algorithm 1. Next, the gradient estimation ν τ is also updated by averaging local states (line 10 of Algorithm 1). Finally, we update the adaptive matrix H τ (line 11 of Algorithm 1). The update rule V is general, e.g., µ τ +1 = β τ +1 z 2 τ +1 /η 2 τ +1,I-1 + (1 -β τ +1 )µ τ +1 , H τ +1 = Diag( √ µ τ +1 + ϵ) and µ τ +1 = β τ +1 ||z τ +1 ||/η τ +1,I-1 + (1 -β τ +1 )µ τ +1 , H τ +1 = (µ τ +1 + ϵ)I d (8) where we set µ 0 = 0, ϵ is some constant. In summary τ,i with momentum-based variance reduction ( equation 5) and the adaptive matrix H τ with an exponential average of the square of the gradient ( equation 7). In the subsequent discussion, we focus on this variant and perform both theoretical and empirical analysis.

5. THEORETICAL ANALYSIS

In this section, we provide the theoretical analysis of our FedDA framework; more specifically, we focus on the analysis of FedDA-MVR. FedDA-MVR uses equation 7 to update the adaptive matrix H τ and equation 5 to update the gradient estimation ν (k) τ,i . We first state the assumptions we need in our analysis:

5.1. SOME MILD ASSUMPTIONS

Assumption 1 (Bounded Client Heterogeneity). The difference of gradients between different workers are bounded: ∥∇f (k) (x) -∇f (ℓ) (x)∥ 2 ≤ ζ 2 , ∀k, ℓ ∈ [K]. We measure the heterogeneity of the clients in terms of gradient dissimilarity. The above assumption or its similar form is also exploited in the analysis of other Federated Learning Algorithms, such as in Khanduri et al. (2021a) ; Das et al. (2020) . Assumption 2. The function f (x) is bounded from below in X , i.e., f * = inf x∈X f (x). Assumption 3 (Unbiased and Bounded-variance Stochastic Gradient). The stochastic gradients are unbiased with bounded variance, i.e. E[∇f (k) (x; ξ (k) )] = ∇f (k) (x) and there exists a constant σ such that E∥∇f (k) (x; ξ (k) ) -∇f (k) (x)∥ 2 ≤ σ 2 , ∀ ξ (k) ∼ D (k) , ∀ k ∈ [K]. Assumption 2 guarantees the feasibility of the Federated Learning problem equation 1, and Assumption 3 is widely used in stochastic optimization analysis. Assumption 4. The adaptive matrix H τ is symmetric positive definite, i.e. there exists a constant ρ > 0 such that H τ ⪰ ρI d ≻ 0, ∀t ≥ 1, In our analysis, we assume the adaptive matrix is positive definite, and this requirement can be easily satisfied by many adaptive gradient methods. Firstly, most adaptive gradient methods always have non-negative adaptive learning rates, such as equation 3 and equation 4. To make it positive, we can add a bias term ϵ such as in the Adam update rule equation 4. Assumption 5 (Sample Gradient Lipschitz Smoothness). The stochastic functions f (k) (x, ξ (k) ) with ξ (k) ∼ D (k) for all k ∈ [K], satisfy the mean squared smoothness property, i.e, we have E∥∇f (k) (x; ξ (k) ) -∇f (k) (y; ξ (k) )∥ 2 ≤ L 2 ∥x -y∥ 2 for all x, y ∈ R d , The smoothness assumption above is a slightly stronger requirement than the standard smooth condition, but this assumption is widely used in the analysis of variance reduction methods, such as SPIDER Fang et al. (2018) and STORM Cutkosky & Orabona (2019) . Assumption 6. All clients participate in the training at each step, i.e. choose r = K in Algorithm 1. We make the full participation assumption to simplify the exposition of the theoretical results. All the results presented can be easily generalized to the partial participation case.

5.2. CONVERGENCE PROPERTY OF FED-MVR

In this subsection, we provide the convergence property of our FedDA-MVR variant. For convenience of discussion, we redefine the subscript t = τ I + i, i.e. we denote the t step as the i local step in the τ global round. Similarly, we denote the total number of running steps as T = EI. We analyze our algorithm through the following measure: G t = ρ 2 λ 2 η 2 t ||x t -xt+1 || 2 + ||ν t -∇f (x t )|| 2 (9) where νt denotes the average gradient estimation at the t step and xt denotes the virtual global primal state at the t step (see Section 9.2 in the appendix for formal definitions). In Remark 7 of the appendix, we discuss the intuition of the measure G t . In particular, in the unconstrained case i.e. when X = R d , the measure upper-bounds the square norm of the gradient. Therefore, the convergence of our measure G t means the convergence to a first-order stationary point. Now, we are ready to provide the main result of our convergence theorem. Theorem 5.1. In Algorithm 1, we choose the parameters as κ = ρK 2/3 λL , c = 96λ 2 L 2 Kρ 2 + ρ 72κ 3 λLI 2 , w t = max 48 3 I 6 K 2 -t -I, 14 3 K 0.5 , λ > 0, and choose η t = κ (ωt+t+I) 1/3 , then we have: 1 T T -1 t=0 E[G t ] ≤ 96LI 2 T + 2L K 2/3 T 2/3 (f (x 0 ) -f * ) + 72I 4 bT + 3I 2 2bK 2/3 T 2/3 σ 2 + 192 2 × 48I 2 T + 1 K 2/3 T 2/3 × σ 2 4b 1 + 2ζ 2 21 log(T + 1). Note, by choosing a proper value of local updates I and using a minibatch of samples for the first iteration to decrease the noise, our result matches the best known convergence rate for stochastic federated gradient methods Khanduri et al. (2021a) , i.e. our algorithms has gradient complexity of Õ(ϵ -1.5 ) and communication complexity of Õ(ϵ -1 ), moreover we achieve linear speed up w.r.t the number of clients K. More formally, we have the following corollary: Corollary 1. Suppose in Algorithm 1, we set I = O((T /K 2 ) 1/6 ), and use sample minibatch of size O(I 2 ) in the initialization, then we have: 1 T T t=1 E[G t ] = Õ( 1 K 2/3 T 2/3 ) and to reach an ϵ-stationary point, we need to make Õ(ϵ -1.5 /K) number of steps and need Õ(ϵ -1 ) number of communication rounds.

6. NUMERICAL EXPERIMENTS

In this section, we perform numerical experiments to verify the efficacy of the proposed adaptive federated learning framework i.e. FedDA. More specifically, we consider the variant of FedDA-MVR here, and defer experiments for other variants to Section 8 of the appendix. We performed two sets of experiments. In the first experiment, we consider a biomedical prediction task: predicting the survival of colorectal cancer. In this task, we impose a L 1 sparsity constraint. L 1 constraint improves the explainability of the model, which is essential for biomedical applications. Then in the second experiment, we consider a federated multiclass image classification task. More specifically, we consider two datasets: CIFAR10 Krizhevsky et al. (2009) and FEMNIST Caldas et al. (2018) . All experiments are run on a machine with an Intel Xeon Gold 6248 CPU and 4 Nvidia Tesla V100 GPUs. The code is written in Pytorch. We simulate the Federated Learning environment through the Pytorch.distributed package.

6.1. COLORRECTAL CANCER SURVIVAL PREDICTION WITH SPARSE CONSTRAINTS

In this subsection, we consider a colorrectal cancer prediction task on the PATHMNST dataset Yang et al. (2021) ; Kather et al. ( 2019), which contains 9 different classes. It has 89996 training images, and we equally randomly split the training set into 10 clients. We used the original test set for the metric. In this task, we impose the L 1 sparsity constraint to improve the explainability of the model. In this task, we compare with the following baselines: FedAvg McMahan et al. ( 2017) and FedDu-alAvg Yuan et al. (2021) . FedDualAvg is a recently proposed federated algorithm that deals with composite optimization problems. In FedDualAvg, clients maintain dual states locally, but adaptive gradients are not applied. For our FedDA-MVR, we train with and without the L 1 constraint. We tune the hyper-parameters for each method and choose the best setting. The results are summarized in Figure 1 , the plots are averaged over 5 independent runs and then smoothed. In Figure 1 , FedDualAvg and FedDA-MVR-L 1 consider the L 1 constraint, while FedAvg and FedDA-MVR do not. We show results of Train/Test Accuracy and also the number of non-zero (below a threshold) elements in the parameter (i.e. the rightmost plot in Figure 1 ). As shown in the plots, FedDA-MVR-L 1 outperforms unconstrained FedDA-MVR in all metrics. This shows the importance of considering constrained problems in Federated Learning. Furthermore, FedDA-MVR-L 1 also outperforms FedAvg and FedDualAvg in all metrics. This shows that our algorithm can effectively exploit adaptive gradient information in the constrained case. For more details of this experiment, such as the hyper-parameter choices, please refer to Section 8 of the appendix. 

6.2. IMAGE CLASSIFICATION TASK WITH CIFAR10 AND FEMNIST

In this subsection, we consider an unconstrained image classification task for both homogeneous and heterogeneous cases. More specifically, we consider two datasets: CIFAR10 Krizhevsky et al. (2009) and FEMNIST Caldas et al. (2018) . CIFAR10 is a widely used image classification benchmark dataset which contains 50000 training images, and we construct both homogeneous and heterogeneous cases based on it. For the homogeneous case, we uniformly randomly distribute them into 10 clients. The Heterogeneous case is deferred to Appendix 8.3. FEMNIST is a Federated dataset of handwritten digits; it contains hand-written digits of 3550 users (we randomly sample 500 users in our experiments). Data distribution of FEMNIST is heterogeneous for different writing styles of people. In this task, we compare our method with the following baselines: the non-adaptive methods: 2020). For all methods, we tune their hyper-parameters to find the best setting. The results are summarized in Figure 2 (CIFAR10) and Figure 3 (FEMNIST), the plots are averaged over 5 runs and then smoothed. As shown in the figures, our FedDA-MVR outperforms all baselines. In addition, the FedAvg algorithm has competitive training performance; however, it tends to overfit the training data severely. Then we observe that adaptive methods in general get better train and test performance. Finally, the superior performance of our method compared with the three adaptive baselines shows that our method exploits adaptive information better; for example, MIME-MVR also exploits the momentum-based variance reduction technique, but it fixes all optimizer states during local updates, in contrast, we only fix the adaptive matrix but update the momentum ν (k) t , k ∈ [K] at every step. For more details, including the hyper-parameter selection, please refer to Section 8 of the appendix.

7. CONCLUSION

In this paper, we proposed the FedDA framework to incorporate adaptive gradients into the Federated Learning environment. More specifically, we adopted the Mirror Descent view of adaptive gradients, furthermore, we proposed to maintain and average the dual states in the training, meanwhile we fixed the adaptive matrix during local training such that the dual space is shared by all clients. We also analyze the convergence property of our Framework: for the variant FedDA-MVR, we proved that it reaches an ϵ-optimal stationary point with Õ(ϵ -1.5 ) gradient queries and Õ(ϵ -1 ) communication rounds, these results match the best known gradient complexity and communication complexity of stochastic federated algorithms under the non-convex case. Finally, we validate our algorithm for both constrained and unconstrained tasks. The numerical results show the superior performance of our algorithm compared to various baseline methods.

8. MORE EXPERIMENTAL DETAILS AND RESULTS

In this section, we add additional experiments. In Section 8.1, we consider more variants of FedDA besides FedDA-MVR. More specifically, we consider four variants of FedDA. We introduce two cases for the update of the adaptive matrix H τ in equation 7 and equation 8 and we denote them as case 1 and case 2, similarly, we denote equation 5 and equation 6 as case 1 and case 2 of gradient estimation respectively. So we have four different variants, we denote them as FedDA-i-j, for i, j ∈ {1, 2}, where i shows the choice of gradient estimation and j shows the choice of adaptive matrix update rule. Note FedDA-MVR corresponds to FedDA-1-1 as we choose Case 1 of gradient estimation and Case 1 of adaptive matrix update in Algorithm 1. We also introduce more details such as the hyper-parameter choices. Then in Section 8.2, we perform some ablation studies and compare our FedDA with other baselines in more detail; In Section 8.3, we include experiments when we construct heterogeneous dataset from CIFAR10; Finally in Section 8.4, we show the form of our FedDA when I = 1, i.e. no local steps. In this task, we use a 4-layer convolutional neural network with 32 filters at each layer. We have 10 clients and run 20000 steps (T ), average states with interval 5 (I) and use mini-batch size of 16. Besides, we calculate density with threshold 0.01. For other hyper-parameters, we perform grid search and choose the best setting for each method. More specifically, for the SGD method, we use learning rate 0.01; for the FedDualAvg algorithm, we use local learning rate 0.1, global learning rate 0.1, L 1 constraint 0.01; for our FedDA-MVR, we use learning rate 0.01, w as 100000, c as 5000000, β as 0.999 and τ as 0.01, for the L 1 regularized version FedDA-MVR-L 1 , we also add L 1 constraint 0.01. For other variants of FedDA: for FedDA-2-1, we use learning rate 0.001, α as 0.9, β as 0.999, τ as 0.01; for FedDA-1-2, we use learning rate 1, w as 10000, c as 200, β as 0.999, τ as 0.001, L 1 constraint 0.01; for FedDA-2-2, we use learning rate 0.01, α 0.9, β as 0.999, τ as 0.01, L 1 constraint 0.01. The experimental results for different variants of FedDA is summarized in Figure 4 . As shown by the plots, all variants of FedDA get good performance, but we find FedDA-MVR (FedDA-1-1) gets most sparse model as measured by the density metric.

8.1.2. IMAGE CLASSIFICATION TASK WITH CIFAR10 AND FEMNIST

In this unconstrained federated image classification task, we use a 4-layer convolutional neural network with 64 filters at each layer. For the FEMNIST dataset, we randomly sample 50 users at each global round. We run 20000 steps (T ), average states with interval 5 (I) and use mini-batch size of 16. For other hyper-parameters, we perform grid search and choose the best setting for each method. In the CIFAR10 related experiments, for the SGD method, we use learning rate 0.005; for the FedCM algorithm, we use learning rate 0.01, momentum coefficient α as 0.9; for the FedAdam algorithm, we use local learning rate 0.001, global learning rate 0.002, momentum coefficient 0.9, coefficient for adaptive matrix β as 0.999; for the Local-Adapt algorithm, we use local learning rate 0.001, global learning rate 0.002, momentum coefficient 0.9, coefficient for adaptive matrix β as we use learning rate 0.02, w as 10000, c as 1000000, β as 0.999 and τ as 0.01. For other variants of FedDA: for FedDA-2-1, we use learning rate 0.001, α as 0.9, β as 0.999, τ as 0.01; for FedDA-1-2, we use learning rate 1, w as 5000, c as 100, β as 0.999, τ as 0.01; for FedDA-2-2, we use learning rate 0.01, α 0.9, β as 0.999, τ as 0.01. Then in the FEMNIST experiments, for the SGD method, we use learning rate 0.1; for the FedCM algorithm, we use learning rate 0.1, momentum coefficient α as 0.9; for the FedAdam algorithm, we use local learning rate 0.02, global learning rate 0.04, momentum coefficient 0.9, coefficient for adaptive matrix β as 0.999; for the Local-Adapt algorithm, we use local learning rate 0.02, global learning rate 0.02, momentum coefficient 0.9, coefficient for adaptive matrix β as 0.999; for the Local-AMSGrad algorithm, we use learning rate 0.0005, momentum coefficient 0.9, adaptive matrix coefficient 0.999; for the MIME-MVR algorithm, we use learning rate 1, w 10000, c as 400; for the STEM algorithm, we use learning rate 1, w 10000 and c 400; for our FedDA-MVR, we use learning rate 0.02, w as 10000, c as 1000000, β as 0.999 and τ as 0.01. For other variants of FedDA: for FedDA-2-1, we use learning rate 0.001, α as 0.9, β as 0.999, τ as 0.01; for FedDA-1-2, we use learning rate 1, w as 5000, c as 100, β as 0.999, τ as 0.01; for FedDA-2-2, we use the learning rate 0.01, α 0.9, β as 0.999, τ as 0.01. The experimental results for different variants of FedDA is summarized in Figure 5 and 6 . As shown by plots, all variants of FedDA get good performance. FedDA-MVR (FedDA-1-1) gets the best performance in most metrics, we observe that its test loss show some extent of overfitting in the late training stage.

8.2. MORE DISCUSSION OF EXPERIMENTAL RESULTS

In this subsection, we make more detailed comparison between our FedDA and other baselines (The experiments are over homogeneous CIFAR10 dataset). In Figure 7 , we compare FedCM with FedDA-2-1 and FedDA-2-2 for different values of local steps I. Since FedDA-2-1 and FedDA-2-2 do not use variance reduction acceleration, the superior performance shows the effectiveness of using adaptive gradients in our framework. Next, In Figure 8 For hyper-parameters, we perform grid search and choose the best setting for each method. For the SGD method, we use learning rate 0.01; for the FedCM algorithm, we use learning rate 0.01, momentum coefficient α as 0.9; for the FedAdam algorithm, we use local learning rate 0.001, global learning rate 0.002, momentum coefficient 0.9, coefficient for adaptive matrix β as 0.999; for the Local-Adapt algorithm, we use local learning rate 0.001, global learning rate 0.002, momentum coefficient 0.9, coefficient for adaptive matrix β as 0.999; for the Local-AMSGrad algorithm, we use learning rate 0.001, momentum coefficient 0.9, adaptive matrix coefficient 0.999; for the MIME-MVR algorithm, we use learning rate 0.1, w 100, c as 2000; for the STEM algorithm, we use learning rate 0.1, w 100 and c 2000; for our FedDA-MVR/FedDA-1-1, we use learning rate 0.02, w as 10000, c as 1000000, β as 0.999 and τ as 0.01. For other variants of FedDA: for FedDA-2-1, we use learning rate 0.001, α as 0.9, β as 0.999, τ as 0.01; for FedDA-1-2, we use learning rate 1, w as 5000, c as 100, β as 0.999, τ as 0.01; for FedDA-2-2, we use learning rate 0.01, α 0.9, β as 0.999, τ as 0.01.

8.4. A SPECIAL CASE OF FEDDA: I = 1

To better illustrate the structure of our FedDA, we give the form of a special case in this subsection, i.e. I = 1. The pseudo code is summarized in Algorithm 3: at each epoch, the server first gets new primal state through equation 2 (line 4); then each client (we assume full participation for simplicity) updates gradient estimate ν τ locally (line 6), and the server average these states (line 8), the adaptive matrix is also updated by the server (line 8). Algorithm 3 FedDA-Distributed 1: Input: Number of global epochs E, tuning parameters {α τ , β τ , η τ } E i=1 ; 2: Initialize: Choose x 0 ∈ X and compute ν 0 = 1 K K j=1 ∇f (j) (x 0 , B (k) 0 ) where {B (k) 0 } K k=1 are a mini-batch of random points selected from each of K clients; 3: for τ = 0 to E -1 do 4: Compute x τ +1 = arg min x∈X {η τ +1 ⟨x, ν τ ⟩ + 1 2λ (x -x τ ) T H τ (x -x τ ) }; 5: for the client k ∈ [K] in parallel do 6: Compute ν (k) τ +1 = U(ν τ , x τ +1 , x τ ; α τ +1 , B (k) τ +1 ), where B (k) τ +1 is a minibatch of random samples from the client k; Compute ν τ +1 = 1 K k∈[K] ν (k) τ +1 and H τ +1 = V(H τ , ν τ ); 9: end for

9. PROOF OF THEOREMS

In this section, we provide the convergence analysis of our algorithm. 9.1 PRELIMINARY PROPOSITIONS Proposition 1. Let {θ k }, k ∈ K be K vectors. Then the following are true: ||θ i + θ j || 2 ≤ (1 + λ)||θ i || 2 + (1 + 1 λ )||θ j || 2 for any a > 0 and || K k=1 θ k || 2 ≤ K K k=1 ||θ k || 2 Proposition 2. For a finite sequence z (k) ∈ R d for k ∈ [K] define z := 1 K K k=1 z (k) , we then have K k=1 ∥z (k) -z∥ 2 ≤ K k=1 ∥z (k) ∥ 2 . Proposition 3. Let z 0 > 0 and z 1 , z 2 , . . . , z T ≥ 0. We have T t=1 zt z0+ t i=t zi ≤ log(1 + t i=1 zi z0 ). These propositions are standard results. For proofs, the reader can refer to Lemma 3 of Karimireddy et al. (2019a) for Proposition 1 and Lemma C.1 and Lemma C.2 in Khanduri et al. (2021a) for Propositions 2 and 3.

9.2. PRELIMINARY LEMMAS IN LOCAL UPDATES

We first introduce some notation. For 0 ≤ i ≤ I, we denote: ψ (k) τ,i (x) = -⟨x, z (k) τ,i ⟩ + 1 2λ (x -x (k) τ,0 ) T H τ -1 (x -x (k) τ,0 ), then, by definition (Line 4 of Algorithm 2), we have: x (k) τ,i = arg min x∈X ψ (k) τ,i (x), we also define ψτ,i (x) = -⟨x, zτ,i ⟩ + 1 2λ (x -x τ,0 ) T H τ -1 (x -x τ,0 ), where zτ,i = 1 K K k=1 z (k) τ,i is the virtual average of z  Remark 4. In Algorithm 1, at each epoch τ , we only sample r clients from the K clients to perform an update. For k / ∈ S τ , we define the relevant variables for convenience of analysis and they are not really calculated. Remark 5. Note that the global primal state xi is not the arithmetic mean of the local states x (k) i in general. Finally, we also define dτ,i = 1 η τ,i (x τ,i -xτ,i+1 ), d (k) τ,i = 1 η τ,i (x (k) τ,i -x (k) τ,i+1 ), k ∈ [K], i ∈ [I], Furthermore, recall that by the procedure of Algorithm 2 (line 6), we have ντ,i = 1 η τ,i (z τ,i -zτ,i+1 ), ν (k) τ,i = 1 η τ,i (z (k) τ,i -z (k) τ,i+1 ), k ∈ [K], i ∈ [I], Remark 6. When it is clear from the context, we omit the global epoch τ in the subscript of the definitions, i.e. we use ψ (k) i (x), ψi (x), x (k) i , xi , di , d (k) i , νi , ν (k) i and H. Next, we introduce the following lemma related to local updates. We omit the global epoch number τ in the subscript. Lemma 1. For any i ∈ [I] and k ∈ [K], we have the following inequalities be satisfied: 1. λ⟨ν (k) i , d (k) i ⟩ ≥ ρ||d (k) i || 2 , λ||ν (k) i || ≥ ρ||d (k) i || 2. λ⟨ν i , di ⟩ ≥ ρ|| di || 2 , λ||ν i || ≥ ρ|| di ||; 3. λ||z (k) i -zi || ≥ ρ||x (k) i -xi ||; Proof. The first and second claims follow similar derivations, and we provide only the derivations for the first claim. First, if i = 1, we have x (k) 1 = arg min x∈X -⟨x, z (k) 1 ⟩ + 1 2λ (x -x (k) 0 ) T H(x -x (k) 0 ), by the first-order optimality condition, we have: ⟨-z (k) 1 + 1 λ H(x (k) 1 -x (k) 0 ), u -x (k) 1 ⟩ ≥ 0, ∀ u ∈ X , choose u = x (k) 0 and use the fact that z (k) 1 = -η 0 ν 0 , we have: η 0 ||ν (k) 0 ||×||x (k) 0 -x (k) 1 || ≥ η 0 ⟨ν (k) 0 , x (k) 0 -x (k) 1 ⟩ ≥ 1 λ (x (k) 1 -x (k) 0 ) T H(x (k) 1 -x (k) 0 ) ≥ ρ λ ||x (k) 0 -x (k) 1 || 2 we use the Cauchy-Schwartz inequality in the leftmost inequality and use the strong convexity assumption of the adaptive matrix in the rightmost inequality, we get the result in the lemma. Next if i > 0, by the definition of ψ (k) i (x), we have: ψ (k) i (x (k) i+1 ) -ψ (k) i (x (k) i ) = -⟨z (k) i , x (k) i+1 -x (k) i ⟩ + 1 2λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 + x (k) i -2x (k) 0 ) Then by the definition of x i , and the first order optimality condition, we have ⟨-z (k) i + 1 λ H(x (k) i -x (k) 0 ), u -x (k) i ⟩ ≥ 0, ∀ u ∈ X , if we pick u = x (k) i+1 , we have -⟨z (k) i , x (k) i+1 -x (k) i ⟩ ≥ -1 λ (x (k) i+1 -x (k) i ) T H(x (k) i -x (k) 0 ), plug this inequality to equation 16, we have: ψ (k) i (x (k) i+1 ) -ψ (k) i (x (k) i ) ≥ - 1 λ (x (k) i+1 -x (k) i ) T H(x (k) i -x (k) 0 ) + 1 2λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 + x (k) i -2x (k) 0 ) ≥ 1 2λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 -x (k) i ) Similarly for ψ (k) i+1 , we have: ψ (k) i+1 (x (k) i+1 ) -ψ (k) i+1 (x (k) i ) = -⟨z (k) i+1 , x (k) i+1 -x (k) i ⟩ + 1 2λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 + x (k) i -2x (k) 0 ) and by the definition of x (k) i+1 and the first order optimality condition, we can get ⟨-z (k) i+1 + 1 λ H(x (k) i+1 -x (k) 0 ), u -x (k) i+1 ⟩ ≥ 0, ∀ u ∈ X , pick u = x (k) i , we have -⟨z (k) i+1 , x (k) i+1 -x (k) i ⟩ ≤ -1 λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 -x (k) 0 ), plug this inequality to the above equality, we have: ψ (k) i+1 (x (k) i+1 ) -ψ (k) i+1 (x (k) i ) ≤ - 1 λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 -x (k) 0 ) + 1 2λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 + x (k) i -2x (k) 0 ) ≤ - 1 2λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 -x (k) i ) Next, by definition of ψ (k) i (x) and ψ (k) i+1 (x), we have: ψ (k) i+1 (x (k) i+1 ) -ψ (k) i+1 (x (k) i ) = ψ (k) i (x (k) i+1 ) -ψ (k) i (x (k) i ) + η i ⟨ν (k) i , x (k) i+1 -x (k) i ⟩ Finally, we combine the above relations and have: η i ||ν (k) i ||×||x (k) i -x (k) i+1 || ≥ η i ⟨ν (k) i , x (k) i -x (k) i+1 ⟩ ≥ 1 λ (x (k) i+1 -x (k) i ) T H(x (k) i+1 -x (k) i ) ≥ ρ||x (k) i -x (k) i+1 || 2 we use the Cauchy-Schwartz inequality in the leftmost inequality and use the strong convexity assumption of the adaptive matrix in the rightmost inequality, we get the result in the claim of the lemma. Next, we prove the third claim, by the definition of ψ (k) i+1 , we have: ψ (k) i+1 (x (k) i+1 ) -ψ (k) i+1 (x i+1 ) = -⟨z (k) i+1 , x (k) i+1 -xi+1 ⟩ + 1 2λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 + xi+1 -2x (k) 0 ) By the definition of x (k) i+1 and first order optimality condition, we have ⟨-z (k) i+1 + 1 λ H(x (k) i+1 -x (k) 0 ), u -x (k) i+1 ⟩ ≥ 0, ∀ u ∈ X , pick u = xi+1 , we have -⟨z (k) i+1 , x (k) i+1 -xi+1 ⟩ ≤ -1 λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 -x (k) 0 ) . Plug this inequality back to the above inequality, we have: ψ (k) i+1 (x (k) i+1 ) -ψ (k) i+1 (x i+1 ) ≤ - 1 λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 -x (k) 0 ) + 1 2λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 + xi+1 -2x (k) 0 ) ≤ - 1 2λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 -xi+1 ) Then for ψi+1 (x), we have: ψ(k) i+1 (x (k) i+1 ) -ψi+1 (x i+1 ) = -⟨z i+1 , x (k) i+1 -xi+1 ⟩ + 1 2λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 + xi+1 -2x 0 ) By the definition of xi+1 and first order optimality condition, we have: ⟨-z i+1 + 1 λ H(x i+1 -x0 ), u -xi+1 ⟩ ≥ 0, ∀ u ∈ X , pick u = x (k) i+1 , we have -⟨z i+1 , x (k) i+1 -xi+1 ⟩ ≥ -1 λ (x (k) i+1 -xi+1 ) T H(x i+1 -x0 ). Plug this inequality back to the above inequality, we have: ψi+1 (x (k) i+1 ) -ψi+1 (x i+1 ) ≥ - 1 λ (x (k) i+1 -xi+1 ) T H(x i+1 -x0 ) + 1 2λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 + xi+1 -2x 0 ) ≥ 1 2λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 -xi+1 ) Next, since we have x  ψ (k) i+1 (x (k) i+1 ) -ψ (k) i+1 (x i+1 ) = ψi+1 (x (k) i+1 ) -ψi+1 (x i+1 ) -⟨z (k) i+1 -zi+1 , x (k) i+1 -xi+1 ⟩ Next, we combine the above relations and have: ||z (k) i+1 -zi+1 || × ||x (k) i+1 -xi+1 || ≥ ⟨z (k) i+! -zi+1 , x (k) i+1 -xi+1 ⟩ ≥ 1 λ (x (k) i+1 -xi+1 ) T H(x (k) i+1 -xi+1 ) ≥ ρ||x (k) i+1 -xi+1 || 2 where the first inequality is by the Cauchy-Schwartz inequality and the last inequality is by the positive definiteness of H. This concludes the proof of the first inequality in the lemma.

9.3. STATE CONSENSUS ERROR

As each client performs local update, the states i.e. z (k) τ,i and ν (k) τ,i drift away, the following lemmas bound this difference. We omit the global epoch number τ in the subscript. Lemma 2. For each 0 ≤ i ≤ I, and suppose iterates z (k) i , k ∈ [K] are generated from Algorithm 2, we have: K k=1 E∥z (k) i -zi ∥ 2 ≤ (I -1) i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 , where the expectation is w.r.t the stochasticity of the algorithm. Proof. Based on Algorithm 2, we have z (k) 0 = z0 = 0, the inequality in the lemma holds trivially. Otherwise, we have z (k) i = - i-1 ℓ=0 η ℓ ν (k) ℓ and zi = - i-1 ℓ=0 η ℓ νℓ . So we have: K k=1 ∥z (k) i -zi ∥ 2 = K k=1 i-1 ℓ=1 η ℓ ν (k) ℓ -η ℓ νℓ 2 ≤ (I -1) i-1 ℓ=1 η 2 ℓ K k=1 ∥ν (k) ℓ -νℓ ∥ 2 where the equality uses the fact ν (k) 0 = ν 0 for k ∈ [K], the inequality uses the Proposition 1 and the fact that we have i ≤ I. We get the claim in the lemma by taking expectation on both sides of the above inequality. This completes the proof. Lemma 3. For i ∈ [I], we have: K k=1 ||d (k) i -di || 2 ≤ 4λ 2 (I -1) ρ 2 η 2 i i ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 where the expectation is w.r.t the stochasticity of the algorithm. Proof. Firstly, when i = 0, x (k) 0 = x0 , z (k) 1 = z1 , so we have x (k) 1 = x1 by Line 5 of Algorithm 2, and then we have η 0 d (k) 0 = x (k) 0 -x (k) 1 = x0 -x1 = η t d0 , the inequality in the lemma holds trivially. Next when i > 0, we have: η 2 i ||d (k) i -di || 2 = ||x (k) i -x (k) i+1 -(x i -xi+1 )|| 2 ≤ 2||x (k) i -xi || 2 + 2||x (k) i+1 -xi+1 || 2 ≤ 2λ 2 ρ 2 ||z (k) i -zi || 2 + ||z (k) i+1 -zi+1 || 2 The last inequality uses claim 3 of Lemma 1. Sum over k ∈ [K] and use Lemma 2, we have: ρ 2 η 2 i K k=1 ||d (k) i -di || 2 ≤ 2λ 2 (I -1) i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 + 2λ 2 (I -1) i ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 ≤ 4λ 2 (I -1) i ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 This completes the proof.

9.4. DESCENT LEMMA

In this subsection, we bound the descent of function value f (x τ,i ) over the virtual sequence xτ,i . Lemma 4. Suppose that the sequence {x (k) τ,i } I-1 i=0 be generated from Algorithm 2, then we have f (x τ +1 ) ≤ f (x τ ) - I-1 i=0 3ρη τ +1,i 4λ - η 2 τ +1,i L 2 ∥ dτ+1,i ∥ 2 + I-1 i=0 λη τ +1,i ρ ēτ+1,i 2 , where ēτ,i = ντ,i - 1 K K k=1 ∇f (k) (x τ,i ). Proof. Since the function f (x) is L-smooth, we have (we omit the global epoch number τ for ease of notation): f (x i+1 ) ≤ f (x i ) + ⟨∇f (x i ), xi+1 -xi ⟩ + L 2 ∥x i+1 -xi ∥ 2 = f (x i ) -η i ⟨∇f (x i ), di ⟩ + Lη 2 i 2 ∥ di ∥ 2 = f (x i ) -η i ⟨ν i , di ⟩ -η i ⟨∇f (x i ) -νi , di ⟩ + Lη 2 i 2 ∥ di ∥ 2 (a) ≤ f (x i ) -( ρη i λ - Lη 2 i 2 )∥ di ∥ 2 -η i ⟨∇f (x i ) -νi , di ⟩ (b) ≤ f (x i ) - ρη i λ - η 2 i L 2 ∥ di ∥ 2 + ρη i 4λ ∥ di ∥ 2 + λη i ρ ∥ν i -∇f (x i )∥ 2 (c) ≤ f (x i ) - 3ρη i 4λ - η 2 i L 2 ∥ di ∥ 2 + λη i ρ ∥ē i ∥ 2 In inequality (a), we use claim 1 of Lemma 1; inequality (b) uses Young's inequality; inequality (c) denotes ēi = νi -1 K K k=1 ∇f (k) (x i ). For the τ global epoch, we sum over i = 0 to I -1, we have: f (x τ +1,I ) ≤ f (x τ +1,0 ) - I-1 i=0 3ρη τ +1,i 4λ - η 2 τ +1,i L 2 ∥ dτ+1,i ∥ 2 + I-1 i=0 λη τ +1,i ρ ēτ+1,i 2 , Follow the update rules in Algorithm 1 and Algorithm 2, we have xτ+1,0 = x τ and xτ+1,I = x τ +1 . This completes the proof.

9.5. GRADIENT ERROR CONTRACTION

In this subsection, we bound the gradient estimation error ēτ,i , where we have ēτ,i = ντ,i - 1 K K k=1 ∇f (k) (x τ,i ) as defined in Lemma 4, additionally, we also define the global gradient estimation error e τ as e τ = ν τ -1 K K k=1 ∇f (k) (x τ ) = ν τ -∇f (x τ ). Note we have e τ = ēτ,I = ēτ+1,0 . We first show a fact about ē0 , the initial gradient estimation error. Lemma 5. For e 0 := ν 0 -1 K K k=1 ∇f (k) (x 0 ), suppose we choose mini-batch size of |B (k) 0 | = b, k ∈ [K], we have: E∥e 0 ∥ 2 ≤ σ 2 bK . Proof. By line 1 of Algorithm 1, we have: E∥e 0 ∥ 2 = E ν 0 - 1 K K k=1 ∇f (k) (x 0 ) 2 = E 1 K K k=1 ∇f (k) (x 0 ; B (k) 0 ) - 1 K K k=1 ∇f (k) (x 0 ) 2 (a) ≤ 1 K 2 K k=1 E ∇f (k) (x 0 ; B (k) 0 ) -∇f (k) (x 0 ) 2 (b) ≤ σ 2 bK . where (a) follows from the following: From the unbiased gradient assumption, we have: E ∇f (k) (x (k) 0 ; B (k) 0 ) = ∇f (k) (x (k) 0 ), for all k ∈ [K]. Moreover, the samples B (k) 0 and B (ℓ) 0 at the k th and the ℓ th clients are chosen uniformly randomly, and independent of each other for all k, ℓ ∈ [K] and k ̸ = ℓ. E (x (k) 0 ; B (k) 0 ) -∇f (k) (x 0 ) , ∇f (ℓ) (x (ℓ) 0 ; B (ℓ) 0 ) -∇f (ℓ) (x 0 ) = E E ∇f (k) (x (k) 0 ; B (k) 0 ) -∇f (k) (x (k) 0 ) =0 , E ∇f (ℓ) (x (ℓ) 0 ; B (ℓ) 0 ) -∇f (ℓ) (x (ℓ) 0 ) =0 = 0. Inequality (c) results from the bounded variance assumption. This completes the proof. Lemma 6. Define ēτ,i := ντ,i -1 K K k=1 ∇f (k) (x τ,i ), then for every τ ≥ 1 and i ≥ 0, suppose α i < 1 and clients use batchsize b 1 in the training, then we have: E∥ē τ,i ∥ 2 ≤ (1 -α τ,i ) 2 E∥ē τ,i-1 ∥ 2 + 40λ 2 (I -1)L 2 ρ 2 K 2 i-1 ℓ=1 η 2 τ,ℓ K k=1 E∥ν (k) τ,ℓ -ντ,ℓ ∥ 2 + 8η 2 τ,i-1 L 2 K E∥ dτ,i-1 ∥ 2 + 4α 2 τ,i σ 2 b 1 K where the expectation is w.r.t the stochasticity of the algorithm. Proof. Consider the error term ∥ē i ∥ 2 , i ≥ 1 (we omit the global epoch number τ for ease of notation), we have: E∥ē i ∥ 2 = E νi - 1 K K k=1 ∇f (k) (x i ) 2 = E 1 K K k=1 ∇f (k) (x (k) i ; B (k) i ) + (1 -α i ) νi-1 - 1 K K k=1 ∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K k=1 ∇f (k) (x i ) 2 = E 1 K K k=1 ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x i-1 ) + (1 -α i )ē i-1 2 = (1 -α i ) 2 E∥ē i-1 ∥ 2 + 1 K 2 E K k=1 ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x i-1 ) 2 = (1 -α i ) 2 E∥ē i-1 ∥ 2 + 1 K 2 K k=1 E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x i-1 ) 2 , where the first equality uses the definition of νi ; last equality follows from expanding the norm using the inner products across k ∈ [K] and noting that the cross term is zero in expectation because of the samples are sampled independently at different workers. Now we consider the 2nd term above: E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x i-1 ) 2 = E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i-1 ) + ∇f (k) (x (k) i ) -∇f (k) (x i ) -(1 -α i ) ∇f (k) (x (k) i-1 ) -∇f (k) (x i-1 ) 2 ≤ 2E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i-1 ) 2 + 2E ∇f (k) (x (k) i ) -∇f (k) (x i ) -(1 -α i ) ∇f (k) (x (k) i-1 ) -∇f (k) (x i-1 ) 2 For the first term of the above inequality, we have: E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i-1 ) 2 = E (1 -a i ) ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i ) -∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i-1 ) + α i ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i ) 2 ≤ 2(1 -α i ) 2 E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i ) -∇f (k) (x (k) i-1 ) 2 + 2α 2 i E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i ) 2 ≤ 2(1 -α i ) 2 E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i-1 ; B (k) i ) 2 + 2α 2 i σ 2 /b 1 ≤ 2(1 -α i ) 2 L 2 E∥x (k) i -x (k) i-1 ∥ 2 + 2a 2 i σ 2 /b 1 ≤ 2(1 -α i ) 2 L 2 η 2 i-1 E∥d (k) i-1 ∥ 2 + 2α 2 i σ 2 /b 1 ≤ 4(1 -α i ) 2 L 2 η 2 i-1 E∥d (k) i-1 -di-1 ∥ 2 + 4(1 -α i ) 2 L 2 η 2 i-1 E∥ di-1 ∥ 2 + 2α 2 i σ 2 /b 1 where uses Proposition 1 in the first inequality and the bounded variance assumption in the second inequality. For the second inequality, we have: E ∇f (k) (x (k) i ) -∇f (k) (x i ) -(1 -α i ) ∇f (k) (x (k) i-1 ) -∇f (k) (x i-1 ) 2 (a) ≤ 2E ∇f (k) (x (k) i ) -∇f (k) (x i ) 2 + 2E (1 -α i ) ∇f (k) (x (k) i-1 ) -∇f (k) (x i-1 ) 2 ≤ 2L 2 E x (k) i -xi 2 + 2L 2 (1 -α i ) 2 E x (k) i-1 -xi-1 2 (b) ≤ 2λ 2 L 2 ρ 2 E z (k) i -zi 2 + 2λ 2 L 2 (1 -α i ) 2 ρ 2 E z (k) i-1 -zi-1 2 where (a) uses Proposition 1; (b) uses claim 3 of Lemma 1; Next, we combine the above inequalities together to get: E∥ē i ∥ 2 ≤ (1 -α i ) 2 E∥ē i-1 ∥ 2 + 4α 2 i σ 2 b 1 K + 8(1 -α i ) 2 η 2 i-1 L 2 K 2 K k=1 E∥d (k) i-1 -di-1 ∥ 2 + 8(1 -α i ) 2 η 2 i-1 L 2 K E∥ di-1 ∥ 2 + 4λ 2 L 2 K 2 ρ 2 K k=1 E z (k) i -zi 2 + 4λ 2 L 2 (1 -α i ) 2 K 2 ρ 2 K k=1 E z (k) i-1 -zi-1 2 ≤ (1 -α i ) 2 E∥ē i-1 ∥ 2 + 4α 2 i σ 2 b 1 K + 32λ 2 (I -1)(1 -α i ) 2 L 2 K 2 ρ 2 i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 + 8(1 -α i ) 2 η 2 i-1 L 2 K E∥ di-1 ∥ 2 + 4λ 2 (I -1)L 2 K 2 ρ 2 i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 + 4λ 2 (I -1)L 2 (1 -α i-1 ) 2 K 2 ρ 2 i-2 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 ≤ (1 -α i ) 2 E∥ē i-1 ∥ 2 + 4α 2 i σ 2 b 1 K + 40λ 2 (I -1)L 2 K 2 ρ 2 i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 + 8η 2 i-1 L 2 K E∥ di-1 ∥ 2 , The second inequality uses Lemma 2 and Lemma 3 and the last inequality uses the assumption that α i < 1. This completes the proof. Lemma 7. For τ ≥ 0. Suppose we choose η τ,i = κ/(ω i + i + τ I) 1/3 , additionally, suppose α i < 1, w i ≤ w i-1 , w i ≥ 2, η τ,i ≤ ρ 48λLI 2 be satisfied, we have: ρK 64L 2 E∥ē τ +1 ∥ 2 η τ +1,I-1 - E∥ē τ ∥ 2 η τ,I-1 ≤ - I-1 i=0 3η τ +1,i 2ρ E∥ē τ +1,i ∥ 2 + I-1 i=0 η τ +1,i ρ 8 E∥ dτ+1,i ∥ 2 + I-1 i=0 σ 2 c 2 η 3 τ +1,i ρ 16L 2 + 5I(I -1) 4Kρ I ℓ=1 η τ +1,ℓ K k=1 E∥ν (k) τ +1,ℓ -ντ+1,ℓ ∥ 2 Proof. Using Lemma 6 at the global epoch τ -1, then for i ≥ 0 (we denote η τ,-1 = η τ -1,I-1 for all τ ≥ 1), we have: E∥ē τ,i+1 ∥ 2 η τ,i - E∥ē τ,i ∥ 2 η τ,i-1 ≤ (1 -a τ,i+1 ) 2 η τ,i - 1 η τ,i-1 E∥ē τ,i ∥ 2 + 40λ 2 (I -1)L 2 ρ 2 K 2 η τ,i i ℓ=1 η 2 τ,ℓ K k=1 E∥ν (k) τ,ℓ -ντ,ℓ ∥ 2 + 8L 2 η τ,i K E∥ dτ,i ∥ 2 + 4a 2 τ,i+1 σ 2 η τ,i b 1 K (a) ≤ η -1 τ,i -η -1 τ,i-1 -cη τ,i E∥ē τ,i ∥ 2 + 80λ 2 (I -1)L 2 ρ 2 K 2 i ℓ=1 η τ,ℓ K k=1 E∥ν (k) τ,ℓ -ντ,ℓ ∥ 2 + 8L 2 η τ,i K E∥ dτ,i ∥ 2 + 4σ 2 c 2 η 3 τ,i b 1 K , where inequality (a) utilizes the fact that (1 -α τ,i ) 2 ≤ 1 -α τ,i ≤ 1 and a τ,i+1 = cη 2 τ,i for all i ∈ [I], and the following fact: suppose we choose η τ,i = κ/(ω i + i + τ I) 1/3 , then for 0 ≤ l ≤ i < I, we have: η τ,l η τ,i = (w i + i + τ I) 1/3 (w l + l + τ I) 1/3 = 1 + w i + i -w l -l w l + l + τ I 1/3 ≤ 1 + (I -1) w l + l + τ I 1/3 ≤ 1 + (I -1) 3(w l + l + τ I) ≤ 2 (17) The first inequality is by the fact that 0 < i -l < I -1, the second last inequality uses the concavity of x 1/3 as: (x + y) 1/3 -x 1/3 ≤ y/3x 2/3 , while the last inequality uses the fact that w l ≥ 0, I ≥ 1, l ≥ 0, τ ≥ 1. For the difference η -1 i -η -1 i-1 , we have: 1 η τ,i - 1 η τ,i-1 = (w i + i + τ I) 1/3 κ - (w i-1 + i -1 + τ I) 1/3 κ (a) ≤ (w i + i + τ I) 1/3 κ - (w i + i -1 + τ I) 1/3 κ (b) ≤ 1 3κ(w i + i -1 + τ I) 2/3 (c) ≤ 2 2/3 κ 2 3κ 3 (w i + i + τ I) 2/3 (d) = 2 2/3 3κ 3 η 2 i (e) ≤ ρ 72κ 3 λLI 2 η i , where inequality (a) is because that we choose w i ≤ w i-1 , (b) results from the concavity of x 1/3 as: (x + y) 1/3 -x 1/3 ≤ y/(3x 2/3 ), (c) used the fact that w i ≥ 2, finally, (d) and (e) utilize the definition of η τ,i and the condition that η τ,i ≤ ρ 48λLI 2 , respectively. So if we choose c = 96λ 2 L 2 Kρ 2 + ρ 72κ 3 λLI 2 we have: η -1 τ,i -η -1 τ,i-1 -cη τ,i ≤ -96λ 2 L 2 Kρ 2 η τ,i , Therefore, we have: E∥ē τ,i+1 ∥ 2 η τ,i - E∥ē τ,i ∥ 2 η τ,i-1 ≤ - 96λ 2 L 2 η τ,i Kρ 2 E∥ē τ,i ∥ 2 + 80λ 2 (I -1)L 2 ρ 2 K 2 i ℓ=1 η τ,ℓ K k=1 E∥ν (k) τ,ℓ -ντ,ℓ ∥ 2 + 8L 2 η τ,i K E∥ dτ,i ∥ 2 + 4σ 2 c 2 η 3 τ,i b 1 K , Multiplying ρK/64λL 2 on both sides, we have: ρK 64λL 2 E∥ē τ,i+1 ∥ 2 η τ,i - E∥ē τ,i ∥ 2 η τ,i-1 ≤ - 3λη τ,i 2ρ E∥ē τ,i ∥ 2 + 5λ(I -1) 4Kρ i ℓ=1 η τ,ℓ K k=1 E∥ν (k) τ,ℓ -ντ,ℓ ∥ 2 + η τ,i ρ 8λ E∥ dτ,i ∥ 2 + σ 2 c 2 η 3 τ,i ρ 16λL 2 b 1 . Then we sum the above inequality from 0 to I -1 and get: ρK 64λL 2 E∥ē τ,I ∥ 2 η τ,I-1 - E∥ē τ,0 ∥ 2 η τ -1,I-1 ≤ - I-1 i=0 3λη i 2ρ E∥ē τ,i ∥ 2 + I-1 i=0 5λ(I -1) 4Kρ i ℓ=1 η ℓ K k=1 E∥ν (k) τ,ℓ -ντ,ℓ ∥ 2 + I-1 i=0 η τ,i ρ 8λ E∥ dτ,i ∥ 2 + I-1 i=0 σ 2 c 2 η 3 τ,i ρ 16λL 2 b 1 ≤ - I-1 i=0 3λη i 2ρ E∥ē τ,i ∥ 2 + 5λI(I -1) 4Kρ I ℓ=1 η ℓ K k=1 E∥ν (k) τ,ℓ -ντ,ℓ ∥ 2 + I-1 i=0 η τ,i ρ 8λ E∥ dτ,i ∥ 2 + I-1 i=0 σ 2 c 2 η 3 τ,i ρ 16λL 2 b 1 By definition, we have ēτ,0 = e τ -1 and ēτ,I = e τ , then we get the results in the lemma by replacing τ by τ + 1.

9.6. DESCENT IN POTENTIAL FUNCTION

We define the potential function as follows: Φ τ := f (x τ ) + ρK 64λL 2 ∥e τ ∥ 2 η τ -1,I-1 . Next, we characterize the descent in the potential function. Lemma 8. For any τ ≥ 0, we have: E[Φ τ +1 -Φ τ ] ≤ - I-1 i=0 5ρη τ +1,i 8λ - η 2 τ +1,i L 2 E∥ di ∥ 2 - λ 2ρ I-1 i=0 η τ +1,i E∥ē τ +1,i ∥ 2 + σ 2 c 2 ρ 16λL 2 b 1 I-1 i=0 η 3 τ +1,i + 5λI(I -1) 4Kρ I i=1 η τ +1,i K k=1 E∥ν (k) τ +1,i -ντ+1,i ∥ 2 , where the expectation is w.r.t the stochasticity of the algorithm. Proof. We can the inequality in the lemma by combining Lemma 4 and Lemma 7 9.7 ACCUMULATED GRADIENT ERROR In this subsection, we bound the gradient consensus error given by term K k=1 E∥ν (k) τ,i -ντ,i ∥ 2 . Lemma 9. For i ≥ 1 and α i < 1, we have: K k=1 E∥ν (k) τ,i -ντ,i ∥ 2 ≤ (1 + 1 I ) K k=1 E ν (k) τ,i-1 -ντ,i-1 2 + 8KIL 2 η 2 τ,i-1 E∥ dτ,i-1 ∥ 2 + 8KIσ 2 c 2 η 4 τ,i-1 b 1 + 16KIζ 2 c 2 η 4 τ,i-1 + 96λ 2 I 2 L 2 ρ 2 i-1 ℓ=1 η τ,ℓ 2 K k=1 E∥ν (k) τ,ℓ -ντ,ℓ ∥ 2 where the expectation is w.r.t. the stochasticity of the algorithm. Proof. By the update rule of ν (k) i (we omit the global epoch step for convenience), we have: E∥ν (k) i -νi ∥ 2 = E ∇f (k) (x (k) i ; B (k) i ) + (1 -α i ) ν (k) i-1 -∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i ; B (j) i ) + (1 -α i ) νi-1 - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 = E (1 -α i ) ν (k) i-1 -νi-1 + ∇f (k) (x (k) i ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i ; B (j) i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 ≤ (1 + β)(1 -α i ) 2 E ν (k) i-1 -νi-1 2 + 1 + 1 β E ∇f (k) (x (k) i ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i ; B (j) i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 , ( ) where the last inequality uses Proposition 1. Next, we consider the second term: E ∇f (k) (x (k) i ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i ; B (j) i ) -(1 -α i ) ∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 (a) ≤ 2E ∇f (k) (x (k) i ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i ; B (j) i ) -∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 + 2α 2 i E ∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 (b) ≤ 2E ∇f (k) (x (k) i ; B (k) i ) -∇f (k) (x (k) i-1 ; B (k) i ) 2 + 2α 2 i E ∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 (c) ≤ 2L 2 E x (k) i -x (k) i-1 2 + 2α 2 i E ∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 , where inequality (a) uses Proposition 1; inequality (b) uses Proposition 2; inequality (c) uses the smoothness assumption. Under review as a conference paper at ICLR 2023 Next, we consider the second term in equation 21 above, we have E ∇f (k) (x (k) i-1 ; B (k) i ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) 2 = E ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i-1 ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) -∇f (j) (x (j) i-1 ) + ∇f (k) (x (k) i-1 ) - 1 K K j=1 ∇f (j) (x (j) i-1 ) 2 ≤ 2E ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i-1 ) - 1 K K j=1 ∇f (j) (x (j) i-1 ; B (j) i ) -∇f (j) (x (j) i-1 ) 2 + 2E ∇f (k) (x (k) i-1 ) - 1 K K j=1 ∇f (j) (x (j) i-1 ) 2 (a) ≤ 2E ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i-1 ) 2 + 2E ∇f (k) (x (k) i-1 ) - 1 K K j=1 ∇f (j) (x (j) i-1 ) 2 ≤ 2E ∇f (k) (x (k) i-1 ; B (k) i ) -∇f (k) (x (k) i-1 ) 2 + 4E ∇f (k) (x i-1 ) -∇f (x i-1 ) 2 + 8E ∇f (k) (x (k) i-1 ) -∇f (k) (x i-1 ) 2 + 8E ∇f (x i-1 ) - 1 K K j=1 ∇f (j) (x (j) i-1 ) 2 (b) ≤ 2σ 2 b 1 + 4 K K j=1 E∥∇f (k) (x i-1 ) -∇f (j) (x i-1 )∥ 2 + 8L 2 E∥x (k) i-1 -xi-1 ∥ 2 + 8L 2 K K j=1 E∥x (j) i-1 -xi-1 ∥ 2 (c) ≤ 2σ 2 b 1 + 4ζ 2 + 8L 2 E∥x (k) i-1 -xi-1 ∥ 2 + 8L 2 K K j=1 E∥x (j) i-1 -xi-1 ∥ 2 , where inequality (a) uses Proposition 2; inequality (b) utilizes bounded variance assumption; (c) uses the bounded heterogeneity assumption. Finally, substituting equation 22 and equation 21 into equation 20 and sum over all K workers, we get K k=1 E∥ν (k) i -νi ∥ 2 ≤ (1 -α i ) 2 (1 + β) K k=1 E ν (k) i-1 -νi-1 2 + 2L 2 1 + 1 β K k=1 E∥x (k) i -x (k) i-1 ∥ 2 + 4Kσ 2 b 1 1 + 1 β α 2 i + 8Kζ 2 1 + 1 β α 2 i + 32L 2 1 + 1 β α 2 i K k=1 E∥x (k) i-1 -xi-1 ∥ 2 ≤ (1 -α i ) 2 (1 + β) K k=1 E ν (k) i-1 -νi-1 2 + 2L 2 η 2 i-1 1 + 1 β K k=1 E∥d (k) i-1 ∥ 2 + 4Kσ 2 b 1 1 + 1 β α 2 i + 8Kζ 2 1 + 1 β α 2 i + 32λ 2 L 2 a 2 i ρ 2 1 + 1 β K k=1 E∥z (k) i-1 -zi-1 ∥ 2 where the second inequality uses claim 3 of the Lemma 1. Next using Lemma 2, we have: K k=1 E∥ν (k) i -νi ∥ 2 ≤ (1 -α i ) 2 (1 + β) K k=1 E ν (k) i-1 -νi-1 2 + 2L 2 η 2 i-1 1 + 1 β K k=1 E∥d (k) i-1 ∥ 2 + 4Kσ 2 b 1 1 + 1 β α 2 i + 8Kζ 2 1 + 1 β α 2 i + 32λ 2 L 2 a 2 i ρ 2 1 + 1 β (I -1) i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 (23) For the second term of the above inequality, we have: 2L 2 η 2 i-1 1 + 1 β K k=1 E∥d (k) i-1 ∥ 2 ≤ 4L 2 η 2 i-1 1 + 1 β K k=1 E∥d (k) i-1 -di-1 ∥ 2 + 4KL 2 η 2 i-1 1 + 1 β E∥ di-1 ∥ 2 ≤ 16λ 2 L 2 (I -1) ρ 2 1 + 1 β i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 + 4KL 2 η 2 i-1 1 + 1 β E∥ di-1 ∥ 2 where the first inequality uses Proposition 1 and the second inequality uses Lemma 3. Next plug the above inequality back to Eq. equation 23, we have: K k=1 E∥ν (k) i -νi ∥ 2 ≤ (1 -α i ) 2 (1 + β) K k=1 E ν (k) i-1 -νi-1 2 + 4KL 2 η 2 i-1 1 + 1 β E∥ di-1 ∥ 2 + 4Kσ 2 b 1 1 + 1 β α 2 i + 8Kζ 2 1 + 1 β α 2 i + 16λ 2 L 2 (1 + 2a 2 i )(I -1) ρ 2 1 + 1 β i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 ≤ (1 + 1 I ) K k=1 E ν (k) i-1 -νi-1 2 + 8KIL 2 η 2 i-1 E∥ di-1 ∥ 2 + 8KIσ 2 c 2 η 4 i-1 b 1 + 16KIζ 2 c 2 η 4 i-1 + 96λ 2 I 2 L 2 ρ 2 i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 , In the last inequality, we choose β = 1/I, then we have (1 + 1/β) ≤ (1 + I) ≤ 2I, we also use the fact that (1 -α i ) 2 < 1 and a i = cη 2 i-1 < 1. This completes the proof. Lemma 10. For η i ≤ ρ 48LI 2 , then we have I 2 ρK I i=1 η i K k=1 E∥ν (k) i -νi ∥ 2 ≤ ρ 84 I-1 i=0 η i E∥ di ∥ 2 + ρσ 2 c 2 84b 1 L 2 + ρζ 2 c 2 42L 2 I-1 i=0 η 3 i Proof. By Lemma 9 (we omit the global epoch number for convenience) we have: K k=1 E∥ν (k) i -νi ∥ 2 ≤ (1 + 1 I ) K k=1 E ν (k) i-1 -νi-1 2 + 8KIL 2 η 2 i-1 E∥ di-1 ∥ 2 + 8KIσ 2 c 2 η 4 i-1 b 1 + 16KIζ 2 c 2 η 4 i-1 + 96λ 2 I 2 L 2 ρ 2 i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 ≤ (1 + 1 I ) K k=1 E ν (k) i-1 -νi-1 2 + KLρη i-1 6λI E∥ di-1 ∥ 2 + Kρσ 2 c 2 η 3 i-1 6λILb 1 + Kρζ 2 c 2 η 3 i-1 3λIL + 96λ 2 I 2 L 2 ρ 2 i-1 ℓ=1 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 , where in the second inequality, we use the condition that η i ≤ ρ 48λLI 2 . Applying equation 24 recursively from 1 to i. We have: K k=1 E∥ν (k) i -νi ∥ 2 ≤ KLρ 6λI i-1 ℓ=0 1 + 1 I i-1-ℓ η ℓ E∥ dℓ ∥ 2 + Kρσ 2 c 2 6λILb 1 i-1 ℓ=0 1 + 1 I i-1-ℓ η 3 ℓ + Kρζ 2 c 2 3λIL i-1 ℓ=0 1 + 1 I i-1-ℓ η 3 ℓ + 96λ 2 L 2 I 2 ρ 2 i-1 ℓ=0 1 + 1 I i-1-ℓ ℓ l=0 η 2 l K k=1 E∥ν (k) l -νl∥ 2 (a) ≤ KLρ 6λI 1 + 1 I I i-1 ℓ=0 η ℓ E∥ dℓ ∥ 2 + Kρσ 2 c 2 6λILb 1 1 + 1 I I i-1 ℓ=0 η 3 ℓ + Kρζ 2 c 2 3λIL 1 + 1 I I i-1 ℓ=0 η 3 ℓ + 96λ 2 L 2 I 3 ρ 2 1 + 1 I I i-1 l=0 η 2 l K k=1 E∥ν (k) l -νl∥ 2 (b) ≤ KLρ 2λI i-1 ℓ=0 η ℓ E∥ dℓ ∥ 2 + Kρσ 2 c 2 2λILb 1 i-1 ℓ=0 η 3 ℓ + Kρζ 2 c 2 λIL i-1 ℓ=0 η 3 ℓ + 288λ 2 L 2 I 3 ρ 2 i-1 ℓ=0 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 , where inequality (a) is by the fact that 1 + 1/I > 1 and i -1 -ℓ ≤ I for i ∈ [I] and ℓ ∈ [i] and inequality (b) is because that (1 + 1/I) I ≤ e < 3. Next, multiplying both sides of equation 25 by η i and summing over i = 1 to I: I i=1 η i K k=1 E∥ν (k) i -νi ∥ 2 ≤ KLρ 2λI I i=1 η i i-1 ℓ=0 η ℓ E∥ dℓ ∥ 2 + Kρσ 2 c 2 2λILb 1 I i=1 η i i-1 ℓ=0 η 3 ℓ + Kρζ 2 c 2 λIL I i=1 η i i-1 ℓ=0 η 3 ℓ + 288λ 2 L 2 I 3 ρ 2 I i=1 η i i-1 ℓ=0 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 (a) ≤ KLρ 2λI I i=1 η i I-1 ℓ=0 η ℓ E∥ dℓ ∥ 2 + Kρσ 2 c 2 2λILb 1 + Kρζ 2 c 2 λIL I i=1 η i I-1 ℓ=0 η 3 ℓ + 288λ 2 L 2 I 3 ρ 2 I i=1 η i I-1 ℓ=0 η 2 ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 (b) ≤ Kρ 2 96λ 2 I 2 I-1 i=0 η i E∥ di ∥ 2 + Kρ 2 σ 2 c 2 96λ 2 I 2 L 2 b 1 + Kρ 2 ζ 2 c 2 48λ 2 I 2 L 2 I-1 i=0 η 3 i + 1 8 I-1 ℓ=1 η ℓ K k=1 E∥ν (k) ℓ -νℓ ∥ 2 where inequality (a) uses the fact that i ≤ I and (b) uses that we choose η i ≤ ρ/(48λLI 2 ). Rearranging the terms we have: 7 8 I i=1 η i K k=1 E∥ν (k) i -νi ∥ 2 ≤ Kρ 2 96λ 2 I 2 I-1 i=0 η i E∥ di ∥ 2 + Kρ 2 σ 2 c 2 96λ 2 I 2 L 2 b 1 + Kρ 2 ζ 2 c 2 48λ 2 I 2 L 2 I-1 i=0 η 3 i Multiplying 8λI 2 /(7Kρ) on both sides, we have: λI 2 Kρ I i=1 η i K k=1 E∥ν (k) i -νi ∥ 2 ≤ ρ 84λ I-1 i=0 η i E∥ di ∥ 2 + ρσ 2 c 2 84λL 2 b 1 + ρζ 2 c 2 42λL 2 I-1 i=0 η 3 i This completes the proof.

9.8. PROOF OF THE MAIN CONVERGENCE THEOREM

In this subsection, we prove Theorem 5.1 and Corollary 5.7. To prove Theorem 5.1, we firstly show the following theorem hold: Theorem 9.1. Choosing the parameters as κ = ρK 2/3 λL , c = 96λ 2 L 2 Kρ 2 + ρ 72κ 3 λLI 2 , w t = max 48 3 I 6 K 2 -t -I, 14 3 K 0.5 , λ > 0, and choose η t = κ (ωt+t+I) 1/3 , then we have: 1 T T -1 t=0 E∥ dt ∥ 2 + λ 2 ρ 2 E∥ē t ∥ 2 ≤ 96λ 2 LI 2 ρ 2 T + 2λ 2 L ρ 2 K 2/3 T 2/3 (f (x 0 ) -f * ) + 72λ 2 I 4 bρ 2 T + 3λ 2 I 2 2bρ 2 K 2/3 T 2/3 σ 2 + 192 2 λ 2 ρ 2 × 48I 2 T + 1 K 2/3 T 2/3 × σ 2 4b 1 + 2ζ 2 21 log(T + 1). Proof. By definition, we have η t ≤ η 0 < κ/w 1/3 0 = ρK 2/3 /48λLI 2 K 2/3 = ρ/48λLI 2 , then c = λ 2 L 2 96 Kρ 2 + 1 72K 2 ρ 2 I 2 ≤ 192λ 2 L 2 Kρ 2 and: cη 2 t ≤ cη 2 0 < 192λ 2 L 2 Kρ 2 * κ 2 w 2/3 0 = 192L 2 Kρ 2 * ρ 2 K 4/3 L 2 w 2/3 0 = 192K 1/3 w 2/3 0 ≤ 192K 1/3 196K 1/3 < 1, So we have α t < 1, then the conditions of Lemma 8-Lemma 10 are satisfied. Firstly, substitute the gradient consensus error in Lemma 10 to Lemma 8, we can write the descent of potential function as: E[Φ τ +1 -Φ τ ] ≤ - I-1 i=0 5ρη τ +1,i 8λ - η 2 τ +1,i L 2 E∥ dτ+1,i ∥ 2 - λ 2ρ I-1 i=0 η τ +1,i E∥ē τ +1,i ∥ 2 + σ 2 c 2 ρ 16λL 2 b 1 I-1 i=0 η 3 τ +1,i + ρ 42λ I-1 i=0 η τ +1,i E∥ dτ+1,i ∥ 2 + ρσ 2 c 2 42λL 2 b 1 + ρζ 2 c 2 21λL 2 I-1 i=0 η 3 τ +1,i ≤ - I-1 i=0 3ρη τ +1,i 5λ - η 2 τ +1,i L 2 E∥ dτ+1,i ∥ 2 - λ 2ρ I-1 i=0 η τ +1,i E∥ē τ +1,i ∥ 2 + ρσ 2 c 2 8λL 2 b 1 + ρζ 2 c 2 21λL 2 I-1 i=0 η 3 τ +1,i ≤ - I-1 i=0 ρη i 2λ E∥ dτ+1,i ∥ 2 - λ 2ρ I-1 i=0 η τ +1,i E∥ē τ +1,i ∥ 2 + ρσ 2 c 2 8λL 2 b 1 + ρζ 2 c 2 21λL 2 I-1 i=0 η 3 τ +1,i , where (a) follows from the fact that η i ≤ ρ 48λLI 2 ≤ ρ 48λL . Suppose we denote T = EI, and t = τ I + i for t ≥ 0 and τ ≥ 0. Then we have η t = η τ +1,i , dt = dτ+1,i , ēt = ēτ+1,i . In particular, we denote η -1 = η 0 for convenience. Then we sum the above inequality for τ from 0 to E -1, and get: E[Φ E -Φ 0 ]≤ - T -1 t=0 ρη t 2λ E∥ dt ∥ 2 - T -1 t=0 λη t 2ρ E∥ē t ∥ 2 + ρσ 2 c 2 8λL 2 b 1 + ρζ 2 c 2 21λL 2 T t=0 η 3 t , Rearranging terms, we get: T t=1 ρη t 2λ E∥ dt ∥ 2 + λη t 2ρ E∥ē t ∥ 2 ≤ E[Φ 0 -Φ E ] + ρσ 2 c 2 8λL 2 b 1 + ρζ 2 c 2 21λL 2 T -1 t=0 η 3 t (a) ≤ f (x 0 ) -f * + ρK 64λL 2 E∥e 0 ∥ 2 η 0 + ρσ 2 c 2 8λL 2 b 1 + ρζ 2 c 2 21λL 2 T -1 t=0 η 3 t (b) ≤ f (x 0 ) -f * + σ 2 ρ 64λbL 2 η 0 + ρσ 2 c 2 8λL 2 b 1 + ρζ 2 c 2 21λL 2 T -1 t=0 η 3 t , where (a) follows from the fact that f * ≤ Φ E and (b) results from application of Lemma 5 and b is the minibatch size at the first iteration. Next for the last term of the equation 26 above, we have: T -1 t=0 η 3 t = T -1 t=0 κ 3 w t + t (a) ≤ T -1 t=0 κ 3 1 + t = κ 3 T -1 t=0 1 1 + t (b) ≤ κ 3 ln(T + 1). ( ) where inequality (a) above follows from the fact that we have w t > 1 and inequality (b) follows from the application of Proposition 3. Substituting equation 27 in equation 26, multiplying both sides by 2λ/(ρη T T ) and using the fact that η t is non-increasing in t we have 1 T T -1 t=0 E∥ dt ∥ 2 + λ 2 ρ 2 E∥ē t ∥ 2 ≤ 2λ(f (x 0 ) -f * ) ρη T T + 1 η T T σ 2 32bL 2 η 0 + κ 3 η T T σ 2 c 2 4b 1 L 2 + 2ζ 2 c 2 21L 2 ln(T + 1). (28) Now considering each term of equation 28 above separately. For the first term: 1 η T T = (w T + T ) 1/3 κT (a) ≤ w 1/3 T κT + 1 κT 2/3 = 48λLI 2 ρT + λL ρK 2/3 T 2/3 . ( ) where inequality (a) follows from identity (x + y) 1/3 ≤ x 1/3 + y 1/3 and inequality (b) follows from the definition of κ and w T w T = max (I + 1), 48 3 I 6 K 2 -T, 2 * 320 1.5 K 0.5 ≤ 48 3 I 6 K 2 , Similarly, for the second term of equation 28, we have from the definition of η 0 and η T 1 η T T σ 2 32bL 2 η 0 ≤ 48λLI 2 ρT + λL ρK 2/3 T 2/3 × σ 2 32bL 2 × w 1/3 0 κ ≤ 48λLI 2 ρT + λL ρK 2/3 T 2/3 × σ 2 32bL 2 × 48λLI 2 ρ ≤ 72λ 2 I 4 bρ 2 T σ 2 + 3λ 2 I 2 2bρ 2 K 2/3 T 2/3 σ 2 . ( ) Finally, for the last term in equation 28 above, we have from the definition of the stepsize, η t ,  κ 3 c 2 η T T ≤ 192 2 λ 2 ρ 2 × 48I 2 T + 1 K 2/3 T 2/3 × σ 2 4b 1 + 2ζ 2 21 log(T + 1). Finally, substituting the bounds obtained in equation 29, equation 30 and equation 31 into equation 28, we get 1 T T -1 t=0 E∥ dt ∥ 2 + λ 2 ρ 2 E∥ē t ∥ 2 ≤ 96λ 2 LI 2 ρ 2 T + 2λ 2 L ρ 2 K 2/3 T 2/3 (f (x 0 ) -f * ) + 72λ 2 I 4 bρ 2 T + 3λ 2 I 2 2bρ 2 K 2/3 T 2/3 σ 2 + 192 2 λ 2 ρ 2 × 48I 2 T + 1 K 2/3 T 2/3 × σ 2 4b 1 + 2ζ 2 21 log(T + 1). This completes the proof of the theorem. Now we are ready to show Theorem 5.1. Firstly notice that: λ 2 G t ρ 2 = 1 η 2 t ||x t -xt+1 || 2 + λ 2 ρ 2 ||ν t -∇f (x t )|| 2 = || dt || 2 + λ 2 ρ 2 ||ē t || 2 Combine with Theorem 9.1, we have: 1 T T -1 t=0 E[G t ] ≤ 96LI 2 T + 2L K 2/3 T 2/3 (f (x 0 ) -f * ) + 72I 4 bT + 3I 2 2bK 2/3 T 2/3 σ 2 + 192 2 × 48I 2 T + 1 K 2/3 T 2/3 × σ 2 4b 1 + 2ζ 2 21 log(T + 1). Remark 7. For the measure G t , we discuss its intuition under both the unconstrained and constrained case. First, for unconstrained case, i.e. when X = R d , we have: ||∇f (x τ,i )||/||H τ || = ||H τ × H -1 τ ∇f (x τ,i )||/||H τ || ≤ ||H -1 τ ∇f (x τ,i )|| = ||H -1 τ ∇f (x τ,i ) -H -1 τ ντ,i + H -1 τ ντ,i || ≤ ||H -1 τ ∇f (x τ,i ) -H -1 τ ντ,i || + ||H -1 τ ντ,i || ≤ 1 ρ ||ν τ,i -∇f (x τ,i )|| + 1 λη τ,i ||x τ,i -xτ,i+1 || ≤ √ 2 G τ,i /ρ In the last inequality, we use Jensen inequality, and in the second last inequality, we use Assumption 4 and the fact that xτ,i+1 = x τ,0 +λH -1 τ zτ,i+1 and xτ,i = x τ,0 +λH -1 τ zτ,i and η τ,i ντ,i = zτ,i+1 -z τ,i in the unconstrained case. In other words, we have ||∇f (x t )|| 2 ≤ 2||Hτ || 2 ρ 2 G τ . Note the coefficient of the right-side is an upper bound of the square condition number of H τ . It is common assumption in the analysis of adaptive gradient methods that H t has a finite condition number Huang et al. (2021) . In sum, the convergence of our measure G t means the convergence to a first order stationary point in the unconstrained case. Next, for the constrained case, our measure upper bounds the gradient mapping 1 ητ+1,i ||x τ -x * τ +1,i ||, x * t is defined as follows: x * τ +1,i = arg min where inequality (a) is due to the triangle inequality. Next we have: ∥x τ -x * τ +1,i ∥ = ∥x τ -xτ+1,i + xτ+1,i -x * τ +1,i ∥ ≤ ∥x τ -xτ+1,i ∥ + ∥x τ +1,i -x * τ +1,i ∥ ≤ ∥ i-1 l=0 dτ+1,i ∥ + ∥x τ +1,i -x * τ +1,i ∥ ≤ i-1 l=0 ∥ dτ+1,ℓ ∥ + λη τ +1,ℓ ρ ∥∇f (x τ +1,ℓ ) -ντ+1,ℓ ∥ By Jensen inequality and the definition of the measure equation 9, we have ∥ dt ∥ + λη t ρ ∥∇f (x t ) -νt ∥ ≤ √ 2λη t ρ G t , So we have 1 η τ +1,i ∥x τ -x * τ +1,i ∥ ≤ √ 2λ ρ i-1 l=0 η τ +1,l η τ +1,i G τ +1,l ≤ 2 √ 2λ ρ i-1 l=0 G τ +1,ℓ , the last inequality is because of Eq. equation 17. In all, when the measure G τ +1,ℓ → 0, the gradient mapping 1 ητ+1,i ∥x τ -x * τ +1,i ∥ converges to 0. Corollary 2. With the hyper-parameters chosen as in Theorem 9.1. Suppose we set I = O((T /K 2 ) 1/6 ) and use sample minibatch of size O(I 2 ) in the first step, Then we have: E[G t ] = O f (x 0 ) -f * K 2/3 T 2/3 + Õ σ 2 K 2/3 T 2/3 + Õ ζ 2 K 2/3 T 2/3 . and to reach an ϵ-stationary point, we need to make Õ(ϵ -1.5 /K) number of steps and need Õ(ϵ -1 ) number of communication rounds. Proof. It is straightforward to verify the expression for E[G t ] in the corollary by applying Theorem 9.1 and choosing I and b as corresponding values. As for the gradient and communication complexity of the algorithm. We have the following results: The number of total steps T needed to achieve an ϵ-stationary point, i.e. Õ( 1 K 2/3 T 2/3 ) = ϵ are O( 1 Kϵ 3/2 ), i.e. the gradient complexity. Total rounds of communication steps to achieve an ϵ-stationary point is E = T /I, as we have I = O((T /K 2 ) 1/6 ), then T /I = Õ(K 1/3 T 5/6 ). Assume we have large number of clients compared, more specifically, assume K ≥ √ T . Then we have T /I = Õ(K 1/3 T 5/6 ) = Õ(K 2/3 T 2/3 ), in other words, we have E = Õ(ϵ -1 ). This completes the proof of the corollary.



set S τ of r clients chosen uniformly at random w/o replacement; 5: for the client k ∈ S τ in parallel do 6:

+1,I and the local gradient estimation ν (k)

Figure 1: Results for the PATHMNIST Yang et al. (2021) dataset. Plots show the Train Accuracy, Test Accuracy, Density vs Number of Rounds (E in Algorithm 1) respectively. The post-fix of L 1 means we consider the L 1 constraints. I is chosen as 5.

Figure 2: Results for CIFAR10 dataset. From left to right, we show Train Loss, Train Accuracy, Test Loss, Test Accuracy w.r.t the number of rounds (E in Algorithm 1), respectively. I is chosen as 5.

Figure 3: Results for FEMNIST dataset. From left to right, we show Train Loss, Train Accuracy, Test Loss, Test Accuracy w.r.t the number of rounds (E in Algorithm 1), respectively. I is chosen as 5.

FedAvg McMahan et al. (2017), FedCM Xu et al. (2021), STEM Khanduri et al. (2021a) and adaptive methods: FedAdam (Reddi et al., 2020), Local-Adapt Wang et al. (2021), Local-AMSGrad Chen et al. (2020b), MIME-MVR Praneeth Karimireddy et al. (

Figure 4: Results for the PATHMNIST dataset. Plots show the Train Accuracy, Test Accuracy, Density vs Number of Rounds (E in Algorithm 1) respectively. The post-fix of L 1 means we consider the L 1 constraints.

Figure 5: Results for CIFAR10 dataset. From left to right, we show Train Loss, Train Accuracy, Test Loss, Test Accuracy w.r.t the number of global rounds (E in Algorithm 1), respectively.

Figure 6: Results for FEMNIST dataset. From left to right, we show Train Loss, Train Accuracy, Test Loss, Test Accuracy w.r.t the number of global rounds (E in Algorithm 1), respectively.

, we compare Local-AMSGrad vs FedDA-2-1 for different values of I, FedDA-2-1 outperforms Local-AMSGrad for all I and with a greater margin for larger I. Note both Local-AMSGrad and FedDA-2-1 use Adam-style adaptive gradients (equation 6 and equation 7) and have same communication cost per epoch. In Figure 9, we compare FedAdam and Local-Adapt with FedDA-2-1. All methods use Adam-style adaptive gradients. FedAdam only performs adaptive gradients over the server, Local-Adapt performs both local and global adaptive gradients, but the state of the local adaptive gradient is refreshed per epoch. We have two observations: First, the Local-Adapt method has very marginal improvement over FedAdam, which shows the restarted strategy used by Local-Adapt is less effective than our method; Second, both FedAdam and Local-Adapt benefit little from increasing the I value (compared to our FedDA-2-1). For FedAdam, this shows the limitation of only applying adaptive gradients at the server level. Finally, in Figure 10, we change I for all four variants of our FedDA. As shown by the figure, our framework can benefit from more local steps.

Figure 7: Comparison between FedCM vs FedDA-2-1 and FedDA-2-2. From top to bottom, we show I = 5, 10, 20 respectively. The number inside the parentheses is the value of I.

Figure 8: Comparison between Local-AMSGrad vs FedDA-2-1. From top to bottom, we show I = 5, 10, 20 respectively. The number inside the parentheses is the value of I.

Figure 9: Comparison between FedAdam and Local-Adapt vs FedDA-2-1. The number inside the parentheses is the value of I.

Figure 10: Ablation study of local steps I. From top row to the bottom row, we show results for FedDA-1-1, FedDA-1-2, FedDA-2-1 and FedDA-2-2. The number inside the parentheses is the value of I.

Figure 11: Results for heterogeneous CIFAR10 dataset. From left to right, we show Train Loss, Train Accuracy, Test Loss, Test Accuracy w.r.t the number of rounds (E in Algorithm 1), respectively. I is chosen as 5.

and x τ,0 = x τ . Then we define xτ,i = arg min x∈X ψτ,i (x),

, then by the definition of ψ (k) i+1 (x) and ψi+1 (x) we have:

x τ ) T H τ (x -x τ )} where z * τ +1,i = i ℓ=0 -η ℓ ∇f (x τ +1,i) is the accumulation of true gradient. Next follow Lemma 1, we have:∥x * τ +1,i -xτ+1,i ∥ ≤ λ ρ ∥z * τ +1,i -zτ+1,i ∥ +1,ℓ (∇f (x τ +1,ℓ ) -ντ+1,ℓ ))∥ ,ℓ ρ ∥∇f (x τ +1,ℓ ) -ντ+1,ℓ ∥

, Algorithm 1 aggregates and averages dual states at each global round. The adaptive matrix H τ is fixed during local updates and is refreshed on the server side at each global round. Since the algorithm uses a new mirror map (adaptive gradient matrix) at each global round, we call our framework to be restarted dual averaging. Remark 1. In contrast to our dual-averaging strategy, some existing adaptive FL algorithms Praneeth Karimireddy et al. (2020) average the local primal states. In the unconstrained case, the primal and dual spaces are linear with each other, but in the constrained case, the linearity does not exist, and the averaging in the primal space and dual space is not equivalent. As we show in the subsequent theoretical analysis, dual averaging leads to the convergence in the constrained case. Remark 2. Note that we use the averaged dual states z τ +1 when we update the adaptive matrix (line 11 of Algorithm 1). An alternative choice is to use the most recent gradient PraneethKarimireddy  et al. (2020). In comparison, the dual state aggregates information of whole round and offers smoother estimation of the problem's local curvature. Another possible choice, as used in the Local-AMSGrad methodChen et al. (2020b), is to update the state of the adaptive matrix µ τ (see equation 7) locally and then average in the server synchronization step. The limitation of this approach is that µ τ is not linear w.r.t gradient, and thus averaging µ τ does not offer a linear speed-up w.r.t. the number of clients; in contrast, the dual state satisfies linearity. Remark 3. By choosing different update rules U and V, we can create many variants of FedDA. An representative is FedDA-MVR, in which we update ν

L 2

