CONVERGENT ADAPTIVE GRADIENT METHODS IN DE-CENTRALIZED OPTIMIZATION

Abstract

Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks, in the past few years. Meanwhile, given the need for distributed training procedures, distributed optimization algorithms are at the center of attention. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In that regard, more and more attention is shifted from the traditional parameter server training paradigm to the decentralized one, which usually requires lower communication costs. In this paper, we rigorously incorporate adaptive gradient methods into decentralized training procedures and introduce novel convergent decentralized adaptive gradient methods. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent.

1. INTRODUCTION

Distributed training of machine learning models is drawing growing attention in the past few years due to its practical benefits and necessities. Given the evolution of computing capabilities of CPUs and GPUs, computation time in distributed settings is gradually dominated by the communication time in many circumstances (Chilimbi et al., 2014; McMahan et al., 2017) . As a result, a large amount of recent works has been focussing on reducing communication cost for distributed learning (Alistarh et al., 2017; Lin et al., 2018; Wangni et al., 2018; Stich et al., 2018; Wang et al., 2018; Tang et al., 2019) . In the traditional parameter (central) server setting, where a parameter server is employed to manage communication in the whole network, many effective communication reductions have been proposed based on gradient compression (Aji & Heafield, 2017) and quantization (Chen et al., 2010; Ge et al., 2013; Jegou et al., 2010) techniques. Despite these communication reduction techniques, its cost still, usually, scales linearly with the number of workers. Due to this limitation and with the sheer size of decentralized devices, the decentralized training paradigm (Duchi et al., 2011b) , where the parameter server is removed and each node only communicates with its neighbors, is drawing attention. It has been shown in Lian et al. (2017) that decentralized training algorithms can outperform parameter server-based algorithms when the training bottleneck is the communication cost. The decentralized paradigm is also preferred when a central parameter server is not available. In light of recent advances in nonconvex optimization, an effective way to accelerate training is by using adaptive gradient methods like AdaGrad (Duchi et al., 2011a) , Adam (Kingma & Ba, 2015) or AMSGrad (Reddi et al., 2018) . Their popularity are due to their practical benefits in training neural networks, featured by faster convergence and ease of parameter tuning compared with Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) . Despite a large amount of studies within the distributed optimization literature, few works have considered bringing adaptive gradient methods into distributed training, largely due to the lack of understanding of their convergence behaviors. Notably, Reddi et al. (2020) develop the first decentralized ADAM method for distributed optimization problems with a direct application to federated learning. An inner loop is employed to compute mini-batch gradients on each node and a global adaptive step is applied to update the global parameter at each outer iteration. Yet, in the settings of our paper, nodes can only communicate to their neighbors on a fixed communication graph while a server/worker communication is required in Reddi et al. (2020) . Designing adaptive methods in such settings is highly non-trivial due to the already complex update rules and to the interaction between the effect of using adaptive learning rates and the decentralized communication protocols. This paper is an attempt at bridging the gap between both realms in nonconvex optimization. Our contributions are summarized as follows: • In this paper, we investigate the possibility of using adaptive gradient methods in the decentralized training paradigm, where nodes have only a local view of the whole communication graph. We develop a general technique that converts an adaptive gradient method from a centralized method to its decentralized variant. • By using our proposed technique, we present a new decentralized optimization algorithm, called decentralized AMSGrad, as the decentralized counterpart of AMSGrad. • We provide a theoretical verification interface, in Theroem 2, for analyzing the behavior of decentralized adaptive gradient methods obtained as a result of our technique. Thus, we characterize the convergence rate of decentralized AMSGrad, which is the first convergent decentralized adaptive gradient method, to the best of our knowledge. A novel technique in our framework is a mechanism to enforce a consensus on adaptive learning rates at different nodes. We show the importance of consensus on adaptive learning rates by proving a divergent problem instance for a recently proposed decentralized adaptive gradient method, namely DADAM (Nazari et al., 2019) , a decentralized version of AMSGrad. Though consensus is performed on the model parameter, DADAM lacks consensus principles on adaptive learning rates. After having presented existing related work and important concepts of decentralized adaptive methods in Section 2, we develop our general framework for converting any adaptive gradient algorithm in its decentralized counterpart along with their rigorous finite-time convergence analysis in Section 3 concluded by some illustrative examples of our framework's behavior in practice. Notations: x t,i denotes variable x at node i and iteration t. • abs denotes the entry-wise L 1 norm of a matrix, i.e. A abs = i,j |A i,j |. We introduce important notations used throughout the paper: for any t > 0, G t := [g t,N ] where [g t,N ] denotes the matrix [g t,1 , g t,2 , • • • , g t,N ] (where g t,i is a column vector), M t := [m t,N ], X t := [x t,N ], ∇f (X t ) := 1 N N i=1 ∇f i (x t,i ), U t := [u t,N ], Ũt := [ũ t,N ], V t := [v t,N ], Vt := [v t,N ], X t := 1 N N i=1 x t,i , U t := 1 N N i=1 u t,i and Ũt := 1 N N i=1 ũt,i .

2.1. RELATED WORK

Decentralized optimization: Traditional decentralized optimization methods include well-know algorithms such as ADMM (Boyd et al., 2011) , Dual Averaging (Duchi et al., 2011b) , Distributed Subgradient Descent (Nedic & Ozdaglar, 2009) . More recent algorithms include Extra (Shi et al., 2015) , Next (Di Lorenzo & Scutari, 2016) , Prox-PDA (Hong et al., 2017) , GNSD (Lu et al., 2019) , and Choco-SGD (Koloskova et al., 2019) . While these algorithms are commonly used in applications other than deep learning, recent algorithmic advances in the machine learning community have shown that decentralized optimization can also be useful for training deep models such as neural networks. Lian et al. (2017) demonstrate that a stochastic version of Decentralized Subgradient Descent can outperform parameter server-based algorithms when the communication cost is high. Tang et al. (2018) propose the D 2 algorithm improving the convergence rate over Stochastic Subgradient Descent. Assran et al. (2019) propose the Stochastic Gradient Push that is more robust to network failures for training neural networks. The study of decentralized training algorithms in the machine learning community is only at its initial stage. No existing work, to our knowledge, has seriously considered integrating adaptive gradient methods in the setting of decentralized learning. One noteworthy work (Nazari et al., 2019) propose a decentralized version of AMSGrad (Reddi et al., 2018) and it is proven to satisfy some non-standard regret. Adaptive gradient methods: Adaptive gradient methods have been popular in recent years due to their superior performance in training neural networks. Most commonly used adaptive methods include AdaGrad (Duchi et al., 2011a) or Adam (Kingma & Ba, 2015) and their variants. Key features of such methods lie in the use of momentum and adaptive learning rates (which means that the learning rate is changing during the optimization and is anisotropic, i.e. depends on the dimension). The method of reference, called Adam, has been analyzed in Reddi et al. (2018) where the authors point out an error in previous convergence analyses. Since then, a variety of papers have been focussing on analyzing the convergence behavior of the numerous existing adaptive gradient methods. Ward et al. (2019) , Li & Orabona (2019) derive convergence guarantees for a variant of AdaGrad without coordinate-wise learning rates. Chen et al. (2019) analyze the convergence behavior of a broad class of algorithms including AMSGrad and AdaGrad. Zou & Shen (2018) provide a unified convergence analysis for AdaGrad with momentum. Noticeable recent works on adaptive gradient methods can be found in Agarwal et al. (2019) ; Luo et al. (2019); Zaheer et al. (2018) .

2.2. DECENTRALIZED OPTIMIZATION

In distributed optimization (with N nodes), we aim at solving the following problem min x∈R d 1 N N i=1 f i (x) , ( ) where x is the vector of parameters and f i is only accessible by the ith node. Through the prism of empirical risk minimization procedures, f i can be viewed as the average loss of the data samples located at node i, for all i ∈ [N ]. Throughout the paper, we make the following mild assumptions required for analyzing the convergence behavior of the different decentralized optimization algorithms: A1. For all i ∈ [N ], f i is differentiable and the gradients is L-Lipschitz, i.e., for all (x, y) ∈ R d , ∇f i (x) -∇f i (y) ≤ L x -y . A2. We assume that, at iteration t, node i accesses a stochastic gradient g t,i . The stochastic gradients and the gradients of f i have bounded L ∞ norms, i.e. g t,i ≤ G ∞ , ∇f i (x) ∞ ≤ G ∞ . A3. The gradient estimators are unbiased and each coordinate have bounded variance, i.e. E[g t ,i ] = ∇f i (x t,i ) and E[([g t,i -f i (x t,i )] j ) 2 ] ≤ σ 2 , ∀t, i, j . Assumptions A1 and A3 are standard in distributed optimization literature. A2 is slightly stronger than the traditional assumption that the estimator has bounded variance, but is commonly used for the analysis of adaptive gradient methods (Chen et al., 2019; Ward et al., 2019) . Note that the bounded gradient estimator assumption in A2 implies the bounded variance assumption in A3. In decentralized optimization, the nodes are connected as a graph and each node only communicates to its neighbors. In such case, one usually constructs a N × N matrix W for information sharing when designing new algorithms. We denote λ i to be its ith largest eigenvalue and define λ max(|λ 2 |, |λ N |). The matrix W cannot be arbitrary, its required key properties are listed in the following assumption: A4. The matrix W satisfies: (I) N j=1 W i,j = 1, N i=1 W i,j = 1, W i,j ≥ 0, (II) λ 1 = 1, |λ 2 | < 1, |λ N | < 1 and (III) W i,j = 0 if node i and node j are not neighbors. We now present the failure to converge of current decentralized adaptive method before introducing our proposed framework for general decentralized adaptive gradient methods.

2.3. DIVERGENCE OF DADAM

Algorithm 1 DADAM (with N nodes) DADAM is essentially a decentralized version of ADAM and the key modification is the use of a consensus step on the optimization variable x to transmit information across the network, encouraging its convergence. The matrix W is a doubly stochastic matrix (which satisfies A4) for achieving average consensus of x. Introducing such mixing matrix is standard for decentralizing an algorithm, such as distributed gradient descent (Nedic & Ozdaglar, 2009; Yuan et al., 2016) . It is proven in Nazari et al. (2019) that DADAM admits a non-standard regret bound in the online setting. Nevertheless, whether the algorithm can converge to stationary points in standard offline settings such training neural networks is still unknown. The next theorem shows that DADAM may fail to converge in the offline settings. 1: Input: α, current point X t , u 1 2 ,i = v0,i = 1, m 0 = 0 and mixing matrix W 2: for t = 1, 2, • • • , T do 3: for all i ∈ [N ] do in parallel 4: g t,i ← ∇f i (x t,i ) + ξ t,i 5: m t,i = β 1 m t-1,i + (1 -β 1 )g t,i 6: v t,i = β 2 v t-1,i + (1 -β 2 )g 2 t,i 7: vt,i = β 3 vt,i +(1-β 3 ) max(v t-1,i , v t,i ) 8: x t+ 1 2 ,i = N j=1 W ij x t,j 9: x t+1,i = x t+ 1 2 ,i -α mt,i Theorem 1. There exists a problem satisfying A1-A4 where DADAM fails to converge to a stationary points with ∇f ( Xt ) = 0. Proof. Consider a two-node setting with objective function f (x) = 1/2 2 i=1 f i (x) and f 1 (x) = 1[|x| ≤ 1]2x 2 + 1[|x| > 1](4|x| -2), f 2 (x) = 1[|x-1| ≤ 1](x -1) 2 + 1[|x-1| > 1](2|x -1| -1). We set the mixing matrix W = [0.5, 0.5; 0.5, 0.5]. The optimal solution is x * = 1/3. Both f 1 and f 2 are smooth and convex with bounded gradient norm 4 and 2, respectively. We also have L = 4 (defined in A1). If we initialize with x 1,1 = x 1,2 = -1 and run DADAM with β 1 = β 2 = β 3 = 0 and ≤ 1, we will get v1,1 = 16 and v1,2 = 4.  (x) = 2 i=1 f i (x) with f 1 (x) = 0.25f 1 (x) and f 2 (x) = 0.5f 2 (x), which unique optimal x = 0.5. Define xt = (x t,1 + x t,2 )/2, then by Th. 2 in Yuan et al. ( 2016), we have when α < 1/4, f (x t ) -f (x ) = O(1/(αt)). Since f has a unique optima x , the above bound implies xt is converging to x = 0.5 which has non-zero gradient on function ∇f (0.5) = 0.5. Theorem 1 shows that, even though DADAM is proven to satisfy some regret bounds (Nazari et al., 2019) , it can fail to converge to stationary points in the nonconvex offline setting (common for training neural networks). We conjecture that this inconsistency in the convergence behavior of DADAM is due to the definition of the regret in Nazari et al. (2019) . The next section presents decentralized adaptive gradient methods that are guaranteed to converge to stationary points under assumptions and provide a characterization of that convergence in finite-time and independently of the initialization.

3. CONVERGENCE OF DECENTRALIZED ADAPTIVE GRADIENT METHODS

In this section, we discuss the difficulties of designing adaptive gradient methods in decentralized optimization and introduce an algorithmic framework that can turn some existing convergent adaptive gradient methods to their decentralized counterparts. We also develop the first convergent decentralized adaptive gradient method, converted from AMSGrad, as an instance of this framework.

3.1. IMPORTANCE AND DIFFICULTIES OF CONSENSUS ON ADAPTIVE LEARNING RATES

The divergent example in the previous section implies that we should synchronize the adaptive learning rates on different nodes. This can be easily achieved in the parameter server setting where all the nodes are sending their gradients to a central server at each iteration. The parameter server can then exploit the received gradients to maintain a sequence of synchronized adaptive learning rates when updating the parameters, see Reddi et al. (2020) . However, in our decentralized setting, every node can only communicate with its neighbors and such central server does not exist. Under that setting, the information for updating the adaptive learning rates can only be shared locally instead of broadcasted over the whole network. This makes it impossible to obtain, in a single iteration, a synchronized adaptive learning rate update using all the information in the network. Systemic Approach: On a systemic level, one way to alleviate this bottleneck is to design communication protocols in order to give each node access to the same aggregated gradients over the whole network, at least periodically if not at every iteration. Therefore, the nodes can update their individual adaptive learning rates based on the same shared information. However, such solution may introduce an extra communication cost since it involves broadcasting the information over the whole network. Algorithmic Approach: Our contributions being on an algorithmic level, another way to solve the aforementioned problem is by letting the sequences of adaptive learning rates, present on different nodes, to gradually consent, through the iterations. Intuitively, if the adaptive learning rates can consent fast enough, the difference among the adaptive learning rates on different nodes will not affect the convergence behavior of the algorithm. Consequently, no extra communication costs need to be introduced. We now develop this exact idea within the existing adaptive methods stressing on the need for a relatively low-cost and easy-to-implement consensus of adaptive learning rates.

3.2. DECENTRALIZED ADAPTIVE GRADIENT UNIFYING FRAMEWORK

We now develop a method that implements consensus of adaptive learning rates. While each node can have different vt,i in DADAM (Algorithm 1), one can keep track of the min/max/average of these adaptive learning rates and use that quantity as the new adaptive learning rate. Algorithm 2 Decentralized Adaptive Gradient Method (with N nodes) 1: Input: α, initial point x 1,i = x init , u 1 2 ,i = v0,i , m 0,i = 0, mixing matrix W 2: for t = 1, 2, • • • , T do 3: for all i ∈ [N ] do in parallel 4: g t,i ← ∇f i (x t,i ) + ξ t,i 5: m t,i = β 1 m t-1,i + (1 -β 1 )g t,i 6: vt,i = r t (g 1,i , • • • , g t,i ) 7: x t+ 1 2 ,i = N j=1 W ij x t,j 8: ũt,i = N j=1 W ij ũt-1 2 ,j 9: u t,i = max(ũ t,i , ) 10: x t+1,i = x t+ 1 2 ,i -α mt,i √ ut,i 11: ũt+ 1 2 ,i = ũt,i -vt-1,i + vt,i 12: end for The predefinition of some convergent lower and upper bounds may also lead to a gradual synchronization of the adaptive learning rates on different nodes as developed for AdaBound in Luo et al. (2019) . In this paper, we present an algorithm framework for decentralized adaptive gradient methods as Algorithm 2, which uses average consensus of vt,i (see consensus update in line 8 and 11) to help convergence. Algorithm 2 can become different adaptive gradient methods by specifying r t as different functions. E.g., when we choose vt,i = 1 t t k=1 g 2 k,i , Algorithm 2 becomes a decentralized version of AdaGrad. When one chooses vt,i to be the adaptive learning rate for AMSGrad, we get decentralized AMSGrad (Algorithm 3). The intuition of using average consensus is that for adaptive gradient methods such as AdaGrad or Adam, vt,i approximates the second moment of the gradient estimator, the average of the estimations of those second moments from different nodes is an estimation of second moment on the whole network. Also, this design will not introduce any extra hyperparameters that can potentially complicate the tuning process ( in line 9 is important for numerical stability as in vanilla Adam). The main convergence result for the framework Algorithm 2 reads as follows: Theorem 2. Assume A1-A4. When α ≤ 0.5 16L , Algorithm 2 yields the following regret bound 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤C 1 1 T α (E[f (Z 1 )] -min x f (x)) + α dσ 2 N + C 2 α 2 d + C 3 α 3 d + 1 T √ N (C 4 + C 5 α)E T t=1 (-Vt-2 + Vt-1 ) abs (2) where • abs denotes the entry-wise L 1 norm of a matrix (i.e A abs = i,j |A ij |). The constants C 1 = max(4, 4L/ ), C 2 = 6((β 1 /(1 -β 1 )) 2 + 1/(1 -λ) 2 )LG 2 ∞ / 1.5 , C 3 = 16L 2 (1 -λ)G 2 ∞ / 2 , C 4 = 2/( 1.5 (1-λ))(λ+β 1 /(1-β 1 ))G 2 ∞ , C 5 = 2/( 2 (1-λ))L(λ+β 1 /(1-β 1 ))G 2 ∞ +4/( 2 (1- λ))LG 2 ∞ are independent of d, T and N . In addition, 1 N N i=1 x t,i -X t 2 ≤ α 2 1 1-λ 2 dG 2 ∞ 1 which quantifies the consensus error. Theorem 2 shows how the convergence guarantee is affected by different factors. In addition, one can specify α to show convergence in terms of T , d, and N . An immediate result is by setting α = √ N / √ T d, which is shown in Corollary 2.1. Corollary 2.1. Assume A1-A4. Set α = √ N / √ T d. When α ≤ 0.5 16L , Algorithm 2 yields the following regret bound 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤C 1 √ d √ T N (E[f (Z 1 )] -min x f (x)) + σ 2 + C 2 N T + C 3 N 1.5 T 1.5 d 0.5 + C 4 1 T √ N + C 5 1 T 1.5 d 0.5 E [V T ] where The intuition E[V T ] = o(T ) can guarantee divergence is that the correlation between vt,i and m t,i (due to their shared dependency on historical gradients) can make update direction negatively correlated with true gradient in expectation, leading to a non-negligible bias in updates. However, the total bias across T iterations introduced by such a correlation is bounded by the term V T := T t=1 (-Vt-2 + Vt-1 ) abs and C 1 , C 2 , C 3 , C 4 , C 5 are defined in Theorem 2. Corollary 2.1 shows that if E[V T ] = o(T ) E[V T ]. Thus, if E[V T ] grows sublinearly with T , convergence can still be guaranteed. Furthermore, Corollary 2.1 conveys the benefits of using more nodes in the graph. When T is large enough such that the term O( √ N / √ T d) dominates the RHS of (3), linear speedup can be achieved by increasing N . We now present, in Algorithm 3, a notable special case of our algorithmic framework, namely Decentralized AMSGrad, which is a decentralized variant of AMSGrad. Compared with DADAM, the above algorithm exhibits a dynamic average consensus mechanism to keep track of the average of {v t,i } N i=1 , stored as ũt,i on ith node, and uses u t,i := max(ũ t,i , ) for updating the adaptive learning rate for ith node. As the number of iteration grows, even though vt,i on different nodes can converge to different constants, the u t,i will converge to the same number lim t→∞ 1 N N i=1 vt,i if the limit exists. This average consensus mechanism enables the consensus of adaptive learning rates on different nodes, which accordingly guarantees the convergence of the method to stationary points. The consensus of adaptive learning rates is the key difference between decentralized AMSGrad and DADAM and is the reason why decentralized AMSGrad is convergent while DADAM is not. Algorithm 3 Decentralized AMSGrad (with N nodes) 1: Input: learning rate α, initial point x 1,i = x init , u 1 2 ,i = v0,i = 1 (with ≥ 0), m 0,i = 0, mixing matrix W 2: for t = 1, 2, • • • , T do 3: for all i ∈ [N ] do in parallel 4: g t,i ← ∇f i (x t,i ) + ξ t,i 5: m t,i = β 1 m t-1,i + (1 -β 1 )g t,i 6: v t,i = β 2 v t-1,i + (1 -β 2 )g 2 t,i 7: vt,i = max(v t-1,i , v t,i ) 8: x t+ 1 2 ,i = N j=1 W ij x t,j 9: ũt,i = N j=1 W ij ũt-1 2 ,j 10: u t,i = max(ũ t,i , ) 11: x t+1,i = x t+ 1 2 ,i -α mt,i √ ut,i 12: ũt+ 1 2 ,i = ũt,i -vt-1,i + vt,i 13: end for One may notice that decentralized AMSGrad does not reduce to AMSGrad for N = 1 since the quantity u t,i in line 10 is calculated based on v t-1,i instead of v t,i . This design encourages the execution of gradient computation and communication in a parallel manner. Specifically, line 4-7 (line 4-6) in Algorithm 3 (Algorithm 2) can be executed in parallel with line 8-9 (line 7-8) to overlap communication and computation time. If u t,i depends on v t,i which in turn depends on g t,i , the gradient computation must finish before the consensus step of the adaptive learning rate in line 9. This can slow down the running time per-iteration of the algorithm. To avoid such delayed adaptive learning, adding ũt-1 2 ,i = ũt,i -vt-1,i + vt,i before line 9 and get rid of line 12 in Algorithm 2 is an option. Similar convergence guarantees will hold since one can easily modify our proof of Theorem 2 for such update rule. As stated above, Algorithm 3 converges, with the following rate: Theorem 3. Assume A1-A4. Set α = 1/ √ T d. When α ≤ 0.5 16L , Algorithm 3 yields the following regret bound 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤ C 1 √ d √ T N D f + σ 2 + C 2 N T + C 3 N 1.5 T 1.5 d 0.5 + C 4 √ N d T + C 5 N d 0.5 T 1.5 where D f := E[f (Z 1 )]-min x f (x), C 1 = C 1 , C 2 = C 2 , C 3 = C 3 , C 4 = C 4 G 2 ∞ and C 5 = C 5 G 2 ∞ . C 1 , C 2 , C 3 , C 4 , C 5 are constants independent of d, T and N defined in Theorem 2. In addition, the consensus of variables at different nodes is given by 1 N N i=1 x t,i -X t 2 ≤ N T 1 1-λ 2 G 2 ∞ 1 . Theorem 3 shows that Algorithm 3 converges with a rate of O( √ d/ √ T ) when T is large, which is the best known convergence rate under the given assumptions. Note that in some related works, SGD admits a convergence rate of O(1/ √ T ) without any dependence on the dimension of the problem. Such improved convergence rate is derived under the assumption that the gradient estimator have a bounded L 2 norm, which can thus hide a dependency of √ d in the final convergence rate. Another remark is the convergence measure can be converted to 1 T T t=1 E ∇f (X t ) 2 using the fact that U t ∞ ≤ G 2 ∞ (by update rule of Algorithm 3), for the ease of comparison with existing literature.

3.3. CONVERGENCE ANALYSIS

The detailed proofs of this section are reported in the supplementary material. Proof of Theorem 2: We now present a proof sketch for out main convergence result of Algorithm 2. Step 1: Reparameterization. Similarly to Yan et al. (2018) ; Chen et al. (2019) with SGD (with momentum) and centralized adaptive gradient methods, define the following auxiliary sequence: Z t = X t + β 1 1 -β 1 (X t -X t-1 ) , with X 0 X 1 . Such an auxiliary sequence can help us deal with the bias brought by the momentum and simplifies the convergence analysis. An intermediary result needed to conduct our proof reads: Lemma 1. For the sequence defined in (4), we have Z t+1 -Z t = α β 1 1 -β 1 1 N N i=1 m t-1,i ( 1 √ u t-1,i - 1 √ u t,i ) -α 1 N N i=1 g t,i √ u t,i . Lemma 1 does not display any momentum term in 1 N N i=1 gt,i √ ut,i . This simplification is convenient since it is directly related to the current gradients instead of the exponential average of past gradients. Step 2: Smoothness. Using smoothness assumption A1 involves the following scalar product term: κ t := ∇f (Z t ), 1 N N i=1 ∇f i (x t,i )/ U t which can be lower bounded by: κ t ≥ 1 2 ∇f (X t ) U 1/4 t 2 - 3 2 ∇f (Z t ) -∇f (X t ) U 1/4 t 2 - 3 2 1 N N i=1 ∇f i (x t,i ) -∇f (X t ) U 1/4 t 2 . The above inequality substituted in the smoothness condition f (Z t+1 ) ≤ f (Z t ) + ∇f (Z t ), Z t+1 - Z t + L 2 Z t+1 -Z t 2 yields: 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤ 2 T α E[∆ f ] + L T α T t=1 E Z t+1 -Z t 2 + 2 T β 1 D 1 1 -β 1 + 2D 2 T + 3D 3 T , where ∆ f := E[f (Z 1 )] -E[f (Z T +1 )] D 1 , D 2 and D 3 are three terms, defined in the supplementary material, and which can be tightly bounded from above. We first bound D 3 using the following quantities of interest: T t=1 Z t -X t 2 ≤ T β 1 1 -β 1 2 α 2 d G 2 ∞ and T t=1 1 N N i=1 x t,i -X t 2 ≤ T α 2 1 1 -λ 2 dG 2 ∞ 1 . where λ = max(|λ 2 |, |λ N |) and recall that λ i is ith largest eigenvalue of W . Then, concerning the term D 2 , few derivations, not detailed here for simplicity, yields: D 2 ≤ G 2 ∞ N E T t=1 1 2 1.5 - N l=2 Ũt q l q T l abs , where q l is the eigenvector corresponding to lth largest eigenvalue of W and • abs is the entry-wise L 1 norm of matrices. We can also show that T t=1 - N l=2 Ũt q l q T l abs ≤ √ N T -1 o=0 λ 1 -λ (-Vo-1 + Vo ) abs , Under review as a conference paper at ICLR 2021 resulting in an upper bound for D 2 proportional to T -1 o=0 (-Vo-1 + Vo ) abs . Similarly: D 1 ≤ G 2 ∞ 1 2 1.5 1 √ N E 1 1 -λ T t=1 (-Vt-2 + Vt-1 ) abs . Step 3: Bounding the drift term variance. An important term that needs upper bounding in our proof is the variance of the gradients multiplied (element-wise) by the adaptive learning rate: E   1 N N i=1 g t,i √ u t,i 2   ≤ E[ Γ f u 2 ] + d N σ 2 , where Γ f u := 1/N N i=1 ∇f i (x t,i )/ √ u t,i . Two consecutive and simple bounding of the above yields: T t=1 E[ Γ f u 2 ] ≤ 2 T t=1 E[ Γ f U 2 ] + 2 T t=1 E 1 N N i=1 G 2 ∞ 1 √ 1 √ u t,i - 1 U t 1 and T t=1 E[ Γ f U 2 ] ≤ 2 T t=1 E   ∇f (X t ) U t 2   + 2 T t=1 E   1 N N i=1 ∇f i (X t ) -∇f i (x t,i ) U t 2   . (6) Then, by plugging the LHS of ( 6) in ( 5), and further bounding as operated for D 2 , D 3 (see supplement), we obtain the desired bound in Theorem 2. Proof of Theorem 3: Recall the bound in (3) of Theorem 2. Since Algorithm 3 is a special case of Algorithm 2, the remaining of the proof consists in characterizing the growth rate of E[ T t=1 (-Vt-2 + Vt-1 ) abs ]. By construction, Vt is non decreasing, then it can be shown that E[ T t=1 (-Vt-2 + Vt-1 ) abs ] = E[ N i=1 d j=1 (-[v 0,i ] j + [v T -1,i ] j )]. Besides, since for all t, i, g t,i ∞ ≤ G ∞ and v t,i is an exponential moving average of g 2 k,i , k = 1, 2, • • • , t, we have |[v t,i ] j | ≤ G 2 ∞ for all t, i, j. By construction of Vt , we also observe that each element of Vt cannot be greater than G 2 ∞ , i.e. |[v t,i ] j | ≤ G 2 ∞ for all t, i, j. Given that [v 0,i ] j ≥ 0 , we have E T t=1 (-Vt-2 + Vt-1 ) abs = E   N i=1 d j=1 (-[v 0,i ] j + [v T -1,i ] j )   ≤ N i=1 d j=1 E[G 2 ∞ ] = N dG 2 ∞ . Substituting into (3) yields the desired convergence bound for Algorithm 3.

3.4. ILLUSTRATIVE NUMERICAL EXPERIMENTS

In this section, we conduct some experiments to test the performance of Decentralized AMSGrad, developed in Algorithm 3, on both homogeneous data and heterogeneous data distribution (i.e. the data generating distribution on different nodes are assumed to be different). Comparison with DADAM and the decentralized stochastic gradient descent (DGD) developed in Lian et al. (2017) are conducted. We train a Convolutional Neural Network (CNN) with 3 convolution layers followed by a fully connected layer on MNIST (LeCun, 1998) . We set = 10 -6 for both Decentralized AMSGrad and DADAM. The learning rate is chosen from the grid [10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 , 10 -6 ] based on validation accuracy for all algorithms. In the following experiments, the graph contains 5 nodes and each node can only communicate with its two adjacent neighbors forming a cycle. Regarding the mixing matrix W , we set W ij = 1/3 if nodes i and j are neighbors and W ij = 0 otherwise. More details and experiments can be found in the supplementary material of our paper. Homogeneous data: The whole dataset is shuffled and evenly split into different nodes. Such a setting is possible when the nodes are in a computer cluster. We see, Figure 1 (a), that decentralized AMSGrad and DADAM perform quite similarly while DGD is much slower both in terms of training loss and test accuracy. Though the (possible) non convergence of DADAM, mentioned in this paper, its performance are empirically good on homogeneous data. The reason is that the adaptive learning rates tend to be similar on different nodes in presence of homogeneous data distribution. We thus compare these algorithms under the heterogeneous regime. Heterogeneous data: Here, each node only contains training data with two labels out of ten. Such a setting is common when data shuffling is prohibited, such as in federated learning. We can see that each algorithm converges significantly slower than with homogeneous data. Especially, the performance of DADAM deteriorates significantly. Decentralized AMSGrad achieves the best training and testing performance in that setting as observed Figure 1(b) .

4. EXTENSION TO ADAGRAD

In this section, we provide a decentralized version of AdaGrad (optionally with momentum) converted by Algorithm 2, further supporting the usefulness of the decentralization framework. The required modification for decentralize AdaGrad is to specify line 4 of Algorithm 2 as vt,i = t -1 t vt-1,i + 1 t g 2 t,i which is equivalent to vt,i = 1 t t k=1 g 2 k,i . Throughout this section, we will call this algorithm decentralized AdaGrad. There are two details in the algorithm worth mentioning. One is that the framework uses momentum m t,i in updates, while original AdaGrad does not use momentum. The momentum can be turned off by setting β 1 = 0 and the convergence results will hold. The other one is that in decentralized AdaGrad, we use average instead of sum in vt,i . I.e. vt,i = 1 t t k=1 g 2 k,i . This is different from original AdaGrad which should use vt,i = t k=1 g 2 k,i . The reason is in original AdaGrad, a constant stepsize (α independent of t or T ) is used with vt,i = t k=1 g 2 k,i and this is equivalent to using a well-known diminishing stepsize sequence α t = 1 √ t with vt,i = 1 t t k=1 g 2 k,i . In our convergence analysis which will be presented later, we will use a constant stepsize α = O( 1 √ T ) to replace the diminishing stepsize sequence α t = O( 1 √ t ). Such a replacement is popularly used in SGD analysis to simplify analysis and achieving better convergence rate. In addition, it is easy to modify our theoretical framework to apply diminishing stepsize sequences such as α t = O( 1 √ t ). The convergence analysis for decentralized AdaGrad is shown in Theorem 4. Theorem 4. Assume A1-A4. Set α = √ N / √ T d. When α ≤ 0.5 16L , decentralized AdaGrad yields the following regret bound 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤ C 1 √ d √ T N D f + C 2 T + C 3 N 1.5 T 1.5 d 0.5 + √ N (1 + log(T )) T (dC 4 + √ d T 0.5 C 5 ) , where D f := E[f (Z 1 )] -min z f (z)] + σ 2 , C 1 = C 1 , C 2 = C 2 , C 3 = C 3 , C 4 = C 4 G 2 ∞ and C 5 = C 5 G 2 ∞ . C 1 , C 2 , C 3 , C 4 , C 5 are defined in Theorem 2 independent of d, T and N . In addition, the consensus of variables at different nodes is given by 1 N N i=1 x t,i -X t 2 ≤ N T 1 1-λ 2 G 2 ∞ 1 .

5. CONCLUSION

This paper studies the problem of designing adaptive gradient methods for decentralized training. We propose a unifying algorithmic framework that can convert existing adaptive gradient methods to decentralized settings. With rigorous convergence analysis, we show that if the original algorithm satisfies converges under some minor conditions, the converted algorithm obtained using our proposed framework is guaranteed to converge to stationary points of the regret function. By applying our framework to AMSGrad, we propose the first convergent adaptive gradient methods, namely Decentralized AMSGrad. Experiments show that the proposed algorithm achieves better performance than the baselines.

A PROOF OF AUXILIARY LEMMAS

Lemma 1. For the sequence defined in (10), we have Z t+1 -Z t = α β 1 1 -β 1 1 N N i=1 m t-1,i ( 1 √ u t-1,i - 1 √ u t,i ) -α 1 N N i=1 g t,i √ u t,i . Proof: By update rule of Algorithm 2, we first have X t+1 = 1 N N i=1 x t+1,i = 1 N N i=1 x t+0.5,i -α m t,i √ u t,i = 1 N N i=1   N j=1 W ij x t,j -α m t,i √ u t,i   (i) =   1 N N j=1 x t,j   - 1 N N i=1 α m t,i √ u t,i =X t - 1 N N i=1 α m t,i √ u t,i , where (i) is due to an interchange of summation and i=1 W ij = 1. Then, we have Z t+1 -Z t =X t+1 -X t + β 1 1 -β 1 (X t+1 -X t ) - β 1 1 -β 1 (X t+1 -X t ) = 1 1 -β 1 (X t+1 -X t ) - β 1 1 -β 1 (X t+1 -X t ) = 1 1 -β 1 - 1 N N i=1 α m t,i √ u t,i - β 1 1 -β 1 - 1 N N i=1 α m t-1,i √ u t-1,i = 1 1 -β 1 - 1 N N i=1 α β 1 m t-1,i + (1 -β 1 )g t,i √ u t,i - β 1 1 -β 1 - 1 N N i=1 α m t-1,i √ u t-1,i =α β 1 1 -β 1 1 N N i=1 m t-1,i ( 1 √ u t-1,i - 1 √ u t,i ) -α 1 N N i=1 g t,i √ u t,i , which is the desired result.  Proof: Without loss of generality, assume a i ≤ a j when i < j, i.e. a i is a non-decreasing sequence. Define h(r) = n i=1 |b i (r) -b(r)| = n i=1 | max(a i , r) - 1 n n j=1 max(a j , r)| . We need to prove that h is a non-increasing function of r. First, it is easy to see that h is a continuous function of r with non-differentiable points r = a i , i ∈ [n], thus h is a piece-wise linear function. Next, we will prove that h(r) is non-increasing in each piece. Define l(r) to be the largest index with a(l(r)) < r, and s(r) to be the largest index with a s(r) < b(r). Note that we have for i ≤ l(r), b i (r) = r and for i ≤ s(r) b i (r) -b(r) ≤ 0 since a i is a non-decreasing sequence. Therefore, we have h(r) = l(r) i=1 ( b(r) -r) + s(r) i=l(r)+1 ( b(r) -a i ) + n i=s(r)+1 (a i -b(r)) and b(r) = 1 n   l(r)r + n i=l(r)+1 a i   . Taking derivative of the above form, we know the derivative of h(r) at differentiable points is  h (r) =l(r)( l(r) n -1) + (s(r) -l(r)) l(r) n -(n -s(r)) l(r) n = l(r) n ((l(r) -n) + (s(r) -l(r)) -(n -s(r))) .

B PROOF OF THEOREM 2

To prove convergence of the algorithm, we first define an auxiliary sequence Z t = X t + β 1 1 -β 1 (X t -X t-1 ) , ( ) with X 0 X 1 . Since E[g t,i ] = ∇f (x t,i ) and u t,i is a function of G 1:t-1 (which denotes G 1 , G 2 , • • • , G t-1 ), we have E Gt|G1:t-1 1 N N i=1 g t,i √ u t,i = 1 N N i=1 ∇f i (x t,i ) √ u t,i . Assuming smoothness (A1) we have f (Z t+1 ) ≤ f (Z t ) + ∇f (Z t ), Z t+1 -Z t + L 2 Z t+1 -Z t 2 . Using Lemma 1 into the above inequality and take expectation over G t given G 1:t-1 , we have E Gt|G1:t-1 [f (Z t+1 )] ≤f (Z t ) -α ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) √ u t,i + L 2 E Gt|G1:t-1 Z t+1 -Z t 2 + α β 1 1 -β 1 E Gt|G1:t-1 ∇f (Z t ), 1 N N i=1 m t-1,i ( 1 √ u t-1,i - 1 √ u t,i ) . Then take expectation over G 1:t-1 and rearrange, we have αE ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) √ u t,i ≤E[f (Z t )] -E[f (Z t+1 )] + L 2 E Z t+1 -Z t 2 + α β 1 1 -β 1 E ∇f (Z t ), 1 N N i=1 m t-1,i ( 1 √ u t-1,i - 1 √ u t,i ) . ( ) In addition, we have ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) √ u t,i = ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) U t + ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) 1 √ u t,i - 1 U t (13) and the first term on RHS of the equality can be lower bounded as ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) U t = 1 2 ∇f (Z t ) U 1/4 t 2 + 1 2 1 N N i=1 ∇f i (x t,i ) U 1/4 t 2 - 1 2 ∇f (Z t ) -1 N N i=1 ∇f i (x t,i ) U 1/4 t 2 ≥ 1 4 ∇f (X t ) U 1/4 t 2 + 1 4 ∇f (X t ) U 1/4 t 2 - 1 2 ∇f (Z t ) -1 N N i=1 ∇f i (x t,i ) U 1/4 t 2 - 1 2 ∇f (Z t ) -∇f (X t ) U 1/4 t 2 - 1 2 1 N N i=1 ∇f i (x t,i ) -∇f (X t ) U 1/4 t 2 ≥ 1 2 ∇f (X t ) U 1/4 t 2 - 3 2 ∇f (Z t ) -∇f (X t ) U 1/4 t 2 - 3 2 1 N N i=1 ∇f i (x t,i ) -∇f (X t ) U 1/4 t 2 , where the inequalities are all due to Cauchy-Schwartz. Substituting ( 14) and ( 13) into (11), we get 1 2 αE   ∇f (X t ) U 1/4 t 2   ≤E[f (Z t )] -E[f (Z t+1 )] + L 2 E Z t+1 -Z t 2 + α β 1 1 -β 1 E ∇f (Z t ), 1 N N i=1 m t-1,i ( 1 √ u t-1,i - 1 √ u t,i ) -αE ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) 1 √ u t,i - 1 U t + 3 2 αE   1 N N i=1 ∇f i (x t,i ) -∇f (X t ) U 1/4 t 2 + ∇f (Z t ) -∇f (X t ) U 1/4 t 2   . Then sum over the above inequality from t = 1 to T and divide both sides by T α/2, we have 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤ 2 T α (E[f (Z 1 )] -E[f (Z T +1 )]) + L T α T t=1 E Z t+1 -Z t 2 + 2 T β 1 1 -β 1 T t=1 E ∇f (Z t ), 1 N N i=1 m t-1,i ( 1 √ u t-1,i - 1 √ u t,i ) D1 + 2 T T t=1 E ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) 1 U t - 1 √ u t,i D2 + 3 T T t=1 E   1 N N i=1 ∇f i (x t,i ) -∇f (X t ) U 1/4 t 2 + ∇f (Z t ) -∇f (X t ) U 1/4 t 2   D3 . ( ) Now we need to upper bound all the terms on RHS of the above inequality to get the convergence rate. For the terms composing D 3 in ( 15), we can upper bound them by ∇f (Z t ) -∇f (X t ) U 1/4 t 2 ≤ 1 min j∈[d] [U 1/2 t ] j ∇f (Z t ) -∇f (X t ) 2 ≤ L 1 min j∈[d] [U 1/2 t ] j Z t -X t 2 D4 ( ) and 1 N N i=1 ∇f i (x t,i ) -∇f (X t ) U 1/4 t 2 ≤ 1 min j∈[d] [U 1/2 t ] j 1 N N i=1 ∇f i (x t,i ) -∇f (X t ) 2 ≤L 1 min j∈[d] [U 1/2 t ] j 1 N N i=1 x t,i -X t 2 D5 , using Jensen's inequality, Lipschitz continuity of f i , and the fact that f = 1 N N i=1 f i . Next we need to bound D 4 and D 5 . Recall the update rule of X t , we have X t = X t-1 W -α M t-1 U t-1 = X 1 W t-1 -α t-2 k=0 M t-k-1 U t-k-1 W k , ( ) where we define W 0 = I. Since W is a symmetric matrix, we can decompose it as W = QΛQ T where Q is a orthonormal matrix and Λ is a diagonal matrix whose diagonal elements correspond to eigenvalues of W in an descending order, i.e. Λ ii = λ i with λ i being ith largest eigenvalue of W . In addition, because W is a doubly stochastic matrix, we know λ 1 = 1 and q 1 = 1 N √ N . With eigen-decomposition of W , we can rewrite D 5 as N i=1 x t,i -X t 2 = X t -X t 1 T N 2 F = X t QQ T -X t 1 N 1 N 1 T N 2 F = N l=2 X t q l 2 . ( ) In addition, we can rewrite (18) as X t = X 1 W t-1 -α t-2 k=0 M t-k-1 U t-k-1 W k = X 1 -α t-2 k=0 M t-k-1 U t-k-1 QΛ k Q T , where the last equality is because x 1,i = x 1,j , for all i, j and thus X 1 W = X 1 . Then we have when l > 1, X t q l = (X 1 -α t-2 k=0 M t-k-1 U t-k-1 QΛ k Q T )q l = -α t-2 k=0 M t-k-1 U t-k-1 q l λ k l , since Q is orthonormal and X 1 q l = x 1,1 1 T N q l = x 1,1 √ N q T 1 q l = 0, for all l = 1 . Combining ( 19) and ( 21), we have D 5 = N i=1 x t,i -X t 2 = N l=2 X t q l 2 = N l=2 α 2 t-2 k=0 M t-k-1 U t-k-1 λ k l q l 2 ≤ α 2 1 1 -λ 2 N dG 2 ∞ 1 , where the last inequality follows from the fact that g t,i ≤ G ∞ , q l = 1, and |λ l | ≤ λ < 1. Now let us turn to D 4 , it can be rewritten as Z t -X t 2 = β 1 1 -β 1 (X t -X t-1 ) 2 = β 1 1 -β 1 2 α 2 1 N N i=1 m t-1,i √ u t-1,i 2 ≤ β 1 1 -β 1 2 α 2 d G 2 ∞ . Now we know both D 4 and D 5 are in the order of O(α 2 ) and thus D 3 is in the order of O(α 2 ). Next we will bound D 2 and D 1 . Define G 1 max t∈[T ] max i∈[N ] ∇f i (x t,i ) ∞ , G 2 max t∈[T ] ∇f (Z t ) ∞ , G 3 max t∈[T ] max i∈[N ] g t,i ∞ and G ∞ = max(G 1 , G 2 , G 3 ). Then we have D 2 = T t=1 E ∇f (Z t ), 1 N N i=1 ∇f i (x t,i ) 1 U t - 1 √ u t,i ≤ T t=1 E   G 2 ∞ 1 N N i=1 d j=1 1 [U t ] j - 1 [u t,i ] j   = T t=1 E   G 2 ∞ 1 N N i=1 d j=1 1 [U t ] j - 1 [u t,i ] j [U t ] j + [u t,i ] j [U t ] j + [u t,i ] j   = T t=1 E   G 2 ∞ 1 N N i=1 d j=1 [U t ] j -[u t,i ] j [U t ] j [u t,i ] j + [U t ] j [u t,i ] j   ≤E T t=1 G 2 ∞ 1 N N i=1 d j=1 [U t ] j -[u t,i ] j 2 1.5 D6 , where the last inequality is due to [u t,i ] j ≥ , for all t, i, j. To simplify notations, define A abs = i,j |A ij | to be the entry-wise L 1 norm of a matrix A, then we obtain D 6 ≤ G 2 ∞ N T t=1 1 2 1.5 U t 1 T -U t abs ≤ G 2 ∞ N T t=1 1 2 1.5 Ũ t 1 T -Ũt abs = G 2 ∞ N T t=1 1 2 1.5 Ũt 1 N 1 N 1 T N -Ũt QQ T abs = G 2 ∞ N T t=1 1 2 1.5 - N l=2 Ũt q l q T l abs , where the second inequality is due to Lemma A.1, introduced Section A, and the fact that U t = max( Ũt , ) (element-wise max operator). Recall from update rule of U t , by defining V-1 V0 and U 0 U 1/2 , we have for all t ≥ 0, Ũt+1 = ( Ũt -Vt-1 + Vt )W . Thus, we obtain Ũt = Ũ0 W t + t k=1 (-Vt-1-k + Vt-k )W k = Ũ0 + t k=1 (-Vt-1-k + Vt-k )QΛ k Q T . Then we further obtain when l = 1, Ũt q l = ( Ũ0 + t k=1 (-Vt-1-k + Vt-k )QΛ k Q T )q l = t k=1 (-Vt-1-k + Vt-k )q l λ k l , where the last equality is due to the definition Ũ0 U 1/2 = 1 d 1 T N = √ N 1 d 1 T N (recall that q 1 = 1 √ N 1 T N ) and q T i q j = 0 when i = j. Note that by definition of • abs , we have for all A, B, A + B abs ≤ A abs + B abs , then D 6 ≤ G 2 ∞ N T t=1 1 2 1.5 - N l=2 Ũt q l q T l abs = G 2 ∞ N T t=1 1 2 1.5 - t k=1 (-Vt-1-k + Vt-k ) N l=2 q l λ k l q T l abs ≤ G 2 ∞ N T t=1 1 2 1.5 t k=1 d j=1 N l=2 q l λ k l q T l 1 (-Vt-1-k + Vt-k ) T e j 1 ≤ G 2 ∞ N T t=1 1 2 1.5 t k=1 d j=1 √ N N l=2 q l λ k l q T l 2 (-Vt-1-k + Vt-k ) T e j 1 ≤ G 2 ∞ N T t=1 1 2 1.5 t k=1 d j=1 (-Vt-1-k + Vt-k ) T e j 1 √ N λ k = G 2 ∞ N T t=1 1 2 1.5 t k=1 (-Vt-1-k + Vt-k ) abs √ N λ k = G 2 ∞ N 1 2 1.5 T -1 o=0 T t=o+1 (-Vo-1 + Vo ) abs √ N λ t-o ≤ G 2 ∞ √ N 1 2 1.5 T -1 o=0 λ 1 -λ (-Vo-1 + Vo ) abs , where λ = max(|λ 2 |, |λ N |). Combining ( 24) and ( 25), we have D 2 ≤ G 2 ∞ √ N 1 2 1.5 λ 1 -λ E T -1 o=0 (-Vo-1 + Vo ) abs . Now we need to bound D 1 , we have D 1 = T t=1 E ∇f (Z t ), 1 N N i=1 m t-1,i ( 1 √ u t-1,i - 1 √ u t,i ) ≤ T t=1 E   G 2 ∞ 1 N N i=1 d j=1 1 [u t-1,i ] j - 1 [u t,i ] j   = T t=1 E   G 2 ∞ 1 N N i=1 d j=1 1 [u t-1,i ] j - 1 [u t,i ] j [u t,i ] j + [u t-1,i ] j [u t,i ] j + [u t-1,i ] j   ≤ T t=1 E   G 2 ∞ 1 N N i=1 d j=1 1 2 1.5 ([u t-1,i ] j -[u t,i ] j )   (a) ≤ T t=1 E   G 2 ∞ 1 N N i=1 d j=1 1 2 1.5 |([ũ t-1,i ] j -[ũ t,i ] j )|   =G 2 ∞ 1 2 1.5 1 N E T t=1 Ũt-1 -Ũt abs , where (a) is due to [ũ t-1,i ] j = max([u t-1,i ] j , ) and the function max(•, ) is 1-Lipschitz. In addition, by update rule of U t , we have T t=1 Ũt-1 -Ũt abs = T t=1 Ũt-1 -( Ũt-1 -Vt-2 + Vt-1 )W abs = T t=1 Ũt-1 (QQ T -QΛQ T ) + (-Vt-2 + Vt-1 )W abs = T t=1 Ũt-1 ( N l=2 q l (1 -λ l )q T l ) + (-Vt-2 + Vt-1 )W abs ≤ T t=1 t-1 k=1 (-Vt-2-k + Vt-1-k ) N l=2 q l λ k l (1 -λ l )q T l abs + T t=1 (-Vt-2 + Vt-1 )W abs ≤ T t=1 t-1 k=1 -Vt-2-k + Vt-1-k abs √ N λ k + T t=1 (-Vt-2 + Vt-1 ) abs = T t=1 t-1 o=1 -Vo-2 + Vo-1 abs √ N λ t-o + T t=1 (-Vt-2 + Vt-1 ) abs = T -1 o=1 T t=o+1 -Vo-2 + Vo-1 abs √ N λ t-o + T t=1 (-Vt-2 + Vt-1 ) abs ≤ T -1 o=1 λ 1 -λ -Vo-2 + Vo-1 abs √ N + T t=1 (-Vt-2 + Vt-1 ) abs ≤ 1 1 -λ T t=1 (-Vt-2 + Vt-1 ) abs √ N . ( ) =2α 2 β 1 1 -β 1 2 G 2 ∞ 1 N 1 2 2 Ũt -Ũt-1 abs + 2α 2 1 N N i=1 g t,i √ u t,i 2 , ( ) where the last inequality is again due to the definition that [ũ t,i ] j = max([u t,i ] j , ) and the fact that max(•, ) is 1-Lipschitz. Then, we have T t=1 E[ Z t+1 -Z t 2 ] ≤2α 2 β 1 1 -β 1 2 G 2 ∞ 1 N 1 2 2 E T t=1 Ũt -Ũt-1 abs + 2α 2 T t=1 E   1 N N i=1 g t,i √ u t,i 2   ≤α 2 β 1 1 -β 1 2 G 2 ∞ √ N 1 2 1 1 -λ E T t=1 (-Vt-2 + Vt-1 ) abs + 2α 2 T t=1 E   1 N N i=1 g t,i √ u t,i 2   , where the last inequality is due to (27). We now bound the last term on RHS of the above inequality. A trivial bound can be T t=1 1 N N i=1 g t,i √ u t,i 2 ≤ T t=1 dG 2 ∞ 1 , due to g t,i ≤ G ∞ and [u t,i ] j ≥ , for all j (verified from update rule of u t,i and the assumption that [v t,i ] j ≥ , for all i). However, the above bound is independent of N , to get a better bound, we need a more involved analysis to show its dependency on N . To do this, we first notice that E Gt|G1:t-1   1 N N i=1 g t,i √ u t,i 2   =E Gt|G1:t-1   1 N 2 N i=1 N j=1 ∇f i (x t,i ) + ξ t,i √ u t,i , ∇f j (x t,j ) + ξ t,j √ u t,j   (a) = E Gt|G1:t-1   1 N N i=1 ∇f i (x t,i ) √ u t,i 2   + E Gt|G1:t-1 1 N 2 N i=1 ξ t,i √ u t,i 2 (b) = 1 N N i=1 ∇f i (x t,i ) √ u t,i 2 + 1 N 2 N i=1 d l=1 E Gt|G1:t-1 [[ξ t,i ] 2 l ] [u t,i ] l (c) ≤ 1 N N i=1 ∇f i (x t,i ) √ u t,i 2 + d N σ 2 , where (a) is due to E Gt|G1:t-1 [ξ t,i ] = 0 and ξ t,i is independent of x t,j , u t,j for all j, and ξ j , for all j = i, (b) comes from the fact that x t,i , u t,i are fixed given G 1:t , (c) is due to E Gt|G1:t-1 [[ξ t,i ] 2 l ≤ σ 2 and [u t.i ] l ≥ by definition. Then we have E   1 N N i=1 g t,i √ u t,i 2   =E G1:t-1   E Gt|G1:t-1   1 N N i=1 g t,i √ u t,i 2     ≤E G1:t-1   1 N N i=1 ∇f i (x t,i ) √ u t,i 2 + d N σ 2   =E   1 N N i=1 ∇f i (x t,i ) √ u t,i 2   + d N σ 2 . ( ) In traditional analysis of SGD-like distributed algorithms, the term corresponding to E 1 N N i=1 ∇fi(xt,i) √ ut,i will be merged with the first order descent when the stepsize is chosen to be small enough. However, in our case, the term cannot be merged because it is different from the first order descent in our algorithm. A brute-force upper bound is possible but this will lead to a worse convergence rate in terms of N . Thus, we need a more detailed analysis for the term in the following. E   1 N N i=1 ∇f i (x t,i ) √ u t,i 2   =E   1 N N i=1 ∇f i (x t,i ) U t + 1 N N i=1 ∇f i (x t,i ) 1 √ u t,i - 1 U t 2   ≤2E   1 N N i=1 ∇f i (x t,i ) U t 2   + 2E   1 N N i=1 ∇f i (x t,i ) 1 √ u t,i - 1 U t 2   ≤2E   1 N N i=1 ∇f i (x t,i ) U t 2   + 2E   1 N N i=1 ∇f i (x t,i ) 1 √ u t,i - 1 U t 2   ≤2E   1 N N i=1 ∇f i (x t,i ) U t 2   + 2E 1 N N i=1 G 2 ∞ 1 √ 1 √ u t,i - 1 U t 1 . Summing over T , we have T t=1 E   1 N N i=1 ∇f i (x t,i ) √ u t,i 2   ≤2 T t=1 E   1 N N i=1 ∇f i (x t,i ) U t 2   + 2 T t=1 E 1 N N i=1 G 2 ∞ 1 √ 1 √ u t,i - 1 U t 1 . ( ) For the last term on RHS of (31), we can bound it similarly as what we did for D 2 from ( 24) to ( 25), which yields T t=1 E 1 N N i=1 G 2 ∞ 1 √ 1 √ u t,i - 1 U t 1 ≤ T t=1 E 1 N N i=1 G 2 ∞ 1 √ 1 2 1.5 u t,i -U t 1 = T t=1 E 1 N G 2 ∞ 1 2 2 U t 1 T -U t abs ≤ T t=1 E 1 N G 2 ∞ 1 2 2 - N l=2 Ũt q l q T l abs ≤ 1 √ N G 2 ∞ 1 2 2 E T -1 o=0 λ 1 -λ (-Vo-1 + Vo ) abs . Further, we have T t=1 E   1 N N i=1 ∇f i (x t,i ) U t 2   ≤2 T t=1 E   1 N N i=1 ∇f i (X t ) U t 2   + 2 T t=1 E   1 N N i=1 ∇f i (X t ) -∇f i (x t,i ) U t 2   =2 T t=1 E   ∇f (X t ) U t 2   + 2 T t=1 E   1 N N i=1 ∇f i (X t ) -∇f i (x t,i ) U t 2   and the last term on RHS of the above inequality can be bounded following similar procedures from (17) to ( 22), as what we did for D 3 . Completing the procedures yields T t=1 E   1 N N i=1 ∇f i (X t ) -∇f i (x t,i ) U t 2   ≤ T t=1 E L 1 1 N N i=1 x t,i -X t 2 ≤ T t=1 E L 1 1 N α 2 1 1 -λ N dG 2 ∞ 1 =T L 1 2 α 2 1 1 -λ dG 2 ∞ . Finally, combining (30) to (33), we get T t=1 E   1 N N i=1 g t,i √ u t,i 2   ≤4 T t=1 E   ∇f (X t ) U t 2   + 4T L 1 2 α 2 1 1 -λ dG 2 ∞ + 2 1 √ N G 2 ∞ 1 2 2 E T -1 o=0 λ 1 -λ (-Vo-1 + Vo ) abs + T d N σ 2 ≤4 1 √ T t=1 E   ∇f (X t ) U 1/4 t 2   + 4T L 1 2 α 2 1 1 -λ dG 2 ∞ + 2 1 √ N G 2 ∞ 1 2 2 E T -1 o=0 λ 1 -λ (-Vo-1 + Vo ) abs + T d N σ 2 . where the last inequality is due to each element of U t is lower bounded by by definition. Combining all above, we obtain 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤ 2 T α (E[f (Z 1 )] -E[f (Z T +1 )]) + L T α β 1 1 -β 1 2 G 2 ∞ √ N 1 2 1 1 -λ E [V T ] + 8L T α 1 √ T t=1 E   ∇f (X t ) U 1/4 t 2   + 8L 2 α 1 2 α 2 1 1 -λ dG 2 ∞ (34) + 4L T α 1 √ N G 2 ∞ 1 2 2 E T -1 o=0 λ 1 -λ (-Vo-1 + Vo ) abs + 2Lα d N σ 2 + 2 T β 1 1 -β 1 G 2 ∞ 1 2 1.5 1 √ N E 1 1 -λ V T + 2 T G 2 ∞ √ N 1 2 1.5 λ 1 -λ E [V T ] + 3 T T t=1 L 1 1 -λ 2 α 2 dG 2 ∞ 1 1.5 + T t=1 L β 1 1 -β 1 2 α 2 d G 2 ∞ 1.5 = 2 T α (E[f (Z 1 )] -E[f (Z T +1 )]) + 2Lα d N σ 2 + 8Lα 1 √ 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   + 3α 2 d β 1 1 -β 1 2 + 1 1 -λ 2 L G 2 ∞ 1.5 + 8α 3 L 2 1 1 -λ d G 2 ∞ 2 + 1 T 1.5 G 2 ∞ √ N 1 1 -λ Lα β 1 1 -β 1 2 1 0.5 + λ + β 1 1 -β 1 + 2Lα 1 0.5 λ E [V T ] . where V T := T t=1 (-Vt-2 + Vt-1 ) abs . Set α = 1 √ dT and when α ≤ 0.5 16L , we further have 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤ 4 T α (E[f (Z 1 )] -E[f (Z T +1 )]) + 4Lα d N σ 2 + 6α 2 d β 1 1 -β 1 2 + 1 1 -λ 2 L G 2 ∞ 1.5 + 16α 3 L 2 1 1 -λ d G 2 ∞ 2 + 2 T 1.5 G 2 ∞ √ N 1 1 -λ Lα β 1 1 -β 1 2 1 0.5 + λ + β 1 1 -β 1 + 2Lα 1 0.5 λ E [V T ] ≤ 4 T α (E[f (Z 1 )] -min x f (x)) + 4Lα d N σ 2 + 6α 2 d β 1 1 -β 1 2 + 1 1 -λ 2 L G 2 ∞ 1.5 + 16α 3 dL 2 1 1 -λ G 2 ∞ 2 + 2 T 1.5 G 2 ∞ √ N 1 1 -λ Lα β 1 1 -β 1 2 1 0.5 + λ + β 1 1 -β 1 + 2Lα 1 0.5 λ E [V T ] ≤C 1 1 T α (E[f (Z 1 )] -min x f (x)) + α dσ 2 N + C 2 α 2 d + C 3 α 3 d + 1 T √ N (C 4 + C 5 α)E [V T ] where the first inequality is obtained by moving the term 8Lα 1 √ 1 T T t=1 E ∇f (Xt) U 1/4 t 2 on the RHS of (34) to the LHS to cancel it using the assumption 8Lα 1 √ ≤ 1 2 followed by multiplying both sides by 2. The constants introduced in the last step are defined as following C 1 = max(4, 4L/ ) , C 2 =6 β 1 1 -β 1 2 + 1 1 -λ 2 L G 2 ∞ 1.5 , C 3 =16L 2 1 1 -λ G 2 ∞ 2 , C 4 = 2 1.5 1 1 -λ λ + β 1 1 -β 1 G 2 ∞ , C 5 = 2 2 1 1 -λ L β 1 1 -β 1 2 G 2 ∞ + 4 2 λ 1 -λ LG 2 ∞ . Substituting into Z 1 = X 1 completes the proof.

C PROOF OF THEOREM 3

Under some assumptions stated in Corollary 2.1, we have that 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤C 1 √ d √ T N (E[f (Z 1 )] -min x f (x)) + σ 2 + C 2 N T + C 3 N 1.5 T 1.5 d 0.5 + C 4 1 T √ N + C 5 1 T 1.5 d 0.5 E T t=1 (-Vt-2 + Vt-1 ) abs (36) where • abs denotes the entry-wise L 1 norm of a matrix (i.e A abs = i,j |A ij |) and C 1 , C 2 , C 3 , C 4 , C 5 are defined in Theorem 2. Since Algorithm 3 is a special case of 2, building on result of Theorem 2, we just need to characterize the growth speed of E T t=1 (-Vt-2 + Vt-1 ) abs to prove convergence of Algorithm 3. By the update rule of Algorithm 3, we know Vt is non decreasing and thus E T t=1 (-Vt-2 + Vt-1 ) abs =E   T t=1 N i=1 d j=1 | -[v t-2,i ] j + [v t-1,i ] j |   =E   T t=1 N i=1 d j=1 (-[v t-2,i ] j + [v t-1,i ] j )   =E   N i=1 d j=1 (-[v -1,i ] j + [v T -1,i ] j )   =E   N i=1 d j=1 (-[v 0,i ] j + [v T -1,i ] j )   , where the last equality is because we defined V-1 V0 previously. Further, because g t,i ∞ ≤ G ∞ for all t, i and v t,i is a exponential moving average of g 2 k,i , k = 1, 2, • • • , t, we know |[v t,i ] j | ≤ G 2 ∞ , for all t, i, j. In addition, by update rule of Vt , we also know each element of Vt also cannot be greater than G 2 ∞ , i.e. |[v t,i ] j | ≤ G 2 ∞ , for all t, i, j. Given the fact that [v 0,i ] j ≥ 0 , we have E T t=1 (-Vt-2 + Vt-1 ) abs = E   N i=1 d j=1 (-[v 0,i ] j + [v T -1,i ] j )   ≤ E   N i=1 d j=1 G 2 ∞   = N dG 2 ∞ . Substituting the above into (39), we have 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤C 1 √ d √ T N (E[f (Z 1 )] -min x f (x)) + σ 2 + C 2 N T + C 3 N 1.5 T 1.5 d 0.5 + C 4 1 T √ N + C 5 1 T 1.5 d 0.5 N dG 2 ∞ =C 1 √ d √ T N (E[f (Z 1 )] -min x f (x)) + σ 2 + C 2 N T + C 3 N 1.5 T 1.5 d 0.5 + C 4 √ N d T + C 5 N d 0.5 T 1.5 , where we have C 1 = C 1 C 2 = C 2 C 3 = C 3 C 4 = C 4 G 2 ∞ C 5 = C 5 G 2 ∞ . and we conclude the proof.

D PROOF OF THEOREM 4

The proof follows the same flow as that of Theorem 3. Under assumptions stated in Corollary 2.1, set α  = √ N / √ T d, we have that 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤C 1 √ d √ T N (E[f (Z 1 )] -min x f (x)) + σ 2 + C 2 N T + C 3 N 1.5 T 1.5 d 0.5 + C 4 1 T √ N + C 5 1 T 1. | -[v t-2,i ] j + [v t-1,i ] j |   ≤E   T t=3 N i=1 d j=1 | - 1 t -2 ([ t-2 k=1 g 2 k,i ] j ) + 1 t -1 ([ t-1 k=1 g 2 k,i ] j )|   + N d(G 2 ∞ -) ≤E   T t=3 N i=1 d j=1 |( 1 t -1 - 1 t -2 )([ t-2 k=1 g 2 k,i ] j ) + 1 t -1 [g 2 t-1,i ] j )|   + N dG 2 ∞ =E   T t=3 N i=1 d j=1 |(- 1 (t -1)(t -2) )([ t-2 k=1 g 2 k,i ] j ) + 1 t -1 [g 2 t-1,i ] j |   + N dG 2 ∞ ≤E   T t=3 N i=1 d j=1 max 1 (t -1)(t -2) ([ t-2 k=1 g 2 k,i ] j ), 1 t -1 [g 2 t-1,i ] j   + N dG 2 ∞ ≤E N d T t=3 G 2 ∞ t -1 + N dG 2 ∞ ≤N dG 2 ∞ log(T ) + N dG 2 ∞ =N dG 2 ∞ (log(T ) + 1) where the first equality is because we defined V-1 V0 previously and g k,i ∞ ≤ G ∞ by assumption. Substituting the above into (39), we have , where we have 1 T T t=1 E   ∇f (X t ) U 1/4 t 2   ≤C 1 √ d √ T N (E[f (Z 1 )] -min x f (x)) + σ 2 + C 2 N T + C 3 N 1.5 T 1.5 d 0.5 + C 4 1 T √ N + C 5 1 T 1.5 d 0.5 N dG 2 ∞ (log(T ) + 1) =C 1 √ d √ T N (E[f (Z 1 )] -min x f (x)) + σ 2 + C 2 N T + C 3 N 1.5 T 1. C 1 = C 1 C 2 = C 2 C 3 = C 3 C 4 = C 4 G 2 ∞ C 5 = C 5 G 2 ∞ . ( ) and we conclude the proof.

E ADDITIONAL EXPERIMENTS AND DETAILS

In this section, we compare the training loss and testing accuracy of different algorithms, namely Decentralized Stochastic Gradient Descent (DGD), Decentralized Adam (DADAM) and our proposed Decentralized AMSGrad, with different stepsizes on heterogeneous data distribution. We use 5 nodes and the heterogeneous data distribution is created by assigning each node with data of only two labels. Note that there are no overlapping labels between different nodes. For all algorithms, we compare stepsizes in the grid [10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 , 10 -6 ]. Figure 2 shows the training loss and test accuracy for DGD algorithm. We observe that the stepsize 10 -3 works best for DGD in terms of test accuracy and 10 -1 works best in terms of training loss. This difference is caused by the inconsistency among the value of parameters on different nodes when the stepsize is large. The training loss is calculated as the average of the loss value of different local models evaluated on their local training batch. Thus, while the training loss is small at a particular node, the test accuracy will be low when evaluating data with labels not seen by the node (recall that each node contains data with different labels since we are in the heterogeneous setting). Figure 3 shows the performance of decentralized AMSGrad with different stepsizes. We see that its best performance is better than the one of DGD and the performance is more stable (the test performance is less sensitive to stepsize tuning). Figure 4 displays the performance of Decentralized Adam algorithm. As expected, the performance of DADAM is not as good as DGD or decentralized AMSGrad. Its divergence characteristic, highlighted Section 2.3, coupled with the heterogeneity in the data amplify its non-convergence issue in our experiments. From the experiments above, we can see the advantages of decentralized AMSGrad in terms of both performance and ease of parameter tuning, and the importance of ensuring the theoretical convergence of any newly proposed methods in the presented setting. 



et al. (2019) initiated an attempt to bring adaptive gradient methods into decentralized optimization with Decentralized ADAM (DADAM), shown in Algorithm 1.

Heterogeneous data Figure 1: Training loss and Testing accuracy for homogeneous and heterogeneous data

Lemma A.1. Given a set of numbers a 1 , • • • , a n and denote their mean to be ā = 1 n n i=1 a i . Define b i (r) = max(a i , r) and b(r) = 1 n n i=1 b i (r). For any r and r with r ≥ r we have n i=1 |b i (r) -b(r)| ≥ n i=1 |b i (r ) -b(r )| (8) and when r ≤ min i∈[n] a i , we have

Since we have s(r) ≤ n we know (l(r) -n) + (s(r) -l(r)) -(n -s(r)) ≤ 0 and thus h (r) ≤ 0 , which means h(r) is non-increasing in each piece. Combining with the fact that h(r) is continuous, (8) is proven. When r ≤ a(i), we have b(i) = max(a i , r) = r, for all r ∈ [n] and b(r) = 1 n n i=1 a i = ā which proves (9).

5 d 0.5 E T t=1 (-Vt-2 + Vt-1 ) abs , (39)where • abs denotes the entry-wise L 1 norm of a matrix (i.e A abs = i,j |A ij |) and C 1 , C 2 , C 3 , C 4 , C 5 are defined in Theorem 2. Again, Since decentralized AdaGrad is a special case of 2, we can apply Corollary 2.1 and what we need is to upper bound E T t=1 (-Vt-2 + Vt-1 ) abs derive convergence rate. By the update rule of decentralized AdaGrad, we have vt,i = 1 t ( t k=1 g 2 k,i ) for t ≥ 1 and v0,i = 1. Then we have for t ≥ 3,

Performance comparison of different stepsizes for DADAM

annex

Combining ( 26) and ( 27), we haveWhat remains is to bound. By update rule of Z t , we have

