CONVERGENT ADAPTIVE GRADIENT METHODS IN DE-CENTRALIZED OPTIMIZATION

Abstract

Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks, in the past few years. Meanwhile, given the need for distributed training procedures, distributed optimization algorithms are at the center of attention. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In that regard, more and more attention is shifted from the traditional parameter server training paradigm to the decentralized one, which usually requires lower communication costs. In this paper, we rigorously incorporate adaptive gradient methods into decentralized training procedures and introduce novel convergent decentralized adaptive gradient methods. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent.

1. INTRODUCTION

Distributed training of machine learning models is drawing growing attention in the past few years due to its practical benefits and necessities. Given the evolution of computing capabilities of CPUs and GPUs, computation time in distributed settings is gradually dominated by the communication time in many circumstances (Chilimbi et al., 2014; McMahan et al., 2017) . As a result, a large amount of recent works has been focussing on reducing communication cost for distributed learning (Alistarh et al., 2017; Lin et al., 2018; Wangni et al., 2018; Stich et al., 2018; Wang et al., 2018; Tang et al., 2019) . In the traditional parameter (central) server setting, where a parameter server is employed to manage communication in the whole network, many effective communication reductions have been proposed based on gradient compression (Aji & Heafield, 2017) and quantization (Chen et al., 2010; Ge et al., 2013; Jegou et al., 2010) techniques. Despite these communication reduction techniques, its cost still, usually, scales linearly with the number of workers. Due to this limitation and with the sheer size of decentralized devices, the decentralized training paradigm (Duchi et al., 2011b) , where the parameter server is removed and each node only communicates with its neighbors, is drawing attention. It has been shown in Lian et al. (2017) that decentralized training algorithms can outperform parameter server-based algorithms when the training bottleneck is the communication cost. The decentralized paradigm is also preferred when a central parameter server is not available. In light of recent advances in nonconvex optimization, an effective way to accelerate training is by using adaptive gradient methods like AdaGrad (Duchi et al., 2011a ), Adam (Kingma & Ba, 2015) or AMSGrad (Reddi et al., 2018) . Their popularity are due to their practical benefits in training neural networks, featured by faster convergence and ease of parameter tuning compared with Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) . Despite a large amount of studies within the distributed optimization literature, few works have considered bringing adaptive gradient methods into distributed training, largely due to the lack of understanding of their convergence behaviors. Notably, Reddi et al. (2020) develop the first decentralized ADAM method for distributed optimization problems with a direct application to federated learning. An inner loop is employed to compute mini-batch gradients on each node and a global adaptive step is applied to update the global parameter at each outer iteration. Yet, in the settings of our paper, nodes can only communicate to their neighbors on a fixed communication graph while a server/worker communication is required in Reddi et al. (2020) . Designing adaptive methods in such settings is highly non-trivial due to the already complex update rules and to the interaction between the effect of using adaptive learning rates and the decentralized communication protocols. This paper is an attempt at bridging the gap between both realms in nonconvex optimization. Our contributions are summarized as follows: • In this paper, we investigate the possibility of using adaptive gradient methods in the decentralized training paradigm, where nodes have only a local view of the whole communication graph. We develop a general technique that converts an adaptive gradient method from a centralized method to its decentralized variant. • By using our proposed technique, we present a new decentralized optimization algorithm, called decentralized AMSGrad, as the decentralized counterpart of AMSGrad. • We provide a theoretical verification interface, in Theroem 2, for analyzing the behavior of decentralized adaptive gradient methods obtained as a result of our technique. Thus, we characterize the convergence rate of decentralized AMSGrad, which is the first convergent decentralized adaptive gradient method, to the best of our knowledge. A novel technique in our framework is a mechanism to enforce a consensus on adaptive learning rates at different nodes. We show the importance of consensus on adaptive learning rates by proving a divergent problem instance for a recently proposed decentralized adaptive gradient method, namely DADAM (Nazari et al., 2019) , a decentralized version of AMSGrad. Though consensus is performed on the model parameter, DADAM lacks consensus principles on adaptive learning rates. After having presented existing related work and important concepts of decentralized adaptive methods in Section 2, we develop our general framework for converting any adaptive gradient algorithm in its decentralized counterpart along with their rigorous finite-time convergence analysis in Section 3 concluded by some illustrative examples of our framework's behavior in practice. Notations: x t,i denotes variable x at node i and iteration t. • abs denotes the entry-wise L 1 norm of a matrix, i.e. A abs = i,j |A i,j |. We introduce important notations used throughout the paper: for any t > 0, G t := [g t,N ] where [g t,N ] denotes the matrix [g t,1 , g t,2 , • • • , g t,N ] (where g t,i is a column vector), M t := [m t,N ], X t := [x t,N ], ∇f (X t ) := 1 N N i=1 ∇f i (x t,i ), U t := [u t,N ], Ũt := [ũ t,N ], V t := [v t,N ], Vt := [v t,N ], X t := 1 N N i=1 x t,i , U t := 1 N N i=1 u t,i and Ũt := 1 N N i=1 ũt,i .

2.1. RELATED WORK

Decentralized optimization: Traditional decentralized optimization methods include well-know algorithms such as ADMM (Boyd et al., 2011 ), Dual Averaging (Duchi et al., 2011b ), Distributed Subgradient Descent (Nedic & Ozdaglar, 2009) . More recent algorithms include Extra (Shi et al., 2015) , Next (Di Lorenzo & Scutari, 2016 ), Prox-PDA (Hong et al., 2017) , GNSD (Lu et al., 2019), and Choco-SGD (Koloskova et al., 2019) . While these algorithms are commonly used in applications other than deep learning, recent algorithmic advances in the machine learning community have shown that decentralized optimization can also be useful for training deep models such as neural networks. Adaptive gradient methods: Adaptive gradient methods have been popular in recent years due to their superior performance in training neural networks. Most commonly used adaptive methods include AdaGrad (Duchi et al., 2011a) or Adam (Kingma & Ba, 2015) and their variants. Key features of such methods lie in the use of momentum and adaptive learning rates (which means that the learning rate is changing during the optimization and is anisotropic, i.e. depends on the dimension). The method of reference, called Adam, has been analyzed in Reddi et al. (2018) where the authors point out an error in previous convergence analyses. Since then, a variety of papers have



Lian et al. (2017) demonstrate that a stochastic version of Decentralized Subgradient Descent can outperform parameter server-based algorithms when the communication cost is high. Tang et al. (2018) propose the D 2 algorithm improving the convergence rate over Stochastic Subgradient Descent. Assran et al. (2019) propose the Stochastic Gradient Push that is more robust to network failures for training neural networks. The study of decentralized training algorithms in the machine learning community is only at its initial stage. No existing work, to our knowledge, has seriously considered integrating adaptive gradient methods in the setting of decentralized learning. One noteworthy work(Nazari et al., 2019)  propose a decentralized version of AMSGrad(Reddi et al., 2018)  and it is proven to satisfy some non-standard regret.

