DADAO: DECOUPLED ACCELERATED DECENTRALIZED ASYNCHRONOUS OPTIMIZATION

Abstract

DADAO is a novel decentralized asynchronous stochastic first order algorithm to minimize a sum of L-smooth and µ-strongly convex functions distributed over a time-varying connectivity network of size n. We model the local gradient updates and gossip communication procedures with separate independent Poisson Point Processes, decoupling the computation and communication steps in addition to making the whole approach completely asynchronous. Our method employs primal gradients and does not use a multi-consensus inner loop nor other ad-hoc mechanisms such as Error Feedback, Gradient Tracking, or a Proximal operator. By relating the inverse of the smallest positive eigenvalue χ * 1 and the effective resistance χ * 2 of our graph to a necessary minimal communication rate between nodes of the network, we show that our algorithm requires communications to reach a precision ǫ. If SGD with uniform noise σ 2 is used, we reach a precision ǫ with same speed, up to a bias term in O( σ 2 √ µL ). This improves upon the bounds obtained with current state-of-the-art approaches, our simulations validating the strength of our relatively unconstrained method.

1. INTRODUCTION

With the rise of highly-parallelizable and connected hardware, distributed optimization for machine learning is a topic of significant interest holding many promises. In a typical distributed training framework, the goal is to minimize a sum of functions (f i ) i≤n split across n nodes of a computer network. A corresponding optimization procedure involves alternating local computation and communication rounds between the nodes. Also, spreading the compute load is done to ideally obtain a linear speedup in the number of nodes. In the decentralized setting, there is no central machine aggregating the information sent by the workers: nodes are only allowed to communicate with their neighbors in the network. In this setup, optimal methods (Scaman et al., 2017; Kovalev et al., 2021a) have been derived for synchronous first-order algorithms, whose executions are blocked until a subset (or all) nodes have reached a predefined state: the instructions must be performed in a specific order (e.g., all nodes must perform a local gradient step before the round of communication begins), which is one of the locks limiting their efficiency in practice. This work attempts to simultaneously address multiple limitations of existing decentralized algorithms while guaranteeing fast convergence rates. To tackle the synchronous lock, we rely on the continuized framework (Even et al., 2021a) , introduced initially to allow asynchrony in a fixed topology setting: iterates are labeled with a continuous-time index (in opposition to a global iteration count) and performed locally with no regards to a specific global ordering of events. This is more practical while being theoretically grounded and simplifying the analysis. However, in Even et al. (2021a), gradient and gossip operations are still coupled: each communication along an edge requires the computation of the gradients of the two functions locally stored on the corresponding nodes and vice-versa. As more communications steps than gradient computations are necessary to reach an ǫ precision, even in an optimal framework (Kovalev et al., 2021a; Scaman et al., 2017) , the coupling directly implies an overload in terms of gradient steps. Another limitation is the restriction to a fixed topology: in a more practical setting, connections between nodes should be allowed to disappear or new ones to appear over time. The procedures of Kovalev et al. (2021c) ; Li & Lin (2021) are the first to obtain an optimal complexity in gradient steps while being robust to topological change. Unfortunately, synchrony is mandatory in their frameworks as they either rely on the Error-Feedback mechanism (Stich & Karimireddy, 2020) or the Gradient Tracking one (Nedic et al., 2017) . Moreover, they both use an inner loop to control the number of gradient steps at the cost of significantly increasing the number of activated communication edges. To our knowledge, this is the first work to tackle those locks simultaneously. In this paper, we propose a novel algorithm (DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization) based on a combination of similar formulations to Kovalev et al. (2021a); Even et al. (2021b); Hendrikx (2022) in the continuized framework of Even et al. (2021a) . We study: inf x∈R d n i=1 f i (x) , where each f i : R d → R is a µ-strongly convex and L-smooth function computed in one of the n nodes of a network. We derive a first-order optimization algorithm that only uses primal gradients and relies on a time-varying Point-wise Poisson Processes (P.P.P.s (Last & Penrose, 2017)) modeling of the communication and gradient occurrences, leading to accelerated communication and computation rates. Our framework is based on a simple fixed-point iteration and kept minimal: it only involves primal computations with an additional momentum term and works in both the Gradient and Stochastic Gradient Descent (SGD) settings. Thus, we do not add other cumbersome designs such as the Error-Feedback or Forward-Backward used in Kovalev et al. (2021a) , which are intrinsically synchronous. While we do not consider the delays bound to appear in practice (we assume instantaneous communications and gradient computations), we show that the ordering of the gradient and gossip steps can be variable, removing the coupling lock.

annex

Our contributions are as follows: (1) first, we propose the first primal algorithm with provable guarantees in the context of asynchronous decentralized learning with time-varying connectivity.(2) This algorithm reaches accelerated rates of communication and computations while not requiring adhoc mechanisms obtained from an inner loop. (3) Our algorithm also leads to an accelerated rate with SGD with a minor modification. (4) We propose a simple theoretical framework compared to concurrent works, and (5) we demonstrate its optimality numerically.The structure of our paper is as follows: in Sec. 3.1, we describe our work hypothesis and our model of a decentralized environment, while Sec. 3.2 describes our dynamic. Sec. 3.3 states our convergence guarantees whose proofs are fully detailed in the Appendix. Then, Sec. 4.1 compares our work with its competitors, Sec. 4.2 explains our implementation of this algorithm, and finally, Sec. 4.3 verifies our claims numerically. All our experiments are reproducible, using PyTorch (Paszke et al., 2019) and our code can be found in Appendix.

2. RELATED WORK

Tab. 1 compares our contribution with other references to highlight the benefits of our method. 

