DADAO: DECOUPLED ACCELERATED DECENTRALIZED ASYNCHRONOUS OPTIMIZATION

Abstract

DADAO is a novel decentralized asynchronous stochastic first order algorithm to minimize a sum of L-smooth and µ-strongly convex functions distributed over a time-varying connectivity network of size n. We model the local gradient updates and gossip communication procedures with separate independent Poisson Point Processes, decoupling the computation and communication steps in addition to making the whole approach completely asynchronous. Our method employs primal gradients and does not use a multi-consensus inner loop nor other ad-hoc mechanisms such as Error Feedback, Gradient Tracking, or a Proximal operator. By relating the inverse of the smallest positive eigenvalue χ * 1 and the effective resistance χ * 2 of our graph to a necessary minimal communication rate between nodes of the network, we show that our algorithm requires O(n L µ log( 1 ǫ )) local gradients and only O(n χ * 1 χ *

1. INTRODUCTION

With the rise of highly-parallelizable and connected hardware, distributed optimization for machine learning is a topic of significant interest holding many promises. In a typical distributed training framework, the goal is to minimize a sum of functions (f i ) i≤n split across n nodes of a computer network. A corresponding optimization procedure involves alternating local computation and communication rounds between the nodes. Also, spreading the compute load is done to ideally obtain a linear speedup in the number of nodes. In the decentralized setting, there is no central machine aggregating the information sent by the workers: nodes are only allowed to communicate with their neighbors in the network. In this setup, optimal methods (Scaman et al., 2017; Kovalev et al., 2021a) have been derived for synchronous first-order algorithms, whose executions are blocked until a subset (or all) nodes have reached a predefined state: the instructions must be performed in a specific order (e.g., all nodes must perform a local gradient step before the round of communication begins), which is one of the locks limiting their efficiency in practice. This work attempts to simultaneously address multiple limitations of existing decentralized algorithms while guaranteeing fast convergence rates. To tackle the synchronous lock, we rely on the continuized framework (Even et al., 2021a) , introduced initially to allow asynchrony in a fixed topology setting: iterates are labeled with a continuous-time index (in opposition to a global iteration count) and performed locally with no regards to a specific global ordering of events. This is more practical while being theoretically grounded and simplifying the analysis. However, in Even et al. (2021a), gradient and gossip operations are still coupled: each communication along an edge requires the computation of the gradients of the two functions locally stored on the corresponding nodes and vice-versa. As more communications steps than gradient computations are necessary to reach an ǫ precision, even in an optimal framework (Kovalev et al., 2021a; Scaman et al., 2017) , the coupling directly implies an overload in terms of gradient steps. Another limitation is the restriction to a fixed topology: in a more practical setting, connections between nodes should be allowed to disappear or new ones to appear over time. The procedures of Kovalev et al. (2021c) ; Li & Lin



Lµ log( 1 ǫ )) communications to reach a precision ǫ. If SGD with uniform noise σ 2 is used, we reach a precision ǫ with same speed, up to a bias term in O( σ 2 √ µL ). This improves upon the bounds obtained with current state-of-the-art approaches, our simulations validating the strength of our relatively unconstrained method.

