DADAO: DECOUPLED ACCELERATED DECENTRALIZED ASYNCHRONOUS OPTIMIZATION

Abstract

DADAO is a novel decentralized asynchronous stochastic first order algorithm to minimize a sum of L-smooth and µ-strongly convex functions distributed over a time-varying connectivity network of size n. We model the local gradient updates and gossip communication procedures with separate independent Poisson Point Processes, decoupling the computation and communication steps in addition to making the whole approach completely asynchronous. Our method employs primal gradients and does not use a multi-consensus inner loop nor other ad-hoc mechanisms such as Error Feedback, Gradient Tracking, or a Proximal operator. By relating the inverse of the smallest positive eigenvalue χ * 1 and the effective resistance χ * 2 of our graph to a necessary minimal communication rate between nodes of the network, we show that our algorithm requires communications to reach a precision ǫ. If SGD with uniform noise σ 2 is used, we reach a precision ǫ with same speed, up to a bias term in O( σ 2 √ µL ). This improves upon the bounds obtained with current state-of-the-art approaches, our simulations validating the strength of our relatively unconstrained method.

1. INTRODUCTION

With the rise of highly-parallelizable and connected hardware, distributed optimization for machine learning is a topic of significant interest holding many promises. In a typical distributed training framework, the goal is to minimize a sum of functions (f i ) i≤n split across n nodes of a computer network. A corresponding optimization procedure involves alternating local computation and communication rounds between the nodes. Also, spreading the compute load is done to ideally obtain a linear speedup in the number of nodes. In the decentralized setting, there is no central machine aggregating the information sent by the workers: nodes are only allowed to communicate with their neighbors in the network. In this setup, optimal methods (Scaman et al., 2017; Kovalev et al., 2021a) have been derived for synchronous first-order algorithms, whose executions are blocked until a subset (or all) nodes have reached a predefined state: the instructions must be performed in a specific order (e.g., all nodes must perform a local gradient step before the round of communication begins), which is one of the locks limiting their efficiency in practice. This work attempts to simultaneously address multiple limitations of existing decentralized algorithms while guaranteeing fast convergence rates. To tackle the synchronous lock, we rely on the continuized framework (Even et al., 2021a) , introduced initially to allow asynchrony in a fixed topology setting: iterates are labeled with a continuous-time index (in opposition to a global iteration count) and performed locally with no regards to a specific global ordering of events. This is more practical while being theoretically grounded and simplifying the analysis. However, in Even et al. (2021a) , gradient and gossip operations are still coupled: each communication along an edge requires the computation of the gradients of the two functions locally stored on the corresponding nodes and vice-versa. As more communications steps than gradient computations are necessary to reach an ǫ precision, even in an optimal framework (Kovalev et al., 2021a; Scaman et al., 2017) , the coupling directly implies an overload in terms of gradient steps. Another limitation is the restriction to a fixed topology: in a more practical setting, connections between nodes should be allowed to disappear or new ones to appear over time. The procedures of Kovalev et al. (2021c) ; Li & Lin (2021) are the first to obtain an optimal complexity in gradient steps while being robust to topological change. Unfortunately, synchrony is mandatory in their frameworks as they either rely on the Error-Feedback mechanism (Stich & Karimireddy, 2020) or the Gradient Tracking one (Nedic et al., 2017) . Moreover, they both use an inner loop to control the number of gradient steps at the cost of significantly increasing the number of activated communication edges. To our knowledge, this is the first work to tackle those locks simultaneously. In this paper, we propose a novel algorithm (DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization) based on a combination of similar formulations to Kovalev et al. (2021a) ; Even et al. (2021b) ; Hendrikx (2022) in the continuized framework of Even et al. (2021a) . We study: inf x∈R d n i=1 f i (x) , where each f i : R d → R is a µ-strongly convex and L-smooth function computed in one of the n nodes of a network. We derive a first-order optimization algorithm that only uses primal gradients and relies on a time-varying Point-wise Poisson Processes (P.P.P.s (Last & Penrose, 2017 )) modeling of the communication and gradient occurrences, leading to accelerated communication and computation rates. Our framework is based on a simple fixed-point iteration and kept minimal: it only involves primal computations with an additional momentum term and works in both the Gradient and Stochastic Gradient Descent (SGD) settings. Thus, we do not add other cumbersome designs such as the Error-Feedback or Forward-Backward used in Kovalev et al. (2021a) , which are intrinsically synchronous. While we do not consider the delays bound to appear in practice (we assume instantaneous communications and gradient computations), we show that the ordering of the gradient and gossip steps can be variable, removing the coupling lock. Our contributions are as follows: (1) first, we propose the first primal algorithm with provable guarantees in the context of asynchronous decentralized learning with time-varying connectivity. (2) This algorithm reaches accelerated rates of communication and computations while not requiring adhoc mechanisms obtained from an inner loop. (3) Our algorithm also leads to an accelerated rate with SGD with a minor modification. (4) We propose a simple theoretical framework compared to concurrent works, and (5) we demonstrate its optimality numerically. The structure of our paper is as follows: in Sec. 3.1, we describe our work hypothesis and our model of a decentralized environment, while Sec. 3.2 describes our dynamic. Sec. 3.3 states our convergence guarantees whose proofs are fully detailed in the Appendix. Then, Sec. 4.1 compares our work with its competitors, Sec. 4.2 explains our implementation of this algorithm, and finally, Sec. 4.3 verifies our claims numerically. All our experiments are reproducible, using PyTorch (Paszke et al., 2019) and our code can be found in Appendix.

2. RELATED WORK

Tab. 1 compares our contribution with other references to highlight the benefits of our method. Continuized and asynchronous algorithms. We highly rely on the elegant continuized framework (Even et al., 2021a) , which allows obtaining simpler proofs and brings the flexibility of asynchronous algorithms. However, in our work, we significantly reduce the necessary amount of gradient steps compared to Even et al. (2021a) while maintaining the same amount of activated edges. Another type of asynchronous algorithm can also be found in Latz ( 2021), yet it fails to obtain Nesterov's accelerated rates for lack of momentum. We note that Leblond et al. (2018) studies the robustness to delays yet requires a shared memory and thus applies to a different context than decentralized optimization. Hendrikx ( 2022) is a promising approach for modeling random communication on graphs yet fails to obtain acceleration in a neat framework that does not use inner-loops. Decentralized algorithms with fixed topology. Scaman et al. (2017) is the first work to derive an accelerated algorithm for decentralized optimization, and it links the convergence speed to the Laplacian eigengap. The corresponding algorithm uses a dual formulation and a Chebychev acceleration (synchronous and for a fixed topology). Yet, as stated in Tab. 2, it still requires many edges to be activated. Furthermore, under a relatively flexible condition on the intensity of our P.P.P.s, we show that our work improves over bounds that depend on the spectral gap. An emerging line of work following this formulation employs the continuized framework (Even et al., 2020; 2021a; b) , but is unfortunately not amenable to incorporating a time-varying topology by essence, as they rely on a coordinate descent scheme in the dual (Nesterov & Stich, 2017) . Finally, we note that Even et al. (2021b) incorporates delays in their model, using the same technique as our work, yet transferring this robustness to another setting remains unclear. Reducing the number of communication has been studied in Mishchenko et al. (2022) , only in the context of constant topology and without obtaining accelerated rates. Hendrikx et al. (2021) allows for fast communication and gossip rates yet requires a proximal step and synchrony between nodes to apply a momentum variable. Decentralized algorithms with varying topology. We highlight that Kovalev et al. (2021a) ; Li & Lin (2021) ; Koloskova et al. (2020) are some of the first works to propose a framework for decentralized learning in the context of varying topology. However, they rely on inner-loops propagating variables multiple times through a network, which imposes complete synchrony and communication overheads. In addition, as noted empirically in Lin et al. (2015) , inner-loops lead to a plateau effect. Furthermore, we note that Kovalev et al. (2021b) ; Salim et al. (2021) employ a formulation derived from Salim et al. (2020) ; Condat et al. (2022) , casting decentralized learning as a monotonous inclusion, obtaining a linear rate thanks to a preconditioning step of a Forward-Backward like algorithm. However, being sequential by nature, these types of algorithms are not amenable to a continuized framework. Error feedback/Gradient tracking. A major lock for asynchrony is the use of Gradient Tracking (Koloskova et al., 2021; Nedic et al., 2017; Li & Lin, 2021) or Error Feedback (Stich & Karimireddy, 2020; Kovalev et al., 2021b) . Indeed, gradient operations are locally tracked by a running-mean variable which is updated at each gradient update. This is incompatible with an asynchronous framework as it requires communication between nodes. Furthermore, a multi-consensus inner-loop seems mandatory to obtain accelerated rates, again not desirable. Decoupling procedures Decoupling subsequent steps of an optimization procedures traditionally leads to speed-ups (Hendrikx et al., 2021; Hendrikx, 2022; Belilovsky et al., 2020; 2021) . This contrasts with methods which couple gradient and gossip updates, such that they happen in a predefined order, i.e. simultaneously (Even et al., 2021a) or sequentially (Kovalev et al., 2021a; Koloskova et al., 2020) . In decoupled optimization procedures, inner-loops are not desirable because they require an external procedure that can be potentially slow and need a block-barrier instruction during the algorithm's execution (e.g., Hendrikx et al. (2021; 2019) ).

3.1. GOSSIP FRAMEWORK

We consider the problem defined by Eq. 1 in a distributed environment constituted by n nodes whose dynamic is indexed by a continuous time index t ∈ R + . Each node has a local memory and can compute a local gradient ∇f i , as well as elementary operations, in an instantaneous manner. As said above, having no delay is less realistic, yet adding them also leads to significantly more difficult proofs whose adaptation to our framework remains largely unclear. Next, we will assume that our computations and gossip result from independent in-homogeneous piecewise constant P.P.P. with no delay. For the sake of simplicity, we assume that all nodes can compute a gradient at the same rate: Given our notations, we note that if (i, j) ∈ E(t), then the connexion between (i, j) can be thought as a P.P.P. with intensity 0. We now introduce the instantaneous expected gossip matrix of our graph: Λ(t) (i,j)∈E(t) λ ij (t)(e ie j )(e ie j ) T . We also write Λ(t) (i,j)∈E(t) λ ij (t)(e ie j )(e ie j ) T its tensorized counter-part that will be useful for our proofs and defining our Lyapunov potential, and Λ + (t) its pseudo inverse. Following Scaman et al. (2017) , we will further compare this quantity to the centralized gossip matrix: π I - 1 n 11 T = 1 2n i,j (e i -e j )(e i -e j ) T . We introduce the instantaneous connectivity of Λ(t), as in Kovalev et al. (2021a) : 1 χ 1 (t) inf x⊥1, x =1 x T Λ(t)x . We might also write χ 1 [Λ(t)] to avoid confusion, depending on the context. Next, we introduce the maximal effective resistance of the network, as in Even et al. (2021a); Ellens et al. (2011) : χ 2 (t) 1 2 sup (i,j)∈E(t) (e ie j ) T Λ + (t)(e ie j ) . We remind the following Lemma (proof in Appendix D.2), which will be useful to control χ 1 (t), χ 2 (t) and compare our bounds with the ones making use of the spectral gap of a graph: Lemma 3.1 (Bound on the connectivity constants). The spectrum of Λ(t) is non-negative. Furthermore, we have χ 1 (t) = +∞ iff Ē(t) is not a connected graph. Also, if the graph is connected, then n-1 Tr Λ(t) ≤ min(χ 1 (t), χ 2 (t)) . Furthermore, we also have 2χ 2 (t) inf (i,j)∈E(t) λ ij (t) ≤ 1 . The last part of this Lemma allows to bound χ 2 (t) when no degenerated behavior on the graph's connectivity happens: we assume the networks do not contain arbitrarily slow communication edges. The following assumption is necessary to avoid oscillatory effects due to the variations of Λ(t): Assumption 3.2 (Slowly varying graphs). Assume that Λ(t) is piece-wise constant on time intervals. In particular, it implies that each λ ij (t) is piece-wise constant. Next, following Kovalev et al. (2021a) , we bound uniformly the connectivity to avoid degenerated effects: Assumption 3.3 (Strongly connected topology). Assume that there is χ * 1 > 0 such that χ 1 (t) ≤ χ * 1 . We might write this quantity χ * 1 [Λ] to stress the dependency in Λ(t). From supra, it's clear that χ 2 (t) ≤ χ 1 (t) so that under 3.3, χ 2 (t) is upper bounded by 0 < χ * 2 ≤ χ * 1 .

3.2. DYNAMIC TO OPTIMUM

Next, we follow a standard approach (Kovalev et al., 2021c; a; Salim et al., 2022; Hendrikx, 2022) for solving Eq. 1 (see Appendix B for details), leading to, for 0 < ν < µ: inf x∈R d n i=1 f i (x) = inf x∈R n×d sup y,z∈R n×d n i=1 f i (x i ) - ν 2 x 2 -x, y - 1 2ν πz + y 2 . For F (x) = n i=1 f i (x i ) -ν 2 x 2 , the saddle points (x * , y * , z * ) of this Lagrangian, are given by:    ∇F (x * ) -y * = 0 y * +πz * ν + x * = 0 πz * + πy * = 0 . (2) Contrary to Kovalev et al. (2021a) , we do not employ a Forward-Backward algorithm, which requires both an extra-inversion step and additional regularity on the considered proximal operator. Not only does this condition not hold in this particular case, but this is not desirable in a continuized framework where iterates are not ordered in a predefined sequence and require a local descent at each instant. Another major difference is that no Error-feedback is required by our approach, which allows to unlock asynchrony while simplifying the proofs and decreasing the required number of communications. Instead, we show it is enough to incorporate a standard fixed point algorithm, without any specific preconditioning (see Condat et al. (2019) ). We consider the following dynamic:                      dx t = η(x t -x t )dt -γ(∇F (x t ) -ỹt ) dN(t) dx t = η(x t -xt )dt -γ(∇F (x t ) -ỹt ) dN(t) dỹ t = -θ(y t + z t + ν xt )dt + (δ + δ)(∇F (x t ) -ỹt )dN(t) dy t = α(ỹ t -y t )dt dz t = α(z t -z t )dt -β (i,j)∈E(t) (e i -e j )(e i -e j ) T (y t + z t )dM ij (t) dz t = α(z t -zt )dt -β (i,j)∈E(t) (e i -e j )(e i -e j ) T (y t + z t )dM ij (t) , where ν, η, η, γ, α, α, θ, δ, δ, β, β are undetermined parameters yet. As in Nesterov (2003) , variables are paired to obtain a Nesterov acceleration. The variables (x, y) allow decoupling the gossip steps from the gradient steps using independent P.P.P.s. Furthermore, the Lebesgue integrable path of ỹt does not correspond to a standard momentum, as in a continuized framework (Even et al., 2021a) ; however, it turns out to be a crucial component of our method. Compared to Kovalev et al. (2021a) , no extra multi-consensus step needs to be integrated. Our formulation of an asynchronous gossip step is similar to Even et al. (2021a) which introduces a stochastic variable on edges; however, contrary to this work, our gossip and gradient computations are decoupled. In fact, we can also consider SGD (Bottou, 2010) , by replacing ∇F (x) by an estimate ∇F (x, ξ), for ξ ∈ Ξ, some measurable space. We will need the following assumption on the bias and variance of the gradient: Assumption 3.4 (Unbiased gradient with uniform additive noise). We assume that: E ξ ∇F (x, ξ) = ∇F (x) , and that its quadratic error is uniformly bounded by σ > 0: E ξ ∇F (x, ξ) -∇F (x) 2 ≤ σ 2 . Next, for SGD use, we modify the three first lines of Eq. 3, replacing the stochastic terms (∇F (x t )ỹt ) dN(t) by Ξ (∇F (x t , ξ) -ỹt ) dN(t, ξ), see Appendix C for the complete dynamic. Simulating those SDEs (Arnold, 1974) can be efficiently done, as explained in Sec. 4.2 and Appendix H.

3.3. THEORETICAL GUARANTEES

We follow the approach introduced in Even et al. (2021a) for studying the convergence of the dynamic 3. To do so, we introduce X (x, x, ỹ), Y (y, z, z) and the following Lyapunov potential: Φ(t, X, Y ) A t d F (x, x * ) + Ãt x -x * 2 + B t y -y * 2 + Bt ỹ -y * 2 + C t z + y -z * -y * 2 + Ct z -z * 2 Λ(t) + , where A t , Ãt , B t , Bt , C t , Ct , D t are non-negative functions to be defined. We will use this potential to control the trajectories of X t (x t , xt , ỹt ), Y t (y t , z t , zt ), leading to the equivalent dynamic: dX t = a 1 (X t , Y t )dt + b 1 (X t )dN(t) dY t = a 2 (X t , Y t )dt + (i,j)∈E(t) b ij 2 (Y t )dM ij (t) , where a 1 , a 2 , b 1 = (b i 1 ) i , (b ij 2 ) ij are smooth functions. We prove the following in Appendix D.3. Theorem 3.2 (Gradient Descent). Assume each f i is µ-strongly convex and L-smooth. For any Λ(t), assume 3.1-3.3, and that χ * 1 [Λ]χ * 2 [Λ] ≤ 1 2 . Then there exists some parameters for the dynamic Eq. 3 (given in App. H.2), such that for any initialization x 0 ∈ R d , and x0 = x 0 , y 0 = ỹ0 = ∇f (x 0 ), z 0 = z0 = -π∇f (x 0 ), we get for t ∈ R + and f (x) = n i=1 f i (x i ): E[f (x t )] -f (x * ) ≤ 2(f (x 0 ) -f (x * )) + µ 8 x 0 -x * 2 e -t 8 √ 2 √ µ L . Also, the expected number of gradients is nt and the expected number of edges activated is: E[ t 0 (i,j)∈E(u) λ ij (u) du] = 1 2 t 0 E[Tr Λ(u)] du . ( ) We can obtain the following corollary with a minor modification of our current proof: Corollary 3.2.1 (SGD with additive noise). Under the same assumption as Thm. 3.2 as well as 3.4, for the SGD-dynamic Eq. 8, the same parameters as Thm. 3.2 allows to obtain for t ∈ R + : E[f (x t )] -f (x * ) ≤ 2(f (x 0 ) -f (x * )) + µ 8 x 0 -x * 2 e -t 8 √ 2 √ µ L + 5σ 2 √ µL . Following Even et al. (2021a), L allows to adjust the trace-off bias-variance of our descent.

4.1. EXPECTED COMPUTATIONAL COMPLEXITY

For a given graph E(t), multiple choices of Λ(t) are possible and would still lead to accelerated and linear rates as long as the condition 2χ * 1 [Λ]χ * 2 [Λ] ≤ 1 is verified. Thus, we discuss how to choose our instantaneous expected gossip matrix to compare to concurrent work. To get a precision ǫ, T = O( L µ log( 1 ǫ )) local gradient computations is required per machine. More details can be found in Appendix F on our methodology for comparing with other methods, and particularly the way to recover the order of magnitudes we mention. In the following, each algorithm to which we compare ourselves is parameterized by a Laplacian matrix with various properties. Systematically, for an update frequency f and a family of Laplacians {L q } q (which can be potentially reduced to a single element) given by concurrent work, we will set: Λ(t) = 2χ * 1 [L]χ * 2 [L]L ⌊tf ⌋ λ * L ⌊tf ⌋ , where λ * can be understood as a lower bound on the instantaneous expected communication rate. In this case, Λ(t) satisfies the conditions of Theorem 3.2 or Corollary 3.2.1. From a physical point of view, it allows us to relate the spatial quantities of our graphs to a necessary minimal communication rate between nodes of the network; see Appendix E for a discussion on this topic. Comparison with ADOM+. In ADOM+ (Kovalev et al., 2021a) , one picks χ * 1 [L] ≥ 1 and f = χ * 1 [L]. Then, the number of gossip steps of our algorithm is at most: χ * 1 [L]χ * 2 [L] sup q Tr(L q ) L µ log( 1 ǫ ) = O( χ * 1 [L]χ * 2 [L]n L µ log( 1 ǫ )) Then, the expected communication of ADOM+ is potentially substantially higher than ours (see Appendix F.1):  T t=1 χ * 1 [L]|E(t)| ≥ O( χ * 1 [L]χ * 2 [L]n L µ log( 1 ǫ )) . ( L µ log( 1 ǫ ) χ * 1 [L]χ * 2 [L] ) gradient and gossip iterations are needed. The number of gossip iterations is the same as ours, yet, thanks to Lemma 3.1, the number of gradient iterations can be substantially higher without any additional assumptions, as n - Comparison with methods that depend on the spectral gap. For instance, MSDA relies on a Tchebychev acceleration of the number of gossip steps (possible because Scaman et al. (2017) uses a fixed gossip matrix) which allows getting the number of edges activated of about 1 ≤ 2 χ * 1 [L]χ * 2 [L] (see Appendix F.2). √ ρ * |E| L µ log( 1 ǫ ) , where ρ * is the spectral gap. For our algorithm, the number of gossip writes (with f = 0): L µ log( 1 ǫ ) χ * 1 [L]χ * 2 [L]Tr L ≤ O( √ ρ * |E| L µ log( 1 ǫ )) , where the details of this bound can be found in Appendix F.3 and relies solely on an assumption on the ratio between the maximal and minimal weights in the Laplacian. We highlight that Scaman et al. (2017) ; Kovalev et al. (2021a) claimed that their respective algorithms are optimal because they study the number of computations and synchronized gossips on a worst-case graph; our claim is, by nature different, as we are interested in the number of edges fired rather than the number of synchronized gossip rounds. Tab. 2 predicts the behavior of our algorithm for various classes of graphs encoded via the Laplacian of a stochastic matrix. It shows that systematically, our algorithm leads to the best speedfoot_0 . We note that the graph class depicted in the Tab. 2 were used as worstcase examples of Scaman et al. (2017) ; Kovalev et al. (2021a) . The next section implements and validates our ideas. Table 2 : Complexity for various graphs using a stochastic matrix. We have, respectively for a star / line or cyclic / complete graph and the d-dimensional grid:  χ * 1 = O(1), ρ * = O(n) / χ * 1 = O(n 2 ), ρ * = O(n 2 ), χ * 2 = O(1) / χ * 1 = O(1), ρ * = O(1) / χ * 1 = O(n 2/d ), ρ * = O(n 2/d ), χ * 2 = O(1).

4.2. ALGORITHM

We now describe the algorithm used to implement the dynamics of Eq. 3 and, in particular, our simulator of P.P.P.. Let us write T (i) 1 < T (i) 2 < ... < T (i) k < . .. the time of the k-th event on the i-th node, which is either an edge activation, either a gradient update. We remind that the spiking times of a specific event correspond to random variables with independent exponential increments and can thus be generated at the beginning of our simulation. They can also be generated on the fly and locally to stress the locality and asynchronicity of our algorithm. Let's write X t = (X (i) t ) i and Y t = (Y (i) t ) i , then on the i-th node and at the k-th iteration, we integrate the linear Ordinary Differential Equation (ODE) on [T (i) k ; T (i) k+1 ], given by dX t = a 1 (X t , Y t )dt dY t = a 2 (X t , Y t )dt , to define the values right before the spike, for A the corresponding constant matrix, we thus have:   X (i) T (i)- k+1 Y (i) T (i)- k+1   = exp (T (i) k+1 -T (i) k )A   X (i) T (i) k Y (i) T (i) k   . Under review as a conference paper at ICLR 2023 Next, if one has a gradient update, then: X (i) T (i) k+1 = X (i) T (i)- k+1 + b 1 X (i) T (i)- k+1 . Otherwise, if the edge (i, j) or (j, i) is activated, a communication bridge is created between both nodes i and j. In this case, the local update on i writes: Y (i) T (i) k+1 = Y (i) T (i)- k+1 + b 2 Y (i) T (i)- k+1 , Y T (i)- k+1 . Note that, even if this event takes place along an edge (i, j), we can write it separately for nodes i and j by making sure they both have the events T (i) ki = T (j) kj , for some k i , k j ∈ N, corresponding to this communication. As advocated, all those operations are local, and we summarize in the Alg. 1 the algorithmic block which corresponds to our implementation. See Appendix H for more details. Algorithm 1: This algorithm block describes our implementation on each local machine. The ODE routine is described by Eq. 6 and Ping is an instantaneous routine. Input: On each machine i ∈ {1, ..., n}, gradient oracle ∇f i , parameters µ, L, χ * 1 , t max . Initialize on each machine i ∈ {1, ..., n}: Set X (i) , Y (i) , T (i) to 0 and set A via Eq. 105; Synchronize the clocks of all machines ; In parallel on workers i ∈ {1, ..., n}, while t < t max , continuously do: t ← clock() ; Ping surrounding machines and adjust λ ij (t) ; if there is an event at time t then (X (i) , Y (i) ) ← ODE(A, t -T (i) , (X (i) , Y (i) )) ; if the event is to take a gradient step then 10 X (i) ← X (i) + b 1 (X (i) ) ; else if the event is to communicate with j then Y (i) ← Y (i) + b 2 (Y (i) , Y (j) ) ; // Happens at j simultaneously. T (i) ← t ; return (x (i) tmax ) 1≤i≤n , the estimate of x * on each worker i.

4.3. NUMERICAL RESULTS

In this section, we study the behavior of our method and compare it to several settings inspired by Kovalev et al. (2021a); Even et al. (2021a) . In our experiments, we perform the empirical risk minimization for both the decentralized linear and logistic regression tasks given either by: f i (x) = 1 m m j=1 log(1 + exp(-b ij a ⊤ ij x)) + µ 2 x 2 or f i (x) = 1 m m j=1 a ⊤ ij x -c ij 2 , ( ) where a ij ∈ R d , b ij ∈ {-1, 1} and c ij ∈ R correspond to m local data points stored at node i. For both varying and fixed topology settings, we follow a protocol similar to Kovalev et al. (2021a) : we generate n independent synthetic datasets with the make_classification and make_regression functions of scikit-learn (Pedregosa et al., 2011) , each worker storing m = 100 data points. We recall that the metrics of interest are the total number of local gradient steps and the total number of individual messages exchanged (i.e., number of edges that fired) to reach an ǫ-precision. We systematically used the proposed hyper-parameters of each reference paper for our implementation without any specific fine-tuning. Comparison in the time-varying setting. We compare our method to ADOM+ (Kovalev et al., 2021a) on a sequence of 50 random geometric graphs of size n = 20 in Fig. 1 . To construct the graphs, we sample n points uniformly in [0, 1] 2 ⊂ R 2 and connect each of them to all at a distance less than some user-specified radius, which allows controlling the constant χ * 1 (we consider values in {3, 33, 180, 233}). We ensure the connectedness of the graphs by adding a minimal number of edges, exactly as done in Kovalev et al. (2021b) . We then use the instantaneous gossip matrix introduced in Eq. 5 with f = χ * 1 . We compare ourselves to both versions of ADOM+: with and without the Multi-Consensus (M.-C.). Thanks to its M.-C. procedure, ADOM+ can significantly reduce the number of necessary gradient steps. However, consistently with our analysis in Sec. 4.1, our method is systematically better in all settings in terms of communications. Comparison with accelerated methods in the fixed topology setting. Now, we fix the Laplacian matrix via Eq. 5 to compare simultaneously to the continuized framework (Even et al., 2021a) and MSDA (Scaman et al., 2017) . We reports in Fig. 2 results corresponding to the complete graph with n = 250 nodes and the line graph of size n = 150. While sharing the same asymptotic rate, we note that the Continuized framework (Even et al., 2021a) and MSDA (Scaman et al., 2017) have better absolute constants than DADAO, giving them an advantage both in terms of the number of communication and gradient steps. However, in the continuized framework, the gradient and communication steps being coupled, the number of gradient computations can potentially be orders of magnitude worse than our algorithm, which is reflected by Fig. 2 for the line graph. As for MSDA and ADOM+, Tab. 2 showed they do not have the best communication rates on certain classes of graphs, as confirmed to the right in Fig. 2 for MSDA and the communication plots for ADOM+. In conclusion, while several methods can share similar convergence rates, ours is the only one to perform at least as well as its competitors in every setting for different graph's topology and two distinct tasks, as predicted by Tab. 1. 7RWDOQXPEHURIJUDG í í í í 1 n n ∑ i = 1 ‖xi -x ‖ ‖ 2 $YHUDJHGLVWDQFHWRRSW $'200& $'20 '$'$2 7RWDOQXPEHURIFRP H í í í í $YHUDJHGLVWDQFHWRRSW 7RWDOQXPEHURIJUDG í í 1 n n ∑ i = 1 ‖xi -x ‖ ‖ 2 $YHUDJHGLVWDQFHWRRSW 7RWDOQXPEHURIFRP H í í $YHUDJHGLVWDQFHWRRSW 7RWDOQXPEHURIJUDG í í 1 n n ∑ i = 1 ‖xi -x ‖ ‖ 2 $YHUDJHGLVWDQFHWRRSW 7RWDOQXPEHURIFRP H í í $YHUDJHGLVWDQFHWRRSW $'200& $'20 &RQWLQXL]HG 06'$ '$'$2 7RWDOQXPEHURIJUDG í í í 1 n n ∑ i = 1 ‖xi -x ‖ ‖ 2 $YHUDJHGLVWDQFHWRRSW 7RWDOQXPEHURIFRP H í í í $YHUDJHGLVWDQFHWRRSW

A NOTATIONS

For a positive semi-definite matrix A, x A x T Ax, f = O(g) means there is a constant C > 0 such that |f | ≤ C|g|, {e i } i≤d is the canonical basis of R d , d ∈ N, 1 is the vector of 1, I the identity, A + is the pseudo-inverse of A and for a smooth convex function F , d F (x, y) F (x) -F (y) -∇F (y), xy is its Bregman divergence. We further write e i e i ⊗ I.

B SADDLE POINT REFORMULATION

With 0 < ν < µ and introducing an extra dual variable x, we get: inf x∈R d n i=1 f i (x) = inf x,x∈R n×d x=x,π x=0 n i=1 f i (x i ) - ν 2 x 2 + ν 2 x 2 = inf x,x∈R n×d sup y,z∈R n×d n i=1 f i (x i ) - ν 2 x 2 + ν 2 x 2 + y, x -x + z, πx = inf x∈R n×d sup y,z∈R n×d inf x∈R n×d n i=1 f i (x i ) - ν 2 x 2 + ν 2 x 2 + y, x -x + z, πx = inf x∈R n×d sup y,z∈R n×d n i=1 f i (x i ) - ν 2 x 2 -x, y - 1 2ν πz + y 2 .

C SGD DYNAMIC

The dynamic considered when using stochastic gradients is given by:                      dx t = η(x t -x t )dt -γ Ξ (∇F (x t , ξ) -ỹt ) dN(t, ξ) dx t = η(x t -xt )dt -γ Ξ (∇F (x t , ξ) -ỹt ) dN(t, ξ) dỹ t = -θ(y t + z t + ν xt )dt + (δ + δ) Ξ (∇F (x t , ξ) -ỹt )dN(t, ξ) dy t = α(ỹ t -y t )dt dz t = α(z t -z t )dt -β (i,j)∈E(t) (e i -e j )(e i -e j ) T (y t + z t )dM ij (t) dz t = α(z t -zt )dt -β (i,j)∈E(t) (e i -e j )(e i -e j ) T (y t + z t )dM ij (t)

D PROOF OF THE THEOREM D.1 PROPERTIES AND ASSUMPTIONS

The following properties will be used all along the proofs of the Lemmas and Theorems and are related to the communication of our nodes. Lemma D.1. Under the assumptions of Theorem 3.2, if z 0 , z0 ∈ span(π), then z t , zt ∈ span(π) almost surely. Proof. It's clear that for any i, j, we get: π(e ie j )(e ie j ) T = (e ie j )(e ie j ) T . Thus, the variations of (z t , zt ) belong to span(π), and so is the trajectory. We derive the following Lemma, similar to a result from Boyd et al. (2006) : Lemma D.2 (Spiking contraction). Under the assumptions of Theorem 3.2, we have: (i,j)∈E(t) λ ij (t) (e i -e j )(e i -e j ) T x -πx 2 -πx 2 = -x T Λ(t)x ≤ - 1 χ * 1 πx 2 . Proof. If i = j, then λ ii = 0. For a given (i, j), we get if i = j: (e i -e j )(e i -e j ) T x -πx 2 = πx 2 + x i -x j 2 (9) -2 π(x), (e ie j )(e ie j ) T x = π(x) 2x, (e ie j )(e ie j ) T x . (10) And this allows conclusion by sum. Lemma D.3 (Effective resistance contraction). For i, j and any x ∈ R d , we have: (e i -e j )(e i -e j ) T x 2 Λ(t) + ≤ χ * 2 (e i -e j )(e i -e j ) T x . Proof. Indeed, we note that: (e ie j )(e ie j ) T x 2 Λ(t) + = x T (e ie j )(e ie j ) T Λ(t) + (e ie j )(e ie j ) T x (11) ≤ 2χ * 2 x T (e i -e j )(e i -e j ) T x (12) = χ * 2 (e i -e j )(e i -e j ) T x 2 Lemma D.4 (Min and max voltage values, see, e.g., Klein & Randic (1993).) . For any i, j, k, e T i Λ(t) + (e ie j ) ≥ e T k Λ(t) + (e ie j ). Proof. Let us call v Λ(t) + (e ie j ), and for a vertex k, N (k) the set its neighbors in E(t). Note that k ∈ N (k). We want to prove that v i e T i v is greater than v k . In fact, we will prove that ∀k, v i ≥ v k ≥ v j . Recall that Λ(t)Λ(t) + = Iπ, meaning that Λ(t)v = e ie j . Thus, ∀k ∈ {i, j}, we have (Λ(t)v) k = 0, leading to: 0 = v k k ′ ∈N (k) (λ kk ′ (t) + λ k ′ k (t)) - k ′ ∈N (k) (λ kk ′ (t) + λ k ′ k (t))v k ′ . This allows us to write: v k = k ′ ∈N (k) (λ kk ′ (t) + λ k ′ k (t))v k ′ k ′ ∈N (k) λ kk ′ (t) + λ k ′ k (t) , meaning that v k is a convex combination of the values of v of its neighbors. As such, for any k ∈ {i, j}, the value v k is inside the convex hull of the {v k ′ , k ′ ∈ N (k)} and cannot be strictly superior to the maximal nor strictly inferior to the minimal value of v among its neighbors. The only case where v k is maximal or minimal is when the {v k ′ , k ′ ∈ N (k)} are all equal. This means that the only two coordinates of v that are allowed to be strictly maximal or minimal among their neighbors are i and j. Thus, the graph being connected, there is a path beginning at any k and leading to i or j made by iterating arg max k ′ ∈N (k) v k ′ or arg min k ′ ∈N (k) v k ′ steps, hence max k v k and min k v k are in {v i , v j }. But, we have v i -v j = (e i -e j ) T Λ(t) + (e i -e j ) ≥ 0 and v i ≥ v j . Thus, max k v k = v i and min k v k = v j . Lemma D.5 (Bound effective resistance). For any i, j, we have λ ij (t)(e ie j ) T Λ(t) + (e ie j ) ≤ 1. Proof. As Λ(t)Λ(t) + = Iπ, for any i, j, we write: 2 = (e i -e j ) T Λ(t)Λ(t) + (e i -e j ) = (k,l)∈E(t) λ kl (t)(e i -e j ) T (e k -e l )(e k -e l ) T Λ(t) + (e i -e j ) . As we have, for any i, j, k, l: (e i -e j ) T (e k -e l ) =                      1 if k = i and l = j 1 if k = i and l = j -1 if k = j and l = i -1 if k = j and l = i 2 if k = i and l = j -2 if k = j and l = i 0 otherwise , and applying the fact that λ kl (t) = 0 if (k, l) ∈ E(t), we can expand Eq. 15 as: 2 = l =j λ il (t) e T i Λ(t) + (e i -e j ) -e T l Λ(t) + (e i -e j ) + k =i λ kj (t) e T k Λ(t) + (e i -e j ) -e T j Λ(t) + (e i -e j ) - l =i λ jl (t) e T j Λ(t) + (e i -e j ) -e T l Λ(t) + (e i -e j ) - k =j λ ki (t) e T k Λ(t) + (e i -e j ) -e T i Λ(t) + (e i -e j ) + 2λ ij (t)(e i -e j ) T Λ(t) + (e i -e j ) -2λ ji (t)(e j -e i ) T Λ(t) + (e i -e j ) . Using Lemma D.4, the facts that λ ji (t) ≥ 0 and Λ(t) + is positive semi-definite, all the terms in Eq. 17 are positive, giving: 2λ ij (t)(e i -e j ) T Λ(t) + (e i -e j ) ≤ 2 Next, we set ν = µ 2 such that: 1 2L ∇F (x) -∇F (y) 2 ≤ d F (x, y) ≤ L 2 x -y 2 , and ν 2 x -y 2 ≤ d F (x, y) ≤ 1 2ν ∇F (x) -∇F (y) 2 , and we remind that: E ξ d F (.,ξ) (x, y) = d E ξ F (.,ξ) (x, y) . D.2 PROOF OF THE LEMMA 3.1 Proof of Lemma 3.1. First, we note that Λ(t) is symmetric and has a non-negative spectrum, as: x T Λ(t)x = (ij)∈E(t) λ ij (t) x i -x j 2 . From this, we also clearly see that χ 1 (t) = +∞ iff the graph is disconnected. Next, assuming that the graph is connected, 0 is an eigenvalue of Λ(t) with multiplicity 1 and by definition of χ 1 (t), we have TrΛ(t) ≥ n-1 χ1(t) . As we also have: (i,j)∈E(t) λ ij (t)(e i -e j ) T Λ + (t)(e i -e j ) = Tr(Λ + (t)Λ(t)) = n -1 , we can write: n -1 ≤ 2χ 2 (t) (i,j)∈E(t) λ ij (t) = χ 2 (t)TrΛ(t) and get n-1 TrΛ(t) ≤ min(χ 1 (t), χ 2 (t)). Finally, for any (i, j) ∈ E(t), using Lemma D.5, we get that: λ ij (t)(e i -e j ) T Λ + (t)(e i -e j ) ≤ 1 . Thus, inf (i,j)∈E(t) λ ij (t) (e ie j ) T Λ + (t)(e ie j ) ≤ 1 , leading to: 2χ 2 (t) ≤ 1 inf (i,j)∈E(t) λ ij (t) . D.3 PROOF OF THEOREM 3.2 AND COROLLARY 3.2.1 Proof of Theorem 3.2. Because Φ is smooth and E(t) is constant on intervals, we get via Ito's formula (Last & Penrose, 2017) applied to the semi-martingale (X t , Y t ), gluing intervals where E(t) is constant (as well as the weights λ ij (t)), that: Φ(t, X t , Y t ) = Φ(0, X 0 , Y 0 ) + T 0 ∇Φ(t, X t , Y t ), 1 a 1 (X t , Y t ) a 2 (X t , Y t ) dt + n i=1 T 0 Φ(t, X t + b i 1 (X t ), Y t ) -Φ(t, X t , Y t ) dt + (i,j)∈E(t) T 0 Φ(t, X t , Y t + b ij 2 (Y t )) -Φ(t, X t , Y t ) λ ij (t)dt + Θ T , where: Θ T n i=1 T 0 Φ(t, X t -, Y t -+ b i 1 (X t -)) -Φ(u, X t -, Y t -) (dN i (t) -dt) + (i,j)∈E(t) T 0 Φ(t, X t -+ b ij 2 (X t -), Y t -) -Φ(t, X t -, Y t -) (dM ij (t) -λ ij (t)dt) . We will use the following technical Lemma, which is also difficult to prove and whose proof is deferred to Appendix D.4: Lemma D.6. There exists some parameters ν, η, η, γ, γ, α, α, θ, δ, δ, β, β and c > 0 such that: ∇Φ(t, X t , Y t ), 1 a 1 (X t , Y t ) a 2 (X t , Y t ) + Φ(t, X t + b 1 (X t ), Y t ) -Φ(t, X t , Y t ) + (i,j)∈E(t) λ ij (t) Φ(t, X t , Y t + b ij 2 (Y t )) -Φ(t, X t , Y t ) ≤ 0 a.s. , with A ′ t = c µ L A t , with A 0 = 1. Following the lemma above, we get that: 0 ≤ E[Φ(t, X t , Y t )] ≤ E[Φ(0, X 0 , Y 0 )] . We thus know that A t = e c √ µ L , which implies that: E[A t d F (x t , x * )] ≤ E[Φ(0, X 0 , Y 0 )] , and we will obtain the conclusion of our theorem by expliciting all the constants in the following. We note that the expected number of activated edges between [0, T ] is by use of the Poisson Process T 0 Tr(Λ(t) dt, and given the gradient fire at rate 1, the expected number of gradients computed is nT . Proof of Corollary 3.2.1. We remind the SGD version of our Lemma: Lemma D.7. There exists some parameters ν, η, η, γ, γ, α, α, θ, δ, δ, β, β and c > 0, C > 0 such that: ∇Φ(t, X t , Y t ), 1 a 1 (X t , Y t ) a 2 (X t , Y t ) + Φ(t, X t + b 1 (X t ), Y t ) -Φ(t, X t , Y t ) + (i,j)∈E(t) λ ij (t) Φ(t, X t , Y t + b ij 2 (Y t )) -Φ(t, X t , Y t ) ≤ CA t 1 L a.s. , ( ) with A ′ t = c µ L A t , with A 0 = 1. The proof follows the same path, except that we have an extra term that writes for any T > 0: T 0 A t L ≤ 1 c √ µL which leads to the conclusion following an identical path. The constants will be explicited in the next Lemma. D.4 PROOF OF THE LEMMA D.6 AND LEMMA D.7 We first state a couple of inequalities that we will combine to obtain a bound on our Lyapunov function. Proposition D.8. First: φ A A t (d F (x + , x * ) -d F (x, x * )) + Ãt ( x+ -x * 2 -x -x * 2 ) + ηA t x -x, ∇F (x) -∇F (x * ) + 2η Ãt x -x, x -x * (22) ≤ ∇F (x) -ỹ 2 A t Lγ 2 2 -A t γ + Ãt γ2 + A t γ ∇F (x) -ỹ, y * -ỹ + 2γ Ãt ỹ -y * , x -x * (23) -2γ Ãt (d F (x, x * ) + d F (x * , x) -d F (x, x)) -ηA t (d F (x, x) + d F (x, x * ) -d F (x, x * )) -Ãt η x -x * 2 + Ãt η x -x * 2 Proof. First, we have to use optimality conditions and smoothness: d F (x + , x * ) -d F (x, x * ) = d F (x + , x) -x + -x, ∇F (x * ) -∇F (x) ≤ L 2 x + -x 2 -x + -x, ∇F (x * ) -∇F (x) (25) = Lγ 2 2 ỹ -∇F (x) 2 -γ ∇F (x) -ỹ 2 + γ ∇F (x) -ỹ, y * -ỹ Next, we note that, again using optimality conditions: x+ -x * 2 -x+ -x * 2 = 2 x+ -x, x -x * + x+ -x 2 (27) = -2γ ∇F (x) -ỹ, x -x * + γ2 ∇F (x) -ỹ 2 (28) = -2γ ∇F (x) -∇F (x * ), x -x * + 2γ ỹ -y * , x -x * + γ2 ∇F (x) -ỹ 2 (29) = -2γ(d F (x, x * ) + d F (x * , x) -d F (x, x)) + 2γ ỹ -y * , x -x * + γ2 ∇F (x) -ỹ 2 (30) Momentum in x associated with the term d F (x, x * ) gives: η x -x, ∇F (x) -∇F (x * ) = -η(d F (x, x) + d F (x, x * ) -d F (x, x * )) and momentum in x associated with xx * 2 leads to: 2η x -x, x -x * = -2η x -x * 2 + 2η x -x * , x -x * ≤ -η x -x * 2 + η x -x * 2 (32) Corollary D.8.1. Under Assumption 3.4, we have: φA E ξ [A t (d F (x + , x * ) -d F (x, x * )) + Ãt ( x+ -x * 2 -x -x * 2 ) + ηA t x -x, ∇F (x) -∇F (x * ) + 2η Ãt x -x, x -x * ] (33) ≤ φ A + σ 2 (A t Lγ 2 2 -A t γ + Ãt γ) (34) Proof. Using the same computations and the Eq. 36, we next note that: E ξ [ ∇F (x, ξ) -y 2 ] = E ξ [ ∇F (x, ξ) 2 -2 ∇F (x, ξ), y + y 2 ] (35) ≤ ∇F (x) -y 2 + σ 2 Proposition D.9. Next, we show that if αB t = δ 2 Bt : φ B B t ( y + -y * 2 -y -y * 2 ) + Bt ( ỹ+ -y * 2 -ỹ -y * 2 ) + 2αB t y -y * , ỹ -y -2θ Bt y + z + ν x, ỹ -y * + 2αC t ỹ -y, z + y -y * -z * (37) ≤ - δ 2 Bt ỹ -y * 2 - δ 2 Bt y -y * 2 -2 δ Bt ∇F (x) -ỹ, y * -ỹ + δ Bt ∇F (x) -∇F (x * ) 2 + (δ + δ) 2 -δ Bt ∇F (x) -y 2 -2θ Bt y + z -y * -z * , ỹ -y * -2θν Bt x -x * , ỹ -y * + 2αC t ỹ -y, z + y -y * -z * (38) Proof. Using optimality conditions: ỹ+ -y * 2 -ỹ -y * 2 = 2 ỹ -y * , ỹ+ -ỹ + ỹ+ -ỹ 2 (39) = 2δ ∇F (x) -ỹ, ỹ -y * + 2 δ ∇F (x) -ỹ, ỹ -y * (δ + δ) 2 ∇F (x) -ỹ 2 (40) = -2 δ ∇F (x) -ỹ, y * -ỹ + δ ∇F (x) -∇F (x * ) 2 -δ ỹ -y * 2 (δ + δ) 2 -δ ∇F (x) -ỹ 2 (41) The momentum in ỹ associated with the term ỹy * 2 gives: -2θ Bt y + z + ν x, ỹ -y * = -2θ Bt y + z -y * -z * , ỹ -y * -2θν Bt x -x * , ỹ -y * (42) The momentum in y associated with the term yy * 2 gives: 2αB t ỹ -y, y -y * = -αB t y -y * 2 -αB t ỹ -y 2 + αB t ỹ -y * 2 (43) and the one associated with y + zy *z * 2 : 2αC t ỹ -y, z + y -y * -z * (44) Proof of Lemma D.7. All the previous computations hold, except that the term in front of σ 2 is given by: (δ 2 + (δ + δ) 2 ) Bt + (A t Lγ 2 2 -A t γ + Ãt γ) = (δ 2 + (δ + δ) 2 ) A t 8L + (A t Lγ 2 2 -A t γ + ν A t 4 γ) (85) ≤ 5 8L A t Thus, we obtain, by integration of the potential that: Φ(t, X t , Y t ) ≤ Φ(0, X 0 , Y 0 ) + σ 2 t 0 5 8L A u du Now with A t , Ãt , B t , Bt , C t , Ct as above and all the constants as above, we get the result, since: t 0 5 8L A u du ≤ 5 √ Lµ .

E PHYSICAL INTERPRETATION

To gain more insight on the condition 2χ * 1 [Λ]χ * 2 [Λ] ≤ 1, we can write Λ(t) as the product of two more interpretable quantities: Λ(t) = (ij)∈E(t) λ ij (t) λ(t) 2Λ(t) Tr Λ(t) Λ(t) . In this setting, λ(t) is the instantaneous expected rate of communication over the whole graph at time t, while Λ(t) can be interpreted as the Laplacian of E(t) with each edge weighted with its probability of having spiked at this instant given an edge fired at time t. Being normalized, Λ(t) only contains the information about the graph's connectivity at time t while λ(t) is the global rate of communication. We have: χ 1 [Λ(t)] = χ 1 [ Λ(t)] λ(t) ; χ 2 [Λ(t)] = χ 2 [ Λ(t)] λ(t) . If we make the following assumptions, Assumption E.1. There is a λ * > 0 such that, at all time t, λ(t) ≥ λ * . Assumption E.2. There are χ * 1 > 0, χ * 2 > 0 such that, for all t, χ 1 [ Λ(t)] ≤ χ * 1 and χ 2 [ Λ(t)] ≤ χ * 2 . meaning we assume bounds on the worst rate of communication and on the worst graph connectivity, we immediately have χ 1 [Λ(t)] ≤ χ * 1 λ * and χ 2 [Λ(t)] ≤ χ * 2 λ * , leading to χ * 1 ≤ χ * 1 λ * and χ * 2 ≤ χ * 2 λ * . Then, if the following condition on the worst rate of communication is met 2 χ * 1 χ * 2 ≤ λ * , meaning that the instantaneous global communication rate is always larger than some spectral quantity quantifying the graph's connectivity, it directly implies 2χ * 1 [Λ]χ * 2 [Λ] ≤ 1 and the convergence of our method.

F COMPARISON WITH OTHER WORKS

We now explain the results of Sec. 4.1. that Tr(Λ (t)) = 2 (ij)∈E(t) λ ij (t) ≤ 2|E(t)| sup (i,j)∈E(t) λ ij (t) and Tr(Λ(t)) ≤ (n -1) Λ(t) , we obtain: √ χ 2 Tr(Λ(t)) ≤ 1 2 inf (i,j)∈E(t) λ ij (t) TrΛ(t) TrΛ(t) (100) √ χ 2 Tr(Λ(t)) ≤ 1 2 inf (i,j)∈E(t) λ ij (t) 2|E(t)| sup (i,j)∈E(t) λ ij (t) TrΛ(t) (101) ≤ sup (i,j)∈E(t) λ ij (t) inf (i,j)∈E(t) λ ij (t) |E(t)| (n -1) Λ(t) (102) ≤ sup (i,j)∈E(t) λ ij (t) inf (i,j)∈E(t) λ ij (t) |E(t)| Λ(t) G FURTHER EXPERIMENTS In this section, we present additional numerical results comparing our method DADAO to ADOM+ (Kovalev et al., 2021a) in the time-varying setting and report our results using SGD.

G.1 TIME-VARYING SETTING

In this section, we study the effect of the parameter χ * 1 on the convergence speed of ADOM+ (Kovalev et al., 2021a) and DADAO by varying it between χ * 1 ∈ {3, 33, 180, 233} for random geometric graphs of size n = 20 on the decentralized linear regression task with time-varying topology. To visualize the difference in connectivity these changes in χ * 1 represent, we plot four graphs of the said types with varying values of χ * 1 in Fig. 3 . In Fig. 4 , we show the different convergence speeds it entails. As expected, we observe in Fig. 4 that varying χ * 1 does not affect the number of gradient computations of both ADOM+ M.-C and DADAO, but the smaller the χ * 1 , the better the slope for ADOM+ in terms of gradient steps. We also confirm for all three methods that the smaller χ * 1 , the less communication is needed to reach an ǫ-precision.

G.2 STOCHASTIC GRADIENT DESCENT WITH DADAO

In the SGD setting, we randomly sample a mini-batch of size B data points on each worker and compute the losses and stochastic gradients ∇f i (x i , ξ) w.r.t. these samples. To study the effect of the quadratic error σ 2 of our gradients on the resulting biases of our parameters, we fix both the data (for linear regression) and the communication network (graph star of size n = 20) and try different values of B. To monitor our results, we plot the mean distance to x * of the running average over time of our local parameters. Then, taking the notations introduced in Sec. 4.2, we can write: 1 n n i=1 1 k i ki j=1 x (i) j -x * 2 , where k i designates a local event counter. We report our results in Fig. 5 . We confirm that the less variance on our stochastic gradients, the less our estimates 1 , 20, 70, 200, 300, 1000, 2000} , we ran DADAO and MSDA on the task of distributed linear regression. We considered the evolution of the average distance to the optimal with the number of gradient steps and commmunication steps in log scale for each run, and computed the slope of each line. For each graph size, we report in Fig. 6 the rate between the slope for MSDA and the slope for DADAO. We remark that the rate between the gradient complexities For the sake of completeness, we also specify the matrix A describing the linear ODE 6: A =        -η η 0 0 0 0 η -η 0 0 0 0 0 0 -α α 0 0 0 -θν -θ 0 -θ 0 0 0 0 0 -α α 0 0 0 0 α - α       As described in Appendix H.1, we call PPPspikes the process mentioned above, returning the ordered sequence of events and time of spikes of the two P.P.Ps. Then, we can write the pseudocode of our implementation of the DADAO optimizer in Algorithm 2. Algorithm 2: Pseudo-code of our implementation of DADAO on a single machine. Input: On each machine i ∈ {1, ..., n}, an oracle able to evaluate ∇f i , Parameters µ, L, χ * 1 , t max , n, λ * . The sequence of time-varying graphs E(t). Initialize on each machine i ∈ {1, ..., n}: Set X (i) = (x i , xi , ỹi ) and Y (i) = (y i , z i , zi ) to 0 ; Set constants ν, η, η, γ, α, α, θ, δ, δ, β, β using µ, L, χ * 1 ; Set A;  T (i X (i) Y (i) ← exp (ListTimes[k] -T (i) )A X (i) Y (i) ; x i ← x iγ (∇f i (x i )νx i -ỹi ); xi ← xi -γ (∇f i (x i )νx i -ỹi ); ỹi ← ỹi + (δ + δ) (∇f i (x i )νx i -ỹi ); T (i) ← ListTimes[k] ; else if ListEvents[k] is to take a communication step then (i, j) ∼ U (E(ListTimes[k])) ; X (i) Y (i) ← exp (ListTimes[k] -T (i) )A X (i) Y (i) ; X (j) Y (j) ← exp (ListTimes[k] -T (j) )A X (j) Y (j) ; m ij ← (y i + z iy jz j ); // Message exchanged. z i ← z i -βm ij ; zi ← zi -βm ij ; z j ← z j + βm ij ; zj ← zj + βm ij ; T (i) ← ListTimes[k]; T (j) ← ListTimes[k]; return (x i ) 1≤i≤n , the estimate of x * on each worker i.



For the case 2-grid, the logarithmic term should appear, yet we decided to neglect them. CONCLUSIONIn this work, we have proposed a novel stochastic algorithm for the decentralized optimization of a sum of smooth and strongly convex functions. We have demonstrated, theoretically and empirically, that this algorithm leads systematically to a substantial acceleration compared to state-of-the-art works. Furthermore, our algorithm is asynchronous, decoupled, primal, and does not rely on an extra inner-loop while being amenable to varying topology settings: each of these properties makes it suitable for real applications. In future work, we would like to explore the robustness of such algorithms to more challenging variabilities occurring in real-life applications.



Comparison with standard Continuized. If L is a Laplacian picked such that Tr L = 2 (thus f = 0), as in Even et al. (2021a), then Even et al. (2021a) claims that at least O

Furthermore, the computations of Even et al. (2021a) still use the dual gradients and are for a fixed topology.

Figure 1: Comparison between ADOM+(Kovalev et al., 2021a)  and DADAO, using the same data from left to right: binary classification, linear regression) and the same sequence of random connected graphs with χ * 1 = 180 linking n = 20 workers.

Figure 2: Comparison between ADOM+(Kovalev et al., 2021a), the continuized framework (Even et al., 2021a), MSDA(Scaman et al., 2017) and DADAO, using the same data for the linear regression task, and the same graphs (from left to right: line with n = 150, complete with n = 250).

Figure 3: Examples of random geometric graphs of size n = 20 with χ * 1 taking values in, from left to right, χ * 1 ∈ {3, 33, 180, 233}.

Figure 4: Comparison between ADOM+(Kovalev et al., 2021a)  and DADAO, using the same data for linear regression on n = 20 workers and the same sequence of random connected graphs with varying topology and χ * 1 taking values in, from the left to the right column, χ * 1 ∈ {3, 33, 180, 233}.

Figure 5: Effect of the batch size B on the convergence of our method DADAO. Recall that the full batch size m equals 100.

BETWEEN DADAO AND MSDA ON THE STAR GRAPHFor star graphs of size n ∈ {10

This table shows the strength of DADAO compared to concurrent works. n is the number of node, |E| the number of edges, 1 χ1 the smallest positive eigenvalue of a fixed stochastic Gossip matrix, ρ the eigengap and χ 2 ≤ χ 1 the effective resistance. Note that under reasonable assumptions √ χ

annex

Corollary D.9.1. Under Assumption 3.4, we have: φB E ξ [ Bt ( y +y * 2yy * 2 ) + Bt ( ỹ+y * 2 -ỹy * 2 ) + 2αB t yy * , ỹy -2θ Bt y + z + ν x, ỹy * ] (45)Proof. Exactly as above.Proposition D.10. Finally, assuming θ Bt = β Ct = αC t , letting 1 ≥ τ > 0, z + ij = β(e ie j )(e ie j ) T (y + z) and z+ ij = β(e ie j )(e ie j ) T (y + z), then:λ ij (t) (e ie j )(e ie j ) T (y + z) 2λ ij (t) (e ie j )(e ie j ) T (y + z) 2Proof. Having in mind that π(y * + z * ) = 0 and Λ(t) + Λ(t) = π, we get, using Lemma D.1 and Lemma D.3 on the inequality 52:We also have, as y + = y and using Lemma D.2:βλ ij (t) (e ie j )(e ie j ) T (y + z), y + zy *z * + (i,j)∈E(t)The momentum in z associated with zz * 2 Λ(t) + gives:And the one in z associated with y + zy *z * 2 gives:Then, assuming that θ Bt = β Ct = αC t , we have:At this stage, we split the negative term 62 in two halves, upper-bounding one of the halves by remembering that ν L ≤ 1 and introducing 1 ≥ τ > 0:Keeping in mind that θ Bt = β Ct = αC t and δ 2 Bt = αB t , we put everything together. DefiningResolution GDProof of Lemma D.6. Our goal is to put to zero all of the terms appearing next to scalar products and make the factors of positive quantities (norms or divergences) less or equal to zero. Given our relations, we guess that each exponential has the same rate. Thus, with τ > 0, we fix δ 2 = η = η = α = τ ν L , which leads to γ = 2τ √ νL using Eq. 70. Also, from Eq. 80:Next, from Eq. 72 and Eq. 82, it's necessary that:From Eq. 82, we get:Combining this previous equation with Eq. 81, as 4 Ãt = νA t , we have δ = 4Lγ. Next, Eq. 67 gives, with the equations above:We thus pick γ = 1 4L and τ = 1 8 , so that δ = 1. Via Eq. 76, we fix τ = 1 8 < 1. With Eq. 75, we then get:We also put α = 2τ ν L and only one last equation, Eq. 74, needs to be satisfied, for which we pick β = 1 2 :Finally, it's clear that all the equations are satisfied if we consider A t , Ãt , B t , Bt , C t , Ct as exponentials proportional to e τ √ ν L . Let's pick A 0 = 1. Now, we remark that:If x0 = x 0 , y 0 = ỹ0 = ∇f (x 0 ) and z 0 = z0 = -π∇f (x 0 ), then, given the linear relation between A t , Ãt , B t , Bt , C t , Ct , the L smoothness and the fact π is an orthogonal projection, we get:Now, we use that ν = µ 2 ≤ L and as χ 1 (0) ≤ χ * 1 , we get:Given that d F (x, x * ) = f (x)f (x * ), this implies in particular that:Using the notations of Kovalev et al. (2021a) , we know that gossip matrices satisfy, for q ∈ N:for some χ ≥ 1. It implies that:and for χ large enough, 1 -1 -1 χ ≈ 1 2χ . Consequently, up to a renormalization factor, we have χ * 1 [W ] ≈ 2χ and: Tr(W (q)) ≤ 2n .

F.2 ACCELERATION OF THE CONTINUIZED FRAMEWORK

Under the notation of Even et al. ( 2021a), we note that, an additional simplification holds: θ ′ ARG = θ ARG . We remind that L = AA T and that Ae vw = √ P vw (e ve w ). Next, we note that by definition:so we can directly relate their bounds to ours. Next, as LL + = Iπ, we observe that:which, together with the fact that χ 1 [L] ≥ n-1 TrL (see Lemma 3.1), leads to:in the setting of Even et al. (2021a) where Tr L = 2.

F.3 COMPARISON WITH METHODS THAT USE THE SPECTRAL GAP

We note that:Further, we assumeinf (i,j)∈E(t) λij (t) = O(1) as n grows. This condition can be understood as a way to prevent degenerated behaviors in the network's connectivity: the worst case communication rate should always be greater than some fraction of the largest rate, with a fraction value not growing with the network's size. This condition is always met if we assume there are both a lower and upper bound on the communication rate of the channels linking the nodes, which seems reasonable in a physical setting. Then, using Lemma 3.1, recalling Rate between the lope of MSDA and DADAO in log cale for tar graph of ize n. {10, 20, 70, 200, 300, 1000, 2000}. of DADAO and MSDA is indeed a O(1) (with a constant value of ≃ 14) while MSDA is indeed O( √ n) worse than DADAO for communications on the star graph, as stated in Tab. 2.

H PRACTICAL IMPLEMENTATION

In this section, we describe in more detail the implementation of our algorithm. As we did not physically execute our method on a compute network but carried it out on a single machine, all the asynchronous computations and communications had to be simulated. Thus, we will first discuss the method we followed to simulate our asynchronous framework before detailing the practical steps of our algorithm through a pseudo-code.

H.1 SIMULATING THE POISSON POINT PROCESSES

To emulate the asynchronous setting, before running our algorithm, we generate 2 independent sequences of jump times at the graph's scale: one for the computations and one for the communications. As we considered independent P.P.Ps, the time increments follow a Poisson distribution.At the graph's scale, each node spiking at a rate of 1, the Poisson parameter for the gradient steps process is n. Following the experimental setting of the Continuized framework (Even et al., 2021a), we considered that all edges in E(t) had the same probability of spiking between t and t + dt. Thus, given the sequence of graphs E(t) and L(t) their corresponding Laplacians, we computed the parameter λ * of the communication process as such:Having generated the 2 sequences of spiking times at the graph's scale, we run our algorithm playing the events in order of appearance, attributing the location of the events by sampling uniformly one node if the event is a gradient step or sampling uniformly an edge in E(t) if it is a communication.

H.2 PSEUDO CODE

We keep the notations introduced in Eq. 3 and recall the following constant values specified in Appendix D.4:

