DIGEST: FAST AND COMMUNICATION EFFICIENT DECENTRALIZED LEARNING WITH LOCAL UPDATES

Abstract

Decentralized learning advocates the elimination of centralized parameter servers (aggregation points) for potentially better utilization of underlying resources, delay reduction, and resiliency against parameter server unavailability and catastrophic failures. Gossip based decentralized algorithms, where each node in a network has its own locally kept model on which it effectuates the learning by talking to its neighbors, received a lot of attention recently. Despite their potential, Gossip algorithms introduce huge communication costs. In this work, we show that nodes do not need to communicate as frequently as in Gossip for fast convergence; in fact, a sporadic exchange of a global model is sufficient. Thus, we design a fast and communication-efficient decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD). DIGEST is a decentralized algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning. We show through analysis and experiments that DIGEST significantly reduces the communication cost without hurting convergence time for both iid and non-iid data.

1. INTRODUCTION

Emerging applications such as Internet of Things (IoT), mobile healthcare, self-driving cars, etc. dictates learning be performed on data predominantly originating at edge and end user devices (Gubbi et al., 2013; Li et al., 2018a) . A growing body of research work, e.g., federated learning (McMahan et al., 2016; Kairouz et al., 2021; Konecný et al., 2015; McMahan et al., 2017; Li et al., 2020a; b) has focused on engaging the edge in the learning process, along with the cloud, by allowing the data to be processed locally instead of being shipped to the cloud. Learning beyond the cloud can be advantageous in terms of better utilization of network resources, delay reduction, and resiliency against cloud unavailability and catastrophic failures. However, the proposed solutions, like federated learning, predominantly suffer from having a critical centralized component referred to as the Parameter Server (PS) organizing and aggregating the devices' computations. Decentralized learning emerges as a promising solution to this problem. Decentralized algorithms have been extensively studied in the literature, with Gossip algorithms receiving the lion's share of research attention (Boyd et al., 2006b; Nedic & Ozdaglar, 2009a; Koloskova et al., 2019; Aysal et al., 2009; Duchi et al., 2012a; Kempe et al., 2003; Xiao & Boyd, 2003; Boyd et al., 2006a) . In Gossip algorithms, each node (edge or end user device) has its own locally kept model on which it effectuates the learning by talking to its neighbors. This makes Gossip attractive from a failure-tolerance perspective. However, this comes at the expense of a high network resource utilization. As shown in Fig. 1a , all nodes in a Gossip algorithm in a synchronous mode perform a model update and wait for receiving model updates from their neighbors. When a node completes receiving all the updates from its neighbors, it aggregates the updates. As seen, there should be data communication among all nodes after each model update, which is a significant communication overhead. Furthermore, some nodes may be a bottleneck for the synchronization as these nodes (which are also called stragglers) can be delayed due to computation and/or communication delays, which increases the convergence time. Asynchronous Gossip algorithms, where nodes communicate asynchronously and without waiting for others are promising to reduce idle nodes and eliminate the stragglers, i.e., delayed nodes (Lian et al., 2018; Li et al., 2018b; Avidor & Tal-Israel, 2022) . Indeed, asynchronous algorithms significantly reduce the idle times of nodes by performing model updates and model exchanges simultaneously as illustrated in Fig. 1b . For example, node 1 can still update its model from x 1 t to x 1 t+1 and x 1 t+2 while receiving model updates from its neighbors. When it receives from all (or majority) Under review as a conference paper at ICLR 2023 Figure 1 : DIGEST in perspective as compared to existing decentralized learning algorithms; (a) synchronous Gossip, asynchronous Gossip, and random-walk. Note that "∇" represents a model update. "Xmit" represents the transmission of a model from a node to one of its neighbors. "Recv" represents the communication duration while receiving model updates from all of a node's neighbors. "A" represents model aggregation. x v t shows the local model of node v at iteration t. For random walk algorithm, the global model iterates are denoted as x t . of its neighbors, it performs model aggregation. However, asynchronous Gossip does not reduce communication overhead as compared to synchronous Gossip. Furthermore, the delayed updates, also referred as gradient staleness in asynchronous Gossip may lead to high error floors (Dutta et al., 2021) , or require very strict assumptions to converge to the optimum solution (Lian et al., 2018) . ∇ ∇ ∇ . . . ∇ ∇ ∇ ∇ . . . ∇ . . . x v t = x τ g x v t+1 x v t+2 x v t+H x v t ′ = x τ g x v t ′ +1 x v t ′ +2 x v t ′ +H P S node v x τ g x v t+H x τ +1 g x v t ′ +H x τ +2 g Figure 2: Local-SGD with H sequential SGD steps in node v. If Gossip algorithms are one side of the spectrum of decentralized learning algorithms, the other side is random-walk based decentralized learning (Bertsekas, 1996; Ayache & Rouayheb, 2021; Sun et al., 2018; Needell et al., 2014) . The random-walk algorithms advocate activating a node at a time, which would update the global model with its local data as illustrated in Fig. 1c . Then, the node selects one of its neighbors randomly and sends the updated global model. The selected neighbor becomes a newly activated node, so it updates the global model using its local data. This continues until convergence. Random-walk algorithms significantly reduce the communication cost as well as computation and power utilization in the network with the cost of increased convergence time. Our key intuitions in this work are that (i) nodes do not need to communicate as frequently as in Gossipfor fast convergence; in fact, a sporadic exchange of a model is sufficient, and (ii) nodes do not need to wait idle as in random walk. Thus, we design a fast and communication-efficient decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD). DIGEST is a decentralized algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning (Stich, 2019; Wang & Joshi, 2021; Lin et al., 2020) . In local-SGD, each node performs multiple model updates before sending the model to the PS as illustrated in Fig. 2 . The PS aggregates the updates received from multiple nodes and transmits the updated global model back to nodes. The sporadic communication between nodes and the PS reduces the communication overhead. Our goal in this work is to exploit this idea for decentralized learning. The following are our contributions. • Design of DIGEST. We design a fast and communication-efficient decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD). DIGEST works as follows. DIGEST supports multiple streams of global model updates. For example, node 1 may transmit its semi-global model to node 2 while node 3 transmits its semi-global model to node 6 as illustrated in Fig. 3 . We use the term semi-global model in the multi-stream DIGEST as the global model can be obtained only after semi-global models are aggregated. The motivation behind introducing multiple streams is to further improve the convergence time as compared to the single stream DI-GEST. We note that the communication overhead increases when the number of streams increases, and there is a nice convergence and communication overhead tread-off. • Convergence analysis of DIGEST. We analyze the convergence of single-and multi-stream DI-GEST, and prove that both algorithms approach to the optimal solution asymptotically. Our convergence proof is novel in the sense that it removes symmetric communication capabilities among nodes, which is needed for the Gossip convergence proof (Koloskova et al., 2020) . Furthermore, our convergence proof holds for any (i) any non-iid data distribution across nodes, (ii) any (and possibly time-varying) network topology as long as the underlying graph is connected. • Evaluation of DIGEST. We evaluate the performance of DIGEST for (i) two data sets; w8a (Platt, 1999) and MNIST (Lecun et al., 1998), (ii) iid and non-iid data, and (iii) network topologies with different number of nodes. The simulation results confirm that the communication cost of DIGEST is low as compared to the baselines, and it has nice convergence properties; i.e., its convergence time is better than or comparable to the baselines. Decentralized optimization has been studied at least since Tsitsiklis (1984) . Decentralized optimization algorithms are designed, where nodes interact with their neighbors to solve an optimization problem Nedic & Ozdaglar (2009b) ; Chen & Sayed (2012) ; Duchi et al. (2012b) . Despite their potential, these algorithms suffer from a bias in non-iid data (Yuan et al., 2016) , and they require synchronization and orchestration among nodes, which is costly in terms of communication overhead.

2. RELATED

Decentralized algorithms based on Gossip usually involve a mixing step where nodes compute their new models by mixing their own and neighbors' models Koloskova et al. (2020) ; Scaman et al. (2019) ; Xiao & Boyd (2003) . However, this is costly in terms communication as every node requires O(deg(G)) data exchange for every model update. Also, some existing Gossip-based approaches require symmetrical data exchanges, i.e., if node i sends to node j, then node j should be able to receive from node i, (Koloskova et al., 2020; Lian et al., 2018) . Our goal in this paper is to reduce the communication cost in decentralized learning for any network topology and data distribution. It is discussed in Giaretta & Girdzijauskas (2019) that existing Gossip-based algorithms usually have strong assumptions on data distribution, the communication power of the nodes, and the connectivity among them. Violation of these assumptions may lead to slow convergence and/or bias in the final model Giaretta & Girdzijauskas (2019) . To address such problems, Gossip SGD with periodic global averaging is proposed in Chen et al. (2021) , a method for accelerating convergence on large and sparse networks by adding periodic global averaging into Gossip. For scenarios like wireless sensor networks, where global averaging is prohibitively expensive, it is suggested to use multiple Gossip communication steps in succession once in a while with no computations in between (Berahas et al., 2019) . An asynchronous decentralized parallel stochastic gradient descent algorithm is designed in Lian et al. (2018) , where nodes do not wait for all other nodes and only communicate in a decentralized manner. However, it has the same limitations of Gossip-based algorithms as it uses a similar model exchange policy as well as gradient staleness. A random walk-based decentralized learning is proposed in Ayache & Rouayheb (2021) , which is similar to work on random walk data sampling for stochastic gradient descent, e.g., (Sun et al., 2018; Needell et al., 2014)  D v at node v is f v (x) = 1 Dv i∈Dv f i (x). We design DIGEST to solve x * .

3.2. SINGLE-STREAM DIGEST

DIGEST has two functionalities; (i) local model update at each node, and (ii) global model update and exchange among nodes. Next, we will first provide an overview of these functionalities and then provide detailed descriptions of DIGEST algorithms.

3.2.1. OVERVIEW

Local Model Update. We assume that the time is slotted, and at each slot/iteration, a local model is updated. However, a calculation of a gradient may take more than one slot, vary over time, or not fit into slot boundaries. Thus, at each iteration t, any gradients which have been delayed up to iteration t, and not used in previous local updates are used to update the local model. We note that time slots across nodes do not need to be synchronized in DIGEST as each node can have its own iteration sequence and update local and global models over its own sequence. The only assumption we make is that the slot sizes are the same across nodes, which can be decided a priori. Let us consider that L v T = {l v t } 0≤t<T is the set of the delayed gradient calculations at node v, where l v t shows that the local-SGD update of iteration t is delayed until iteration l v t . For instance, l v t ′ = t means that the local-SGD of iteration t ′ is lagged behind and performed in iteration t, t ≥ t ′ . Then, we define u v t = {t ′ | l v t ′ = t} to show all the updates completed at iteration t in node v. If we consider that there is no global update at node v, the local model is updated as x v t+1 = x v t -z∈u v t η z ∇f i v z (x v z ) , where η z is the learning rate, i v z is a sample uniformly chosen from D v in iteration z, and ∇f i v z (x v z ) is the gradient. However, there may be global model updates  v t = 1. Otherwise, i.e., when node v does not receive the global model from its neighbors, the indicator is set to s v t = 0. If s v t = 0, then node v updates its model locally according to the update mechanism presented eaerlier in the "Local Model Update" section. If s v t = 1, i.e., when a global model is received by node v from one of its neighbors, then the global model should be incorporated in the calculations. DIGEST sets the local model to the global model when there is a global model update as follows. x v t = x v t-1 -z∈u v t-1 η z ∇f i v z (x v z ) if s v t = 0 xt if s v t = 1 (1) The global model is updated as xt = xt-1 + D v D x v t-1 - z∈u v t-1 η z ∇f i v z (x v z ) -x v τ v t-1 , where xt-1 is the global model received by node v at slot t -1. The global model, i.e., xt is updated by using xt-1 as well as the local updates of node v. We use τ v t to denote the last time slot up to t, when node v's model was updated with the global model, i.e., τ v t = max{t ′ | t ′ ≤ t, s v t ′ = 1}. The equivalent of (2) is xt = x 0 - V v=1 τ v t -1 t ′ =0 z∈u v t ′ Dv D η z ∇f i v z (x v z ) , where x 0 is the initial model. As seen, the global model is updated across all nodes by taking into account all delayed gradient calculations. We use Dv D ratio to give more weight to the gradients with larger data sets. Now that we provided an overview of DIGEST, we provide details on how DIGEST algorithms operate next.

3.2.2. ALGORITHM DESIGN

Algorithm 1 Local and global model update of DIGEST at node v ∈ V. 1: Initialization: x v 0 = x 0 , x v -1 = x 0 , x0 = x 0 , visited = {}, pre node = v, S v T = {0} 0<t≤T , s v0 1 = 1. 2: for t in 0, ..., T -1 do 3: Sample i v t uniformly from D v . 4: Compute the gradient ∇f i v t (x v t ). 5: x v t+1 = x v t -z∈u v t η z ∇f i v z (x v z ) ▷ Local model update. 6: if Received new message from another node then 7: (x t , visited, pre node, 0) ← message 8: s v t+1 = 1 9: if s v t+1 = 1 then 10: xt+1 = xt + Dv D (x v t+1 -x v -1 ) 11: x v t+1 = xt+1 ▷ Local model is updated using global model. 12: We define visited as the set of nodes that are recently visited for the global model updates. It is initialized as an empty set at node v. We define a period of time, during which all the nodes in V are visited at least once, as a synchronization round. During a synchronization round, all nodes update their local models with a global model as they are visited at least once. More details regarding the visited set will be provided as part of Alg. 2. x v -1 = x v t+1 13: if mod (t, H) = 0 or visited ̸ = V The node that node v receives the global model from is defined by pre node, where its initial value is set to v as there is no previous node at the start. The set of global model update indicators, i.e., S v T = {s v t } 0<t≤T is initialized as an empty set, where T is the number of slots that Alg. 1 runs. Assuming that v 0 is the node where the global model update starts, s v0 1 is set to 1, i.e., s v0 1 = 1. Algorithm 2 Sending global model from node v ∈ V. Input: message = (x t+1 , visited, pre node, r) 1: if visited = V then 2: visited = {} 3: if v / ∈ visited then 4: visited = visited ∪ {v} 5: p v = pre node 6: C = {v ′ ∈ N v | v ′ / ∈ visited} 7: if C ̸ = ∅ then 8: select v ′ randomly from C. 9: Send message = (x t+1 , visited, v, r) to node v ′ . 10: else 11: Send message = (x t+1 , visited, v, r) to node p v . At every iteration t, node v first gets one data sample from the local dataset randomly (line 3), and computes a stochastic local gradient (line 4) based on the selected data sample and the current model at node v, i.e., x v t . Then, node v uses all the gradients whose computations are delayed until iteration t, and that are not used in local model updates so far for the local model update (line 5). If node v receives a "message" from one of its neighbors at slot t, then it should update the global model. Each message contains information on the global model xt , the set of visited nodes, i.e., visited, the id of the node that sends this message to node v, e.g., v ′ , and a parameter r, which is always set to 0 in single-stream DIGEST, but may take different values for multi-stream DIGEST. After the message is extracted (line 7), global model update indicator is set to 1 (line 8), and global model is updated (lines 10 -12). In particular, the global model is updated using the most recent local model of node v (line 10). The local model is updated with the global model (line 11). The current local model is stored at node v and will be used in the next global update (line 12). Algorithm 3 DIGEST on node v ∈ V with R synchronization streams. 1: Initialization: x v 0 = x 0 , x v -1 = x 0 , queue = (). 2: for r in 0, ..., R -1 do 3: x0 [r] = x 0 , x-1 [r] = x 0 , visited[r] = {}, pre node[r] = v, S v T [r] = {0} 0<t≤T , s vr 1 [r] = 1. 4: for t in 0, ..., T -1 do 5: Sample i v t uniformly from D v . 6: Start computing the gradient ∇f i v t (x v t ). 7: x v t+1 = x v t -z∈u v t η z ∇f i v z (x v z ) 8: if queue ̸ = () then 9: for any message in queue do 10: (x t [r], visited[r], pre node[r], r) ← message 11: s v t+1 [r] = 1 12: Remove message from queue 13: for r in 0, ..., R -1 do 14: if s v t+1 [r] = 1 then 15: xt+1 [r] = xt [r] + Dv D (x v t+1 -x v -1 ) 16:  x v t+1 = x v -1 + (x t+1 [r] -x-1 [r]) ▷ Local model update. 17: x v -1 = x v t+1 ▷ Last updated model at node v 18: x-1 [r] = xt+1 [r] ▷ Last updated model at node v corresponding to stream r 19: if mod (t, H) = 0 or visited[r] ̸ = V then 20: Send message = xt+1 [r] , visited[r] , pre node[r] , s v t+2 [r] = 1 If the global model is updated at node v, i.e., if s v0 1 = 1, then node v creates a message and sends it to one of its neighbors if (i) visited ̸ = V: when not all nodes are visited in the current synchronization round; or (ii) mod (t, H) = 0: this is an indicator of the start of a new synchronization round, which happens periodically at every H iterations. In other words, global model synchronization continues until all nodes in V are visited. Then, global model update is paused until a new synchronization round (satisfied by line 16), which starts at every H iteration. We will describe how H should be selected later in the paper as part of our convergence analysis and evaluations. If one of the conditions in line 13 is satisfied, then node v sends the global model to one of its neighbors by calling Alg. 2. Sending Global Model. Alg. 2 describes the logic of DIGEST at node v for sending a global model to a neighboring node. Alg. 2 implements a Depth-First Search (DFS) to traverse all the nodes in the network in a synchronization round. If all nodes are visited, i.e., at the end of a synchronization round, visited is set to an empty set (line 2). If v is not visited before in this synchronization round, it is added to visited (line 4) and its parent node p v is set to pre node (line 5). The parent node is the node that node v receives the global model from for the first time in this synchronization round. C is a set of nodes that node v can possibly transmit. It includes all of the neighboring nodes which are not in the visited set. If C is not empty, one of its elements v ′ is chosen randomly (line 8) and a message including the global model is transmitted to node v ′ from node v. If C is empty, i.e., all the neighbors of node v are visited in the current synchronization round, the message is sent to the parent of node v (p v ) (line 11). We note that if all the nodes are visited in the network, Alg. 1 pauses global model update (line 16), and Alg. 2 is not called. We also add that Alg. 2 and Alg. 1 operate simultaneously; one does not need to stop and wait for the other as also illustrated in Fig. 1d .

3.3. MULTI-STREAM DIGEST

The extended version of the local and global model update algorithm of DIGEST supporting multiple streams is summarized in Alg. 3. The following are the differences between Algs. 3 and 1. There are multiple semi-global models in different streams, i.e., xt [r] corresponds to the semi-global model in stream r out of R streams. There are R models stored in each node, i.e., x-1 [r] to represent the semi-global model corresponding to the last synchronization of stream r at node v. We define visited[r], pre node[r], and s v t [r] for each stream r. Each node v has a queue to store all the messages that a node receives from its neighbors. It is initialized as an empty queue at the start. Whenever node v receives a message from one of its neighbors, it is added in the queue. Each node can receive up to R messages related to different streams, so the size of the queue is R. In each message there is a stream index r (line 10). Node v extracts all the messages in its queue (line 9-12). Then, it updates its semi-global and local models as in Alg. 1 if s v t+1 [r] = 1. The semi-global models are accumulated in the local models and add up to the global model. In particular, the local model is updated using semi-global models (line 16), and just one semi-global models is updated for every spesific local update (line 15).

4. CONVERGENCE ANALYSIS OF DIGEST

We use the following assumptions for the convergengence analysis of DIGEST. 1. Smooth local loss. The function f v is continuously differentiable and its gradient is L-Lipschitz for 1 ≤ v ≤ V , i.e., ∥∇f v (y) -∇f v (x)∥ ≤ L∥y -x∥, ∀x, y ∈ R d . 2. Convexity. The function f is µ-strongly convex, i.e., ∀x, y ∈ R d , f (y) ≥ f (x) + ⟨∇f (x), y - x⟩ + µ 2 ∥y -x∥ 2 . 3. Bounded local variance. The variance of the stochastic gradient is bounded for all nodes, i.e., 0 ≤ t < T , 1 ≤ v ≤ V , E i v t ∥∇f i v t (x v t ) -∇f v (x v t )∥ 2 ≤ σ 2 . 4. Bounded second moment. The expected squared norm of the stochastic gradient is bounded, i.e., E i v t ∥∇f i v t (x v t )∥ 2 ≤ G 2 , 0 ≤ t < T, 1 ≤ v ≤ V 5. Bounded lag. We assume bounded lag, i.e., max{l v t -t} ≤ E, 0 ≤ t < T, 1 ≤ v ≤ V . 6. Bounded synchronization interval. We assume that the interval between two subsequent synchronizations is bounded, i.e., gap(S v T ) ≤ H, 1 ≤ v ≤ V , where gap(S v T ) shows the maximum of the gap between two subsequent 1s in S v T . Theorem 4.1 (Asymptotic result for single-stream DIGEST). Let assumptions 1-6 hold, and the learning rate be η t = 4 µ(a+t) with a > max{ 16L µ , H + E, 2}. The convergence rate of singlestream DIGEST is E f (x T ) -f * ≤ O( 1 T + H + E T 2 )ρσ 2 + O( (H + E) 3 T 3 )∥x 0 -x * ∥ 2 (3) + O V ρ(H + E) 2 T 2 (1 + ln(T + H + E) T ) G 2 , where xT = 1 DS T V v=1 T -1 t=0 D v ω t x v t , ω t = (a + t) 2 , S T = T -1 t=0 ω t , and ρ = V v=1 ( Dv D ) 2 . Remark. The convergence rate to the optimum value f * is O( ρ T ) if H + E ≤ O( T V ) , and asymptotically approaches to zero, where ρ = V v=1 ( Dv D ) 2 is a data concentration coefficient that can take values between 1 V ≤ ρ < 1. If all the nodes have the same amount of data, i.e., ρ = 1 V , then a linear speedup in the convergence rate O( 1 V T ) is observed. On the other hand, in the extreme scenario that ρ = 1, where one node has all the data, the speedup is O( 1T ). Sketch of Proof of Theorem 4.1. (The details of the proof is provided in the supplementary material.) We define a virtual sequence {x t } t≥0 as xt = x 0 - V v=1 t-1 z=0 Dv D η z ∇f i v z (x v z ) following a similar idea in Stich (2019) . We also define g t = V v=1 Dv D ∇f i v t (x v t ), ḡt = V v=1 Dv D ∇f v (x v t ), g * t = V v=1 Dv D ∇f (x v t ). Let i t = {i 1 t , . ..i V t } denote the data samples selected randomly during time slot t in all nodes. It can be seen that ḡt = E it g t , and g * t is the real direction of optimal convergence at every step. The virtual direction is updated as xt+1 = xt -η t g t . We first illustrate how the virtual sequence {x t } t≥0 approaches to the optimal solution in Lemma 4.2. Lemma 4.2. If assumptions 1-2 hold, and η t ≤ 1 4L , then E ∥x t+1 -x * ∥ 2 ≤ (1 + µ 5 η t )(1 - µη t ) E ∥x t -x * ∥ 2 -ηt 2 (E f (x t ) -f * ) + 2η 2 t E ∥g t -ḡt ∥ 2 + (1 + µ 5 η t )2η t L + ( 5 µ η t + 2η 2 )4L 2 V v=1 Dv D E ∥x t -x v t ∥ 2 . Lemma 4.2 indicates how the convergence criteria; E f (x t ) -f * is related to E ∥x t+1 -x * ∥ 2 and E ∥x t -x * ∥ 2 , which can be handled with some method like telescopic sum. E ∥g t -ḡt ∥ 2 is related to local variance and is bounded in Lemma 4.3. E ∥x t -x v t ∥ 2 shows the deviation between virtual and actual sequences and we find an upper-bound for this term in Lemma 4.4. Lemma 4.3 (Bounding variance). If assumptions 3 holds, then E ∥g t -ḡt ∥ 2 ≤ ρσ 2 . Lemma 4.4 (Bounding deviation). If assumptions 4-6 hold, and η t ≤ 2η t+H+E for 0 ≤ t ≤ T -1, 1 ≤ v ≤ V , then V v=1 Dv D E ∥x t -x v t ∥ 2 ≤ 64V ρη 2 t (H + E) 2 G 2 . Now, we focus on the convergence of multi-stream DIGEST. We make the following assumptions. 7. Strongly bounded synchronization interval. We assume that the interval between two subsequent synchronizations for all streams are bounded, i.e., gap(S v T [r]) ≤ H, 1 ≤ v ≤ V , 1 ≤ r ≤ R. The duration between two subsequent synchronizations in node v by any two streams is gap(∨ 1≤r≤R S v T [r]) ≤ H R + δ where δ is a constant to handle special cases where the duration is longer due to an uneven arrangement of streams. Note that ∨ 0≤i≤1 A i is defined as logical or of all A i s element-wise.

8.. Efficient covering. We assume that

E R r=1 v ′ ∈B v r (t) ( D v ′ D ) 2 ≤ cρ, 0 ≤ t < T, 1 ≤ v ≤ V , where c is a constant. We define B v r (t) = [v ′ | s v ′ t ′ [r] = 1, τ v t [r] ≤ t ′ ≤ t] as the list of nodes that are visited by stream r after the last visit of this stream at node v until t (repeated nodes may appear in the list). Theorem 4.5 (Asymptotic result for multi-stream DIGEST). Let assumptions 1-5 and 7-8 hold, and the learning rate is η t = 4 µ(a+t) with a > max{ 16L µ , H + E, 2}. The convergence rate of both multi-stream DIGEST (hence single-stream DIGEST as a special case) is E f (x T ) -f * ≤ O( 1 T + H + E T 2 )ρσ 2 + O( (H + E) 3 T 3 )∥x 0 -x * ∥ 2 (4) + O ρ( H R + δ + E) 2 (V + cRh max ) T 2 (1 + ln(T + H + E) T ) G 2 , where h max is the maximum value of h(u, v), which is defined as the expected number of steps for random walk between u and v. The details of the proof is provided in the supplementary material. Remark. The convergence rate to the optimum value We examine the convergence performance of logistic regression, i.e., f (x) = f * with R streams is O( ρ T ) if H +R(E +δ) ≤ O( T R 2 V +cRhmax ), 1 D D i=1 CrossEntropy softmax(xa i ), b i + λ 2 ∥x∥ 2 , where a i ∈ R d , and b i are the feature and label of the data sample i. The regularization parameter is considered λ = 1 D . We consider two network topologies, an Erdős-Rényi graph of V = 10 and V = 100 nodes with 0.3 as the probability of connectivity. We use datasets w8a (Platt, 1999) and MNIST (Lecun et al., 1998) . We use two different data distribution over nodes: (i) iid-balanced, and (ii) non-iid-unbalanced. In iid-balanced case, data set is shuffled and equally divided over different nodes. In non-iid-unbalanced, we first sort data samples based on their labels. Then, we follow a geometric series as the size of local datasets. For each run, we measure the global loss f (x) during the optimization. We calculate the loss for different weighted averages of the models over iterations: the last model, the uniform average, the average with linear weights, and the average with quadratic weights (such as in Theorem 4.1). Finally, the minimum is reported. Practically speaking, the final model could be adequate, but an auxiliary sequence might simply track the weighted average of the iterations, without having to store models in all previous iterations; some examples can be seen in Table 1 of (Stich, 2019) . We run the optimization using η t = 1 10+t/1000 . To derive the plots of convergence over time, we assume that each iteration of Local SGD takes 1 millisecond. The communication delay between every two neighbors is assumed to have exponential distribution where its average is randomly chosen from 0 to 10 milliseconds. The numerical experiments were run on Ubuntu 20.04 using 36 Intel Core i9-10980XE processors. For each experiment, we repeat 50 times and present the error bars associated with the randomness of the optimization. In every figure, we include the average and 3 standard deviation error bar. Fig. 4 shows the convergence behavior of w8a dataset in 10-nodes and 100-nodes topologies. We see that PGA algorithms do not perform well in such an environment, where global averaging and several Gossip steps take a long time to complete. Sync-Gossip with H = 1 does not perform well as performing Gossip communications every iteration increases communication cost, hence convergence time. This observation is is supported by the fact that the results are significantly better when we execute more local-SGDs by raising H to 200 in Sync-Gossip. One link-Gossip and and URW have similar performance. This observation suggests that for w8a, performing simultaneous computations in all nodes (as in One link-Gossip) without a proper communication does not improve the convergence speed. DIGEST, Sync-Gossip with H = 200, and Async-Gossip have similar performance in Fig. 4a . On the other hand, we observe that Gossip based algorithms and URW are suffering some slow convergence due to the data distribution in Figs. 4b, 4c , while DIGEST performs better as it (i) supports non-iid data, and (ii) less communication overhead (so better convergence time in wall-clock time), which is amplified in Fig. 4c where there are more nodes. Fig. 5a demonstrates the convergence time for non-iid-unbalanced data distribution over 100-node topology with MNIST dataset for multi-stream DIGEST. Using the multi-stream DIGEST Alg. 3, we have simulated the results for different values of R, i.e., number of streams in the network. Note that even after increasing number of streams, the overall communication overhead is still low as illustrated in Fig. 5b thanks to local-SGD and periodic global model updates of DIGEST.

6. CONCLUSION

We designed a fast and communication-efficient decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD). We designed single-and multi-stream DIGEST to exploit the convergence rate and communication overhead tradeoff. We analyzed the convergence of single-and multi-stream DIGEST, and proved that both algorithms approach to the optimal solution asymptotically. The simulation results confirms that the communication cost of DIGEST is low as compared to the baselines, and it has nice convergence properties; i.e., its convergence time is better than or comparable to the baselines. 

7. APPENDIX: NOTATION

f i (x) Loss function of model x associated with the data sample i f (x) Global loss function of model x f v (x) Local loss function of model x at node v f * min x∈R d f (x) x * arg min x∈R d f (x) x 0 Initial model η t Learning rate at iteration t l v t Completion time of local-SGD update started at t L v T Set of {l v t } 0≤t<T x v t Local model in node v at t s v t Binary variable that shows if node v receives the global model at t in single-stream DIGEST S v T Set of {s v t } 0<t≤T s v t [r] Binary variable that shows id node v receives the semi-global model at t from stream r in multi-stream DIGEST The semi-global model received by node v at t from stream r in multi-stream DIGEST pre node S v T [r] {s v t [r]} 0<t≤T visited Set of The node that node v receives the global model from in single-stream DIGEST pre node[r] The node that node v receives the semi-global model from in stream r for multi-stream DIGEST p v The node that node v receives the (semi-)global model from for the first time in the current synchronization round h(u, v) The expected number of steps for the random walk between u and v h max Maximum value of h(u, v) over all ordered pairs of nodes δ Constant that bounds the intervals between two subsequent visits of a node by all streams c Constant that determines how efficiently the multiple streams are covering the whole network B v r (t) List of nodes that stream r visits after the last visit of node v until t

7.2. SINGLE-STREAM DIGEST

Motivated by Stich (2019) , a virtual sequence {x t } t≥0 is defined as follows. xt = x 0 - V v=1 t-1 z=0 D v D η z ∇f i v z (x v z ). We do not need to calculate this sequence in the algorithm explicitly and it is only used for the sake of the analysis. We also define g t = V v=1 D v D ∇f i v t (x v t ), ḡt = V v=1 D v D ∇f v (x v t ), g * t = V v=1 D v D ∇f (x v t ), where f (x), f v (x) are global loss function and local loss function in node v, respectively. Let us introduce i t = {i 1 t , ...i V t } to denote the data samples selected randomly during time slot t in all nodes. Then, observe that ḡt = E it g t . g * t is the real true direction to go in opposite of in each step. We have xt+1 = xt -η t g t . First, we illustrate how the virtual sequence, {x t } t≥0 , approaches to the optimal in Lemma 1, and 2. Second, we depict in Lemma 3 that there is a little deviation from the virtual sequence in the actual iterates, x v t . Finally, the convergence rate is proved in 7.2.1. Lemma 7.1. If f is L-smooth and µ-strongly convex and η t ≤ 1 4L , then E ∥x t+1 -x * ∥ 2 ≤ (1 + µ 5 η t )(1 -µη t ) E ∥x t -x * ∥ 2 - η t 2 (E f (x t ) -f * ) (7) + 2η 2 t E ∥g t -ḡt ∥ 2 + (1 + µ 5 η t )2η t L + ( 5 µ η t + 2η 2 )4L 2 V v=1 D v D E ∥x t -x v t ∥ 2 . Proof. We have ∥x t+1 -x * ∥ 2 = ∥x t -η t g t -x * ∥ 2 = ∥x t -η t g t -x * -η t g * t + η t g * t ∥ 2 (8) = ∥x t -η t g * t -x * ∥ 2 + η 2 t ∥g t -ḡt + ḡt -g * t ∥ 2 + 2η t ⟨x t -x * -η t g * t , g * t -g t ⟩ (9) = ∥x t -η t g * t -x * ∥ 2 + 2η 2 t ∥g t -ḡt ∥ 2 + ∥ḡ t -g * t ∥ 2 + 2η t ⟨x t -x * -η t g * t , g * t -g t ⟩, ) where ( 10) is based on the following inequality. ∥ n i=1 a i ∥ 2 ≤ n n i=1 ∥a i ∥ 2 . ( ) Then we apply expectation to get E i0,...,it ∥x t+1 -x * ∥ 2 . Based on the law of total expectation, for every two random variables α, β and a function y, 13), we used the fact that once we know i 0 , ..., i t-1 , the value of x v t , 1 ≤ v ≤ V , and therefore xt and g * t are not random any more. From now on, we drop the subscript i 0 , ..., i t for the ease of notation. Thus, E α y(α) = E β E α [y(α)|β]. Considering α = i 0 , ..., E ∥x t+1 -x * ∥ 2 ≤(1 + µ 5 η t ) E ∥x t -η t g * t -x * ∥ 2 + 2η 2 t E ∥g t -ḡt ∥ 2 (16) + ( 5 µ η t + 2η 2 ) E ∥ḡ t -g * t ∥ 2 , where we used ( 15) in ( 10) and the fact that for λ > 0, 2⟨a, b⟩ ≤ λ∥a∥ 2 + 1 λ ∥b∥ 2 . ( ) We obtain ∥x t -η t g * t -x * ∥ 2 = ∥x t -x * ∥ 2 + η 2 t ∥g * t ∥ 2 -2η t ⟨x t -x * , g * t ⟩ (18) = ∥x t -x * ∥ 2 + η 2 t ∥g * t ∥ 2 -2η t V v=1 D v D ⟨x t -x v t + x v t -x * , ∇f (x v t )⟩ (19) = ∥x t -x * ∥ 2 + η 2 V v=1 D v D ∥∇f (x v t )∥ 2 -2η t V v=1 D v D ⟨x v t -x * , ∇f (x v t )⟩ (20) -2η V v=1 D v D ⟨x t -x v t , ∇f (x v t )⟩, Where in (20) we have used the convexity of ∥.∥ 2 that η 2 ∥g * t ∥ 2 ≤ η 2 V v=1 D v D ∥∇f (x v t )∥ 2 (21) By L-smoothness we have ∥∇f (x v t ) -∇f (x * )∥ 2 ≤ 2L(f (x v t ) -f * ). ( ) So we can rewrite the second term in (20) as η 2 V v=1 D v D ∥∇f (x v t )∥ 2 ≤ η 2 2L V v=1 D v D (f (x v t ) -f * ) µ-strong convexity provides us with -⟨x v t -x * , ∇f (x v t )⟩ ≤ -(f (x v t ) -f * ) - µ 2 ∥x v t -x * ∥ 2 . ( ) Following ( 17) to bound the last term in (20), we have -2⟨x t -x v t , ∇f (x v t )⟩ ≤ 2L∥x t -x v t ∥ 2 + 1 2L ∥∇f (x v t ) -∇f (x * )∥ 2 (25) ≤ 2L∥x t -x v t ∥ 2 + (f (x v t ) -f * ), where ( 22) is used in (26). We obtain the following result by applying these three estimates to (20): ∥x t -η t g * t -x * ∥ 2 ≤ ∥x t -x * ∥ 2 + 2η t L V v=1 D v D ∥x t -x v t ∥ 2 (27) +2η t V v=1 D v D (η t L - 1 2 )(f (x v t ) -f * ) + -µ 2 ∥x v t -x * ∥ 2 . we have (η t L-1 2 ) ≤ -1 4 as we assumed η t ≤ 1 4L . Using concavity of α(f (x v t )-f * )+β∥x v t -x * ∥ 2 for α, β ≤ 0, we get 2η t V v=1 D v D (η t L - 1 2 )(f (x v t ) -f * ) + -µ 2 ∥x v t -x * ∥ 2 ≤ - η t 2 (f (x t ) -f * ) -µη t ∥x t -x * ∥ 2 By Applying the last inequality in ( 27), ∥x t -η t g * t -x * ∥ 2 ≤ (1 -µη t )∥x t -x * ∥ 2 + 2η t L V v=1 D v D ∥x t -x v t ∥ 2 - η t 2 (f (x t ) -f * ). ( ) We obtain ∥ḡ t -g * t ∥ 2 = ∥ V v=1 D v D (∇f v (x v t ) -∇f (x v t ))∥ 2 (30) = ∥ V v=1 D v D (∇f v (x v t ) -∇f v (x t ) + ∇f v (x t ) -∇f (x v t ))∥ 2 (31) ≤ 2(∥ V v=1 D v D (∇f v (x v t ) -∇f v (x t ))∥ 2 + ∥ V v=1 D v D (∇f v (x t ) -∇f (x v t ))∥ 2 ) (32) ≤ 2(∥ V v=1 D v D (∇f v (x v t ) -∇f v (x t ))∥ 2 + ∥ V v=1 D v D (∇f (x t ) -∇f (x v t ))∥ 2 ) (33) ≤ 2 V v=1 D v D ∥∇f v (x v t ) -∇f v (x t )∥ 2 + 2 V v=1 D v D ∥∇f (x v t ) -∇f (x t )∥ 2 (34) ≤ 2L 2 V v=1 D v D ∥x v t -xt ∥ 2 + 2L 2 V v=1 D v D ∥x v t -xt ∥ 2 (35) = 4L 2 V v=1 D v D ∥x v t -xt ∥ 2 , ( ) where in (32), we use (11). In (33) we have used the fact that V v=1 Dv D f v (x) = f (x). (34), and ( 35) are due to the convexity of ∥.∥ 2 and L-smoothness, respectively. Taking expectation of (29), and (36) and applying them into ( 16) provides E ∥x t+1 -x * ∥ 2 ≤ (1 + µ 5 η t )(1 -µη t ) E ∥x t -x * ∥ 2 - η t 2 (E f (x t ) -f * ) (37) + 2η 2 t E ∥g t -ḡt ∥ 2 + (1 + µ 5 η t )2η t L + ( 5 µ η t + 2η 2 )4L 2 V v=1 D v D E ∥x t -x v t ∥ 2 . Lemma 7.2 (Bounding variance). If E i v t ∥∇f i v t (x v t ) -∇f v (x v t )∥ 2 ≤ σ 2 for 0 ≤ t ≤ T -1, 1 ≤ v ≤ V , then E ∥g t -ḡt ∥ 2 ≤ ρσ 2 . Proof. We have by definition that E ∥g t -ḡt ∥ 2 = E ∥ V v=1 D v D (∇f i v t (x v t ) -∇f v (x v t ))∥ 2 (38) = V v=1 ( D v D ) 2 E ∥(∇f i v t (x v t ) -∇f v (x v t )∥ 2 (39) = σ 2 V v=1 ( D v D ) 2 (40) ≤ ρσ 2 , where ( 39) is based on the fact that variance of the sum of independent random variables equals sum of their variances. Lemma 7.3 (Bounding deviation single-stream). If gap(S v T ) ≤ H, max{l v t -t} ≤ E, E i ∥∇f i (x v t )∥ 2 ≤ G 2 , and η t ≤ 2η t+H+E for 0 ≤ t ≤ T -1, 1 ≤ v ≤ V , then V v=1 Dv D E ∥x t -x v t ∥ 2 ≤ 64V ρη 2 t (H + E) 2 G 2 . Proof. For every v there exist a τ v t , such that x v τ v t = xτ v t . Considering τ 0 = min{τ v t , ..., τ V t }, we have t -τ 0 ≤ H. we know that all the updates of all of the nodes up to iteration τ 0 , are aggregated in xt . We have xτ v t = xτ0 - h∈H v τ h t -1 t ′ =τ0 z∈u v t ′ D h D η z ∇f i h z (x h z ), where H v = {h | τ h t ≤ τ v t }, and xτ0 = x 0 - V v=1 τ0-1 t ′ =0 z∈u v t ′ Dv D η z ∇f i v z (x v z ). Lets use (11) to decompose the the deviation term as depicted in the following: ∥x t -x v t ∥ 2 ≤ 4(∥x v t -xτ v t ∥ 2 + ∥x τ v t -xτ0 ∥ 2 + ∥x τ0 -xτ0 ∥ 2 + ∥x t -xτ0 ∥ 2 ). based on the fact that t -τ v t ≤ H, we can obtain E ∥x v t -xτ v t ∥ 2 = E ∥x v t -x v τ v t ∥ 2 (44) = E ∥ t-1 t ′ =τ v t z∈u v t ′ η z ∇f h z (x h z )∥ 2 (45) = E ∥ z∈∪ t-1 t ′ =τ v t u v t ′ η z ∇f i h z (x h z )∥ 2 (46) ≤ η 2 τ v t -E | ∪ t-1 t ′ =τ v t u v t ′ | z∈∪ t-1 t ′ =τ v t u v t ′ E ∥∇f i v z (x v z )∥ 2 (47) ≤ η 2 τ v t -E (t -(τ v t -E)) 2 G 2 (48) ≤ η 2 τ0-E (H + E) 2 G 2 , where we have used η τ v t -E ≤ η τ0-E . For the second term, using the same approach, we have E ∥x τ v t -xτ0 ∥ 2 = E ∥ h∈H v τ h t -1 t ′ =τ0 z∈u v t ′ D h D η z ∇f i h z (x h z )∥ 2 (50) ≤ |H v | h∈H v ( D h D ) 2 E ∥ τ v t -1 t ′ =τ0 z∈u v t ′ η z ∇f i h z (x h z )∥ 2 (51) ≤ |H v | h∈H v ( D h D ) 2 η 2 τ0-E (H + E) 2 G 2 (52) ≤ η 2 τ0-E (H + E) 2 G 2 V V v=1 ( D v D ) 2 (53) ≤ η 2 τ0-E (H + E) 2 G 2 V ρ The third term can be bounded like E ∥x τ0 -xτ0 ∥ 2 = E ∥ V v=1 τ0-1 z=0 D v D η z ∇f i v z (x v z ) - V v=1 τ0-1 t ′ =0 z∈u v t ′ D v D η z ∇f i v z (x v z )∥ 2 (55) ≤ E ∥ V v=1 z / ∈∪ τ 0 -1 t ′ =0 D v D η z ∇f i v z (x v z )∥ 2 (56) ≤ V V v=1 ( D v D ) 2 E ∥ z / ∈∪ τ 0 -1 t ′ =0 η z ∇f i v z (x v z )∥ 2 (57) ≤ V V v=1 ( D v D ) 2 η 2 τ0-E E z / ∈∪ τ 0 -1 t ′ =0 E ∥∇f i v z (x v z )∥ 2 (58) ≤ η 2 τ0-E E 2 G 2 V ρ. (59) For the last term, using the same logic, we can obtain ∥x t -xτ0 ∥ 2 ≤ η 2 τ0 H 2 G 2 V ρ 60) Considering that η τ0-E ≤ 2η t and adding up the previous four estimates, we have ∥x t -x v t ∥ 2 ≤ 64V ρη 2 t (H + E) 2 G 2 . ( ) Observe, that Lemmas 7.1 and 7.2 hold regardless of how to synchronize the nodes. Lemma 7.4, that limits how far the local sequences can deviate from the virtual average, is also still valid for the multiple synchronization streams. This is obvious in the first sight as having multiple streams helps further reduce the gap between the local sequences and the virtual iterates (∥x t -x v t ∥ 2 ).

7.2.1. COMPLETING THE PROOF OF THEOREM 4.1

By replacing results of lemmas 7.2, and 7.4 in lemma 7.1, we obtain E ∥x t+1 -x * ∥ 2 ≤ (1 + µ 5 η t )(1 -µη t ) E ∥x t -x * ∥ 2 - η t 2 (E f (x t ) -f * ) + A 1 η 2 t + A 2 η 3 t + A 3 η 4 t , where A 1 = 2ρσ 2 , A 2 = 128V ρL(H +E) 2 G 2 (1+ 10L µ ), and A 3 = 128V ρL(H +E) 2 G 2 ( µ 5 +4L). Observe that ω t η t (1 + µ 5 η t )(1 -µη t ) = µ 4 (a + t) 3 - 16 5 (a + t) 2 - 16 5 (a + t) (63) ≤ µ 4 (a + t) 3 -3(a + t) 2 + 3(a + t) -3 (64) = ω t-1 η t-1 , where ( 64) is correct for a ≥ 2. By multiplication of (62) and ωt ηt , and using the last inequality we have ω t η t E ∥x t+1 -x * ∥ 2 ≤ ω t-1 η t-1 E ∥x t -x * ∥ 2 - ω t 2 (E f (x t ) -f * ) + A 1 ω t η t + A 2 ω t η 2 t + A 3 ω t η 3 t . 66) So we can recursively substitute the first term of the right hand side of the inequality to get ω T -1 η T -1 E ∥x T +1 -x * ∥ 2 ≤ ω 0 η 0 (1 + µ 5 η 0 )(1 -µη 0 )∥x 0 -x * ∥ 2 - T -1 t=0 ω t 2 (E f (x t ) -f * ) (67) +A 1 T -1 t=0 ω t η t + A 2 T -1 t=0 ω t η 2 t + A 3 T -1 t=0 ω t η 3 t . By rearranging the terms and considering that (1 + µ 5 η 0 )(1 -µη 0 ) ≤ 1, we have T -1 t=0 ω t (E f (x t ) -f * ) ≤ 2ω 0 η 0 ∥x 0 -x * ∥ 2 + 2A 1 T -1 t=0 ω t η t + 2A 2 T -1 t=0 ω t η 2 t + 2A 3 T -1 t=0 ω t η 3 t . (68) Based on the convexity of f we have E f (x T ) -f * ≤ 1 S T T -1 t=0 ω t (E f (x t ) -f * ) (69) ≤ 2ω 0 S T η 0 ∥x 0 -x * ∥ 2 + 2A 1 S T T -1 t=0 ω t η t + 2A 2 S T T -1 t=0 ω t η 2 t + 2A 3 S T T -1 t=0 ω t η 3 t . ( ) We next aim to bound the terms on the right hand side of the inequality:  S T = T -1 t=0 ω t = T 6 (2T 2 + 6aT -3T + 6a 2 -6a + 1) ≥ T 3 3 , Where ( 74) is correct due to a ≥ 2. Using the above bounds we can write (70) as E f (x T ) -f * ≤ 3µa 3 2T 3 ∥x 0 -x * ∥ 2 + 12(2a + T -1) µT 2 A 1 + 96 T 2 µ 2 A 2 + 384 ln(T + a -2) µ 3 T 3 A 3 . (75) This completes the proof of Theorem 4.1.

7.3. MULTI-STREAM DIGEST

Notice that Lemmas 7.1 and 7.2 hold for the multi-stream scenario. Hence, we need a modified version of Lemma 7.4 which limits how far local sequences can depart from the virtual in the multistream DIGEST. Lemma 7.4 (Bounding deviation multi-stream). If gap(S v T [r]) ≤ H, gap(∨ 1≤r≤R S v T [r]) ≤ H R + δ, R r=1 v ′ ∈B v r (t) ( D v ′ D ) 2 ≤ cρ, max{l v t -t} ≤ E, E i ∥∇f i (x v t )∥ 2 ≤ G 2 , and η t ≤ 2η t+H+E for 0 ≤ t ≤ T -1, 1 ≤ v ≤ V , 1 ≤ r ≤ R, then V v=1 Dv D E ∥x t -x v t ∥ 2 ≤ 4( H R +δ +E) 2 η 2 t G 2 ρ(6V + 8cRh max ). Proof. We use τ v t [r] to denote the last time slot up to t, when node v's model was updated with stream r, i.e., τ v t [r] = max{t ′ | t ′ ≤ t, s v t ′ [r] = 1}. Lets use (11) to decompose the the deviation term as depicted in the following: ∥x t -x v t ∥ 2 ≤ 2(∥x v t - R r=1 xτ v t [r] [r] -(R -1)x 0 ∥ 2 + ∥x t - R r=1 xτ v t [r] [r] -(R -1)x 0 ∥ 2 ). Lets assume τ v l (t) = max{τ v t [1], ..., τ V t [R]}. For the first term we can obtain E ∥x v t - R r=1 xτ v t [r] [r] -(R -1)x 0 ∥ 2 = E ∥ t-1 t ′ =τ v l z∈u v t ′ η z ∇f i h z (x h z )∥ 2 (77) ≤ ( H R + δ + E) 2 η 2 τ v l (t)-E G 2 (78) ≤ ( H R + δ + E) 2 η 2 t-H-E G 2 (79) For the second term in (76) we again use (11) to expand it to two terms as ∥x t - R r=1 xτ v t [r] [r] -(R -1)x 0 ∥ 2 ≤ 2 ∥x t - R r=1 xt [r] -(R -1)x 0 ∥ 2 (80) +∥ R r=1 (x t [r] -xτ v t [r] [r])∥ 2 . Now we bound two terms on the right hand side of (80) in the following. The first term shows the difference between the virtual sequence and the sum of the updates in all nodes aggregated in global models. In fact, the difference shows all the updates that has not been seen by any stream pulse the updates that are lagged. we difine τ 0 = min{τ v l (t), ..., τ v l (t)}. E ∥x t - R r=1 xt [r] -(R -1)x 0 ∥ 2 = E ∥ V v=1 t-1 t ′ =τ v l (t) z∈u v t ′ D v D η z ∇f i h z (x z )∥ 2 ≤ ( H R + δ + E) 2 η 2 τ0-E G 2 V ρ ≤ ( H R + δ + E) 2 η 2 t-H-E G 2 V ρ, Where (83) can be found with the same approach as (54). Here we define B v r (t) = [h | s v t ′ [r] = 1, τ v t [r] ≤ t ′ ≤ t], as the list of nodes that are visited by stream r after node v (Repeated nodes may appear in the list). Note that E |B v r (t)| ≤ 2h max . E ∥ R r=1 (x t [r] -xτ v t [r] [r])∥ 2 ≤ R E R r=1 ∥x t [r] -xτ v t [r] [r]∥ 2 (84) ≤ R E R r=1 ∥ h∈B v r (t) τ h t [r] τ h l (τ h t [r]) z∈u h t ′ D h D η z ∇f i h z (x h z )∥ 2 (85) ≤ 2Rh max E R r=1 B v r (t) ∥ τ h t [r] τ h l (τ h t [r]) z∈u v t ′ D h D η z ∇f i h z (x h z )∥ 2 (86) ≤ 2Rh max ( H R + δ + E) 2 E R r=1 B v r (t) ∥ D h D η z ∇f i h z (x h z )∥ 2 (87) ≤ 2Rh max ( H R + δ + E) 2 η 2 t-H-E E R r=1 B v r (t) ∥ D h D ∇f i h z (x h z )∥ 2 (88) ≤ 2Rh max ( H R + δ + E) 2 η 2 t-H-E G 2 E R r=1 B v r (t) ( D h D ) 2 (89) ≤ 2Rh max ( H R + δ + E) 2 η 2 t-H-E G 2 cρ, where (84, 85, 86, 87) are based on (11) and the fact that the duration between two subsequent visit pf node v from different streams is at most H R + δ. (90) follows from the assumption of not too many streams in companions to V . By using (79, 80, 83, 90) 



Figure 3: Example multi-stream DIGEST.

We consider a setup where nodes have access to a subset of data samples D. Each node v has a local dataset D v , where D v = |D v | is the size of the local dataset and D = V v=1 D v . The distribution of data across nodes is not identical and independently distributed (non-iid). Stochastic Optimization. We assume that the nodes in the network jointly minimize a ddimensional function f : R d → R. The goal of the nodes is to converge on a model x * , which minimizes the empirical loss over D samples, i.e., x * := arg min x∈R d f (x) := 1 D D i=1 f i (x) , where f i (x) : R d → R is the loss function of model x associated with the data sample i. The optimum solution is denoted by f * . The loss function on local dataset

Figure 5: Convergence results and communication overhead for MNIST dataset in 100-nodes / non-iid / unbalanced setting with multiple streams.

nodes that are visited for the global model update in the most recent synchronization round for single-stream DIGEST visited[r] Set of nodes that are visited for the semi-global model update in the most recent synchronization round in stream r for multi-stream DIGEST xt The global model received by node v at t in single-stream DIGEST xt [r]

When node 2 receives the global model from node 1, it aggregates it with its local model. The aggregated global model is transmitted to node 3 next. We note that the exchanged models are global models as each node adds its own local updates to the received model. A node that has the global model selects the next node for global model transmission randomly among its neighbors. After all the nodes update their models with a global model, DIGEST pauses global model exchange, while local SGD computations still continue. The global model exchange is repeated at every H iterations. DIGEST reduces the communication overhead as compared to both synchronous and asynchronous Gossip as there is no need for exchanging models among all nodes after every model update. DIGEST improves the convergence time as compared to random-walk as it eliminates idle times at nodes by employing local-SGD updates. To summarize, DIGEST gets the best of both Gossip and random-walk algorithms by exploiting local-SGD. Furthermore, DIGEST is designed to support both iid and non-iid data distributed over nodes.

We model the underlying network topology with a directed graph G = (V, E), where V is the set of vertices (nodes) and E is the set of directed edges. The vertex set contains V nodes, i.e., |V| = V , and |.| shows the size of the set. The computing capabilities of nodes are arbitrary and heterogeneous. If node i is connected to node j through a communication link and can transmit data, then link (i, j) is in the edge set, i.e., (i, j) ∈ E. The set of the nodes that node i is connected to and can transmit data is called the neighbors of node i, and the neighbor set of node i is denoted by N i . We do not make any assumptions about the behavior of the communication links; there can be an arbitrary, but finite amount of delay over the links.

at node v, i.e., node v could receive a global model update from one of its neighbors at iteration t. Such a global model reception should be reflected in local model updates, which we discuss next. Global Model Update and Exchange. Let xt be the global model that is being transferred from from one node to another at time slot t. If node v receives the global model xt from one of its neighbors, a global model update indicator s v t is set to s

, which is a copy of the local model in the latest global model update at node v. xt is the global model. All of these models are initialized with the same initial model x 0 . We note that only one of the nodes, let us say node v 0 , has the global model xt at the start of the algorithm.

r to a neighboring node by calling Alg.2.

and asymptotically approaches to zero. Note that if cRh max < O(V ) we get H +R(E +δ) ≤ O( T R 2 evaluate DIGEST in terms of convergence time as well as communication cost as compared to the following baselines; (i) One link-Gossip(Koloskova et al., 2020): At every slot, only one directed communication link is activated randomly, and a model is sent from a sender to a receiver. The receiver's model is updated with the received model; (ii) Uniform Random-Walk (URW) (Ayache & Rouayheb, 2021): This is random walk-based learning algorithm described in Fig.1c;(iii) Real Avg-Gossip-PGA(Chen et al., 2021): It adds Periodic Global Averaging (PGA) to Gossip. To perform one global averaging step, the whole graph is traversed twice, to get all the models first and returning the averaged model. P is used to show the period, i.e., the global averaging happens every

TABLE AND PROOF OF THEOREMS 4.1 AND 4.5 Subset of D at node v with size D v

i t and β = i 0 , ..., i t-1 , we get thatE i0,...,it ⟨x t -x * -η t g * t , g * t -g t ⟩ = E i0,...,it-1 E i0,...,it [⟨x t -x * -η t g * t , g * t -g t ⟩|i 0 , ..., i t-1 ] (12) = E i0,...,it-1 ⟨x t -x * -η t g * t , g * t -E it g t ⟩ (13) = E i0,...,it ⟨x t -x * -η t g * t , g * t -E it g t ⟩ (14) = E i0,...,it ⟨x t -x * -η t g *

in (76) we get∥x t -x v t ∥ 2 ≤ (

