DIGEST: FAST AND COMMUNICATION EFFICIENT DECENTRALIZED LEARNING WITH LOCAL UPDATES

Abstract

Decentralized learning advocates the elimination of centralized parameter servers (aggregation points) for potentially better utilization of underlying resources, delay reduction, and resiliency against parameter server unavailability and catastrophic failures. Gossip based decentralized algorithms, where each node in a network has its own locally kept model on which it effectuates the learning by talking to its neighbors, received a lot of attention recently. Despite their potential, Gossip algorithms introduce huge communication costs. In this work, we show that nodes do not need to communicate as frequently as in Gossip for fast convergence; in fact, a sporadic exchange of a global model is sufficient. Thus, we design a fast and communication-efficient decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD). DIGEST is a decentralized algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning. We show through analysis and experiments that DIGEST significantly reduces the communication cost without hurting convergence time for both iid and non-iid data.

1. INTRODUCTION

Emerging applications such as Internet of Things (IoT), mobile healthcare, self-driving cars, etc. dictates learning be performed on data predominantly originating at edge and end user devices (Gubbi et al., 2013; Li et al., 2018a) . A growing body of research work, e.g., federated learning (McMahan et al., 2016; Kairouz et al., 2021; Konecný et al., 2015; McMahan et al., 2017; Li et al., 2020a; b) has focused on engaging the edge in the learning process, along with the cloud, by allowing the data to be processed locally instead of being shipped to the cloud. Learning beyond the cloud can be advantageous in terms of better utilization of network resources, delay reduction, and resiliency against cloud unavailability and catastrophic failures. However, the proposed solutions, like federated learning, predominantly suffer from having a critical centralized component referred to as the Parameter Server (PS) organizing and aggregating the devices' computations. Decentralized learning emerges as a promising solution to this problem. Decentralized algorithms have been extensively studied in the literature, with Gossip algorithms receiving the lion's share of research attention (Boyd et al., 2006b; Nedic & Ozdaglar, 2009a; Koloskova et al., 2019; Aysal et al., 2009; Duchi et al., 2012a; Kempe et al., 2003; Xiao & Boyd, 2003; Boyd et al., 2006a) . In Gossip algorithms, each node (edge or end user device) has its own locally kept model on which it effectuates the learning by talking to its neighbors. This makes Gossip attractive from a failure-tolerance perspective. However, this comes at the expense of a high network resource utilization. As shown in Fig. 1a , all nodes in a Gossip algorithm in a synchronous mode perform a model update and wait for receiving model updates from their neighbors. When a node completes receiving all the updates from its neighbors, it aggregates the updates. As seen, there should be data communication among all nodes after each model update, which is a significant communication overhead. Furthermore, some nodes may be a bottleneck for the synchronization as these nodes (which are also called stragglers) can be delayed due to computation and/or communication delays, which increases the convergence time. Asynchronous Gossip algorithms, where nodes communicate asynchronously and without waiting for others are promising to reduce idle nodes and eliminate the stragglers, i.e., delayed nodes (Lian et al., 2018; Li et al., 2018b; Avidor & Tal-Israel, 2022) . Indeed, asynchronous algorithms significantly reduce the idle times of nodes by performing model updates and model exchanges simultaneously as illustrated in Fig. 1b . For example, node 1 can still update its model from x 1 t to x 1 t+1 and x 1 t+2 while receiving model updates from its neighbors. When it receives from all (or majority) Under review as a conference paper at ICLR 2023 of its neighbors, it performs model aggregation. However, asynchronous Gossip does not reduce communication overhead as compared to synchronous Gossip. Furthermore, the delayed updates, also referred as gradient staleness in asynchronous Gossip may lead to high error floors (Dutta et al., 2021) , or require very strict assumptions to converge to the optimum solution (Lian et al., 2018) . ∇ ∇ ∇ . . . ∇ ∇ ∇ ∇ . . . ∇ . . . x v t = x τ g x v t+1 x v t+2 x v t+H x v t ′ = x τ g x v t ′ +1 x v t ′ +2 x v t ′ +H P S node v x τ g x v t+H x τ +1 g x v t ′ +H x τ +2 g Figure 2: Local-SGD with H sequential SGD steps in node v. If Gossip algorithms are one side of the spectrum of decentralized learning algorithms, the other side is random-walk based decentralized learning (Bertsekas, 1996; Ayache & Rouayheb, 2021; Sun et al., 2018; Needell et al., 2014) . The random-walk algorithms advocate activating a node at a time, which would update the global model with its local data as illustrated in Fig. 1c . Then, the node selects one of its neighbors randomly and sends the updated global model. The selected neighbor becomes a newly activated node, so it updates the global model using its local data. This continues until convergence. Random-walk algorithms significantly reduce the communication cost as well as computation and power utilization in the network with the cost of increased convergence time. Our key intuitions in this work are that (i) nodes do not need to communicate as frequently as in Gossipfor fast convergence; in fact, a sporadic exchange of a model is sufficient, and (ii) nodes do not need to wait idle as in random walk. Thus, we design a fast and communication-efficient decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD). DIGEST is a decentralized algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning (Stich, 2019; Wang & Joshi, 2021; Lin et al., 2020) . In local-SGD, each node performs multiple model updates before sending the model to the PS as illustrated in Fig. 2 . The PS aggregates the updates received from multiple nodes and transmits the updated global model back to nodes. The sporadic communication between nodes and the PS reduces the communication overhead. Our goal in this work is to exploit this idea for decentralized learning. The following are our contributions. • Design of DIGEST. We design a fast and communication-efficient decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD). DIGEST works as follows. Each node keeps updating its local model all the time as in local-SGD. Meanwhile, there is an ongoing stream of global model update among nodes, Fig. 1d . For example, node 1



Figure1: DIGEST in perspective as compared to existing decentralized learning algorithms; (a) synchronous Gossip, asynchronous Gossip, and random-walk. Note that "∇" represents a model update. "Xmit" represents the transmission of a model from a node to one of its neighbors. "Recv" represents the communication duration while receiving model updates from all of a node's neighbors. "A" represents model aggregation. x v t shows the local model of node v at iteration t. For random walk algorithm, the global model iterates are denoted as x t .

