MULTI-LEVEL LOCAL SGD: DISTRIBUTED SGD FOR HETEROGENEOUS HIERARCHICAL NETWORKS

Abstract

We propose Multi-Level Local SGD, a distributed stochastic gradient method for learning a smooth, non-convex objective in a multi-level communication network with heterogeneous workers. Our network model consists of a set of disjoint subnetworks, with a single hub and multiple workers; further, workers may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete, communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We illustrate the effectiveness of our algorithm in a multi-level network with slow workers via simulation-based experiments.

1. INTRODUCTION

Stochastic Gradient Descent (SGD) is a key algorithm in modern Machine Learning and optimization (Amari, 1993) . To support distributed data as well as reduce training time, Zinkevich et al. (2010) introduced a distributed form of SGD. Traditionally, distributed SGD is run within a huband-spoke network model: a central parameter server (hub) coordinates with worker nodes. At each iteration, the hub sends a model to the workers. The workers each train on their local data, taking a gradient step, then return their locally trained model to the hub to be averaged. Distributed SGD can be an efficient training mechanism when message latency is low between the hub and workers, allowing gradient updates to be transmitted quickly at each iteration. However, as noted in Moritz et al. (2016) , message transmission latency is often high in distributed settings, which causes a large increase in overall training time. A practical way to reduce this communication overhead is to allow the workers to take multiple local gradient steps before communicating their local models to the hub. This form of distributed SGD is referred to as Local SGD (Lin et al., 2018; Stich, 2019) . There is a large body of work that analyzes the convergence of Local SGD and the benefits of multiple local training rounds (McMahan et al., 2017; Wang & Joshi, 2018; Li et al., 2019) . Local SGD is not applicable to all scenarios. Workers may be heterogeneous in terms of their computing capabilities, and thus the time required for local training is not uniform. For this reason, it can be either costly or impossible for workers to train in a fully synchronous manner, as stragglers may hold up global computation. However, the vast majority of previous work uses a synchronous model, where all clients train for the same number of rounds before sending updates to the hub (Dean et al., 2012; Ho et al., 2013; Cipar et al., 2013) . Further, most works assume a hub-and-spoke model, but this does not capture many real world settings. For example, devices in an ad-hoc network may not all be able to communicate to a central hub in a single hop due to network or communication range limitations. In such settings, a multi-level communication network model may be beneficial. In flying ad-hoc networks (FANETs), a network architecture has been proposed to improve scalability by partitioning the UAVs into mission areas (Bekmezci et al., 2013) . Here, clusters of UAVs have their own clusterheads, or hubs, and these hubs communicate through an upper level network, e.g., via satellite. Multi-level networks have also been utilized in Fog and Edge computing, a paradigm de-signed to improve data aggregation and analysis in wireless sensor networks, autonomous vehicles, power systems, and more (Bonomi et al., 2012; Laboratory, 2017; Satyanarayanan, 2017) . Motivated by these observations, we propose Multi-Level Local SGD (MLL-SGD), a distributed learning algorithm for heterogeneous multi-level networks. Specifically, we consider a two-level network structure. The lower level consists of a disjoint set of hub-and-spoke sub-networks, each with a single hub server and a set of workers. The upper level network consists of a connected, but not necessarily complete, hub network by which the hubs communicate. For example, in a Fog Computing application, the sub-network workers may be edge devices connected to their local data center, and the data centers act as hubs communicating over a decentralized network. Each subnetwork runs one or more Local SGD rounds, in which its workers train for a local training period, followed by model averaging at the sub-network's hub. Periodically, the hubs average their models with neighbors in the hub network. We model heterogeneous workers using a stochastic approach; each worker executes a local training iteration in each time step with a probability proportional to its computational resources. Thus, different workers may take different numbers of gradient steps within each local training period. Note since MLL-SGD averages every local training period, regardless of how many gradient steps each worker takes, slow workers do not slow algorithm execution. We prove the convergence of MLL-SGD for smooth and potentially non-convex loss functions. We assume data is distributed in an IID manner to all workers. Further, we analyze the relationship between the convergence error and algorithm parameters and find that, for a fixed step size, the error is quadratic in the number of local training iterations and the number of sub-network training iterations, and linear in the average worker operating rate. Our algorithm and analysis are general enough to encompass several variations of SGD as special cases, including classical SGD (Amari, 1993), SGD with weighted workers (McMahan et al., 2017) , and Decentralized Local SGD with an arbitrary hub communication network (Wang & Joshi, 2018) . Our work provides novel analysis of a distributed learning algorithm in a multi-level network model with heterogeneous workers. The specific contributions of this paper are as follows. 1) We formalize the multi-level network model with heterogeneous workers, and we define the MLL-SGD algorithm for training models in such a network. 2) We provide theoretical analysis of the convergence guarantees of MLL-SGD with heterogeneous workers. 3) We present an experimental evaluation that highlights our theoretical convergence guarantees. The experiments show that in multi-level networks, MLL-SGD achieves a marked improvement in convergence rate over algorithms that do not exploit the network hierarchy. Further, when workers have heterogeneous operating rates, MLL-SGD converges more quickly than algorithms that require all workers to execute the same number of training steps in each local training period. The rest of the paper is structured as follows. In Section 2, we discuss related work. Section 3 introduces the system model and problem formulation. We describe MLL-SGD in Section 4, and we present our main theoretical results in Section 5. Proofs of these results are deferred to the appendix. We provide experimental results in Section 6. Finally, we conclude in Section 7.

2. RELATED WORK

Distributed SGD is a well studied subject in Machine Learning. Zinkevich et al. (2010) introduced parallel SGD in a hub-and-spoke model. Variations on Local SGD in the hub-and-spoke model have been studied in several works (Moritz et al., 2016; Zhang et al., 2016; McMahan et al., 2017) . Many works have provided convergence bounds of SGD within this model (Wang et al., 2019b; Li et al., 2019) . There is also a large body of work on decentralized approaches for optimization using gradient based methods, dual averaging, and deep learning (Tsitsiklis et al., 1986; Jin et al., 2016; Wang et al., 2019a) . These previous works, however, do not address a multi-level network structure. In practice, workers may be heterogeneous in nature, which means that they may execute training iterations at different rates. Lian et al. (2017) addressed this heterogeneity by defining a gossipbased asynchronous SGD algorithm. In Stich (2019), workers are modeled to take gradient steps at an arbitrary subset of all iterations. However, neither of these works address a multi-level network model. Grouping-SGD (Jiang et al., 2019) considers a scenario where workers can be clustered into groups, for example, based on their operating rates. Workers within a group train in a synchronous manner, while the training across different groups may be asynchronous. The system model differs

