MULTI-LEVEL LOCAL SGD: DISTRIBUTED SGD FOR HETEROGENEOUS HIERARCHICAL NETWORKS

Abstract

We propose Multi-Level Local SGD, a distributed stochastic gradient method for learning a smooth, non-convex objective in a multi-level communication network with heterogeneous workers. Our network model consists of a set of disjoint subnetworks, with a single hub and multiple workers; further, workers may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete, communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We illustrate the effectiveness of our algorithm in a multi-level network with slow workers via simulation-based experiments.

1. INTRODUCTION

Stochastic Gradient Descent (SGD) is a key algorithm in modern Machine Learning and optimization (Amari, 1993) . To support distributed data as well as reduce training time, Zinkevich et al. (2010) introduced a distributed form of SGD. Traditionally, distributed SGD is run within a huband-spoke network model: a central parameter server (hub) coordinates with worker nodes. At each iteration, the hub sends a model to the workers. The workers each train on their local data, taking a gradient step, then return their locally trained model to the hub to be averaged. Distributed SGD can be an efficient training mechanism when message latency is low between the hub and workers, allowing gradient updates to be transmitted quickly at each iteration. However, as noted in Moritz et al. (2016) , message transmission latency is often high in distributed settings, which causes a large increase in overall training time. A practical way to reduce this communication overhead is to allow the workers to take multiple local gradient steps before communicating their local models to the hub. This form of distributed SGD is referred to as Local SGD (Lin et al., 2018; Stich, 2019) . There is a large body of work that analyzes the convergence of Local SGD and the benefits of multiple local training rounds (McMahan et al., 2017; Wang & Joshi, 2018; Li et al., 2019) . Local SGD is not applicable to all scenarios. Workers may be heterogeneous in terms of their computing capabilities, and thus the time required for local training is not uniform. For this reason, it can be either costly or impossible for workers to train in a fully synchronous manner, as stragglers may hold up global computation. However, the vast majority of previous work uses a synchronous model, where all clients train for the same number of rounds before sending updates to the hub (Dean et al., 2012; Ho et al., 2013; Cipar et al., 2013) . Further, most works assume a hub-and-spoke model, but this does not capture many real world settings. For example, devices in an ad-hoc network may not all be able to communicate to a central hub in a single hop due to network or communication range limitations. In such settings, a multi-level communication network model may be beneficial. In flying ad-hoc networks (FANETs), a network architecture has been proposed to improve scalability by partitioning the UAVs into mission areas (Bekmezci et al., 2013) . Here, clusters of UAVs have their own clusterheads, or hubs, and these hubs communicate through an upper level network, e.g., via satellite. Multi-level networks have also been utilized in Fog and Edge computing, a paradigm de-

