MULTI-LEVEL LOCAL SGD: DISTRIBUTED SGD FOR HETEROGENEOUS HIERARCHICAL NETWORKS

Abstract

We propose Multi-Level Local SGD, a distributed stochastic gradient method for learning a smooth, non-convex objective in a multi-level communication network with heterogeneous workers. Our network model consists of a set of disjoint subnetworks, with a single hub and multiple workers; further, workers may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete, communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We illustrate the effectiveness of our algorithm in a multi-level network with slow workers via simulation-based experiments.

1. INTRODUCTION

Stochastic Gradient Descent (SGD) is a key algorithm in modern Machine Learning and optimization (Amari, 1993) . To support distributed data as well as reduce training time, Zinkevich et al. (2010) introduced a distributed form of SGD. Traditionally, distributed SGD is run within a huband-spoke network model: a central parameter server (hub) coordinates with worker nodes. At each iteration, the hub sends a model to the workers. The workers each train on their local data, taking a gradient step, then return their locally trained model to the hub to be averaged. Distributed SGD can be an efficient training mechanism when message latency is low between the hub and workers, allowing gradient updates to be transmitted quickly at each iteration. However, as noted in Moritz et al. (2016) , message transmission latency is often high in distributed settings, which causes a large increase in overall training time. A practical way to reduce this communication overhead is to allow the workers to take multiple local gradient steps before communicating their local models to the hub. This form of distributed SGD is referred to as Local SGD (Lin et al., 2018; Stich, 2019) . There is a large body of work that analyzes the convergence of Local SGD and the benefits of multiple local training rounds (McMahan et al., 2017; Wang & Joshi, 2018; Li et al., 2019) . Local SGD is not applicable to all scenarios. Workers may be heterogeneous in terms of their computing capabilities, and thus the time required for local training is not uniform. For this reason, it can be either costly or impossible for workers to train in a fully synchronous manner, as stragglers may hold up global computation. However, the vast majority of previous work uses a synchronous model, where all clients train for the same number of rounds before sending updates to the hub (Dean et al., 2012; Ho et al., 2013; Cipar et al., 2013) . Further, most works assume a hub-and-spoke model, but this does not capture many real world settings. For example, devices in an ad-hoc network may not all be able to communicate to a central hub in a single hop due to network or communication range limitations. In such settings, a multi-level communication network model may be beneficial. In flying ad-hoc networks (FANETs), a network architecture has been proposed to improve scalability by partitioning the UAVs into mission areas (Bekmezci et al., 2013) . Here, clusters of UAVs have their own clusterheads, or hubs, and these hubs communicate through an upper level network, e.g., via satellite. Multi-level networks have also been utilized in Fog and Edge computing, a paradigm de-signed to improve data aggregation and analysis in wireless sensor networks, autonomous vehicles, power systems, and more (Bonomi et al., 2012; Laboratory, 2017; Satyanarayanan, 2017) . Motivated by these observations, we propose Multi-Level Local SGD (MLL-SGD), a distributed learning algorithm for heterogeneous multi-level networks. Specifically, we consider a two-level network structure. The lower level consists of a disjoint set of hub-and-spoke sub-networks, each with a single hub server and a set of workers. The upper level network consists of a connected, but not necessarily complete, hub network by which the hubs communicate. For example, in a Fog Computing application, the sub-network workers may be edge devices connected to their local data center, and the data centers act as hubs communicating over a decentralized network. Each subnetwork runs one or more Local SGD rounds, in which its workers train for a local training period, followed by model averaging at the sub-network's hub. Periodically, the hubs average their models with neighbors in the hub network. We model heterogeneous workers using a stochastic approach; each worker executes a local training iteration in each time step with a probability proportional to its computational resources. Thus, different workers may take different numbers of gradient steps within each local training period. Note since MLL-SGD averages every local training period, regardless of how many gradient steps each worker takes, slow workers do not slow algorithm execution. We prove the convergence of MLL-SGD for smooth and potentially non-convex loss functions. We assume data is distributed in an IID manner to all workers. Further, we analyze the relationship between the convergence error and algorithm parameters and find that, for a fixed step size, the error is quadratic in the number of local training iterations and the number of sub-network training iterations, and linear in the average worker operating rate. Our algorithm and analysis are general enough to encompass several variations of SGD as special cases, including classical SGD (Amari, 1993) , SGD with weighted workers (McMahan et al., 2017) , and Decentralized Local SGD with an arbitrary hub communication network (Wang & Joshi, 2018) . Our work provides novel analysis of a distributed learning algorithm in a multi-level network model with heterogeneous workers. The specific contributions of this paper are as follows. 1) We formalize the multi-level network model with heterogeneous workers, and we define the MLL-SGD algorithm for training models in such a network. 2) We provide theoretical analysis of the convergence guarantees of MLL-SGD with heterogeneous workers. 3) We present an experimental evaluation that highlights our theoretical convergence guarantees. The experiments show that in multi-level networks, MLL-SGD achieves a marked improvement in convergence rate over algorithms that do not exploit the network hierarchy. Further, when workers have heterogeneous operating rates, MLL-SGD converges more quickly than algorithms that require all workers to execute the same number of training steps in each local training period. The rest of the paper is structured as follows. In Section 2, we discuss related work. Section 3 introduces the system model and problem formulation. We describe MLL-SGD in Section 4, and we present our main theoretical results in Section 5. Proofs of these results are deferred to the appendix. We provide experimental results in Section 6. Finally, we conclude in Section 7.

2. RELATED WORK

Distributed SGD is a well studied subject in Machine Learning. Zinkevich et al. (2010) introduced parallel SGD in a hub-and-spoke model. Variations on Local SGD in the hub-and-spoke model have been studied in several works (Moritz et al., 2016; Zhang et al., 2016; McMahan et al., 2017) . Many works have provided convergence bounds of SGD within this model (Wang et al., 2019b; Li et al., 2019) . There is also a large body of work on decentralized approaches for optimization using gradient based methods, dual averaging, and deep learning (Tsitsiklis et al., 1986; Jin et al., 2016; Wang et al., 2019a) . These previous works, however, do not address a multi-level network structure. In practice, workers may be heterogeneous in nature, which means that they may execute training iterations at different rates. Lian et al. (2017) addressed this heterogeneity by defining a gossipbased asynchronous SGD algorithm. In Stich (2019) , workers are modeled to take gradient steps at an arbitrary subset of all iterations. However, neither of these works address a multi-level network model. Grouping-SGD (Jiang et al., 2019) considers a scenario where workers can be clustered into groups, for example, based on their operating rates. Workers within a group train in a synchronous manner, while the training across different groups may be asynchronous. The system model differs Let the model parameters be denoted by x ∈ R n . Our goal is to find an x that minimizes the following objective function over the training set: F (x) = 1 | S | s∈S f (x; s) where f (•) is the loss function. The workers collaboratively minimize this loss function, in part by executing local iterations of SGD over their training sets. For each executed local iteration, a worker samples a mini-batch of data uniformly at random from its local data. Let ξ be a randomly sampled mini-batch of data and let g(x; ξ) = 1 |ξ| s∈ξ ∇f (x; s) be the mini-batch gradient. For simplicity, we use g(x) instead of g(x; ξ) from here on. Assumption 1. The objective function and the mini-batch gradients satisfy the following: 1a The objective function F : R n → R is continuously differentiable, and the gradient is Lipschitz with constant L > 0, i.e., ∇F (x) -∇F (y) 2 ≤ L xy 2 for all x, y ∈ R n . 1b The function F is lower bounded, i.e., F (x) ≥ F inf > -∞ for all x ∈ R n . 1c The mini-batch gradients are unbiased, i.e., E ξ|x [g(x)] = ∇F (x) for all x ∈ R n . 1d There exist scalars β ≥ 0 and σ ≥ 0 such that E ξ|x g(x)-∇F (x) 2 2 ≤ β||∇F (x)|| 2 2 +σ 2 for all x ∈ R n . Assumption 1a requires that the gradients do not change too rapidly, and Assumption 1b requires that our objective function is lower bounded by some F inf . Assumptions 1c and 1d assume that  parallel for i ∈ M (d) do 5: x (i) k ← y (d) k Workers receive updated model from hub 6: for j = k, . . . , k + τ -1 do 7: x (i) k+1 ← x (i) k -ηg (i) k Local iteration (probabilistic) 8: end for 9: end parallel for 10: z (d) ← i∈M (d) v (i) x (i) k+1 Hub d computes average of its workers' models 11: if k mod q • τ = 0 then 12: y (d) k+1 ← j∈N (d) H j,d z (j) Hub d averages its model with neighboring hubs 13: else 14: d) 15: y (d) k+1 ← z ( end if

16:

end parallel for 17: end for the local data at each worker can be used as an unbiased estimate for the full dataset with the same bounded variance. These assumptions are common in convergence analysis of SGD algorithms (e.g., Bottou et al. (2018) ).

4. ALGORITHM

We now present our Multi-Level Local SGD (MLL-SGD) algorithm. The pseudocode is shown in Algorithm 1. Each sub-network trains in parallel and, periodically, the hubs average their models with neighboring hubs. The steps corresponding to Local SGD are shown in lines 5-10. Each hub and worker stores a copy of the model. For worker i ∈ M (d) , we denote its copy of the local model by x (i) . We denote the model at hub d by y (d) . The hub first sends its model to its workers, and the workers update their local models to match their hub's model. Workers then execute multiple local training iterations, shown in line 7, to refine their local models independently. To represent the different rates of computation at each worker, we use a probabilistic approach. We assume that, in expectation, a worker i execute τ (i) local iterations for every τ time steps (τ (i) ≤ τ ). We thus define the N-vector p where each entry p i = τ (i) τ is the probability with which worker i executes a local gradient step in each iteration k. Worker i updates its local model at iteration k as follows: x (i) k+1 = x (i) k -ηg (i) k (2) where η is the step size and g (i) k is a random variable such that g (i) k = g(x (i) k ) w/ probability p i 0 w/ probability 1 -p i . After τ time steps, the hub updates its model based on the models of its workers (line 10). For each worker i, we assign a positive weight w (i) . Let v (i) be the weight for worker i normalized within its sub-network: j) , where d(i) denotes the sub-network of worker i. Each hub's updates its model to be a weighted average over the workers' models in its sub-network: i) . Weights may be assigned for different reasons. If all worker gradients are treated equally, then w (i) = 1 and v v (i) = w (i) j∈M (d(i)) w ( y (d) = i∈M (d) v (i) x ( (i) = 1 N (d(i)) . We may also weight a worker's gradient proportional to its local dataset size, in which case w (i) = | S (i) | and v (i) = | S (i) | r∈M (d(i)) | S (r) | . The latter approach is used in Federated Averaging (McMahan et al., 2017) . After q iterations of Local SGD in each sub-network (q • τ time steps), the hubs average their models with their neighbors in the hub communication network (line 12). The weight assigned to each hub's model is defined by a D × D matrix H so that: y (d) = j∈N (d) H j,d y (j) . (4) Define the total weight in the network to be w tot = i∈M w (i) . Let b be a D-vector with each component d given by b d = ( i∈M (d) w (i) )/w tot . We assume H meets the following requirements. Assumption 2. The matrix H satisfies the following: 2a If (i, j) ∈ E, then H i,j > 0. Otherwise, H i,j = 0. 2b H is column stochastic, i.e., D i=1 H i,j = 1. 2c For all i, j ∈ D, we have b i H i,j = b j H j,i . Assumption 2 implies that H has one as a simple eigenvalue, with corresponding right eigenvector b and left eigenvector 1 D . Further, all of its other eigenvalues have magnitude strictly less than 1 (since G is connected) (Rotaru & Nägeli, 2004) . By defining H in this way, we ensure that the contributions from the workers' gradients in each hub are incorporated in proportion to the workers' weights. This weighted averaging approach allows us to naturally extend Federated Averaging to the multi-level network model.

5. ANALYSIS

We note that hubs are essentially stateless in MLL-SGD, as the hub models are copied to all workers after each sub-network or hub averaging. Thus, our analysis focuses on how worker models evolve. We first present an equivalent formulation of the MLL-SGD algorithm in terms of the evolution of the worker models. We then present our main result on the convergence of MLL-SGD. The system behavior can be summarized by the following update rule for worker models: X k+1 = (X k -η G k ) T k (5) where n×N matrix X k = [x (1) k , . . . , x (N ) k ], n×N matrix G k = [g k , . . . , g (N ) k ], and N ×N matrix T k is a time-varying operator that captures the three stages in MLL-SGD: local iterations, hub-andspoke averaging within each sub-network, and averaging across the hub network. We define T k as follows: T k =    Z if k mod qτ = 0 V if k mod τ = 0 and k mod qτ = 0 I otherwise. (6) For local iterations, T k = I, as there are no interactions between workers or hubs. For sub-network averaging, V is an N × N block diagonal matrix, with each block V (d) corresponding to a single sub-network d. The matrix V (d) is an N (d) × N (d) matrix where each entry is i) . Finally, we define an N × N matrix Z that captures the sub-network averaging and hub network averaging in one operation that involves all workers. The components of Z are given by V (d) i,j = v ( Z i,j = H d(i),d(j) v (i) . ( ) Let a be an N -vector with each component a i = w (i) wtot representing the weight of worker i, normalized over all worker weights. We observe that Z and V satisfy the following: each have a right eigenvector of a and left eigenvector of 1 T N with eigenvalue 1 and all other eigenvalues have magnitude strictly less than 1. The proof of these properties can be found in the appendix. These properties are necessary (but not sufficient) to ensure that the worker models converge to a consensus model, where each worker's updates have been incorporated according to the worker's weight. As is common, we study an averaged model over all workers in the system (Yuan et al., 2016; Wang & Joshi, 2018) . Specifically, we define a weighted average model: u k = X k a. We identify the recurrence relation of u k . If we multiply a on both sides of (5): X k+1 a = (X k -η G k ) T k a (9) u k+1 = u k -η G k a (10) u k+1 = u k -η N i=1 a i g (i) k (11) where (10) follows from a being a right eigenvector of V and Z with eigenvalue 1. We note that u k is updated via a stochastic gradient descent step using a weighted average of several mini-batch gradients. Since F (•) may be non-convex, SGD may converge to a local minimum or saddle point. Thus, we study the gradients of u k as k increases. We next provide the main theoretical result of the paper. Theorem 1. Under Assumptions 1 and 2, if η satisfies the following for all i ∈ M: (4p i -p 2 i -2) ≥ ηL a i p i (β + 1) -a i p 2 i + p 2 i + 8L 2 η 2 q 2 τ 2 Γ ( ) where Γ = ζ 1-ζ 2 + 2 1-ζ + ζ (1-ζ) 2 and ζ = max{|λ 2 (H)|, |λ N (H)|} , then the expected square norm of the average model gradient, averaged over K iterations, is bounded as follows: E 1 K K k=1 ∇F (u k ) 2 2 ≤ 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 4L 2 η 2 σ 2 q 3 τ 3 1 qτ - 1 K ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 P + 4L 2 η 2 σ 2 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P (13) K →∞ ----→ σ 2 ηL N i=1 a 2 i p i + 4L 2 η 2 σ 2 q 2 τ 2 ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 P + 4L 2 η 2 σ 2 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P ( ) where P = N i=1 a i p i . The proof of Theorem 1 is provided in the appendix. The first term in (13) is the same as in centralized SGD (Bottou et al., 2018) . As K → ∞, this term goes to zero. The second term is similar to centralized SGD as well. If the stochastic gradients have high variance, then the convergence error will be larger. This term is also related to the convergence error in distributed SGD (Bottou et al., 2018) , which is equivalent to MLL-SGD when there is one sub-network, q = τ = 1, a i = 1/N , and p i = 1 for all i. MLL-SGD has a dependence on the probabilities of gradient steps and worker weights, replacing the 1 N in the equivalent term in distributed SGD. The third and fourth terms in ( 13) are additive errors that depend on the topology of the hub network. The value of ζ is given by the second largest eigenvalue of H, by magnitude, which is an indication of the sparsity of the hub network. When worker weights are uniform, a fully connected hub graph G will have ζ = 0, while a sparse G will typically have ζ close to 1. It is interesting to note that ζ only depends on H, and not Z or V, meaning the convergence error does not depend on how worker weights are distributed within sub-networks. We also note the third and fourth terms depend on P, the weighted average probability of the workers. The convergence error increases as the average worker operating rate increases. This relation is expected as more local iterations will increase convergence error (Wang & Joshi, 2018) . It is interesting to note that the convergence error does not depend on the distribution of p, meaning that a skewed and uniform distribution with the same average probability would have the same convergence error. We observe that the condition on η in (12) cannot always be satisfied given certain probabilities. Specifically, when there exists a p i ≤ 2 -√ 2 ≈ 0.59, then the left-hand side will be non-positive, and the inequality can no longer be satisfied. Although this may be a conservative bound, intuitively, when p i 's are below this threshold, the algorithm may not make sufficient progress in each time step to guarantee convergence. The third and fourth terms also grow with q and τ , the number of local iterations per hub network averaging and sub-network averaging steps, respectively. The longer workers train locally without reconciling their models, the more their models will diverge, leading to larger convergence error. We can see that τ plays a slightly larger role in convergence error than q. For a given q • τ , meaning a given number of time steps between hub averaging steps, a larger τ leads to higher convergence error than a larger q would. Thus, there is a slight penalty to performing more local iterations between sub-network averaging steps. We explore this more in Section 6. We note that when setting a i = 1/N and p i = 1 for all workers i, and setting q = 1, MLL-SGD reduces to Cooperative SGD. However, the bound in Theorem 1 differs from that of Cooperative SGD. Specifically, Theorem 1 has error terms dependent on τ 2 as opposed to τ in Cooperative SGD. This discrepancy is due to accommodating all possible values of p i . More details can be found in Appendix C.4. In the following corollary, we analyze the convergence rate of Algorithm 1 when η = 1 L √ K . Corollary 1. Let η = 1 L √ K and let q 2 τ 2 ≤ √ K. If qτ < K, then E 1 K K k=1 ∇F (u k ) 2 2 ≤ O L √ K (F (x 1 ) -F inf ]) + O σ 2 √ K Under the conditions given in Corollary 1, MLL-SGD achieves the same asymptotic convergence rate as Local SGD and HL-SGD.

6. EXPERIMENTS

In this section, we show the performance of MLL-SGD compared to algorithms that do not account for hierarchy and heterogeneous worker rates. We also explore the impact of the different algorithm parameters that show up in Theorem 1. We use the EMNIST (Cohen et al., 2017) and CIFAR-10 (Krizhevsky et al., 2009) datasets. For all experiments, we provide results for training a simple Convolutional Neural Network (CNN) on EMNIST and training ResNet-18 on CIFAR-10. The CNN has two convolutional layers and two fully connected layers. We train the CNN with a step size of 0.01. For ResNet, we use a standard approach of changing the step size from 0.1 to 0.01 to 0.001 over the course of training (He et al., 2016) . We conduct experiments using Pytorch 1.4.0 and Python 3. We compare MLL-SGD with Distributed SGD, Local SGD, and HL-SGD. Distributed SGD is equivalent to MLL-SGD when there is one hub, q = τ = 1, and a i = 1/N and p i = 1 for all i, which means Distributed SGD averages all worker models at every iteration. Thus, we use Distributed SGD as a baseline for convergence error and accuracy in some experiments. Local SGD is equivalent to MLL-SGD when a i = 1/N and p i = 1 for all i, when the hub network is fully connected, and q = 1. HL-SGD extends Local SGD to allow q > 1. For all experiments, we let τ = 32 for Local SGD. We let qτ = 32 for all HL-SGD and MLL-SGD variations to be comparable with Local SGD. In all experiments, we measure training loss and test accuracy of the averaged model u k every 32 iterations. We first explore the effect of different values of τ and q in MLL-SGD. We configure a multi-level network with a fully connected hub network and with 10 hubs, each with 10 workers. We use two configurations for MLL-SGD, one with τ = 8 and q = 4, and one with τ = 4 and q = 8. Distributed SGD and Local SGD treat the hubs as pass-throughs, and average all workers every iteration and every τ iterations respectively. Workers are split into five groups of 20 workers each. Each group is assigned a percentage of the full dataset: 5%, 10%, 20%, 25%, and 40%. Workers within a group partition the data evenly. The workers weights are assigned based on dataset sizes. In Figures 1a and 1c we plot the training loss, and in Figures 1b and 1d we plot the test accuracy for the CNN and ResNet, respectively. We observe that as q increases, while keeping qτ = 32, MLL-SGD improves and approaches the Distributed SGD baseline. Thus, increasing the number of sub-network training rounds improves the convergence behavior of MLL-SGD. The benefit is more pronounced in training ResNet on CIFAR. We next investigate how the number and sizes of the sub-networks impacts the convergence of MLL-SGD. From a pool of 100 workers, we distribute them across 5, 10, and 20 sub-networks. The hub network is a path graph, which yields the largest ζ while keeping the network connected. This hub network topology the worst-case scenario in terms of the convergence bound. Note that as the number of hubs increases, the larger ζ becomes. We let a i = 1/N and p i = 1 for all workers i. We set q = 4 and τ = 8. We also include results using Local SGD with 1 hub and 100 workers. The results of this experiment are shown in Figures 2 and 3 . In the case of the CNN, the difference in training loss is minimal among the MLL-SGD variations. In the case of ResNet we can see that as the number of hubs increase, the convergence rate decreases. This is in line with Theorem 1 since an increased number hubs corresponds with an increased ζ. Interestingly, despite the low hub network connectivity, MLL-SGD outperforms Local SGD. This shows that MLL-SGD still benefits from a hierarchy even when hub connectivity is sparse. Next, we explore the impact of different distributions of worker operating rates. According to Theorem 1, the average probability across workers plays a role in the error bound. To see if this holds in practice, we compare four different MLL-SGD setups, all of which includes a complete hub network, 10 hubs, each with 10 workers, a i = 1/N , and an average probability amongst workers of 0.55: (i) all workers with a p i = 0.55 (Fixed); (ii) workers in each sub-network with probability ranging from 0.1 to 1 at steps of 0.1 (Uniform Distribution); (iii) 90 workers with p i = 0.5 and 10 workers with p i = 1 (Skewed 1); (iv) 90 workers with p i = 0.6 and 10 workers with p i = 0.1 (Skewed 2). We include a case where all workers have p i = 1 as a baseline (Prob=1). In Figures 4 and 5 we can see that in all cases except the baseline, the convergence rate is similar in both models. This is in line with our theoretical results, since all cases have the same average worker probability. Finally, we compare the convergence time of MLL-SGD against algorithms that wait for slower workers: Local SGD and HL-SGD. We simulate real-time with time slots. In every time slot, each worker will take a gradient step with a probability p i . Note when p i = 1 for a worker i, the number of gradient steps taken will match the number of time slots T . Otherwise, the number of gradient steps taken will be T • p i in expectation. MLL-SGD will wait τ time slots before averaging worker models in a sub-network, regardless of the number of gradient steps taken, while Local SGD and HL-SGD will wait for all workers to take τ gradient steps. This approach allows us to compare the progress of each algorithm over time. In this experiment, we set p i = 0.9 for 90% of workers and p i = 0.6 for 10% of the workers. As in the previous experiments, we use a multi-level network with a fully connected hub network and with 10 hubs, each with 10 workers. We study MLL-SGD with two parameter settings, τ = 32, q = 1 and τ = 8, q = 4. We also include results for Local-SGD and HL-SGD. By comparing MLL-SGD with τ = 32, q = 1 with Local-SGD, we can evaluate the impact of using a local training period based on time rather than a number of worker iterations. By comparing MLL-SGD with τ = 8, q = 4 with HL-SGD, we can evaluate this impact in a multi-level network. In Figures 6a and 6c , we plot the training loss, and in Figures 6b and 6d , we plot the test accuracy for the CNN and ResNet, respectively. We can see that MLL-SGD with q = 1 converges more quickly, in both loss and accuracy, than Local SGD, and that MLL-SGD with q = 4 converges more quickly than HL-SGD. These trends hold in both the CNN and ResNet models. The results show that in this experimental setup, waiting for slow workers is detrimental to the overall convergence time.

7. CONCLUSION

We have introduced MLL-SGD, a variation of Distributed SGD in a multi-level network model. Our algorithm incorporates the heterogeneity of worker devices using a stochastic approach. We provide theoretical analysis of the algorithm's convergence, and we show how the convergence error depends on the average worker rate, the hub network topology, and the number of local, subnetwork averaging, and hub averaging steps. Finally, we provide experimental results that illustrate the effectiveness of MLL-SGD over Local SGD and HL-SGD. In future work, we plan to analyze the effects of non-IID data on convergence error.

A CODE REPOSITORY

The code used in our experiments can be found at: https://github.com/rpi-nsl/MLL-SGD. This code simulates a multi-level network with heterogeneous workers, and trains a model using MLL-SGD.

B ADDITIONAL EXPERIMENTS

Our experiments in Section 6 explore how changing MLL-SGD parameters affect training on a nonconvex function. In this section, show the results of the same experiments on a convex loss function. We train a logistic regression model on the MNIST dataset (Bottou et al., 1994) . We train a binary classification model with half the classes being 0 and the other half being 1 and use a step size of 0.2. We run all experiments for 32,000 iterations. We rerun our first experiment from Figure 1 with logistic regression trained on MNIST. Figures 7a and 7b show the training loss and test accuracy, respectively. As with the non-convex functions, we can see that MLL-SGD with larger q approaches the Distributed SGD baseline. We rerun our second experiment comparing different hub and worker distributions with logistic regression trained on MNIST. Figure 8 shows the training loss. The three variations of MLL-SGD do not show much difference in terms of convergence rate, indicating that ζ has little effect in this case. However, they still outperform Local SGD due to q being larger. We rerun our third experiment comparing different worker operating rates distributions with logistic regression trained on MNIST. Figure 9 shows the training loss. As with the non-convex functions, all MLL-SGD variations with the same average probability have similar convergence rate. We rerun our first experiment from Figure 6 with logistic regression trained on MNIST. Figures 10a and 10b show the training loss and test accuracy, respectively. We can see an improvement in convergence rate of MLL-SGD over both Local SGD and HL-SGD.

C PROOF OF THEOREM 1

For our proof we adopt a similar approach to that in Wang & Joshi (2018) . This section is structured as follows. We first define some notation and make some observations in Section C.1. Our supporting lemmas are stated in Section C.2. We close with the full proof of Theorem 1 in Section C.3.

C.1 PRELIMINARIES

For simplicity of notation, we let • denote the l 2 vector norm. Let the weighted Frobenius norm of an N × M matrix X with an N -vector a be defined as follows: X 2 Fa = Tr((diag(a)) 1/2 XX T (diag(a)) 1/2 ) = N i=1 M j=1 a i |x i,j | 2 . ( ) The matrix operator norm for a square matrix Q is defined as: We define the set of Bernoulli random variables Θ = {θ 1 k , . . . , θ N k }, where Q op = λ max (Q T Q). θ i k = 1 with probability p i 0 with probability (1 -p i ). Let Ξ k = {ξ (1) k , ..., ξ (N ) k } be the set of mini-batches used by the N workers at time step k. Without loss of generality, we assign a mini-batch to each worker, even if it does not execute a gradient step in that iteration. An equivalent definition of g (i) k is then g (i) k = θ i k g(ξ (i) k ). For simplicity of notation, let E k be equivalent to E Θ k ,Ξ k |X k . We note that Assumption 1c implies: E k [g (i) k ] = p i E k [g(x (i) k )] = p i ∇F (x (i) k ). Further, when i = j: E k [(g (i) k ) T g (j) k ] = p i p j E k [(g(x (i) k ))] T E k [g(x (j) k )] = p i p j ∇F (x (i) k ) T ∇F (x (j) k ). We also note that Assumption 1d implies: E k g (i) k -∇F (x (i) k ) 2 = E k g (i) k 2 + ∇F (x (i) k ) 2 -2(g (i) k ) T ∇F (x (i) k ) (23) = E k g (i) k 2 + ∇F (x (i) k ) 2 -2E k (g (i) k ) T ∇F (x (i) k ) (24) = p i E k g(x (i) k ) 2 + ∇F (x (i) k ) 2 -2p i E k g(x (i) k ) T ∇F (x (i) k ) (25) = p i E k g(x (i) k ) 2 + p i ∇F (x (i) k ) 2 -2p i E k g(x (i) k ) T ∇F (x (i) k ) + (1 -p i ) ∇F (x (i) k ) 2 (26) = p i E k g(x (i) k ) -∇F (x (i) k ) 2 + (1 -p i ) ∇F (x (i) k ) 2 (27) ≤ p i β ∇F (x (i) k ) 2 + p i σ 2 + (1 -p i ) ∇F (x (i) k ) 2 (28) = (p i (β -1) + 1) ∇F (x (i) k ) 2 + p i σ 2 . ( ) Finally, we define the weighted average stochastic gradient and the weighted average batch gradient as: G k = N i=1 a i g (i) k , H k = N i=1 a i ∇F (x (i) k ).

C.2 LEMMAS AND PROPOSITIONS

Next, we state our supporting lemmas and propositions. Proposition 1. The matrices Z and V satisfy the following properties: 1. Z and V each have a right eigenvector of a with eigenvalue 1. 2. Z and V each have a left eigenvector of 1 T N with eigenvalue 1. 3. All other eigenvalues of Z and V have magnitude strictly less than 1. Proof. Assumption 2 indicates that H is a Generalized Diffusion Matrix as defined in Rotaru & Nägeli (2004) .

Recall Assumption 2:

Assumption 2. The matrix H satisfies the following: 2a If (i, j) ∈ E, then H i,j > 0. Otherwise, H i,j = 0. 2b H is column stochastic, i.e., D i=1 H i,j = 1. 2c For all i, j ∈ D, we have b i H i,j = b j H j,i . If we show this implies that Z and V are Generalized Diffusion Matrices with the same properties to those in Assumption 2 with vector a, then the properties in the proposition are satisfied. Since H and b are non-negative, then Z is also non-negative. It is also clear that Z is column stochastic by construction. It is left to prove that: Z i,j a j = Z j,i a i . ( ) Applying the definition of Z to the left side, we have: Z i,j a j = H d(i),d(j) v (i) a j Since we know that H is a Generalized Diffusion Matrix with vector b, we know that: H i,j b j = H j,i b i ( ) H i,j = H j,i b i b j . ( ) Plugging this in for H d(i),d(j) , we have: Z i,j a j = H d(j),d(i) b d(i) b d(j) v (i) a j (34) = H d(j),d(i) r∈M (d(i)) w (r) w tot w tot r∈M (d(j)) w (r) w (i) r∈M (d(i)) w (r) w (j) w tot (35) = H d(j),d(i) w (i) w tot w (j) r∈M (d(j)) w (r) (36) = H d(j),d(i) v (j) a i (37) = Z j,i a i . Therefore, Z is a Generalized Diffusion Matrix. We can show that V is also a Generalized Diffusion Matrix with the vector a. V is constructed to be non-negative and column stochastic. It is left to prove that V i,j a j = V j,i a i . When i, j are outside a block V (d) , then V i,j = V j,i = 0, so the equation is trivially satisfied. When within a block, in terms of w, we have: V i,j a j = V j,i a i (40) w (i) r∈M (d(i)) w (r) w (j) w tot = w (j) r∈M (d(j)) w (r) w (i) w tot . ( ) Noting that we are within a block, therefore d(i) = d(j), we can see that both sides are equal: w (i) w (j) = w (j) w (i) . Therefore, V is a Generalized Diffusion Matrix. Proposition 2. Given a diffusion matrix H with the properties in Assumption 2, if Z constructed as follows, Z i,j = H d(i),d(j) v (i) (43) then the largest eigenvalues of Z are the eigenvalues of H, and zero otherwise. Proof. In order to prove the relationship of the eigenvalues of Z and H, we prove the following two points separately: 1. The rank of Z is the same as H. 2. All non-zero eigenvalues of H are eigenvalues of Z with the same multiplicity. For the rank of Z, we take a look at how each column is constructed. Consider column j of Z: Z j = [H 1,d(j) v (1) , . . . , H 1,d(j) v (N (1) ) , H 2,d(j) v (N (1) +1) , . . . , H D,d(j) v (N ) ] T . ( ) For two columns i and j where d(i) = d(j), these columns are identical. Therefore, the rank of Z will be, at most, the number of hubs, D. Further, we can see that the elements of a column j in Z are simply scaled elements of column d(j) in H. So any linearly dependent columns in H will also be linearly dependent in Z. Therefore, the rank of the two matrices are the same. For the second point, we show there is a bijective mapping from eigenpairs of H to eigenpairs of Z. Let (λ, y) be an eigenpair of H (with λ = 0), i.e.

Hy = λy. (45)

Define the N -vector x with components x i = v (i) y d(i) . We will show that Z x = λx. Looking at the i-th entry of the vector Z x, we have (Z x) i = N j=1 Z i,j x j . ( ) Applying the definition of Z and x, we obtain (Z x) i = N j=1 1 v (i) H d(i),d(j) v (j) y d(j) (47) = 1 v (i) D l=1 H d(i),l y l k∈M (l) v (k) (48) = 1 v (i) D l=1 H d(i),l y l . ( ) Note that the m-th entry of the vector Hy equals D l=1 H m,l y l = λy m . Applying this equality, we obtain (Z x) i = 1 v (i) λy d(i) (50) = λx i . Therefore, for any eigenpair (λ, y) of H, we can find an eigenpair (λ, x) of Z. It is left to prove that this mapping is a bijection. Suppose eigenvalue λ of H has multiplicity k > 1. We consider any two of the k eigenpairs (λ, c) and (λ, d). Let the corresponding eigenpairs of Z be (λ, e) and (λ, f ). We know that e = f because c and d are unique, and there must exist an index i such that v (i) c d(i) = v (i) d d(i) . Therefore, the mapping of eigenpairs of H to eigenpairs of Z is a bijection. Proposition 3. Given definition of Z and V in Proposition 1, it is the case that Z V = V Z = Z . ( ) Proof. First, we prove that V Z = Z. Note that the i-th row of V contains either v i or zero. Looking at an arbitrary entry i, j of V Z we have: (V Z) i,j = v (i) r∈M d(i) Z r,j (V Z) i,j = v (i) H d(i),d(i) (54) (V Z) i,j = Z i,j . Next we prove that Z V = Z. Note that for any row i in Z, Z i,j = Z i,k when d(j) = d(k). (Z V) i,j = Z i,j N r=1 V r,j . ( ) Since V is column stochastic: (Z V) i,j = Z i,j . Proposition 4. Let A = a 1 T . Given our definition of T k in (6), T k A = A T k = A (58) for all k. Proof. We prove each of the three cases of T k : I, V, and Z. Clearly, I A = A I = A. It is left to prove V A = A V = A and Z A = A Z = A. We can see that Z A = A since a is a right eigenvector of Z with eigenvalue 1: Z A = Z a 1 T = a 1 T = A. Similarly, we can see that A Z = a 1 T Z = a 1 T = A as 1 T is a left eigenvector of Z. The same holds for V. Lemma 1. Under Assumptions 1c and 1d, the variance of the weighted average stochastic gradient is bounded as follows: E k [ G k -H k 2 ] ≤ N i=1 a 2 i (p i (β -1) + 1) ∇F (x (i) k ) 2 + p i σ 2 + N l=1 N j =l a l a j (1 -p j )(1 -p l )∇F (x (j) k ) T ∇F (x (l) k ). ( ) Proof. E k [ G k -H k 2 ] = E k   N i=1 a i (g (i) k -∇F (x k )) 2   (60) = E k   N i=1 a 2 i g (i) k -∇F (x k ) 2 + N l=1 N j =l a l a j g (j) k -∇F (x (j) k ), g k -∇F (x (l) k )   (61) = N i=1 a 2 i E k g (i) k -∇F (x k ) 2 + N l=1 N j =l a l a j E k g (j) k -∇F (x (j) k ), g k -∇F (x (l) k ) . Looking at the cross-terms in ( 62): E k g (j) k -∇F (x (j) k ), g k -∇F (x (l) k ) = E k (g (j) k ) T g (l) k -E k (g (j) k ) T ∇F (x (l) k ) -E k ∇F (x (j) k ) T g (l) k + ∇F (x (j) k ) T ∇F (x (j) k ) (63) = p j p l ∇F (x (j) k ) T ∇F (x (l) k ) -p j ∇F (x (j) k ) T ∇F (x (l) k ) -p l ∇F (x (j) k ) T ∇F (x (l) k ) + ∇F (x (j) k ) T ∇F (x (j) k ) (64) = (1 -p j )(1 -p l )∇F (x (j) k ) T ∇F (x (l) k ). Plugging ( 65) into (62) we have: E k [ G k -H k 2 ] = N i=1 a 2 i E k g (i) k -∇F (x k ) 2 + N l=1 N j =l a l a j (1 -p j )(1 -p l )∇F (x (j) k ) T ∇F (x (l) k ) (66) ≤ N i=1 a 2 i (p i (β -1) + 1) ∇F (x (i) k ) 2 + p i σ 2 + N l=1 N j =l a l a j (1 -p j )(1 -p l )∇F (x (j) k ) T ∇F (x (l) k ) (67) where (67) follows from Assumption 1d and (29). Lemma 2. Under Assumptions 1c and 1d, the squared norm of the stochastic gradients is bounded by: E k G k 2 ≤ N i=1 a 2 i p i (β + 1) -p 2 i + a i p 2 i ∇F (x (i) k ) 2 + N i=1 a 2 i p i σ 2 Proof. E k G k 2 = E k G k -E k [G k ] 2 + E k [G k ] 2 (69) = E k G k - N i=1 a i p i ∇F (x (i) k ) 2 + N i=1 a i p i ∇F (x (i) k ) 2 (70) = E k   G k - N i=1 a i ∇F (x (i) k ) + N i=1 a i (1 -p i )∇F (x (i) k ) 2   + N i=1 a i p i ∇F (x (i) k ) 2 . Applying the definition of G k to (71) we get: E k G k 2 = E k   N i=1 a i g (i) k -a i ∇F (x (i) k ) + a i (1 -p i )∇F (x (i) k ) 2   + N i=1 a i p i ∇F (x (i) k ) 2 (72) = E k N i=1 a i g (i) k -a i ∇F (x (i) k ) + a i (1 -p i )∇F (x (i) k ) 2 + N i=1 a i p i ∇F (x (i) k ) 2 + E k N j=1 N l=1,l =j a l a j (g (j) k -∇F (x (j) k )) + (1 -p j )∇F (x (j) k ), (g (l) k -∇F (x (l) k )) + (1 -p l )∇F (x (l) k ) . ( ) Let the cross-terms in (73) be CR = E k N j=1 N l=1,l =j a l a j (g (j) k -∇F (x (j) k )) + (1 -p j )∇F (x (j) k ), k -∇F (x (l) k )) + (1 -p l )∇F (x (l) k ) . We can simplify CR as follows: CR = N j=1 N l=1,l =j a l a j E k (g (j) k -∇F (x (j) k )) T (g (l) k -∇F (x (l) k )) + (g (j) k -∇F (x (j) k )) T (1 -p l )∇F (x (l) k ) + (1 -p j )∇F (x (j) k ) T (g (l) k -∇F (x (l) k )) + (1 -p j )(1 -p l )∇F (x (j) k ) T ∇F (x (l) k ) T (75) = N j=1 N l=1,l =j a l a j E k (g (j) k -∇F (x (j) k )) T (g (l) k -∇F (x (l) k )) + (E k [g (j) k ] -∇F (x (j) k )) T (1 -p l )∇F (x (l) k ) + (1 -p j )∇F (x (j) k ) T (E k [g (l) k ] -∇F (x (l) k )) + (1 -p j )(1 -p l )∇F (x (j) k ) T ∇F (x (l) k ) T . ( ) Applying Assumption 1c to (76), we get: CR = N j=1 N l=1,l =j a l a j E k (g (j) k -∇F (x (j) k )) T (g (l) k -∇F (x (l) k )) + (p j -1)(1 -p l )∇F (x (j) k ) T ∇F (x (l) k ) + (p l -1)(1 -p j )∇F (x (j) k ) T ∇F (x (l) k ) + (1 -p j )(1 -p l )∇F (x (j) k ) T ∇F (x (l) k ) (77) = N j=1 N l=1,l =j a l a j E k (g (j) k -∇F (x (j) k )) T (g (l) k -∇F (x (l) k )) -(1 -p l )(1 -p j )∇F (x (j) k ) T ∇F (x (l) k ) . ( ) Applying ( 78) to (73) we have: E k G k 2 = E k N i=1 a i g (i) k -a i ∇F (x (i) k ) + a i (1 -p i )∇F (x (i) k ) 2 + N j=1 N l=1,l =j a l a j E k (g (j) k -∇F (x (j) k )) T (g (l) k -∇F (x (l) k )) -(1 -p l )(1 -p j )∇F (x (j) k ) T ∇F (x (l) k ) + N i=1 a i p i ∇F (x (i) k ) 2 . ( ) Expanding the first term in (79) we have: E k G k 2 = E k N i=1 a i g (i) k -a i ∇F (x (i) k ) 2 + N i=1 a 2 i (1 -p i ) 2 ∇F (x (i) k ) 2 + N i=1 E k a i g (i) k -a i ∇F (x (i) k ), a i (1 -p i )∇F (x (i) k ) CR1 + N i=1 E k a i (1 -p i )∇F (x (i) k ), a i g (i) k -a i ∇F (x (i) k ) CR2 + N j=1 N l=1,l =j a l a j E k (g (j) k -∇F (x (j) k )) T (g (l) k -∇F (x (l) k )) -(1 -p l )(1 -p j )∇F (x (j) k ) T ∇F (x (l) k ) + N i=1 a i p i ∇F (x (i) k ) 2 . ( ) We simplify CR1: CR1 = a i p i E k (g(x (i) k )) T -a i ∇F (x (i) k ), a i (1 -p i )∇F (x (i) k ) (81) = a i (p i -1)∇F (x (i) k ), a i (1 -p i )∇F (x (i) k ) (82) = -a 2 i (1 -p i ) 2 ∇F (x (i) k ) 2 . ( ) Similarly, for CR2: CR2 = -a 2 i (1 -p i ) 2 ∇F (x (i) k ) 2 . ( ) Plugging ( 83) and ( 84) back into ( 80): E k G k 2 = E k N i=1 a i g (i) k -a i ∇F (x (i) k ) 2 - N i=1 a 2 i (1 -p i ) 2 ∇F (x (i) k ) 2 + N j=1 N l=1,l =j a l a j E k (g (j) k -∇F (x (j) k )) T (g (l) k -∇F (x (l) k )) -(1 -p l )(1 -p j )∇F (x (j) k ) T ∇F (x (l) k ) + N i=1 a i p i ∇F (x (i) k ) 2 . ( ) We can simplify by observing that: E k G k -H k 2 = E k N i=1 a i g (i) k -a i ∇F (x (i) k ) 2 + N j=1 N l=1,l =j a l a j E k (g (j) k -∇F (x (j) k )) T (g (l) k -∇F (x (l) k )) which gives us: E k G k 2 = E k G k -H k 2 - N i=1 a 2 i (1 -p i ) 2 ∇F (x (i) k ) 2 + N i=1 a i p i ∇F (x (i) k ) 2 - N j=1 N l=1,l =j a l a j (1 -p l )(1 -p j )∇F (x (j) k ) T ∇F (x (l) k ) (87) Applying Lemma 1 to (87): E k G k 2 ≤ N i=1 a 2 i (p i (β -1) + 1) ∇F (x (i) k ) 2 + p i σ 2 - N i=1 a 2 i (1 -p i ) 2 ∇F (x (i) k ) 2 + N i=1 a i p i ∇F (x (i) k ) 2 (88) ≤ N i=1 a 2 i (p i (β -1) + 1) ∇F (x (i) k ) 2 + p i σ 2 - N i=1 a 2 i (1 -p i ) 2 ∇F (x (i) k ) 2 + N i=1 a i p 2 i ∇F (x (i) k ) 2 (89) = N i=1 a 2 i p i (β + 1) -p 2 i + a i p 2 i ∇F (x (i) k ) 2 + N i=1 a 2 i p i σ 2 (90) where equation ( 89) follows from Jensen's inequality. Lemma 3. Under Assumption 1c, the expected inner product of the batch gradient and the weighted average stochastic gradient is equal to: E k [ ∇F (u k ), G k ] = 1 2 ∇F (u k ) 2 + N i=1 a i 2 p i ∇F (x (i) k ) 2 - N i=1 a i 2 ∇F (u k ) -p i ∇F (x (i) k ) 2 (91) Proof. E k [ ∇F (u k ), G k ] = E k ∇F (u k ), N i=1 a i g (i) k (92) = ∇F (u k ), N i=1 p i a i ∇F (x (i) k ) (93) = N i=1 a i ∇F (u k ), p i ∇F (x (i) k ) (94) = N i=1 a i 2 ∇F (u k ) 2 + p i ∇F (x (i) k ) 2 -∇F (u k ) -p i ∇F (x (i) k ) 2 (95) = 1 2 ∇F (u k ) 2 + N i=1 a i 2 p i ∇F (x (i) k ) 2 - N i=1 a i 2 ∇F (u k ) -p i ∇F (x (i) k ) 2 (96) where ( 93) follows from (20), and ( 95) follows from the fact that, for arbitrary vectors y and z, 2y T z = ||y|| 2 + ||z|| 2 -||y -z|| 2 . Lemma 4. Under Assumption 1, following the update rule given in ( 5), if all model parameters are initialized at the same x 1 , the expected weighted average gradient is bounded as follows: E 1 K K k=1 ∇F (u k ) 2 ≤ 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 2L 2 K K k=1 E X k (I -A) 2 Fa - 1 K K k=1 N i=1 a i (4p i -p 2 i -2) -ηL a i p i (β + 1) -a i p 2 i + p 2 i E ∇F (x (i) k ) 2 ( ) where A = a 1 T . Proof. According to Assumption 1a, E k [F (u k+1 )] -F (u k ) ≤ E k ∇F (u k )), u k+1 -u k + L 2 u k+1 -u k 2 2 (98) = -ηE k [ ∇F (u k ), G k ] + η 2 L 2 E k G k 2 2 . ( ) Plugging in Lemmas 2 and 3, we get: E k [F (u k+1 )] -F (u k ) ≤ -η 1 2 ∇F (u k ) 2 + N i=1 a i 2 p 2 i ∇F (x (i) k ) 2 - N i=1 a i 2 ∇F (u k ) -p i ∇F (x (i) k ) 2 + η 2 L 2 N i=1 a 2 i p i (β + 1) -a 2 i p 2 i + a i p 2 i ∇F (x (i) k ) 2 + σ 2 η 2 L 2 N i=1 a 2 i p i (100) = - η 2 ∇F (u k ) 2 + η 2 N i=1 a i ∇F (u k ) -p i ∇F (x (i) k ) 2 + σ 2 η 2 L 2 N i=1 a 2 i p i - η 2 N i=1 a i p 2 i -ηL a i p i (β + 1) -a i p 2 i + p 2 i ∇F (x (i) k ) 2 . ( ) After some rearranging, we obtain: ∇F (u k ) 2 ≤ 2 (F (u k ) -E k [F (u k+1 )]) η + σ 2 ηL N i=1 a 2 i p i + N i=1 a i ∇F (u k ) -p i ∇F (x (i) k ) 2 - N i=1 a i p 2 i -ηL a i p i (β + 1) -a i p 2 i + p 2 i ∇F (x (i) k ) 2 . ( ) Taking the total expectation over all iterations: E 1 K K k=1 ∇F (u k ) 2 ≤ 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 1 K K k=1 N i=1 a i E ∇F (u k ) -p i ∇F (x (i) k ) 2 - 1 K K k=1 N i=1 a i p 2 i -ηL a i p i (β + 1) -a i p 2 i + p 2 i E ∇F (x (i) k ) 2 . The third term in ( 103) can be bounded as: N i=1 a i E ∇F (u k ) -p i ∇F (x (i) k ) 2 = N i=1 a i E ∇F (u k ) -∇F (x (i) k ) + (1 -p i )∇F (x (i) k ) 2 ≤ N i=1 2a i E ∇F (u k ) -∇F (x (i) k ) 2 + 2a i (1 -p i ) 2 E ∇F (x (i) k ) 2 (105) ≤ N i=1 2a i L 2 E u k -x (i) k 2 + N i=1 2a i (1 -p i ) 2 E ∇F (x (i) k ) 2 where ( 105) follows from the fact that y + z 2 ≤ 2 y 2 + 2 z 2 , and (106) follows from ( 105) by Assumption 1a. Recalling the definition of the weighted Frobenius norm and the definition of u, we can simplify the first term in (106): N i=1 2a i L 2 E u k -x (i) k 2 = 2L 2 E u k 1 T -X k 2 Fa (107) = 2L 2 E X k a1 T -X k 2 Fa (108) = 2L 2 E X k (I -A) 2 Fa . Plugging ( 106) and ( 109) back into (103), we obtain: E 1 K K k=1 ∇F (u k ) 2 ≤ 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 2L 2 K K k=1 E X k (I -A) 2 Fa + 1 K K k=1 N i=1 2a i (1 -p i ) 2 E ∇F (x (i) k ) 2 - 1 K K k=1 N i=1 a i p 2 i -ηL a i p i (β + 1) -a i p 2 i + p 2 i E ∇F (x (i) k ) 2 (110) = 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 2L 2 K K k=1 E X k (I -A) 2 Fa - 1 K K k=1 N i=1 a i (4p i -p 2 i -2) -ηL a i p i (β + 1) -a i p 2 i + p 2 i E ∇F (x (i) k ) 2 . Lemma 5. Given the properties of Z and V given in Propositions 1 and 2, it is the case that: Z j -A op = ζ j , V -A op = 1, I -A op = 1 where A = a 1 T and ζ = max{|λ 2 (H)|, |λ(H)|}. Proof. According to the definition of the matrix operator norm, Z j -A op = λ max ((Z j -A) T (Z j -A)) = λ max (Z 2j -A Z j -Z j A + A) (114) = λ max (Z 2j -A) where (114) follows from A j = A, and (115) follows from A Z = Z A = A. We can simplify (115) further: = λ max (Z 2j -A 2j ) (116) = λ max (Z -A) 2j where ( 117) follows from the commutability of Z and A. Based on Proposition 2, the non-zero eigenvalues of Z are the same as H. As shown in Lemma 6 of Rotaru & Nägeli (2004) , for a matrix Z with the properties in Proposition 1, the spectral norm of Z -A is equal to ζ. Therefore: Z j -A op = ζ 2j (118) = ζ j . Similarly for V: V -A op = λ max ((V -A) T (V -A)) (120) = λ max (V -A V -V A + A) (121) = λ max (V -A) (122) where (121) follows from A j = A and V j = V, and (122) follows from A V = V A = A. Note that the eigenvalues of each block V (d) are N (d) -1 zeros and a one. The set of eigenvalues of V will include D ones. If D > 1, then based on Lemma 6 of Rotaru & Nägeli (2004) and Proposition 1, the spectral norm of V -A is 1, so V -A op = √ 1 (124) = 1. Since the eigenvalues of I are all 1, and I is commutable with A, we can similarly say: I -A op = 1. Lemma 6. Given two matrices C ∈ R N ×M and D ∈ R M ×N , and an N -vector a, Tr((diag(a)) 1/2 C D (diag(a)) 1/2 ) ≤ C Fa D Fa . Proof. We define the i-th row of C as c T i and the i-th column of D as d i . We can rewrite the trace as: Tr((diag(a)) 1/2 C D (diag(a)) 1/2 ) = N i=1 M j=1 a i C i,j D j,i = N i=1 a i c T i d i . Placing a squared norm around (129), we can apply the Cauchy-Schwartz inequality: N i=1 a i c T i d i 2 ≤ N i=1 a i c T i 2 N i=1 a i d i 2 (130) =   N i=1 M j=1 a i C 2 i,j     N i=1 M j=1 a i D 2 i,j   (131) = C 2 Fa D 2 Fa . Lemma 7. Given two matrices C ∈ R M ×N and D ∈ R N ×N , and an N -vector a, then C D Fa ≤ C Fa D op . Proof. We define the i-th row of C as c T i and the set I = {i ∈ [1, M ] : c T i = 0} . We can rewrite the squared Frobenius norm as: C D 2 Fa = M i=1 c T i D (diag(a)) 1/2 2 (134) = M i∈I c T i D (diag(a)) 1/2 2 (135) = M i∈I c T i (diag(a)) 1/2 2 c T i D (diag(a)) 1/2 2 c T i (diag(a)) 1/2 2 (136) ≤ M i∈I c T i (diag(a)) 1/2 2 D 2 op (137) = C 2 Fa D 2 op . C.3 PROOF OF THEOREM 1 We recall Theorem 1. Theorem 1. Under Assumptions 1 and 2, if η satisfies the following for all i ∈ M: (4p i -p 2 i -2) ≥ ηL a i p i (β + 1) -a i p 2 i + p 2 i + 8L 2 η 2 q 2 τ 2 Γ ( ) where Γ = ζ 1-ζ 2 + 2 1-ζ + ζ (1-ζ) 2 and ζ = max{|λ 2 (H)|, |λ N (H)|} , then the expected square norm of the average model gradient, averaged over K iterations, is bounded as follows: E 1 K K k=1 ∇F (u k ) 2 2 ≤ 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 4L 2 η 2 σ 2 q 3 τ 3 1 qτ - 1 K ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 P + 4L 2 η 2 σ 2 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P (13) K →∞ ----→ σ 2 ηL N i=1 a 2 i p i + 4L 2 η 2 σ 2 q 2 τ 2 ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 P + 4L 2 η 2 σ 2 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P ( ) where P = N i=1 a i p i . We now give the proof of Theorem 1 using Lemmas 2-7. Proof. Using our intermediate result from Lemma 4, we decompose X k (I -A) using our recursive definition of X k : X k (I -A) = (X k-1 -η G k-1 ) T k-1 (I -A) (139) = X k-1 (I -A) T k-1 -η G k-1 (T k-1 -A) (140) = [(X k-2 -η G k-2 ) T k-2 (I -A)] T k-1 -η G k-1 (T k-1 -A) (141) = [X k-2 (I -A) T k-2 -η G k-2 (T k-2 -A)] T k-1 -η G k-1 (T k-1 -A) (142) = X k-2 (I -A) T k-2 T k-1 -η G k-2 (T k-2 T k-1 -A) -η G k-1 (T k-1 -A). (143) where (140) follows from the commutability of T k and A by Proposition 4. Continuing this, we end up with: X k (I -A) = X 1 (I -A) k-1 l=1 T l -η k-1 s=1 G s k-1 l=s T l -A . ( ) Since all workers initialize their models to the same vector, X 1 (I -A) k-1 l=1 T k = 0, and thus we have: E X k (I -A) 2 Fa = η 2 E k-1 s=1 G s k-1 l=s T l -A 2 Fa . ( ) Let k = jqτ + lτ + f , where j is the number of hub network averaging rounds, l is the number of sub-network averaging rounds since the last hub network averaging round, and f is the number of local iterations since the last sub-network averaging round. Define: Φ s,k-1 = k-1 l=s T l . Noting that V j = V, and V Z = Z V = Z by Proposition 3, Φ s,k-1 can be expressed as: Φ s,k-1 =                    I jqτ + lτ < s < jqτ + lτ + f V jqτ < s ≤ jqτ + lτ Z (j -1)qτ < s ≤ jqτ Z 2 (j -2)qτ < s ≤ (j -1)qτ . . . Z j 1 ≤ s ≤ qτ. (146) For r < j, let Y r = (r+1)qτ s=rqτ +1 G s , Q r = (r+1)qτ s=rqτ +1 ∇F (X s ) We also let Y j1 = jqτ +lτ s=jqτ +1 G s , Y j2 = jqτ +lτ +f s=jqτ +lτ +1 G s , Q j1 = jqτ +lτ s=jqτ +1 ∇F (X s ), and Q j2 = jqτ +lτ +f s=jqτ +lτ +1 ∇F (X s ). With this in mind, we can split the sum in (145) into batches for each hub network averaging period: qτ s=1 G s (Φ s,k-1 -A) = Y 0 (Z j -A) (147) 2qτ s=qτ +1 G s (Φ s,k-1 -A) = Y 1 (Z j-1 -A) ... jqτ s=(j-1)qτ +1 G s (Φ s,k-1 -A) = Y j-1 (Z -A) (149) jqτ +lτ +f s=jqτ +1 G s (Φ s,k-1 -A) = Y j1 (V -A) + Y j2 (I -A). Summing this all together, we get: k-1 s=1 G s (Φ s,k-1 -A) = j-1 r=0 Y r (Z j-r -A) + Y j1 (V -A) + Y j2 (I -A). Plugging ( 151) into ( 145): E X k (I -A) 2 Fa = η 2 E j-1 r=0 Y r (Z j-r -A) + Y j1 (V -A) + Y j2 (I -A) 2 Fa (152) = η 2 E j-1 r=0 (Y r -Q r )(Z j-r -A) + (Y j1 -Q j1 )(V -A) + (Y j2 -Q j2 )(I -A) + j-1 r=0 Q r (Z j-r -A) + Q j1 (V -A) + Q j2 (I -A) 2 Fa (153) ≤ 2η 2 E j-1 r=0 (Y r -Q r )(Z j-r -A) + (Y j1 -Q j1 )(V -A) + (Y j2 -Q j2 )(I -A)) 2 Fa T1 + 2η 2 E j-1 r=0 Q r (Z j-r -A) + Q j1 (V -A) + Q j2 (I -A) 2 Fa T2 where ( 154) follows from the fact that y + z 2 ≤ 2 y 2 + 2 z 2 . We first put a bound on T 1 : T 1 = 2η 2 E j-1 r=0 (Y r -Q r )(Z j-r -A) + (Y j1 -Q j1 )(V -A) + (Y j2 -Q j2 )(I -A) 2 Fa (155) = 2η 2 j-1 r=0 E (Y r -Q r )(Z j-r -A) 2 Fa + E (Y j1 -Q j1 )(V -A) 2 Fa + E (Y j2 -Q j2 )(I -A) 2 Fa + 2η 2 j-1 n=0 j-1 l=0,l =n E Tr (diag(a)) 1/2 (Z j-n -A)(Y n -Q n ) T (Y l -Q l )(Z j-l -A) (diag(a)) 1/2 T R T R0 + 4η 2 j-1 l=0 E Tr (diag(a)) 1/2 (V -A)(Y j1 -Q j1 ) T (Y l -Q l )(Z j-l -A) (diag(a)) 1/2 T R1 + 4η 2 j-1 l=0 E Tr (diag(a)) 1/2 (I -A)(Y j2 -Q j2 ) T (Y l -Q l )(Z j-l -A) (diag(a)) 1/2 T R2 + 4η 2 E Tr (diag(a)) 1/2 (V -A)(Y j1 -Q j1 ) T (Y j2 -Q j2 )(I -A) (diag(a)) 1/2 T R3 . T R can be bounded as: T R ≤ (Z j-n -A)(Y n -Q n ) T Fa (Y l -Q l )(Z j-l -A) Fa (157) ≤ (Z j-n -A) op Y n -Q n Fa Y l -Q l Fa (Z j-l -A) op (158) ≤ ζ 2j-n-l Y n -Q n Fa Y l -Q l Fa (159) ≤ 1 2 ζ 2j-n-l Y n -Q n 2 Fa + Y l -Q l 2 Fa (160) where ( 157) follows from Lemma 6, (158) follows from Lemma 7, and (159) follows from Lemma 5. We can similarly bound T R 1 and T R 3 : T R 1 ≤ 2η 2 j-1 l=0 ζ j-l E Y j1 -Q j1 2 Fa + E Y l -Q l 2 Fa (161) T R 2 ≤ 2η 2 j-1 l=0 ζ j-l E Y j2 -Q j2 2 Fa + E Y l -Q l 2 Fa (162) T R 3 ≤ 2η 2 E Y j1 -Q j1 2 Fa + E Y j2 -Q j2 Fa . ( ) 3 t=0 T R t ≤ η 2 j-1 n=0 j-1 l=0,l =n ζ 2j-n-l E Y n -Q n 2 Fa + E Y l -Q l 2 Fa + 2η 2 j-1 l=0 ζ j-l E Y j1 -Q j1 2 Fa + E Y l -Q l 2 Fa + 2η 2 j-1 l=0 ζ j-l E Y j2 -Q j2 2 Fa + E Y l -Q l 2 Fa + 2η 2 E Y j1 -Q j1 2 Fa + 2η 2 E Y j2 -Q j2 2 Fa (164) ≤ 2η 2 j-1 n=0 j-1 l=0,l =n ζ 2j-n-l E Y n -Q n 2 Fa + 2η 2 j-1 l=0 ζ j-l E Y l -Q l 2 Fa + 2η 2 j-1 l=0 ζ j-l E Y l -Q l 2 Fa + 2η 2 j l=0 ζ j-l E Y j1 -Q j1 2 Fa + 2η 2 j l=0 ζ j-l E Y j2 -Q j2 2 Fa (165) = 2η 2 j-1 n=0 ζ j-n E Y n -Q n 2 Fa j-1 l=0,l =n ζ j-l + 4η 2 j-1 l=0 ζ j-l E Y l -Q l 2 Fa + 2η 2 E Y j1 -Q j1 2 Fa j l=0 ζ j-l + 2η 2 E Y j2 -Q j2 2 Fa j l=0 ζ j-l (166) where (165) follows from the symmetry of the n and l indices. Plugging (166) back into (156): T 1 ≤ 2η 2 j-1 r=0 E (Y r -Q r ) 2 Fa (Z j-r -A) 2 op + 2η 2 E (Y j1 -Q j1 ) 2 Fa V -A 2 op + 2η 2 E (Y j2 -Q j2 ) 2 Fa I -A 2 op + 2η 2 j-1 n=0 ζ j-n E Y n -Q n 2 Fa j-1 l=0,l =n ζ j-l + 2η 2 E Y j1 -Q j1 2 Fa j l=0 ζ j-l + 2η 2 E Y j2 -Q j2 2 Fa j l=0 ζ j-l + 4η 2 j-1 l=0 ζ j-l E Y l -Q l 2 Fa (167) ≤ 2η 2 j-1 r=0 E (Y r -Q r ) 2 Fa ζ 2(j-r) + 2η 2 E (Y j1 -Q j1 ) 2 Fa + 2η 2 E (Y j2 -Q j2 ) 2 Fa + 2η 2 j-1 n=0 ζ j-n E Y n -Q n 2 Fa j-1 l=0,l =n ζ j-l + 2η 2 E Y j1 -Q j1 2 Fa j l=0 ζ j-l + 2η 2 E Y j2 -Q j2 2 Fa j l=0 ζ j-l + 4η 2 j-1 l=0 ζ j-l E Y l -Q l 2 Fa (168) where (167) follows from Lemma 7, and (168) follows from Lemma 5. We further bound T 1 : T 1 ≤ 2η 2 j-1 r=0 E (Y r -Q r ) 2 Fa ζ 2(j-r) + 2η 2 E (Y j1 -Q j1 ) 2 Fa + 2η 2 E (Y j2 -Q j2 ) 2 Fa + 2η 2 j-1 n=0 ζ j-n E Y n -Q n 2 Fa ζ 1 -ζ + 2η 2 E Y j1 -Q j1 2 Fa 1 1 -ζ + 2η 2 E Y j2 -Q j2 2 Fa 1 1 -ζ + 4η 2 j-1 l=0 ζ j-l E Y l -Q l 2 Fa (169) = 2η 2 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ E (Y r -Q r ) 2 Fa + 2η 2 2 -ζ 1 -ζ E Y j1 -Q j1 2 Fa + 2η 2 2 -ζ 1 -ζ E Y j2 -Q j2 2 Fa (170) where ( 169) follows from the summation formulae of a power series: j l=0 ζ j-l ≤ j l=-∞ ζ j-l ≤ 1 1 -ζ , j-1 l=0 ζ j-l ≤ j-1 l=-∞ ζ j-l ≤ ζ 1 -ζ . Taking a closer look at E (Y r -Q r ) 2 Fa for 0 ≤ r < j: E (Y r -Q r ) 2 Fa = E (r+1)qτ s=rqτ +1 (G s -∇F (X s )) 2 Fa (172) = N i=1 a i E (r+1)qτ s=rqτ +1 (g i s -∇F (x (i) s )) 2 (173) ≤ N i=1 a i qτ (r+1)qτ s=rqτ +1 E (g i s -∇F (x (i) s )) 2 (174) ≤ qτ   N i=1 a i (r+1)qτ s=rqτ +1 (p i (β -1) + 1) E ∇F (x (i) s ) 2   + q 2 τ 2 σ 2 N i=1 a i p i (175) = qτ   N i=1 a i (r+1)qτ s=rqτ +1 (p i (β -1) + 1) E ∇F (x (i) s ) 2   + q 2 τ 2 σ 2 P . ( ) where (175) follows from Assumption 1d and (29). Similarly, for r = j 1 and r = j 2 : E (Y j1 -Q j1 ) 2 Fa ≤ lτ   N i=1 a i jqτ +lτ s=jqτ +1 (p i (β -1) + 1) E ∇F (x (i) s ) 2   + l 2 τ 2 σ 2 P (177) E (Y j2 -Q j2 ) Fa ≤ (f -1)   N i=1 a i jqτ +lτ +f -1 s=jqτ +lτ +1 (p i (β -1) + 1) E ∇F (x (i) s ) 2   + (f -1) 2 σ 2 P . Plugging ( 176), (177), and ( 178) into (170), we can bound T 1 as follows: T 1 ≤ 2η 2 σ 2 q 2 τ 2 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ + 2 -ζ 1 -ζ l 2 τ 2 + (f -1) 2 P + 2η 2 qτ j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ (r+1)qτ s=rqτ +1 N i=1 a i (p i (β -1) + 1) E ∇F (x (i) s ) 2 + 2η 2 2 -ζ 1 -ζ lτ   N i=1 a i jqτ +lτ s=jqτ +1 (p i (β -1) + 1) E ∇F (x (i) s ) 2   + 2η 2 2 -ζ 1 -ζ (f -1)   N i=1 a i jqτ +lτ +f -1 s=jqτ +lτ +1 (p i (β -1) + 1) E ∇F (x (i) s ) 2   . Referring back to Lemma 4, our goal is to sum T 1 over k = 1, . . . , K iterations. First, we sum over the j-th sub-network update period up to the j-th hub network averaging, for l = 0, . . . , q -1 and f = 1, . . . , τ : q-1 l=0 τ f =1 T 1 ≤ 2η 2 σ 2 q 3 τ 3 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ P + 2η 2 σ 2 2 -ζ 1 -ζ τ 3 q(q -1)(2q + 1) 6 + q τ (τ -1)(2τ + 1) 6 P + 2η 2 q 2 τ 2 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ (r+1)qτ s=rqτ +1 N i=1 a i (p i (β -1) + 1) E ∇F (x (i) s ) 2 + η 2 2 -ζ 1 -ζ q(q -1)τ 2 N i=1 a i j(qτ +1) s=jqτ +1 (p i (β -1) + 1) E ∇F (x (i) s ) 2 + η 2 2 -ζ 1 -ζ q 2 τ (τ -1) N i=1 a i j(qτ +1)+τ -1 s=j(qτ +1)+1 (p i (β -1) + 1) E ∇F (x (i) s ) 2 . Let: Γ r = ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ . ( ) Note that Γ j = 3-2ζ 1-ζ > 2-ζ 1-ζ . Using this inequality, we can bound the sum of the last three terms of (180) to get 2q 2 τ 2 j r=0 Γ r : q-1 l=0 τ f =1 T 1 ≤ 2η 2 σ 2 q 3 τ 3 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ P + 2η 2 σ 2 2 -ζ 1 -ζ τ 3 q(q -1)(2q + 1) 6 + q τ (τ -1)(2τ + 1) 6 P + 2η 2 q 2 τ 2 j r=0 Γ r (r+1)qτ s=rqτ +1 N i=1 a i (p i (β -1) + 1) E ∇F (x (i) s ) 2 . ( ) Published as a conference paper at ICLR 2021 Summing (182) over the hub network averaging periods j = 0, . . . , K/(qτ ) -1, we obtain: K/(qτ )-1 j=0 q-1 l=0 τ f =1 T 1 ≤ 2η 2 σ 2 q 3 τ 3 K/(qτ )-1 j=0 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ P + 2η 2 σ 2 K 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P + 2η 2 q 2 τ 2 K/(qτ )-1 j=0 j r=0 Γ r (r+1)qτ s=rqτ +1 N i=1 a i (p i (β -1) + 1) E ∇F (x (i) s ) 2 (183) = 2η 2 σ 2 q 3 τ 3 K/(qτ )-2 r=0 K/(qτ )-1 j=r+1 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ P + 2η 2 σ 2 K 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P + 2η 2 q 2 τ 2 K/(qτ )-1 r=0   K/(qτ )-1 j=r Γ j     (r+1)qτ s=rqτ +1 N i=1 a i (p i (β -1) + 1) E ∇F (x (i) s ) 2   . Applying the following summation formula to sum over Γ j , we obtain K/(qτ )-1 j=r ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ ≤ ∞ j=r ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ (185) ≤ 1 1 -ζ 2 + 2 1 -ζ + ζ (1 -ζ) 2 . ( ) We let Γ = 1 1-ζ 2 + 2 1-ζ + ζ (1-ζ) 2 . We can also apply this following summation formula to the first term in (184): K/(qτ )-1 j=r+1 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ ≤ ∞ j=r+1 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ (187) ≤ ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 . ( ) Applying the summation formula in (188), plugging Γ in, and indexing the iterations in terms of k, we bound (184) as: K k=1 T 1 ≤ 2η 2 σ 2 q 3 τ 3 K qτ -1 ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 P + 2η 2 σ 2 K 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P + 2η 2 q 2 τ 2 Γ K k=1 N i=1 a i (p i (β -1) + 1) E ∇F (x (i) k ) 2 . ( ) Now we bound T 2 : T 2 = 2η 2 E j-1 r=0 Q r (Z j-r -A) + Q j1 (V -A) + Q j2 (I -A) 2 Fa (190) = 2η 2 j-1 r=0 E Q r (Z j-r -A) 2 Fa + 2η 2 E Q j1 (V -A) 2 Fa + 2η 2 E Q j2 (I -A) 2 Fa + 2η 2 j-1 n=0 j-1 l=0,l =n E   Tr (diag(a)) 1/2 (Z j-n -A) Q T n Q l (Z j-l -A) (diag(a)) 1/2 T R    T R 0 + 4η 2 j-1 l=0 E Tr (diag(a)) 1/2 (V -A) Q T j1 Q l (Z j-l -A) (diag(a)) 1/2 T R 1 + 4η 2 j-1 l=0 E Tr (diag(a)) 1/2 (I -A) Q T j2 Q l (Z j-l -A) (diag(a)) 1/2 T R 2 + 4η 2 E Tr (diag(a)) 1/2 (V -A) Q T j1 Q j2 (I -A) (diag(a)) 1/2 T R 3 . T R can be bounded as: 194) where (192) follows from Lemma 6. We can similarly bound T R 1 through T R 3 : T R ≤ (Z j-n -A) Q T n Fa Q l (Z j-l -A) Fa (192) ≤ (Z j-n -A) op Q n Fa Q l Fa (Z j-l -A) op (193) ≤ 1 2 ζ 2j-n-l Q n 2 Fa + Q l 2 Fa T R 1 ≤ 2η 2 j-1 l=0 ζ j-l E Q j1 2 Fa + E Q l 2 Fa (195) T R 2 ≤ 2η 2 j-1 l=0 ζ j-l E Q j2 2 Fa + E Q l 2 Fa (196) T R 3 ≤ 2η 2 E Q j1 2 Fa + 2η 2 E Q j2 2 Fa . Summing T R 0 through T R 3 , we get: 199) where (199) follows from the symmetry of the indices n and l. 3 t=0 T R t ≤ η 2 j-1 n=0 j-1 l=0,l =n ζ 2j-n-l E E Q n 2 Fa + E Q l 2 Fa + 2η 2 j-1 l=0 ζ j-l E Q l 2 Fa + 2η 2 j-1 l=0 ζ j-l E Q l 2 Fa + 2η 2 j l=0 ζ j-l E Q j1 2 Fa + 2η 2 j l=0 ζ j-l E Q j2 2 Fa (198) ≤ 2η 2 j-1 n=0 ζ j-n E Q n 2 Fa j-1 l=0,l =n ζ j-l + 4η 2 j-1 l=0 ζ j-l E Q l 2 Fa + 2η 2 E Q j1 2 Fa j l=0 ζ j-l + 2η 2 E Q j2 2 Fa j l=0 ζ j-l Plugging (199) back into (191): Fa (Z j-r -A) T 2 ≤ 2η 2 j-1 r=0 E Q r (Z j-r -A) 2 Fa + 2η 2 E Q j1 (V -A) 2 Fa + 2η 2 E Q j2 (I -A) 2 op + 2η 2 E Q j1 2 Fa V -A 2 op + 2η 2 E Q j2 2 Fa I -A 2 op + 2η 2 j-1 n=0 ζ j-n E Q n 2 Fa j-1 l=0,l =n ζ j-l + 4η 2 j-1 l=0 ζ j-l E Q l 2 Fa + 2η 2 E Q j1 2 Fa j l=0 ζ j-l + 2η 2 E Q j2 2 Fa j l=0 ζ j-l (201) ≤ 2η 2 j-1 r=0 ζ j-r E Q r 2 Fa + 2η 2 E Q j1 2 Fa + 2η 2 E Q j2 2 Fa + 2η 2 j-1 n=0 ζ j-n E Q n 2 Fa j-1 l=0,l =n ζ j-l + 4η 2 j-1 l=0 ζ j-l E Q l 2 Fa + 2η 2 E Q j1 2 Fa j l=0 ζ j-l + 2η 2 E Q j2 2 Fa j l=0 ζ j-l (202) where ( 201) follows from Lemma 7, and (202) follows from Lemma 5. Published as a conference paper at ICLR 2021 We further bound T 2 : T 2 ≤ 2η 2 j-1 r=0 ζ j-r E Q r 2 Fa + 2η 2 E Q j1 2 Fa + 2η 2 E Q j2 2 Fa + 2η 2 j-1 n=0 ζ j-n E Q n 2 Fa ζ 1 -ζ + 4η 2 j-1 l=0 ζ j-l E Q l 2 Fa + 2η 2 E Q j1 2 Fa 1 1 -ζ + 2η 2 E Q j2 2 Fa 1 1 -ζ (203) ≤ 2η 2 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ E Q r 2 Fa + 2η 2 2 -ζ 1 -ζ E Q j1 2 Fa + 2η 2 2 -ζ 1 -ζ E Q j2 2 Fa ( ) where ( 203) follows from the summation formulae of a power series in (171). After applying the definition of Q to (203), we obtain: Summing over all iterates in the j-th sub-network update period, we obtain: T 2 = 2η 2 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ E qτ s=1 ∇F (X rqτ +s ) 2 Fa + 2η 2 2 -ζ 1 -ζ E lτ s=1 ∇F (X jqτ +s ) 2 Fa + 2η 2 2 -ζ 1 -ζ E f -1 s=1 ∇F (X jqτ q-1 l=0 τ f =1 T 2 ≤ 2η 2 q 2 τ 2 j-1 r=0 ζ 2(j-r) + 2ζ j-r + ζ j-r+1 1 -ζ qτ s=1 E ∇F (X rqτ +s ) 2 Fa + η 2 qτ (q -1) 2 -ζ 1 -ζ qτ s=1 E ∇F (X jqτ +s ) 2 Fa + η 2 qτ (τ -1) 2 -ζ 1 -ζ τ -1 s=1 E ∇F (X jqτ +qτ +s ) 2 Fa (207) ≤ 2η 2 q 2 τ 2 j r=0 Γ r qτ s=1 E ∇F (X rqτ +s ) 2 Fa . Summing over all iterations and applying the summation bound in ( 186) to (208): K/(qτ )-1 j=0 q-1 l=0 τ f =1 T 2 ≤ 2η 2 q 2 τ 2 Γ K k=1 E ∇F (X k ) 2 Fa . Summing T 1 and T 2 , we obtain 2L 2 K K k=1 E X k (I -A) 2 Fa ≤ 2L 2 K K k=1 T 1 + 2L 2 K K k=1 T 2 (210) ≤ 4L 2 η 2 σ 2 q 3 τ 3 1 qτ - 1 K ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 P + 4L 2 η 2 σ 2 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P + 8L 2 η 2 q 2 τ 2 Γ 1 K K k=1 N i=1 a i (p i (β -1) + 1) E ∇F (x (i) k ) 2 . ( ) Plugging T 1 and T 2 back into Lemma 4, we arrive at E 1 K K k=1 ∇F (u k ) 2 ≤ 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 2L 2 K K k=1 T 1 + 2L 2 K K k=1 T 2 - 1 K K k=1 N i=1 a i (4p i -p 2 i -2) -ηL a i p i (β + 1) -a i p 2 i + p 2 i E ∇F (x (i) k ) 2 (212) = 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 4L 2 η 2 σ 2 q 3 τ 3 1 qτ - 1 K ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 P + 4L 2 η 2 σ 2 2 -ζ 1 -ζ τ 2 (q -1)(2q + 1) 6 + (τ -1)(2τ + 1) 6 P - 1 K K k=1 N i=1 a i (4p i -p 2 i -2) -ηL a i p i (β + 1) -a i p 2 i + p 2 i -8L 2 η 2 q 2 τ 2 Γ E ∇F (x (i) k ) 2 If η satisfies the following for i = 1, . . . , N , (4p i -p 2 i -2) ≥ ηL a i p i (β + 1) -a i p 2 i + p 2 i + 8L 2 η 2 q 2 τ 2 Γ (214) then we can simplify (213): We note that when setting a i = 1/N and p i = 1 for all workers i, and setting q = 1, MLL-SGD reduces to Cooperative SGD (Wang & Joshi, 2018) . However, the bound in Theorem 1 differs when compared to the bound of Cooperative SGD. Specifically, Theorem 1 has error terms dependent on τ 2 as opposed to τ . This is due to the formulation of g (i) E 1 K K k=1 ∇F (u k ) 2 ≤ 2 (F (x 1 ) -F inf ]) ηK + σ 2 ηL N i=1 a 2 i p i + 4L 2 η 2 σ 2 q 3 τ 3 1 qτ - 1 K ζ 2 1 -ζ 2 + 2ζ 1 -ζ + 1 (1 -ζ) 2 P + 4L 2 η 2 σ 2 2 -ζ 1 -ζ τ 2 (q k . Namely: E k [g (i) k ] = p i E k [g(x (i) k )] = p i ∇F (x (i) k ).



(d) . Let | M | = N . Each worker i holds a set S(i) of local training data. Let S = N i=1 S (i) . The set of all D hubs is denoted C. The hubs communicate with one another via an undirected, connected communication graph G = (C, E). Let N d = {j | e d,j ∈ E} denote the set of neighbors of the hub in sub-network d in the hub graph G.

d = 1, . . . , D 2: for k = 1, . . . , K do 3: parallel for d ∈ D do 4:

(a) Training loss of CNN trained on EMNIST. (b) Test accuracy of CNN trained on EMNIST. (c) Training loss of ResNet-18 trained on CIFAR-10. (d) Test accuracy of ResNet-18 trained on CIFAR-10.

Figure 1: Effect of a hierarchy with different values of τ and q.

Figure 2: Effect of worker distribution on CNN trained on EMNIST.

Figure 3: Effect of worker distribution on ResNet-18 trained on CIFAR-10.

Figure 4: Effect of heterogeneous operating rates on CNN trained on EMNIST.

Figure 5: Effect of heterogeneous operating rates on ResNet-18 trained on CIFAR-10.

(a) Training loss of CNN with respect to time slots. (b) Test accuracy of CNN with respect to time slots. (c) Training loss of ResNet with respect to time slots. (d) Test accuracy of ResNet with respect to time slots.

Figure 6: Comparing convergence time of Local SGD, HL-SGD, and MLL-SGD.

(a) Training loss of logistic regression trained on MNIST. (b) Test accuracy of logistic regression trained on MNIST.

Figure 7: Effect of a hierarchy with different values of τ and q.

Figure 8: Effect of worker distribution on logistic regression trained on MNIST.

Figure 9: Effect of heterogeneous operating rates on logistic regression trained on MNIST.

(a) Training loss of logistic regression with respect to time slots. (b) Test accuracy of logistic regression with respect to time slots.

Figure 10: Comparing convergence time of Local SGD and MLL-SGD.

follows from (205) by Jensen's inequality.

ACKNOWLEDGMENTS

This work is supported by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons), and by the National Science Foundation under grants CNS 1553340 and CNS 1816307.

annex

Because we cannot assume p i = 1, there are cross terms in the expressions in equations ( 156) and (173) that do not cancel out. Thus, we needed to use a more conservative analysis at these steps on the proof. This is the reason that plugging in a value of p i = 1 is not enough to recover the same bound as in Cooperative SGD. A similar discrepancy can be observed when comparing with Koloskova et al. (2020) .

