FEDAVG CONVERGES TO ZERO TRAINING LOSS LINEARLY FOR OVERPARAMETERIZED MULTI-LAYER NEURAL NETWORKS

Abstract

Federated Learning (FL) is a distributed learning paradigm that allows multiple clients to learn a joint model by utilizing privately held data at each client. Significant research efforts have been devoted to develop advanced algorithms that deal with the situation where the data at individual clients have heterogeneous distributions. In this work, we show that data heterogeneity can be dealt from a different perspective. That is, by utilizing a certain overparameterized multi-layer neural network at each client, even the vanilla FedAvg (a.k.a. the Local SGD) algorithm can accurately optimize the training problem: When each client has a neural network with one wide layer of size N (where N is the number of total training samples), followed by layers of smaller widths, FedAvg converges linearly to a solution that achieves (almost) zero training loss, without requiring any assumptions on the clients' data distributions. To our knowledge, this is the first work that demonstrates such resilience to data heterogeneity for FedAvg when trained on multi-layer neural networks. Our experiments also confirm that, neural networks of large size can achieve better and more stable performance for FL problems. Related Work: Federated Learning (FL). FL algorithms were first proposed in McMahan et al. (2017), where within each communication round the clients utilize their private data to update the model parameters using multiple SGD steps. Earlier works analyzed the performance of FedAvg for the case of homogeneous data setting Zhou and Cong (2018); Stich (2018); Lin et al. (2020); Woodworth et al. (2020b); Wang and Joshi (2021), i.e., when the local data at each client follows the same underlying distribution. Motivated by practical applications, recent works have analyzed FedAvg for heterogeneous client data distributions Yu et al. (2019b;a); Haddadpour and Mahdavi (2019); Woodworth et al. (2020a) and it was observed that the performance of FedAvg degrades as the data heterogeneity increases. To address the data heterogeneity issue among clients, many works have focused on developing sophisticated algorithms Karimireddy et al. (2020b);

1. INTRODUCTION

In Federated Learning (FL), multiple clients collaborate with the help of a server to learn a joint model McMahan et al. (2017) . The privacy guarantees of FL has made it a popular distributed learning paradigm, as each client holds a private data set and aims to learn a global model without leaking its data to other nodes or the server. The performance of FL algorithms is known to degrade when training data at individual nodes originates from different distributions, referred to as the heterogeneous data setting Yu et al. (2019a) ; Woodworth et al. (2020a) . In the past few years, a substantial research effort has been devoted towards developing a large number of algorithms that can better deal with data heterogeneity, Karimireddy et al. (2020b) ; Zhang et al. (2021) ; Li et al. (2018) ; Acar et al. (2020) ; Khanduri et al. (2021) . However, in practice it has been observed by a number of recent works, that in spite of the data heterogeneity, the simple vanilla FedAvg algorithm (a.k.a. the Local SGD) still offers competitive performance in comparison to the state-of-the-art. For example, see Table 2 in Karimireddy et al. (2020a) , Table 1 in Reddi et al. (2020) , and Table 2 in Yang et al. (2021) for performance comparison of FedAvg on popular FL tasks. Motivated by these observations, we ask: Is it possible to handle the the data heterogeneity issue from a different perspective, without modifying the vanilla FedAvg algorithm? To answer this question, in this work we show that FedAvg can indeed perform very well regardless of the heterogeneity conditions, if the models to be learned are nice enough. Specifically, FedAvg finds solutions that achieve almost zero training loss (or almost global optimal solution) very quickly (i.e., linearly), when the FL model to be trained is certain overparameterized multi-layer neural network. To the best of our knowledge, this is the first result that shows (linear) convergence of FedAvg in the overparameterized regime for training multilayer neural networks. The major contributions of our work are listed below. • Under certain assumptions on the neural network architecture, we prove some key properties of the clients' (stochastic) gradients during the training phase (Lemmas 1 and 2). These results allow us to establish convergence of FedAvg for training overparameterized neural networks without imposing restrictive heterogeneity assumptions on the gradients of the local loss functions. • We design a special initialization strategy for training the network using FedAvg. The initialization is designed such that the singular values of the model parameters and the outputs of the first layer of local and aggregated model parameters stay positive definite during the training. This property combined with overparameterization enables FedAvg to converge linearly to a (near) optimal solution. • We conduct experiments on CIFAR-10 and MNIST datasets in both i.i.d. and heterogeneous data settings to compare the performance of FedAvg on various network architectures of different sizes. To our knowledge, this is the first work that shows the linear convergence of FedAvg (both SGD and GD versions) to the optimal solution when training a overparameterized multi-layer neural networks. Suppose each client trains a fully-connected neural network with L layers, and with activation function σ : R → R. We denote the vectorized parameters at each node k ∈ {1, . . . , K} as θ k = [vec (W 1,k ) , . . . , vec (W L,k )] ∈ R D , where W l,k ∈ R n l-1 ×n l represents the weight matrix of each layer l ∈ {1, . . . , L} and n l represents the width of each layer. Note that each layer inputs a (feature) vector of dimension n l-1 and outputs a (feature) vector of dimension n l . For simplicity, define n 0 = d in and n L = d out as the input and the output dimensions of the neural network. We define F l,k as the local output of each layer l at client k, then using the above notations, we have F l,k =    X k l = 0 σ (F l-1,k W l,k ) l ∈ {1, 2, . . . , L} F L-1,k W L,k l = L . ( ) We further define the vectorized output of each layer and the labels at each client as f l,k = vec(F l,k ) ∈ R N k n l and y k = vec(Y k ) ∈ R N k n L . Similar to the above setup, we also define the notations to describe a single network, with the full data (X, Y ) with X ∈ R N ×din and Y ∈ R N ×dout as input. This "centralized" network will be useful later to perform the analysis. Then given parameter θ = [vec (W 1 ) , . . . , vec (W L )], the output at each layer of the network is defined as F l =    X l = 0 σ (F l-1 W l ) l ∈ {1, 2, . . . , L} F L-1 W L l = L . Next, we define the local and global loss functions. First, each client k ∈ {1, . . . , K} has a local loss function given by: Φ k (θ) : = 1 2N k ∥f L,k (θ) -y k ∥ 2 2 , where ∥ • ∥ 2 denotes the standard ℓ 2 -norm. Then the global loss function is the sum of weighted local loss functions, given by: Φ(θ) := K k=1 N k N Φ k (θ) = 1 2N ∥F L (θ) -y∥ 2 F . Additionally, define the gradient of (3) as g := [vec(∇ W1 Φ(θ)), . . . , vec(∇ W L Φ(θ))], which is the stacked gradient of the loss w.r.t. the 1 st to L th layer's parameters; define the gradient of the losses at each client k ∈ [K] as: g k := [g 1,k , . . . , g L,K ] with g l,k := vec(∇ W l,k Φ(θ)) for all l ∈ [L]. Next, we define the optimality criteria to solve (3) using an overparameterized neural network. Definition 1 (ϵ-optimal solution). Consider an overparameterized problem min θ Φ(θ), where there exist θ * such that Φ(θ * ) = 0. A solution θ is called an ϵ-optimal solution if it satisfies Φ(θ) ≤ ϵ. Moreover, if θ is a random variable, then we use E[Φ(θ)] ≤ ϵ to denote an ϵ-optimal solution, where the expectation is taken w.r.t. the randomness of x.

3. THE FEDAVG ALGORITHM

A classical algorithm to solve problem (3) is the FedAvg McMahan et al. (2017) . In FedAvg, each client performs multiple local updates before sharing their updated parameters with the server. We refer the algorithm as FedAvg-SGD (resp. FedAvg-GD) if the clients employ SGD (resp. GD) for the local updates. The detailed steps to implement FedAvg-SGD are listed in Algorithm 1. We execute the algorithm for a total of T communication rounds, within each communication round every client performs r local updates. In each communication round t the server aggregates the local parameters and constructs θrt from each client's local parameters θ rt+r k and shares it with the clients. The clients use the aggregated parameter, θrt+r k , as the initial parameter value for computing the next round of local updates. For each v ∈ {0, 1, . . . , r -1}, to update the local parameters the clients compute the (unbiased) stochastic gradient using m-samples drawn form their private data set (X k , Y k ). We denote the random sample drawn at v th local step in the t th communication round as ( Xrt+v k , Ỹ rt+v k ). Using the stochastic gradient estimate, the clients update their parameters locally by employing the SGD step. After r local SGD steps, each client shares its updated parameters with the server and gets back the aggregated parameters before starting the next round of updates. Note that if we choose the batch size m = N k , for all k ∈ {1, . . . , K}, FedAvg-SGD becomes FedAvg-GD. ) as the vectorized labels of the stochastic samples at each local step, we define the mini-batch stochastic loss as: Φk (θ rt+v k ) := 1 2m ∥ f rt+v L,k -ỹrt+v k ∥ 2 2 , ( ) and the stochastic gradient as grt+v k := [g rt+v 1,k , . . . , grt+v L,k ], where grt+v l,k is the stochastic gradient w.r.t. the l th layer of the network evaluated at the k th client: grt+v l,k := vec ∇ W l,k Φk (θ rt+v k ) ∈ R n l-1 n l . (5) For each communication round, let us define the aggregated parameters as: θrt := vec( W rt 1 ), • • • , vec( W rt L ) , W rt l = K k=1 N k N W rt l,k . For FedAvg-GD, we denote g rt+v k := [g rt+v 1,k , . . . , g rt+v L,k ] as the full gradient of k th client's loss function, where similar to (5) g rt+v 1,k defines the gradient of the loss function w.r.t. the l th layer's parameters. Throughout, we make the following standard assumption Ghadimi and Lan (2013) . Assumption 1. The stochastic gradients at each client are unbiased, i.e., we have E [g rt+v k ] = g rt+v k ∀k ∈ [K]. Next, we analyze the performance of the FedAvg for an overparameterized neural network.

4. CONVERGENCE ANALYSIS

We present the convergence guarantees of FedAvg when training an overparameterized neural network. We first present a set of assumptions on the network architecture, and activation functions. Assumption 2. The width of each hidden layer satisfies: n 1 ≥ N, n 2 ≥ n 3 ≥ . . . ≥ n L ≥ 1. Assumption 3. The activation function σ(•) in (1) satisfies the following: 1) σ ′ (x) ∈ [γ, 1]; 2) |σ(x)| ≤ |x|; ∀ x ∈ R; 3) σ ′ is β-Lipschitz, with γ ∈ (0, 1) and β > 0. Remark 1. Assumptions 2 and 3 play an important role in our analysis. They help ensure that the local and global loss functions and their (stochastic) gradients are well behaved. Note that Assumption 2 only requires the first layer to be wide while the rest of the layers can be of constant width. Assumption 2 is required to establish a PL like property for the global and local loss functions Nguyen and Hein (2018) ; Nguyen and Mondelli (2020) . Assumption 3 is also standard in the analysis of overparameterized neural networks. Similar assumptions on the smoothness of the activation functions have been made in the past Jacot et al. (2018) ; Du et al. (2019) ; Nguyen and Mondelli (2020) ; Huang and Yau (2020) and are utilized to manage the behavior of the gradients of the loss functions. Importantly, note that as demonstrated in Nguyen and Mondelli (2020) activation functions satisfying Assumption 3 can be utilized to uniformly approximate the ReLU function to arbitrary accuracy. Remark 2. We do not impose any assumptions on the distribution of individual clients' local data sets. In contrast, a majority of works on FL impose restrictive assumptions on the gradients (and/or the Hessians) of each client's local loss functions to guarantee algorithm convergence Yu et al. (2019b) ; Li et al. (2018) ; Yu et al. (2019a) ; Karimireddy et al. (2020a) . Below, we list two most popular heterogeneity assumptions (from Yu et al. (2019a) and Koloskova et al. (2020) , respectively): ∥∇Φ k (θ) -∇Φ(θ)∥ ≤ δ, ∀θ, ∈ R D , ∀ k ∈ [K], for some δ > 0. (7) 1 K K k=1 ∥∇Φ k (θ)∥ ≤ δ 1 + δ 2 ∥∇Φ(θ)∥, ∀ θ ∈ R D , for some δ 1 , δ 2 > 0. ( ) Both conditions impose strong restrictions on the gradients of the local clients, and they do not hold for even simple quadratic loss Khaled et al. (2019) ; Zhang et al. (2021) . We will see shortly that, our results will indicate that as long as the neural network is large enough, then the local (stochastic) gradients will be well-behaved, thereby eliminating the need to impose any additional assumptions on the data distributions. In the following, we show the convergence guarantees achieved by FedAvg. Our analysis roughly follows the four steps presented below: [Step 1] We first show a key result, that the ratio of the local stochastic gradients and the local full gradients stays bounded (Lemma 1). This result is crucial for the FedAvg-SGD analysis, as it allows us to work with the full local gradients directly, and it helps to bound the gradient drift across local updates within each communication round. [ Step 2] Using the result of Step 1, we bound the summation of (stochastic) gradients and the gradient drift during the local updates within each communication round (Lemma 2). This result ensures that irrespective of the data heterogeneity, the gradients size will not change too much from their initial values at the beginning of each round. [ Step 3] We then show that adopted network architecture allows us to derive bounds on the size of the gradients and ensure the loss function to be PL during the each communication round (Lemma 3). Utilizing this and the results derived in Steps 1 and 2, we show that the expected loss (3) converges linearly to zero (Proposition 1). [ Step 4] Finally, we find a special initialization strategy so that all the conditions imposed on the network properties are satisfied during the entire training process. Next, let us begin with Step 1. We need the following definition. Definition 2. Given parameter θ rt+v k , we define the following quantity for each k ∈ [K], t ∈ {0, 1, . . . , T -1} and v ∈ {0, 1, . . . , r -1}: ρ(θ rt+v k ) := ∥g rt+v k ∥ 2 /∥g rt+v k ∥ 2 . Clearly, ρ(θ rt+v k ) measures the ratio of the norm of stochastic and full gradients of the local loss functions. In the following, we show that if the model parameters at each client satisfy certain conditions, then ρ(θ rt+v k ) is uniformly bounded. Define σ max (•) and σ min (•) as the largest and smallest singular value of a matrix, respectively. Lemma 1. Let Assumptions 2 and 3 hold. Suppose in any iteration rt + v, v ∈ {0, 1, • • • , r -1}, for θ rt+v k = [vec(W rt+v 1,k ), . . . , vec(W rt+v L,k )], there exists constant Λl , Λ l , Λ F > 0 such that the singular values of W rt+v l,k and F rt+v 1,k satisfy      σmax(W rt+v l,k ) ≤ Λl , l ∈ [L], k ∈ [K], σmin(W rt+v l,k ) ≥ Λ l , l ∈ {3, . . . , L}, k ∈ [K], σmin(F rt+v 1,k ) ≥ Λ F , k ∈ [K]. where λ i→j := j l=i λ l for given layer-wise parameter λ l , then: ρ(θ rt+v k ) ≤ LN Λ1→L min l∈[L] Λl mγ L-2 Λ 3→L Λ F . As discussed earlier in Step 1, this lemma is crucial to our analysis as it allows us to work with full gradients of individual clients. Before proceeding to Step 2, we need the following definitions: ḡrt+v := K k=1 N k N g rt+v k and ḡrt+v := K k=1 N k N grt+v k . Here ḡrt+v and ḡrt+v are the weighted averages of the full and stochastic gradients, respectively. Next, in Step 2 (Lemma 2) we first bound the size of the sum of ḡrt+v over the local updates within each communication round. Then we bound the change in ḡrt+v from v = 0 to any v ∈ {0, 1, . . . , r -1}. Note that this quantity measures the drift in the averaged gradients from the start of each communication round. Lemma 2. For FedAvg-SGD, given step size η > 0, v ∈ {0, 1, . . . , r -1} and q ∈ {0, 1, . . . , v -1}. Suppose there exists constants Λl , ρ, and A > 0 such that the following conditions hold: Λl ≥ sup k∈[K] σ max W rt+q l,k , ρ ≥ sup k∈[K] ρ θ rt+q k , Φ k (θ rt+q ) ≤ A q • Φ k ( θrt ), k ∈ [K]. Then we have v q=0 ḡrt+q 2 ≤ ρL∥X∥ F N A v+1 2 -1 √ A -1 Λ1→L min l∈[L] Λl ∥f L ( θrt ) -y∥ 2 . ( ) Further, for all k ∈ [K], ∃Q k > 0, such that we have ḡrt+v -ḡrt 2 ≤ ηρL N Λ1→L min l∈[L] Λl A v+1 2 -1 √ A -1 K k=1 Q 2 k ∥X k ∥ 2 F ∥f L ( θrt ) -y∥ 2 . ( ) Next, we show Step 3, that the averaged parameter θrt defined in ( 6), after tth communication round, will have good performance. Towards this end, we define the full gradient given parameter θrt as g rt := [vec(∇ W1 Φ( θrt ), . . . , vec(∇ W L Φ( θrt )]. Lemma 3. Let Assumptions 2 and 3 hold. At each communication round rt, suppose there exists constant Ωl , Ω l , Ω F , such that      σmax( W rt l ) ≤ Ωl , l ∈ [L], σmin( W rt l ) ≥ Ω l , l ∈ {3, . . . , L}, σmin(F1( θrt )) ≥ Ω F , where θrt and Wl rt are defined in (6). Then we have ∥g( θrt )∥2 ≥ ∥ vec ∇W 2 Φ θrt ∥2 ≥ γ L-1 N Ω 3→L Ω F fL( θrt ) -y 2 , ( ) ∥g( θrt )∥2 ≤ L N Ω1→L min l∈[L] Ωl fL( θrt ) -y 2 . ( ) Remark 3. Note that ( 14) is a PL-type inequality Karimi et al. (2016) , and requires the special structure of the network that satisfies Assumption 2 Nguyen and Hein (2018); Nguyen and Mondelli (2020) . Also, (15) can be proven using Assumption 3. Now, we utilize the results of Steps 1 -2 and Lemma 3 to derive the convergence of FedAvg. Proposition 1. Use Algorithm 1 to minimize (3). Suppose Assumptions 1, 2 and 3 are satisfied, and for each iteration rt + v, v ∈ {0, 1, • • • , r -1}, θ rt+v k satisfies the conditions in Lemmas 1 and 2; and for each communication round rt, θrt satisfies conditions in Lemma 3, then ∃ η > 0 such that E[Φ θrt ] ≤ 1 - rη N γ 2(L-2) Ω 2 3→L Ω 2 F t Φ θ 0 . ( ) Remark 4. Proposition 1 above shows that, if the conditions in Lemmas 1, 2 and 3 are satisfied, i.e., we have well-behaved gradients (Lemmas 1 and 2) and PL condition (Lemma 3), we achieve linear convergence of expected loss function for solving (3) with FedAvg-SGD. We outline the major steps in the proof of Proposition 1. Proof Sketch. Consider the t th communication round, and suppose the singular values of the parameters satisfy (13), then it is easy to show that Φ( θrt ) is Lipschitz smooth with some constant Q > 0. Then using the Lipschitz smoothness of Φ( θrt ), we get Φ( θr(t+1) ) ≤ Φ( θrt ) -η g rt , ḡrt + . . . + grt+r-1 + Q 2 η 2 ḡrt + . . . + ḡrt+r-1 2 2 . Taking expectation on both sides and conditioning on θrt and the past, we get the following E[Φ( θr(t+1) )] ≤ E Φ( θrt ) -η⟨g rt , ḡrt + . . . + ḡrt+r-1 ⟩ + Q 2 η 2 ḡrt + . . . + ḡrt+r-1 2 2 = E Φ θrt -η g rt , rḡ rt -η⟨g rt , r-1 v=1 ḡrt+v -ḡrt ⟩ + Q 2 η 2 ḡrt + . . . + ḡrt+r-1 2 2 ≤ E Φ θrt -ηr∥g rt ∥ 2 2 + η∥g rt ∥2∥ r-1 v=1 ḡrt+v -ḡrt ∥2 + Q 2 η 2 ḡrt + . . . + ḡrt+r-1 2 2 . ( ) Now we bound each term in (17) using Lemmas 2 and 3. We first use the upper and lower bounds in Lemma 3 to bound the gradient norm. First, to bound the second term on the right hand side (rhs) of ( 17) we use the PL-inequality in ( 14) of Lemma 3 ∥g rt ∥ 2 ≥ γ L-1 N Ω 3→L Ω F f L ( θrt ) -y 2 . ( ) We bound gradient norm in the third term using the upper bound of gradient in (15) of Lemma 3 ∥g rt ∥ 2 ≤ L N Ω1→L min l∈[L] Ωl f L ( θrt ) -y 2 := T 1 (19) Additionally, we use (11) in Lemma 2 to bound the gradient drift in the third term, we get ∥ r-1 v=1 ḡrt+v -ḡrt ∥ 2 ≤ η ρL N Λ1→L min l∈[L] Λl A v+1 2 -1 √ A -1 K k=1 Q 2 k ∥X k ∥ 2 F ∥f L ( θrt ) -y∥ 2 := T 2 (20) Next, using (10) in Lemma 2 to bound fourth term on the rhs, the sum of stochastic gradient as ḡrt + . . . + ḡrt+r-1 2 ≤ ρL∥X∥ F N A v+1 2 -1 √ A -1 Λ1→L min l∈[L] Λl ∥f ( θrt ) -y∥ 2 := T 3 . Finally, plugging the bounds for each term in (17), using the definition of loss function Φ( θrt ) = 1 2N ∥f L ( θrt ) -y∥ 2 2 along with the choice of step-size η < γ 2(L-1) N 2 Ω 2 3→L Ω 2 F 2T1T2+QT 2 3 , we get E[Φ( θr(t+1) )] ≤ 1 - rη N γ 2(L-1) Ω 2 3→L Ω 2 F E[Φ( θrt )]. Using the above inequality recursively, we get the statement of Proposition 1. Now Step 3 is complete and we move on to define the initialization strategy of Step 4. It is important to note that Proposition 1 utilized Lemmas 1 -3, all of which impose some conditions on the singular values of the model parameters and the outputs of the first layer at each client during the entire training phase. Next, we define the initialization strategy that ensures that the conditions of Lemmas 1 -3 are satisfied almost surely. Next, we go to Step 4, and discuss the initialization strategy. Define λ l := σ min W 0 l and λl := 2 3 1 + σ max (W 0 l ) , for l ∈ {1, 2}, σ max (W 0 l ), for l ∈ {3, . . . , L} . ( ) We also define the largest and smallest singular values of the output of the first layer at initialization for each client as α 0,k : = σ min σ X k W 0 1,k . Similarly, for the centralized setting when all the clients share the same parameter and full data, we define α 0 := σ min σ XW 0 1 . Initialization Strategy: Given any ϵ < Φ(θ 0 ), we initialize the model weights such that for some constants M 1 , M 2 , M 3 > 0, the following are satisfied M1 min l∈[L] λl ∥X∥F λ1→L • Φ(θ 0 ) 3 2 ϵ ≤ 1 2 λ, l ∈ {3, . . . , L}, 1, l ∈ {1, 2}, , M2 min l∈[L] λl λ1→L • Φ(θ 0 ) 3 2 ϵ ≤ min α 0 , min k∈[K] α 0,k , M3λ 3→L α 0 ≥ λ1→L min l∈[L] λl . ( ) To satisfy the required initialization, we follow the initialization strategy of Nguyen and Mondelli (2020) . First, randomly initialize W 0 1 ij ∼ N (0, 1/d 2 in ). Broadcast [W 0 1 ] ij to each client and collect F 1,k , which is the output of the first layer of each client, as well as the norm of local data ∥X k ∥ F . With F 1,k , α 0 and α 0,k can be computed. For ( 23), since we have n 1 > N , α 0 and α 0,k are strictly positive. Then it is easy to verify that given ϵ > 0, (23) and the second relation in ( 24) will be satisfied if we choose large enough λ1→L min l∈ [L] λl . This can be realized by choosing arbitrarily large λl , l ∈ {3, • • • , L}. In order to satisfy the first relation in ( 24), we need to make λl and λ l close to each other. Intuitively, one way is to construct W 0 l L l=3 such that λ l = λl = ζ > 1, where ζ can be chosen to be any large number such that (23) and the second relation in ( 24) are satisfied. We also need to upper bound Φ(θ 0 ). This can be done by choosing small W 0 2 . Randomly initialize W 0 2 such that W 0 2 ij ∼ N (0, κ). We can set κ to be arbitrarily small, then Φ(θ 0 ) is bounded by 2 N ∥y∥ 2 2 with high probability (see (10) in Nguyen and Mondelli (2020) ). Note that the desired error ϵ is another key constant in the initialization. When we expect the error to be small, ( 23) and the second relation in ( 24) will be more strict. But this is not an issue since we can choose a larger ζ such that the initial conditions are satisfied. The detailed initialization strategy that ensures that the conditions of Lemmas 1, 2 and 3 are satisfied is given in the Appendix B.2. Next, let us state our main result, which indicates the linear convergence of local SGD to any ϵoptimal solution (see Definition 1). The proof is attached in Appendix B.3. Theorem 1. Using FedAvg-SGD to minimize (3) with Algorithm 1. Suppose Assumptions 1, 2 and 3 are satisfied, then there exists an initialization strategy such that for any ϵ < Φ(θ 0 ), there exists step-size η > 0 such that we have (where µ ′ := r 2N γ 2(L-2) 1 2 2(L-1) λ 2 3→L α 2 0 , and ηµ ′ < 1) E[Φ( θr(t+1) )] ≤ (1 -µ ′ η) t Φ(θ 0 ), t ∈ {0, . . . , T -1}. Theorem 1 shows that, for any ϵ > 0, we can always find an initialization, such that FedAvg-SG achieves an ϵ accuracy within O log( 1 ϵ ) rounds of communication. Notice that there is no heterogeneity assumption on the data (see Remark 2), and no assumption on the Lipschitz gradient of the loss function. Remark 5. We comment on the key novelties of this work compared to Nguyen and Mondelli (2020) . (1) Our work requires a careful analysis to deal with multiple local updates at each client. Note that in contrast to Nguyen and Mondelli (2020) , for our algorithm there is no guarantee that the overall objective will always decrease during local updates. In fact, our analysis demonstrates that the overall objective can increase after each local iteration, we show that this increase will be compensated by the descent in the objective value between each communication round. (2) Our algorithm and analysis can deal with the stochastic gradients for conducting local updates, while Nguyen and Mondelli (2020) only considered gradient descent in a centralized setting. A key step in our analysis is to characterize the relationship between the stochastic and full gradient updates, which is illustrated in Lemma 1. Remark 6. We comment on the choice of parameters and the convergence rate. As will be shown in Appendix B.3, by utilizing our initialization strategy, we can choose η = c/µ ′ for some constant c ∈ (0, 1) (independent of ϵ). This implies that µ ′ η = c < 1, which further implies that we have (1 -µ ′ η) < 1 in Theorem 1, ensuring linear convergence of FedAvg-SGD. Finally, we present the convergence guarantees for the case when FedAvg-GD is utilized. Corollary 1. Using FedAvg-GD to minimize (3) with Algorithm 1. Suppose Assumptions 2 and 3 are satisfied, then there exists an initialization strategy and step-size η > 0, such that we have Φ( θr(t+1) ) ≤ (1 -µ ′ η) t Φ(θ 0 ), ∀t ∈ {0, . . . , T -1}. (25) Remark 7. Corollary 1 implies that FedAvg-GD achieves linear convergence when optimizing (3). We note that the result of Corollary 1 is much stronger compared to Theorem 1 as the initialization for FedAvg-GD is independent of ϵ compared to the one for FedAvg-SGD (shown in Appendix B.2). FedAvg in both homogeneous (i.i.d.) and heterogeneous (non-i.i.d.) data settings. Through our experiments we establish that larger sized networks uniformly outperform smaller networks under different settings. Next, we discuss the data and the model setting for our experiments. 

Results and Discussion

For each setting, we compare the training loss and testing accuracy of Fe-dAvg on smaller and larger sized networks. To analyze the effect of network sizes on the stability of FedAvg, we also plot the performance of FedAvg averaged over 10 iterations for non-i.i.d. client data setting for all the network architectures. From our experiments, we make a few observations. First, we observe from Figures 1 and 2 that in all the cases, the i.i.d setting has more stable performance (lower variance) than non-i.i.d setting. Second, we note that the larger network uniformly outperforms the smaller network under all the settings. Third, we note from the box plots in Figures 1 and 2 that the performance of the larger networks have lower variance, hence more stable performance compared with what can be achieved by the smaller networks. Finally, we compare the random initialization with special initialization strategy which satisfies ( 23), ( 24). We can conclude from Figure 3 2018), showed that an infinite width neural network when trained using gradient descent (GD) behaves like a kernel method with the kernel defined as neural tangent kernel (NTK). Using this NTK parameterization Li and Liang (2018) showed that deep neural networks trained using GD require Ω(N 4 ) width to find the global optimal. This result was later improved to Ω(N 3 ) in Huang and Yau (2020) . The authors in Du et al. (2018) and Du et al. (2019) also analyze the performance of GD on overparameterized neural networks under different settings. Under standard parameterization, the work Allen-Zhu et al. ( 2019) studied the convergence of SGD and showed that network width of Ω(N 24 ) suffices to guarantee linear convergence. Recently, Nguyen and Mondelli (2020) and Nguyen (2021) have improved the dependence on the width and have shown that GD requires only Ω(N ) width to achieve linear convergence. All the works mentioned above focus on the centralized setting, and therefore, do not deal with data heterogeneity problem.

B PROOF OF MAIN RESULT B.1 PROOF OF LEMMAS

We define some additional notations before we state some lemmas which are needed in the proof. Let ⊗ denote the Kronecker product, and denote Σ l := diag [vec (σ ′ (F l-1 W l ))] ∈ R N n l ×N n l , Σ l,k := diag [vec (σ ′ (F l-1,k W l,k ))] ∈ R N k n l ×N k n l and Σl,k := diag vec σ ′ Fl-1,k W l,k ∈ R mn l ×mn l . Define f rt+v L,k := f L,k (θ rt+v k ), F rt+v L,k := F L,k (θ rt+v ); f rt L := f L ( θrt ), F rt L := F L ( θrt+v ), f L (θ rt+v ) := vec(F rt+v L ). Lemma 4. Nguyen and Mondelli (2020) Suppose Assumptions 2 and Assumption 3 are satisfied. Then for l ∈ [L] the following holds: 1. g l,k = 1 N k I n l ⊗ F T l-1,k L p=l+1 Σ p-1,k (W p,k ⊗ I N k ) (f L,k -y k ), . ∂f L,k ∂ vec(W l,k ) = L-l-1 p=0 W T L-p,k ⊗ I N k Σ L-t-1 I n l,k ⊗ F l-1,k , 3. ∥g 2,k ∥ 2 ≥ 1 N k σ min (F 1,k ) L p=3 σ min (Σ p-1,k ) σ min (W p,k ) ∥f L,k -y k ∥ 2 , ( ) 4. ∥F l,k ∥ F ≤ ∥X k ∥ F l p=1 σ max (W p,k ), ( ) 5. ∇ W l,k Φ k F ≤ 1 N k ∥X k ∥ F L p=1 p̸ =l σ max (W p,k ) ∥f L,k -y k ∥ 2 , ( ) 6. ∥g k ∥ 2 ≤ L∥X k ∥ F N L l=1 σ max (W l,k ) min l∈[L] σ max (W l,k ) L l=2 σ max (Σ l-1,k ) ∥f L,k -y k ∥ 2 . ( ) Furthermore, given with θ a k and θ b k , if Λl ≥ max σ max W a l,k , σ max W b l,k for some scalars Λl . Let R = L p=1 max 1, Λp . Then, for l ∈ [L], 7. F a L,k -F b L,k F ≤ √ L∥X k ∥ F L l=1 Λl min l∈[L] Λl θ a k -θ b k 2 , (32) 8. ∂f L (θ a k ) ∂ vec (W a l ) - ∂f L θ b k ∂ vec W b l 2 ≤ √ L∥X k ∥ F R (1 + Lβ∥X k ∥ F R) θ a k -θ b k 2 .e The above Lemma follows Lemma 4.1 Nguyen and Mondelli ( 2020  (z) -∇f (x)∥ 2 ≤ C∥z -x∥ 2 for every z = x + t(y -x) with t ∈ [0, 1]. Then, f (y) ≤ f (x) + ⟨∇f (x), y -x⟩ + C 2 ∥x -y∥ 2 . Lemma 6. For constant C, µ, ρ, if η → 0, we have lim η→0 1 + 3ρCη 1 log 1 1-µCη = e 3ρ 2µ Furthermore, given ϵ < Φ(θ 0 ), let T = log(Φ(θ 0 )/ϵ) log( 1 1-µCη ) + 1 , then there exists constant ξ, such that sup 0<η<min( 1 ρC , 1 µC ) 1 + 3ρCη T ≤ ξΦ(θ 0 ) ϵ , where ξ ≥ e 3ρ 2µ is a constant dependent on ρ and µ. Proof. Take logarithm on both sides, we get log 1 + 3ρCη 1 log 1 1-µCη = - log( √ 1 + 3ρCη) log(1 -µCη) = - 1 2 • log(1 + 3ρCη) log(1 -µCη) Now let η → 0, by L'Hôpital's rule, take derivative over η, we have lim η→0 - 1 2 • log(1 + 3ρCη) log(1 -µCη) = lim η→0 1 2 • 3ρC µC 1 -µCη 1 + 3ρCη = 3ρ 2µ . Next, if we can show the function of η, which is √ 1 + 3ρCη 1 log 1 1-µCη , has a limit when η → min( 1 ρC , 1 µC ), then by the continuity, it has an upper bound in 0, min( 1 ρC , 1 µC ) , denote it as ξ. It is easy to derive that lim η→min( 1 ρC , 1 µC ) 1 + 3ρCη 1 log 1 1-µCη = lim η→min( 1 ρ , 1 µ ) 1 + 3ρη 1 log 1 1-µη 2 1 log 1 1- µ ρ , ρ > µ, 1, ρ ≤ µ. Then by the continuity of the function, √ 1 + 3ρCη 1 log 1 1-µCη is bounded by some constant ξ. Then we can derive sup η∈(0,min( 1 ρC , 1 µC )) 1 + 3ρCη T ≥ lim η→0 1 + 3ρη log ( Φ(θ 0 )/ϵ ) log 1 1-µη = e 3ρ 2µ • Φ(θ 0 ) ϵ (38) then we have there exists some constant ξ ≥ e 3ρ 2µ , such that sup η∈(0,min( 1 ρC , 1 µC )) 1 + 3ρCη T ≤ ξΦ θ 0 ϵ . ( ) Lemma 7. Let Assumption 2 and Assumption 3 hold. For θ k , suppose there exists constant Λl , Λ l , Λ F such that and    σ max (W l,k ) ≤ Λl , l ∈ [L], k ∈ [K], σ min (W l,k ) ≥ Λ l , l ∈ {3, . . . , L}, k ∈ [K], σ min (F 1,k ) ≥ Λ F , k ∈ [K]. then we have ρ(θ k ) ≤ LN Λ1→L min l∈[L] Λl mγ L-2 Λ 3→L Λ F Proof. By definition, we have ρ(θ k ) = ∥g k ∥ 2 ∥g k ∥ 2 ≤ ∥g k ∥ 2 ∥g 2,k ∥ 2 . ( ) Since by ( 31) and ( 28) in Lemma 4, we have ∥g k ∥ 2 ≤ L∥ Xk ∥ F m Λ1→L min l∈[L] Λl ∥ fL,k (θ) -ỹ∥ 2 , ( ) ∥g 2,k ∥ 2 ≥ 1 N k γ L-2 Λ 3→L Λ F ∥f L,k (θ) -y∥, where X k is the sampled data at θ k . So we can derive ρ(θ k ) ≤ L∥ Xk ∥ F m Λ1→L min l∈[L] Λl ∥ fL,k (θ) -ỹk ∥ 2 1 N k γ L-2 Λ 3→L Λ F ∥f L,k (θ) -y k ∥ 2 ≤ LN ∥X∥ F Λ1→L min l∈[L] Λl mγ L-2 Λ 3→L Λ F , where the last inequality is because ∥ Xk ∥ F ≤ ∥X∥ F and ∥ fL,k (θ) -ỹk ∥ 2 ≤ ∥f L,k (θ) -y k ∥ 2 . Lemma 8. For the FedAvg-SGD algorithm, given step size η > 0, v ∈ {0, 1, . . . , r -1} and q ∈ {0, 1, . . . , v -1}. Suppose the following conditions hold: 1. Λl ≥ sup k∈[K] σ max W rt+q l,k , 2.ρ ≥ sup k∈[K] ρ θ rt+q k , ( ) 3.Φ k (θ rt+q ) ≤ A q • Φ k (θ rt ), k ∈ [K], then we have v q=0 ḡrt+q 2 ≤ v q=0 ḡrt+q 2 ≤ ρL∥X∥ F N A v+1 2 -1 √ A -1 Λ1→L min l∈[L] Λl ∥f rt L -y∥ 2 . ( ) Further, there exists constant Q k , such that ∀k ∈ [K] we have g rt+q+1 k -g rt+q k 2 ≤ ρQ k θ rt+q+1 k -θ rt+q k 2 , ( ) and ḡrt+v -ḡrt 2 ≤ v-1 q=0 ḡrt+q+1 -ḡrt+q 2 ≤ ηρL N Λ 1→L min l∈[L] Λ l A v+1 2 -1 √ A -1 K k=1 Q 2 k ∥X k ∥ 2 F ∥f rt L -y∥ 2 . (51) Proof. First, let us show (49). v q=0 ḡrt+v 2 (i) ≤ v q=0 ḡrt+v 2 (ii) ≤ v q=0 K k=1 N k N ∥g rt+v k ∥ 2 (iii) ≤ ρ v q=0 K k=1 N k N ∥g rt+v k ∥ 2 (52) (iv) ≤ ρL N v q=0 K k=1 ∥X k ∥ F Λ1→L min l∈[L] Λl ∥f rt+q L,k -y∥ 2 (53) = ρL N Λ1→L min l∈[L] K k=1 ∥X k ∥ F v q=0 ∥f rt+q L,k -y k ∥ 2 (54) (v) ≤ ρL N Λ1→L min l∈[L] Λl K k=1 ∥X k ∥ F v q=0 A q 2 ∥f rt+q L,k -y k ∥ 2 (55) = ρL N Λ1→L min l∈[L] Λl A v+1 2 -1 √ A -1 K k=1 ∥X k ∥ F ∥f rt+q L,k -y k ∥ 2 (56) (vi) ≤ ρL N Λ1→L min l∈[L] Λl A v+1 2 -1 √ A -1 K k=1 ∥X k ∥ 2 F K k=1 ∥f rt+q L,k -y k ∥ 2 F (57) = ρL∥X∥ F N Λ1→L min l∈[L] Λl A v+1 2 -1 √ A -1 ∥f rt L -y∥ 2 , So we can derive (49). Next, we show (50). Let us denote Jf rt+q L,k := ∂f rt+q L,k ∂ vec(W 1,k ) , . . . , ∂f rt+q L,k ∂ vec(W L,k ) . By triangle inequality, we have g rt+q+1 k -g rt+q k 2 = Jf rt+q+1 L,k f rt+q+1 L,k -y k -Jf rt+q L,k f rt+q L,k -y k ≤ f rt+q+1 L,k -f rt+q L,k 2 Jf rt+q+1 L,k 2 + Jf rt+q+1 L,k -Jf rt+q L,k 2 f rt+q L,k -y k 2 (59) Now we find the bound for each term in (59). Since max σ max W rt+q+1 l,k , σ max W rt+q l,k ≤ Λl , by (32) in Lemma 4, we get f rt+q+1 L,k -f rt+q L,k 2 ≤ √ L ∥X k ∥ F Λ1→L min l∈[L] Λl θ rt+q+1 k -θ rt+q k 2 (60) Further, by ( 27) we have Jf rt+q+1 L,k 2 ≤ L l=1 ∂Jf rt+q+1 L,k ∂vec (W l,k ) 2 ≤ L ∥X k ∥ F Λ1→L min l∈[L] Λl . ( ) Using ( 33) in Lemma 4, we have ∥Jf rt+q+1 L,k -Jf rt+q L,k ∥ 2 ≤ L l=1 ∂Jf rt+q+1 L,k ∂vec (W l,k ) - ∂Jf rt+q L,k ∂ vec (W l,k ) 2 ≤ L 3 2 ∥X k ∥ F R ′ (1 + Lβ ∥X k ∥ F R ′ ) θ rt+q+1 k -θ rt+q k 2 , ( ) where R ′ = L p=1 max 1, Λl . So plug the above bounds into (59). Set Lipschitz constant Q k = L √ L N k ∥X k ∥ 2 F Λ2 1→L min l∈[L] Λ2 l + L √ L N k ∥X k ∥ F (1 + Lβ ∥X k ∥ F R ′ ) R ′ f 0 L,k -y k 2 , ( ) then we derive g rt+q+1 k -g rt+q k 2 ≤ Q k ∥θ rt+q+1 k -θ rt+q k ∥ 2 . ( ) Now ( 50) is proved. Last, we prove (51). We have ḡrt+v -ḡrt 2 ≤ v-1 q=0 ḡrt+q+1 -ḡrt+q 2 (i) ≤ v-1 q=0 K k=1 N k N g rt+q+1 k -g rt+q k 2 (ii) ≤ v-1 q=0 K k=1 N k N Q k θ rt+q+1 k -θ rt+q k 2 = v-1 q=0 K k=1 N k N Q k • η grt+q k 2 (iii) ≤ v-1 q=0 K k=1 Q k N L ∥X k ∥ F Λ1→L min l∈[L] Λl f rt+q L -y k 2 (iv) ≤ ηL N Λ1→L min l∈[L] Λl K k=1 Q k ∥X k ∥ F v-1 q=0 A q 2 f rt+q L -y k 2 2 ≤ ηL N Λ1→L min l∈[L] Λl A v+1 2 -1 √ A -1 K k=1 Q k ∥X k ∥ F f rt+q L -y k 2 2 (v) ≤ ηL N Λ1→L min l∈[L] Λl A v+1 2 -1 √ A -1 K k=1 Q 2 k ∥X k ∥ 2 F K k=1 f rt+q L -y k 2 2 = ηL N Λ1→L min l∈[L] Λl A v+1 2 -1 √ A -1 K k=1 Q 2 k ∥X k ∥ 2 F • ∥f rt+q L -y∥ 2 , where (i) uses triangle inequality; (ii) uses the Lipschitz gradient assumption in condition 2; (iii) comes from (31) in Lemma 4; (iv) uses condition 3; (v) is from Cauchy-Schwartz inequality.

B.2 INITIALIZATION STRATEGY

Detailed Initialization for FedAvg-SGD: Denote P := L∥X∥ F N 7 4 L-1 (2 r -1), C := P L∥X∥ F 3 2 L-1 λ2 1→L min l∈[L] λ2 l , ρ := LN ∥X∥ F 7 L-1 λ1→L min l∈[L] λl mγ L-2 λ 3→L min α 0 , min k∈[K] α 0,k , µ := r 2N 2 γ 2(L-2) 1 2 2(L-1) λ 2 3→L α 2 0 C . ( ) Suppose given any small ϵ such that ϵ < Φ(θ 0 ), the initialized weights satisfies the following conditions: 2N 3 2 L∥X∥ F ( 3 2 ) L-1 λ1→L min l∈[L] λl • ξΦ(θ 0 ) ϵ 2Φ(θ 0 ) ≤ 1 2 λ l , l ∈ {3, • • • , L}, 1, l ∈ {1, 2}. ( ) 2N 3 2 L( 3 2 ) L-1 λ1→L min l∈[L] λl • ξΦ(θ 0 ) ϵ 2Φ(θ 0 ) ≤ 1 2 min α 0 , min k∈[K] α 0,k . where ξ ≥ e 3ρ 2µ is some constant dependent on ρ and µ. Now we provide a detailed way to realize the above initialization condition. To satisfy the required initialization, we follow the initialization strategy of Nguyen and Mondelli (2020) . First, randomly initialize W 0 1 ij ∼ N (0, 1/d 2 in ). Broadcast [W 0 1 ] ij to each client and collect F 1,k , which is the output of the first layer of each client, as well as the norm of local data ∥X k ∥ F and norm of local label ∥y k ∥ 2 . With F 1,k , α 0 and α 0,k can be computed. For (23), since we have n 1 > N , α 0 and α 0,k are strictly positive with probability 1. Then it is easy to verify that given ϵ > 0, (23) and the second relation in (24) will be satisfied if we choose large enough λ1→L min l∈ [L] λl . This can be realized by choosing arbitrarily large λ l , l ∈ {3, • • • , L}. However, notice that by Lemma 6, the constant ξ, which is defined in (39), is only dependent on ρ and µ and ξ ≥ e 3ρ 2µ . So if we can fix ρ and µ as some constants, ξ is a bounded constant. Notice in ( 66) and ( 67), for l ∈ {3, • • • , L}, if we can make λl and λ l close to each other, then ρ and µ are also close, so 3ρ 2µ is not large. This is equivalent to the first relation in (24) in main text. In order to satisfy the above conditions, one way is to construct W 0 l L l=3 in such way that λ l = λl = ζ > 1, where ζ can be chosen to be any large number such that (23) and the second relation in (24) are satisfied. Specifically, we can utilize the following construction: Initialize W 0 l such that its top block is a scaled identity matrix and rest of entries are zero W 0 l = ζ • I n l 0 ∈ R n l ×n l-1 , l = 3, . . . , L. We also need to upper bound Φ(θ 0 ). This can be done by choosing small W 0 2 . Randomly initialize W 0 2 such that W 0 2 ij ∼ N (0, κ). We can set κ to be arbitrarily small, similar to (10) in Nguyen and Mondelli (2020),we can find a bound for Φ(θ 0 ) with high probability: 2N Φ(θ 0 ) = ∥F L (θ 0 ) -y∥ F (71) ≤ ∥y∥ 2 + F L θ 0 F ≤ ∥y∥ 2 + L l=1 σ max (W 0 l )∥X∥ F ≤ 2∥y∥ 2 Then the loss function at initialization can be bounded by constant 2N Φ(θ 0 ) ≤ 2∥y∥ 2 . Initialization for FedAvg-GD: The initialized weight matrices satisfy the following conditions: 2N 3 2 L-1 + 2 L-1 (r -1) ∥X∥ F rγ 2(L-2) 1 2 2(L-1)2 2 3→L α 2 0 • λ1→L λl ≤ 1 2 λ l , l ∈ {3, • • • , L} 1, l ∈ {1, 2} 2N 3 2 L-1 + 2 L-1 (r -1) ∥X∥ 2 F rγ 2(L-2) 1 2 2(L-1)2 λ 2 3→L α 2 0 • λ2→L ≤ 1 2 α 0 . The initialization strategy is similar to FedAvg-SGD, so we omit the discussion here.

B.3 PROOF OF THEOREM 1

Theorem 1. Using FedAvg-SGD to minimize (3) with Algorithm 1. Suppose Assumptions 1, 2 and 3 are satisfied, then there exists an initialization strategy such that for any ϵ < Φ(θ 0 ), there exists step-size η > 0 such that we have E[Φ( θr(t+1) )] ≤ (1 -µ ′ η) t Φ(θ 0 ), t ∈ {0, . . . , T -1} where µ ′ = r N γ 2(L-2) 1 2 2(L-1) λ 2 3→L α 2 0 . Proof. First, we provide a structure of our proof. We will show the following recursively at each communication round: 1) The averaged weights are bounded at each communication round; 2) The divergence of loss function ( 3) is bounded at each communication round; 3) The expected loss function (3) decreases linearly at each communication round. Further, we will show that in each local epoch within a fixed communication round, we have: 1) The weights of each client are bounded; 2) The divergence of loss function Φ k of each client is bounded. Now let us set T = log(Φ(θ 0 )/ϵ) log( 1 1-µCη ) + 1 . If we can show (75) holds for t = 0, . . . , T , then it is easy to show that E[Φ( θrT )] ≤ (1 -µCη) T Φ(θ 0 ) ≤ ϵ. We prove Theorem 1 by induction. Define ρ rt+v := sup k∈[K] q∈{0,1,...,v} ρ(θ rt+q k ) ρ := Lm∥X∥ F 7 L-1 λ1→L min l∈[L] λl N γ L-2 λ 3→L min α 0 , min k∈[K] α 0,k , We show that ∀t ≤ T , we have                        σ max W ru ≤ 3 2 λl u ∈ {0, . . . , t}, l ∈ [L], σ min W ru ⩾ 1 2 λ l , u ∈ {0, . . . , t}, l ∈ {3, . . . , L}, σ min (F ru 1 ) ⩾ 1 2 α 0 , u ∈ {0, . . . , t}, σ min F ru 1,k ≥ 1 2 α 0,k , u ∈ {0, . . . , t}, k ∈ [K], ρ rt ≤ ρ, Φ θru ⩽ (1 + 3ρCη) u Φ θ 0 , u ∈ {0, . . . , t} E Φ θru ≤ (1 -µCη) u Φ θ 0 , u ∈ {0, . . . , t} . ( ) where λl is defined in ( 22) and λ l is the smallest eigen value of the weight matrix, C, µ, ρ defined in B.2 and µC = µ ′ . The above recursive equation describes the weight matrix and loss function in each communication round. To prove (77), we decompose the recursive equation into two steps, as follows Step1: For a fixed t and v ∈ [r -1], given                                            σ max Wl ru ⩽ 3 2 λl , u ∈ {0, . . . , t}, l ∈ [L], σ min Wl ru ⩾ 1 2 λ l , u ∈ {0, . . . , t}, l ∈ [L], σ min (F ru 1 ) ⩾ 1 2 α 0 , u ∈ {0, . . . , t}, ρ rt ≤ ρ; Φ θru ≤ (1 + 3ρCη) u Φ θ 0 , u ∈ {0, . . . , t} E Φ θru ≤ (1 -µCη) u Φ θ 0 , u ∈ {0, . . . , t} Φ k θ rt+q k ≤ (1 + 3ρC ′ η) q Φ k (θ rt k ) , q ∈ {0, . . . , v -1}, k ∈ [K], σ max W rt+q l,k ≤ 7 4 λl , q ∈ {0, . . . , v -1}, l ∈ [L], k ∈ [K], σ min W rt+q l,k ≤ 1 4 λ l , q ∈ {0, . . . , v -1}, l ∈ [L], k ∈ [K], σ min F rt+q 1,k ≥ 1 4 α 0,k , q ∈ {0, . . . , v -1}, k ∈ [K], ρ rt+v-1 ≤ ρ, . we aim to show                  σ max W rt+q l,k ≤ 7 4 λl , q ∈ {0, . . . , v}, l ∈ [L], k ∈ [K], σ min W rt+q l,k ≥ 1 4 λ l , q ∈ {0, . . . , v}, l ∈ [L], k ∈ [K], σ min F rt+q 1,k ≥ 1 4 α 0,k , q ∈ {0, . . . , v}, k ∈ [K] ρ rt+v ≤ ρ, Φ k θ rt+q k ≤ (1 + 3ρC ′ η) q Φ k (θ rt k ) , q ∈ {0, 1, . . . , v}, k ∈ [K]. , where C ′ = max k 1 N k 7 4 2(L-1) λ2 1→L min l∈[L] λ2 l . Step 2: Given ( 78) and ( 79), we show                      σ max Wl ru ⩽ 3 2 λl , u ∈ {0, . . . , t + 1}, l ∈ [L], σ min Wl ru ⩾ 1 2 λ l , u ∈ {0, . . . , t + 1}, l ∈ [L], σ min (F ru 1 ) ⩾ 1 2 α 0 , u ∈ {0, . . . , t + 1}, σ min (F 1,k ) ≥ 1 2 α 0,k , u ∈ {0, . . . , t + 1}, k ∈ [K], ρ r(t+1) ≤ ρ, Φ θru ≤ (1 + 3ρCη) u Φ θ 0 , u ∈ {0, . . . , t + 1}, E Φ θru ≤ (1 -µCη) u Φ θ 0 , u ∈ {0, . . . , t + 1}. . ( ) Now we show Step 1 first. (1) We first show    σ max W rt+q l,k ⩽ 7 4 λl , q ∈ {0, . . . , v}, l ∈ [L], k ∈ [K], σ min W rt+q l,k ⩾ 1 4 λ l , q ∈ {0, . . . , v}, l ∈ [L], k ∈ [K]. . We have W rt+v l,k -W rt l F ≤ v-1 q=0 W rt+q+1 l,k -W rt+q l,k F ≤ η v-1 q=0 grt+q l,k 2 (i) ≤ η v-1 q=0 grt+q k 2 (ii) ≤ ηρ v-1 q=0 g rt+q k 2 (82) (iii) ≤ ηρL N k ∥X k ∥ F ( 7 4 ) L-1 λ1→L min l∈[L] λl v-1 q=0 f rt+q L,k -y k 2 (83) (iv) ≤ ηρL N k ∥X k ∥ F ( 7 4 ) L-1 λ1→L min l∈[L] λl v-1 q=0 (1 + 3ρC ′ η) q f rt L,k -y k 2 , where (i) is because the norm of concentated gradient is no smaller than norm of one-layer gradient; (ii) results from Lemma (1); (iii) comes from (31) in Lemma 4; (iv) is because of the induction assumption. Let η < 1 ρC ′ , we have W rt+v l,k -W rt l F ≤ ηρL N k ∥X k ∥ F ( 7 4 ) L-1 λ1→L min l∈[L] λl v-1 q=0 (1 + 3ρC ′ η) q f rt L,k -y k 2 ( ) (i) ≤ ηρL N k ∥X k ∥ F 7 4 l-1 λ1→L min l∈[L] λl v-1 q=0 2 v f rt L,k -y k 2 (ii) ≤ ηρL N k ∥X k ∥ F 7 4 l-1 λ1→L min l∈[L] λl v-1 q=0 2 v f rt L -y 2 ≤ ηρL(2 r -1) N k ∥X k ∥ F 7 4 L-1 λ1→L min l∈[L] λl ∥f rt L -y∥ 2 (86) ≤ ηρL(2 r -1) N k ∥X k ∥ F 7 4 L-1 λ1→L min l∈[L] λl ∥f 0 L -y∥ 2 (1 + 3ρCη) T 2 ≤ ηρL(2 r -1) N k ∥X k ∥ F 7 4 L-1 λ1→L min l∈[L] λl ∥f 0 L -y∥ 2 • ξN Φ θ 0 ϵ ≤ 1 4 λ l , l ∈ {3, . . . , L}, 1 6 , l ∈ {1, 2} . where (i) uses η < 1 ρC ′ ; (ii) is because ∥f rt L,k -y k ∥ 2 ≤ ∥f rt L -y∥ 2 ; the last inequality holds if we choose small enough η. To be more specific, we can choose η < min min l∈[L] 1 4 λ l , 1 6 ρL(2 r -1) N k ∥X k ∥ F 7 4 L-1 λ1→L min l∈[L] λl ∥f 0 L -y∥ 2 • ξΦ(θ 0 ) ϵ ( ) By Weyl's inequality, we have                σ min W rt+v+1 l,k ⩾ σ min W rt l,k -1 4 λ l = 1 4 λ l , l ∈ {3, . . . , L}, k ∈ [K], σ max W rt+v+1 l,k ≤ σ max W rt l,k + 1 4 λl ⩽ 7 4 λl , l ∈ {3, . . . , L}, k ∈ [K], σ max W rt 1,k ≤ 1 6 + 1 + ∥W rt 1,k ∥ 2 ≤ 7 4 λl , k ∈ [K], σ max W rt 2,k ≤ 1 6 + 1 + ∥W rt 2,k ∥ 2 ≤ 7 4 λl , k ∈ [K]. . (2) We next show that σ min F rt+q 1,k ⩾ 1 4 α 0,k , q ∈ {0, . . . , v}, k ∈ [K]. It is sufficient to show σ min F rt+v 1,k ⩾ 1 4 α 0,k , k ∈ [K]. F rt+v 1,k -F rt 1,k F = σ X k W rt+v 1,k -σ X k W rt 1,k) F (90) (i) ≤ σ max (X k )∥W rt+v 1,k -W rt 1,k ∥ F (91) (ii) ≤ σ max (X k ) ηρ(2 r -1) N k ∥X k ∥ F 7 4 L-1 λ1→L min l∈[L] λl ∥f 0 L -y∥ 2 • ξΦ θ 0 ϵ (92) where (i) results from the Lipschitz gradient of σ in Assumption 3 and (ii) comes from ( 86). If we choose small enough η, which satisfies η < 1 4 α 0,k N k σ max (X k )ρ(2 r -1) ∥X k ∥ F 7 4 L-1 λ1→L min l∈[L] λl ∥f 0 L,k -y∥ 2 • ξΦ(θ 0 ) ϵ ( ) then we have F rt+v 1,k -F rt 1,k F ≤ 1 4 α 0,k (3) Next, we show that ρ rt+v ≤ ρ. (95) Since we have already shown in (81) that σ max (W rt+v l,k ) ≤ 7 4 λl , σ min (W rt+v l,k ) ≥ 1 4 λl and we have shown in (89) that σ min (F rt+v 1 ) ≥ 1 4 α 0,k . By lemma 1, we have ρ(θ rt+v k ) ≤ ( 7 4 ) L-1 LN ∥X∥ F λ1→L min l∈[L] λl ( 1 4 ) L-1 mγ L-2 λ 3→L min k∈[K] α 0,k ≤ ρ. (4) Next, we prove Φ k θ rt+q k ⩽ (1 + 3ρC ′ η) q Φ k θ rt k , q ∈ {0, . . . , v}, k ∈ [K]. (97) We show Φ k θ rt+v k ⩽ (1 + 3ρC ′ η) v Φ k θ rt k . (98) First, we need to show Φ k has Lipschitz gradient within [θ rt+v-1 , θ rt+v ]. This is similar to the proof of (50) in Lemma 8. So we don't include the details here. It is easy to show that, for θ rt+v-1,s k := θ rt+v-1 k + s(θ rt+v k -θ rt+v-1 k ), there is max σ max W rt+v-1,s l,k , σ max W rt+v-1 l,k ≤ 7 4 λl . So similarly we can derive the Lipschitz constant Q k = L √ L N k 7 4 2(L-1) ∥X k ∥ 2 F λ2 1→L min l∈[L] λ2 l + L √ L N k ∥X k ∥ F (1 + Lβ ∥X k ∥ F R ′ ) R ′ f 0 L,k -y k 2 , such that ∀s ∈ [0, 1], g rt+v-1,s k -g rt+v-1 k 2 ≤ Q k ∥θ rt+v-1,s k -θ rt+v-1 k ∥ 2 . ( ) With Lipschitz gradient within [θ rt+v-1 k , θ rt+v k ], by Lemma 5, we have Φ k θ rt+v k ≤ Φ k θ rt+v-1 k + ⟨∇Φ k θ rt+v-1 k , θ rt+v k -θ rt+v k + Q k 2 θ rt+v-1 k -θ rt+v-1 k 2 2 (101) = Φ k θ rt+v-1 k + g rt+v-1 k , -ηg rt+v-1 k + Q k 2 ηg rt+v-1 k 2 2 (102) ≤ Φ k θ rt+v-1 k + η g rt+v-1 k 2 grt+v-1 k 2 + Q k 2 η 2 grt+v-1 k 2 2 (103) ≤ Φ k θ rt+v-1 k + ηρ g rt+v-1 k 2 2 + Q k 2 η 2 ρ 2 g rt+v-1 k 2 2 (104) Let η < 1 Q k ρ , we have the above inequality Φ k θ rt+v k ≤ Φ k θ rt+v-1 k + ηρ g rt+v-1 k 2 2 + Q k 2 η 2 ρ 2 g rt+v-1 k 2 2 (105) ≤ Φ k θ rt+v-1 k + 3 2 ρη g rt+v-1 k 2 2 (106) ≤ Φ k θ rt+v k + 3ρηL N k 7 4 2(L-1) λ2 1→L min l∈[L] λ2 l Φ k θ rt+v-1 k , where the third inequality comes from (31) in Lemma 4. Recall C ′ := max k ( 1 N k 7 4 2(L-1) λ2 1→L min l∈[L] λ2 l ), we have Φ k θ rt+v+1 k ≤ Φ k θ rt+v N (1 + 3ρC ′ η). Now Step 1 is proved. Next we show Step 2. (1) Show σ max W ru l ⩽ 3 2 λl , u ∈ {0, 1, . . . t + 1}, l ∈ [L] σ min W ru l ⩾ 1 2 λl u ∈ {0, 1, . . . t + 1} l ∈ {3, . . . , L} . Define ∇W l ,k Φ k (θ rt k ) be the stochastic gradient over layer l of each client. Denote ḡl,rt+v := K k=1 N k N ∇W l ,k Φ k (θ rt+v k ) We have W r(t+1) l -W 0 l F = η t u=0 ḡl,ru + ḡl,ru+1 + . . . + ḡl,rt+r-1 2 (110) ≤ η t u=0 r-1 v=0 ∥ ḡl,ru+v ∥ 2 (111) ≤ η t u=0 r-1 v=0 ∥ ḡru+v ∥ 2 By Step 1, we know for v ∈ {0, 1, . . . , r-1}, we have ρ rt+v ≤ ρ. So by definition of ρ rt+v , we have ∥g ru+v ∥ 2 ≤ ρ∥g ru+v ∥ 2 . Then it is easy to verify that the assumptions in Lemma 8 are satisfied, where Λl = 7 4 λl , Q k is defined in (63) and A = 1 + 3ρC ′ η. Then by Lemma 8, if η < 1 ρC ′ , we have η t u=0 r-1 v=0 ∥ ḡru+v ∥ 2 ≤ η t u=0 ρL∥X∥ F N 7 4 L-1 (2 r -1) λ1→L min l∈[L] λl ∥f rt L -y∥ 2 Using the definition of P = L∥X∥ F N 7 4 L-1 (2 r -1), we have η t u=0 r-1 v=0 ∥ ḡru+v ∥ 2 ≤ η t u=0 ρL∥X∥ F N 7 4 L-1 (2 r -1) λ1→L min l∈[L] λl ∥f rt L -y∥ 2 (114) ≤ ηρP λ1→L min l∈[L] λl t u=0 ∥f rt L -y∥ 2 (115) ≤ ηρP λ1→L min l∈[L] λl t u=0 (1 + 3ρCη) u 2 ∥f 0 L -y∥ 2 , where the last inequality comes from the induction assumption. Now let S = √ 1 + 3ρCη, if we choose η < 1 ρC , we get η t u=0 r-1 v=0 ∥ ḡru+v ∥ 2 ≤ ηρP λ1→L min l∈[L] λl t u=0 (1 + 3ρCη) u 2 ∥f 0 L -y∥ 2 = ηρP λ1→L min l∈[L] λl t u=0 S u ∥f 0 L -y∥ 2 ≤ ηρP λ1→L min l∈[L] λl S T +1 S 2 -1 (S + 1)∥f 0 L -y∥ 2 = ηρP λ1→L min l∈[L] λl S T +1 • 3 3ρCη ∥f 0 L -y∥ 2 By Lemma 6, we have S T ≤ ξΦ(θ 0 ) ϵ . Additionally, S ≤ 2, therefore, we have η t u=0 r-1 v=0 ∥ ḡru+v ∥ 2 ≤ ηρP λ1→L min l∈[L] λl 2S T • 3 3ρCη ∥f 0 L -y∥ 2 (119) ≤ P C λ1→L min l∈[L] λl • 2ξΦ θ 0 ϵ ∥f 0 L -y∥ 2 = 2 L∥X∥ F 3 2 L-1 λ1→L min l∈[L] λ l • ξΦ θ 0 ϵ ∥f 0 L -y∥ 2 (120) ≤ 1 2 λ l , l ∈ {3, . . . , L}, 1, l ∈ {1, 2}. , where the last inequality is from (68). So by Weyl's ineuality, we have            σ min W r(t+1) l ⩾ σ min W rt l -1 2 λ l = 1 2 λ l , l ∈ {3, . . . , L}, k ∈ [K], σ max W r(t+1) l ≤ σ max W rt l + 1 2 λl ⩽ 3 2 λl , l ∈ {3, . . . , L}, k ∈ [K], σ max W rt 1 ≤ 1 + σ max W rt 1 ≤ 3 2 λl , k ∈ [K], σ max W rt 2 ≤ 1 + σ max ( W rt 2,k ) ≤ 3 2 λl , k ∈ [K]. . (2) Show σ min (F ru 1 ) ⩾ 1 2 α 0 , u ∈ {0, . . . , t + 1} l ∈ [L]. Similarly, we have F r(t+1) 1 -F 0 1 F = σ X W r(t+1) 1 -σ XW 0 1 F ( ) (i) ≤ σ max (X) W r(t+1) 1 -W 0 1 F (124) (ii) ≤ σ max (X) 2 L∥X∥ F 3 2 L-1 λ1→L min l∈[L] λl • ξΦ θ 0 ϵ ∥f 0 L -y∥ 2 (125) ≤ ∥X∥ F 2 L∥X∥ F 3 2 L-1 λ1→L min l∈[L] λl • ξΦ θ 0 ϵ ∥f 0 L -y∥ 2 (126) (iii) ≤ 1 2 α 0 , where (i) is because σ is 1-Lipschitz; (ii) comes from (120), and (iii) is because (69) in B.2. So similarly by Weyl's inequality, we have σ min F r(t+1) 1 ≥ σ min F 0 1 = α 0 -1 2 α 0 = 1 2 α 0 . (3) Show σ min F ru 1,k ⩾ 1 2 α 0,k , u ∈ {0, . . . , t + 1} l ∈ [L]. Similarly, we have F r(t+1) 1,k -F 0 1,k F = σ X k W r(t+1) 1 -σ X k W 0 1 F ( ) (i) ≤ σ max (X k ) W r(t+1) 1 -W 0 1 F (130) (ii) ≤ σ max (X k ) 2 L∥X∥ F 3 2 L-1 λ1→L min l∈[L] λl • ξΦ θ 0 ϵ ∥f 0 L -y∥ 2 (131) ≤ ∥X k ∥ F 2 L∥X∥ F 3 2 L-1 λ1→L min l∈[L] λl • ξΦ θ 0 ϵ ∥f 0 L -y∥ 2 (132) (iii) ≤ 1 2 α 0,k , where (i) is because σ is 1-Lipschitz; (ii) comes from (120), and (iii) is because (69) in B. First, similar to the proof of (50), we can derive Q = L √ L N • 3 2 2(L-1) ∥X∥ 2 F λ2 1→L min l∈[L] λ2 l + L √ L N ∥X∥ F (1 + Lβ∥X∥ F R) R∥f 0 L -y∥ 2 , ( ) where R =  ≤ Φ θrt + ηρP L∥X∥ F N 3 2 L-1 λ2 1→L min l∈[L] λ2 l ∥f rt L -y∥ 2 2 + Q 2 η 2 ρ 2 P 2 λ2 1→L min l∈[L] λ2 l ∥f rt L -y∥ 2 2 , λl K k=1 Q 2 k ∥X k ∥ 2 F ∥f rt L -y∥ 2 Plug ( 148) into (147), we get: E Φ( θr(t+1) ) ≤ E Φ θrt -ηr∥g rt ∥ 2 2 + η∥g rt ∥ 2 × ∥ r-1 v=1 ḡrt+v -ḡrt ∥ 2 + Q 2 η 2 ρ 2 P 2 λ2 1→L min l∈[L] λ2 l ∥f rt L -y∥ 2 2 ≤ E Φ θrt -ηr∥g rt ∥ 2 2 + η∥g rt ∥ 2 × ηρL(2 r -1)∥X∥ F N 7 4 L-1 K k=1 Q 2 k ∥X k ∥ 2 F × λ1→L min l∈[L] λl ∥f rt L -y∥ 2 + Q 2 η 2 ρ 2 P 2 λ2 1→L min l∈[L] λ2 l ∥f rt L -y∥ 2 2 (i) ≤ E Φ θrt -ηr∥g rt ∥ 2 2 + η 2 L 2 ρ(2 r -1)∥X∥ F N 2 × ( 21 8 ) L-1 K k=1 Q 2 k ∥X k ∥ 2 F × λ2 1→L min l∈[L] λ2 l ∥f rt L -y∥ 2 2 + Q 2 ρ 2 η 2 P 2 λ2 1→L min l∈[L] λ2 l ∥f rt L -y∥ 2 2 (ii) ≤ E Φ θrt -η r N 2 γ 2(L-2) 1 2 2(L-1) λ 2 3→L α 2 0 :=µ ′ ∥f rt L -y∥ 2 2 + η 2 L 2 ρ(2 r -1)∥X∥ F ( 21 8 ) L-1 K k=1 Q 2 k ∥X k ∥ 2 F N 2 + Q 2 ρ 2 P 2 λ2 1→L min l∈[L] λ2 l :=B ∥f rt L -y∥ 2 2 , where (i) uses (31) to provide an upperbound for ∥g rt ∥, (ii) uses (28) to provide a lower bound for ∥g rt ∥. Let η < µ ′ 2B , we have E[Φ( θr(t+1) )] ≤ E Φ θrt -ηµ ′ ∥f rt L -y∥ 2 2 + η 2 B∥f rt L -y∥ 2 2 (149) ≤ E Φ θrt (1 -ηµ ′ ) (150) = E[Φ θrt ] 1 -η r N γ 2(L-2) 1 2 2(L-1) λ 2 3→L α 2 0 (151)



A model is generally referred to as overparameterized if the number of (trainable) parameters are more than the number of training samples N . NUMERICAL EXPERIMENTSIn this section, we analyze the effect of increasing the network sizes on popular image classification tasks with MNIST, Fashion MNIST and CIFAR-10 data sets. We compare the performance of



The FedAvg-SGD AlgorithmInitialize: Parameters θ 0 k = θ 0 , Step-size η, # of communication rounds, local updates T , r for t = 0, 1, . . . , T -1 do for each client k ∈ {1, . . . , K} do Set θ rt k = θrt for v = 0, 1, . . . , r -1 doSample mini-batch of size m, the above description. For each communication round t ∈ {0, 1, . . . , T -1} and local step v ∈ {0, 1, . . . , r -1}, we define the vector θ rt+v k and the vectorized output of each hidden layer l, respectively, when the input to the client k's local network is the stochastic (mini-batch)

Figure 1: CIFAR-10 with CNN: FedAvg-SGD on large and small size CNN.

Figure 2: CIFAR-10 with ResNet: Comparison of FedAvg-SGD on ResNet18 and ResNet50.

Figure 3: Test accuracy for MNIST (left) and Fashion MNIST (right) datasets. We compare the performance with (standard) random initialization and the proposed initialization strategy for both iid and noniid settings. Legends 'iid_ini' and 'noniid_ini' represent the proposed initialization strategy. Data set: For MNIST, Fashion MNIST and CIFAR-10 data sets, we split the data set among K = 100 clients. For the homogeneous (i.i.d.) setting, we randomly distribute the complete data set with 60, 000 samples to each client. To model the heterogeneous (non i.i.d.) setting, we split the clients into two sets. One set of clients receive randomly drawn samples while the second set of clients receive data from only two out of ten labels McMahan et al. (2017). For our experiments on MNIST and Fashion MNIST data, 70% of the users receive non-i.i.d samples, while for CIFAR-10 data, the fraction is 20%.Results and DiscussionFor each setting, we compare the training loss and testing accuracy of Fe-dAvg on smaller and larger sized networks. To analyze the effect of network sizes on the stability of FedAvg, we also plot the performance of FedAvg averaged over 10 iterations for non-i.i.d. client data setting for all the network architectures. From our experiments, we make a few observations. First, we observe from Figures1 and 2that in all the cases, the i.i.d setting has more stable performance (lower variance) than non-i.i.d setting. Second, we note that the larger network uniformly outperforms the smaller network under all the settings. Third, we note from the box plots in Figures1 and 2that the performance of the larger networks have lower variance, hence more stable performance compared with what can be achieved by the smaller networks. Finally, we compare the random initialization with special initialization strategy which satisfies (23), (24). We can conclude from Figure3that these two initialization are similar in test performance.

Figure 3: Test accuracy for MNIST (left) and Fashion MNIST (right) datasets. We compare the performance with (standard) random initialization and the proposed initialization strategy for both iid and noniid settings. Legends 'iid_ini' and 'noniid_ini' represent the proposed initialization strategy. Data set: For MNIST, Fashion MNIST and CIFAR-10 data sets, we split the data set among K = 100 clients. For the homogeneous (i.i.d.) setting, we randomly distribute the complete data set with 60, 000 samples to each client. To model the heterogeneous (non i.i.d.) setting, we split the clients into two sets. One set of clients receive randomly drawn samples while the second set of clients receive data from only two out of ten labels McMahan et al. (2017). For our experiments on MNIST and Fashion MNIST data, 70% of the users receive non-i.i.d samples, while for CIFAR-10 data, the fraction is 20%.Results and DiscussionFor each setting, we compare the training loss and testing accuracy of Fe-dAvg on smaller and larger sized networks. To analyze the effect of network sizes on the stability of FedAvg, we also plot the performance of FedAvg averaged over 10 iterations for non-i.i.d. client data setting for all the network architectures. From our experiments, we make a few observations. First, we observe from Figures1 and 2that in all the cases, the i.i.d setting has more stable performance (lower variance) than non-i.i.d setting. Second, we note that the larger network uniformly outperforms the smaller network under all the settings. Third, we note from the box plots in Figures1 and 2that the performance of the larger networks have lower variance, hence more stable performance compared with what can be achieved by the smaller networks. Finally, we compare the random initialization with special initialization strategy which satisfies (23), (24). We can conclude from Figure3that these two initialization are similar in test performance.

2. So similarly by Weyl's inequality, we haveσ min F Φ θru ⩽ (1 + 3ρCη) u Φ θ 0 , u ∈ {0, . . . , t + 1}, l ∈ [L].

such that ∀ θrt,s = θrt + s( θr(t+1) -θrt ), s ∈ [0, 1], we have g r(t+1),s -g r(t+1) have Φ θr(t+1) = Φ θrt -η ḡrt -. . . -η ḡrt+r-1

E Φ( θr(t+1) ) ≤ (1 -µCη) u Φ θ 0 , u ∈ {0, 1, . . . , t + 1} By (138), we haveΦ( θr(t+1) ) ≤ Φ θrt -η g rt , ḡrt + . . . + grt+r-1 + Q 2 η 2 ḡrt + . . . + grt+r-1 2 2 (146)Given θrt , take expectation of the stochastic gradient on both sides conditioned on θrt and the past, we getE[Φ( θr(t+1) )] ≤ E Φ θrt -η g rt , ḡrt + . . . + ḡrt+r-E Φ θrt -η g rt , rḡ rt -η⟨g rt , E Φ θrt -ηr∥g rt ∥ 2 2 + η∥g rt ∥ 2 × ∥easy to verify the assumptions in Lemma 1 are satisfied. Let Λl = 7 4 λl , Q defined in (136), A = 1 + 3ρC ′ η ≤ 2, by (51) in Lemma 8:, we have

):(26)  gives the expression of the vectorized gradient; (27) provides the vectorized Jacobian matrix of the output of the network; (28) gives a lower bound on the norm of the gradient, which holds under Assumption 2. (29) provides an upper bound on the norm of output of each layer while (30) gives an upper bound on the norm of gradient of each layer; (32) derives the Lipschitz constant of the networks and (33) provides the Lipschitz constant for the Jacobian of each layer. Similar results can be derived in centralized optimization problem, so we do not include the results here.

By (49)  in Lemma 8, if η < 1 ρC , we have A = 2, and we have ḡrt + . . . + ḡrt+r-1 2 ≤

annex

, we havewhere µC = µ ′ . Now we summarize the choice of η, it should be smaller than all the following quantities:Remark 8. To satisfy the initialization assumptions defined in (69) and (68), we initialize the neural network coefficients such that we have). Note that this follows from the fact that η is smaller than each quantity defined in (153) above. Thus, we have µ ′ η = O(1) and we can always choose η = c/µ ′ for some c ∈ (0, 1), which guarantees linear convergence of the objective in each communication round (see Theorem 1).

C EXPERIMENT SETTING AND RESULT C.1 MODEL AND PARAMETER SETTINGS:

To analyze the performance of FedAvg-SGD on the MNIST data set, we use a single hidden-layer fully-connected neural network (MLP) with ReLU activation. We set the hidden-layer size to be 32 (resp. 1, 000) for the small (resp. large) network. We choose the mini-batch size m = 10 and choose the number of local steps to be r = 10. Using the above network, we also compare the random initialization with the special initialization strategy in (23),( 24) with MNIST and Fashion MNIST dataset. For the CIFAR-10 data set, we analyze the performance of FedAvg-SGD on two network architectures -convolutional neural network (CNN) and ResNet. We design the smaller CNN using two 5 × 5 convolutional layers followed by 2 × 2 max pooling, each has 6 and 16 channels, connected by 2 fully-connected layers with 120 and 84 hidden neurons. For larger CNN, we use three 3 × 3 convolutional layers each with 128 channels followed by 2 × 2 max pooling. The ReLU activation function is used after each hidden layer for small/large CNN. For ResNet, we compare the performance on ResNet18 with ResNet50 architectures. For both the CNN and ResNet, we use a mini-batch size of m = 32 and number of local steps to be r = 5. We randomly sample 10 clients in each epoch and perform FedAvg-SGD for more efficient training. 

