WOULD DECENTRALIZATION HURT GENERALIZATION?

Abstract

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices without the control of a central server. Existing theory suggests that decentralization degrades generalizability, which conflicts with experimental results in large-batch settings that D-SGD generalizes better than centralized SGD (C-SGD). This work presents a new theory that reconciles the conflict between the two perspectives. We prove that D-SGD introduces an implicit regularization that simultaneously penalizes (1) the sharpness of the learned minima and (2) the consensus distance between the global averaged model and local models. We then prove that the implicit regularization is amplified in large-batch settings when the linear scaling rule is applied. We further analyze the escaping efficiency of D-SGD and show that D-SGD favors super-quadratic flat minima. Experiments are in full agreement with our theory. The code will be released publicly. To our best knowledge, this is the first work on the implicit regularization and escaping efficiency of D-SGD.

1. INTRODUCTION

Decentralized stochastic gradient descent (D-SGD) enables simultaneous model training on massive workers without being controlled by a central server, where every worker communicates only with its directly connected neighbors (Xiao & Boyd, 2004; Lopes & Sayed, 2008; Nedic & Ozdaglar, 2009; Lian et al., 2017; Koloskova et al., 2020) . This decentralization avoids the requirements of a costly central server with heavy communication and computation burdens. Despite the absence of a central server, existing theoretical results demonstrate that the massive models on the edge converge to a unique steady consensus model (Shi et al., 2015; Lian et al., 2017; Lu et al., 2011) , with asymptotic linear speedup in convergence rate (Lian et al., 2017) as the distributed centralized SGD (C-SGD) does (Dean et al., 2012; Li et al., 2014) . Consequently, D-SGD offers a promising distributed learning solution with significant advantages in privacy (Nedic, 2020) , scalability (Lian et al., 2017) , and communication efficiency (Ying et al., 2021b) . However, existing theoretical studies show that the decentralization nature of D-SGD introduces an additional positive term into the generalization error bounds, which suggests that decentralization may hurt generalization (Sun et al., 2021; Zhu et al., 2022) . This poses a crippling conflict with empirical results by Zhang et al. (2021) which show that D-SGD generalizes better than C-SGD by a large margin in large batch settings; see Figure 1 . This conflict signifies that the major characteristics were overlooked in the existing literature. Therefore, would decentralization hurt generalization? This work reconciles the conflict. We prove that decentralization introduces implicit regularization in D-SGD, which promotes the generalization. To our best knowledge, this is the first paper that surprisingly shows the advantages of D-SGD in generalizability, which redresses the former misunderstanding. Specifically, our contributions are in twofold. • We prove that the mean iterate of D-SGD closely follows the path of C-SGD on a regularized loss, which is the addition of the original loss and a regularization term introduced by decentralization. This regularization term penalizes the largest eigenvalue of the Hessian matrix, as well as the consensus distance (see Theorem 1). These regularization effects are shown to be considerably amplified in large-batch settings (see Theorem 2), which is consistent with our visualization (see Figure 4 ) and the empirical results in (Zhang et al., 2021) . To prove the above results, we apply second-order multivariate Taylor approximation (Königsberger, 2013) on the gradient diversity (see Equation ( 5)) to derive the regularized loss. Then, we prove that the regularization term contained in the regularized loss scales positively with the largest Hessian eigenvalue, which suggests that D-SGD implicitly minimizes the sharpness of the learned minima (see Lemma C.2). • We prove the first result on the expected escaping speed of D-SGD from local minima (see Theorem 3). Our results show that D-SGD prefers super-quadratic flat minima to sub-quadratic minima with higher probability (see Proposition 4). The proof is based on the construction of a stochastic differential equation (SDE) approximation (Jastrzebski et al., 2017; M et al., 2017; Li et al., 2021) of D-SGD.

2. RELATED WORK

Flatness and generalization. The flatness of minimum is a commonly used concept in the optimization and machine learning literature and has long been regarded as a proxy of generalization (Hochreiter & Schmidhuber, 1997; Izmailov et al., 2018; Jiang et al., 2020) . Intuitively, the loss around a flat minimum varies slowly in a large neighborhood, while a sharp minimum increases rapidly in a small neighborhood (Hochreiter & Schmidhuber, 1997) . Through the lens of the minimum description length theory (Rissanen, 1983) , flat minimizers tend to generalize better than sharp minimizers, since they are specified with lower precision (Keskar et al., 2017) . From a Bayesian perspective, sharp minimizers have posterior distributions highly concentrated around them, indicating that they are more specialized on the training set and thus are less robust to data perturbations than flat minimizers (MacKay, 1992; Chaudhari et al., 2019) . Generalization of large-batch training. Large-batch training is of significant interest for deep learning deployment, which can contribute to a significant speed-up in training neural networks (Goyal et al., 2017; You et al., 2018; Shallue et al., 2019) . Unfortunately, it is widely observed that in the centralized learning setting, large-batch training often suffers from a drastic generalization degradation, even with fine-tuned hyper-parameters, from both empirical (Chen & Huo, 2016; Keskar et al., 2017; Hoffer et al., 2017; Shallue et al., 2019; Smith et al., 2020) and theoretical (Li et al., 2021) aspects. An explanation of this phenomenon is that large-batch training leads to "sharper" minima (Keskar et al., 2017) , which are more sensitive to perturbations (Hochreiter & Schmidhuber, 1997) . Development of D-SGD. The earliest work of classical decentralized optimization can be traced back to Tsitsiklis (1984) , Tsitsiklis et al. (1986) and Nedic & Ozdaglar (2009) . D-SGD, a typical decentralized optimization algorithm, has been extended to various settings in deep learning, including time-varying topologies (Lu & Wu, 2020; Koloskova et al., 2020) , asynchronous settings (Lian et al., 2018; Xu et al., 2021; Nadiradze et al., 2021) , directed topologies (Assran et al., 2019; Taheri et al., 2020) , and data-heterogeneous scenarios (Tang et al., 2018; Vogels et al., 2021) . 2021) demonstrates that D-SGD introduces an "additional" landscape-dependent noise, which improves the convergence of D-SGD. However, the direction, magnitude, and shape of the noise remain unexplored. In contrast, we rigorously prove that the additional noise of D-SGD (i.e., the gradient diversity in Equation ( 4)) biases the trajectory of D-SGD towards flatter minima, which may play a distinct role in shaping the generalizability of D-SGD.

3. PRELIMINARIES

Suppose that X ⊆ R dx and Y ⊆ R are the input and output spaces, respectively. We denote the training set as µ = {z 1 , .  L µ (w) = 1 N N ζ=1 L(w; z ζ ), L(w) = E z∼D [L(w; z)]. Distributed learning. Distributed learning jointly trains a learning model w on multiple workers (Shamir & Srebro, 2014) . In this framework, the j-th worker (j = 1, . . . , m) can access |µ j | independent and identically distributed (i.i.d.) training examples µ j = {z j,1 , . . . , z j,|µj | }, drawn from the data distribution D. In this case, the global empirical risk of w is L µ (w) = 1 m m j=1 L µj (w), where L µj (w) =foot_0 Distributed centralized stochastic gradient descent (C-SGD). 1 In C-SGD, there is only one centralized model w a (t). C-SGD (Dean et al., 2012; Li et al., 2014) updates the model by w a (t+1) = w a (t) - 1 m m j=1 η • Local gradient computation ∇L µj (t) (w a (t)) , where η denotes the learning rate, µ j (t) = {z j,1 , . In the next section, we will show that C-SGD equals the single-worker SGD with a larger batch size. Decentralized stochastic gradient descent (D-SGD). The goal of D-SGD is to learn a consensus model w a (t) = 1 m m j=1 w j (t) on m workers, where w j (t) stands for the d-dimensional local model on the j-th worker. We denote P = [P j,k ] ∈ R m×m as a doubly stochastic gossip matrix (see Definition A.1) that characterizes the underlying topology G. The vanilla Adapt-While-Communicate (AWC) version of the mini-batch D-SGD (Nedic & Ozdaglar, 2009; Lian et al., 2017) updates the model on the j-th worker by w j (t+1) = Communication m j=1 P j,k w k (t) -η • Local gradient computation ∇L µj (t) (w j (t)) . (2) For a more detailed background of D-SGD, please refer to Appendix A .

4. THEORETICAL RESULTS

This section shows the implicit regularization effect and the escaping efficiency of D-SGD. We start by showing that D-SGD can be interpreted as C-SGD on a regularized loss. Then we prove that the regularization term in the new loss scales positively with the largest Hessian eigenvalue (see Theorem 1), which suggests that D-SGD implicitly minimizes the sharpness. Next, we prove that the regularization effect will increase with the total batch size if we apply the linear scaling rule (see Theorem 2), which justifies the superiority of D-SGD in large-batch settings. Finally, we prove the escaping efficiency of D-SGD beyond the quadratic assumption (see Theorem 3) and show that D-SGD favors super-quadratic minima (see Proposition 4).

4.1. D-SGD IS EQUIVALENT TO C-SGD ON A REGULARIZED LOSS

In this subsection, we theoretically compare D-SGD and C-SGD. We prove that D-SGD is equivalent to C-SGD on regularized loss with an extra positive regularization term, as shown in the following theorem. Theorem 1 (Implicit regularization of D-SGD). Given that the loss L is continuous and has fourth-order partial derivatives, denote the weight diversity matrix as Ξ(t) = 1 m m j=1 (w j (t)-w a (t))(w j (t)-w a (t)) T , its diagonal matrix as Ξ * (t), and the d-dimensional all-ones vector as 1. With a probability greater than 1 -O(η), the mean iterate of D-SGD becomes E µj (t)∼D j=1,...,m [w a (t+1)] = w a (t)-η∇ L (w a (t)) + 1 2 Tr(H(w a (t))Ξ * (t)) the regularized loss +O(η 2 1)+O (η∥w j (t)-w a (t)∥ 3 2 1), Under mild assumptions in Lemma C.2, D-SGD implicitly regularizes reg( w j (t) j=1,...,m ) = λ H(wa(t)),1 maximum Hessian eigenvalue • Tr(Ξ(t)) consensus distance . The first term λ H(wa(t)),1 is commonly regraded as a sharpness measure (Jastrzebski et al., 2017; Wen et al., 2020) . It is related to the (C ϵ , A)-sharpness (i.e., max w ′ ∈Cϵ L(w + Aw ′ ) -L(w)) in Keskar et al. (2017) and is an equivalent measure to the Sharpness Aware Minimization (SAM) loss proposed by Foret et al. (2021) at a local minimum (Zhuang et al., 2022) . Theorem 1 shows that the decentralization navigates D-SGD towards the flatter directions, in order to lower the regularization term λ H(wa(t)),1 . The second term, the trace of Ξ(t), equals to the consensus distance, a key component measuring the overall effect of decentralized learning (Kong et al., 2021) , consensus distance = 1 m m j=1 (w j (t)-w a (t)) T (w j (t)-w a (t)). Consequently, Theorem 1 also suggests that D-SGD implicitly controls the discrepancy between the global averaged model w a (t) and the local models w j (t) (j = 1, . . . , m) during training. Our derived implicit regularization on the sharpness of learned minima is similar to how label noise (Blanc et al., 2020; Damian et al., 2021) and artificial noise (Orvieto et al., 2022) smooth the loss function in centralized gradient methods, including distributed centralized gradient methods (C-SGD) and single-worker gradient methods. To the best of our knowledge, this is the first work that shows D-SGD is equivalent to C-SGD on a regularized loss with implicit sharpness regularization. In the existing literature, initial efforts have viewed D-SGD as C-SGD in a higher-dimensional space that penalizes the weight norm ∥W∥foot_2 I-P , where W = [w 1 , • • • , w m ] T ∈ R m×d stands for all local models across the network (Yuan et al., 2021; Gurbuzbalaban et al., 2022) . We summarize the proof sketch below. The full proof is given in Appendix C.

Proof sketch.

(1) Deriving the dynamics of the global averaged model 2 . We first start by rewriting the update of the global averaged model w a (t) of D-SGD as follows, w a (t+1) =w a (t)-η ∇L (w a (t)) unbiased gradient + ∇L (w a (t)) -∇L µ(t) (w a (t)) gradient noise over the superbatch µ(t) + 1 m m j=1 [∇L µj (t) (w j (t)) -∇L µj (t) (w a (t))] gradient diversity among workers . (4) Remark. The equality shows that decentralization introduces an additional noise, which characterizes the gradient diversity between the global averaged model w a (t) and the local models w j (t) (j = 1, . . . , m). It implies that distributed centralized SGD, which has constant zero gradient diversity, is equivalent to standard single-worker SGD with larger batch size. Note that the gradient diversity also equals to zero on quadratic loss L (see Corollary C.1). Consequently, the quadratic approximation in the analysis of mini-batch SGD (Zhu et al., 2019b; Ibayashi & Imaizumi, 2021; Liu et al., 2021) fails to capture how decentralization affects the training dynamics of D-SGD. (2) Performing Taylor expansion on the gradient diversity. Analyzing the effect of the gradient diversity on the training dynamics of D-SGD on the general non-convex losses is highly non-trivial. Technically, we perform a second-order Taylor expansion on the gradient diversity around w a (t), omitting the high-order residuals R: 1 m m j=1 [∇L µj (t) (w j (t)) -∇L µj (t) (w a (t))] = 1 m m j=1 H µj (t) (w a (t))(w j (t)-w a (t)) + 1 2m m j=1 T µj (t) (w a (t)) ⊗ [(w j (t)-w a (t))(w j (t)-w a (t)) T ]. Here H µj (t) (w a (t)) ≜ 1 |µj (t)| |µj (t)| ζ(t)=1 H(w a (t); z j,ζ(t) ) stands for the empirical Hessian at w a (t) and T µj (t) (w a (t)) ≜ 1 |µj (t)| |µj (t)| ζ(t)=1 T (w a (t); z j,ζ(t) ) denotes the empirical third-order partial derivative tensor at w a (t), where µ j (t) and z j,ζ(t) follows the notation in Equation (1). Analogous to the works investigating the SGD dynamics (M et al., 2017; Zhu et al., 2019b; Ziyin et al., 2022; Wu et al., 2022) , we will calculate the expectation and covariance of the gradient diversity. The expectation of gradient diversity is calculated first as follows. We defer the analysis of its covariance to Subsection 4.3. Taking expectation over all local mini-batches µ j (t) (j = 1, . . . , m) providesfoot_3  E µj (t)∼D j=1,...,m 1 m m j=1 [∇L µj (t) (w j (t)) -∇L µj (t) (w a (t))] =H(w a (t)) 1 m m j=1 (w j (t)-w a (t)) =0 + 1 2 T (w a (t)) ⊗ [ 1 m m j=1 (w j (t)-w a (t))(w j (t)-w a (t)) T ]+R. The i-th entry of the above equation will be E µj (t)∼D j=1,...,m 1 m m j=1 [∂ i L µj (t) (w j (t)) -∂ i L µj (t) (w a (t))] = 1 2 k,l ∂ 3 ikl L(w a (t)) 1 m m j=1 (w j (t)-w a (t)) k (w j (t)-w a (t)) l =∂i kl ∂ 2 kl L(zn) 1 m m j=1 (wj (t)-wa(t)) k (wj (t)-wa(t)) l +O (∥w j (t)-w a (t)∥ 3 2 ), where (w j (t)-w a (t)) k denotes the k-th entry of the vector w j (t)-w a (t). The equality in the brace is due to Clairaut's theorem (Rudin et al., 1976) . Then we prove that with probability greater than 1-O(η), the iterate of D-SGD can be written as E µj (t)∼D j=1,...,m [w a (t+1)] =w a (t)-η∇ L (w a (t)) + 1 2 Tr(H(w a (t))Ξ * (t)) the regularized loss +O(η 1 2 1)+O (η∥w j (t)-w a (t)∥ 3 2 1). (3) Controlling the top Hessian eigenvalue with Tr(H(w a (t))Ξ * (t)). According to Lemma C.2, we obtain 0 ≤ Tr(H(w a (t))Ξ * (t)) ≤ λ H(wa(t)),1 sharpness • Tr(Ξ(t)) consensus distance ≤ d 1 Tr(H(w a (t))Ξ * (t)), where λ H(wa(t)),1 denotes the largest eigenvalue of H(w a (t)) and d 1 stands for the marginal contribution of λ H(wa(t)),1 on the full spectrum of H(w a (t)) (i.e., λ H(wa(t)),1 = d1 d Tr(H(w a (t)))). Therefore, combined with Equation (3), we conclude that D-SGD also implicitly regularizes λ H(wa(t)),1 • Tr(Ξ(t)).

4.2. AMPLIFIED REGULARIZATION OF D-SGD IN LARGE-BATCH SETTING

In practice, the decentralization (and also distribution) ordinarily implies an equivalent large total batch size, since a massive number of workers are involved in the system in many practical scenarios. Moreover, large-batch training can enhance the utilization of super computing facilities and further speeds up the entire training process. Thus, studying the large-batch setting is of significant interest for fully understanding the application of D-SGD. Despite the importance, theoretical understanding of the generalization of large-batch training in D-SGD remains an open problem. This subsection examines how the total batch size affects the sharpness reduction effect of D-SGD if the linear scaling rule, as presented below, is applied. Linear scaling rule (LSR). The linear scaling rule is a widely used hyper-parameter-free rule for deep learning (Krizhevsky, 2014; He et al., 2016a; Goyal et al., 2017; Bottou et al., 2018; Smith et al., 2020) , which states that a fixed learning rate to total batch size ratio allows maintaining generalization performance when the total batch size increases. Theorem 2. Suppose that the averaged gradient norm satisfies 1 m m j=1 ∥∇L (w j (t))∥ 2 ≤ (1+ 1-λ 4 ) 1 m m j=1 ∥∇L (w j (t+1))∥ 2 , where 1-λ denotes the spectral gap (see Definition A.2). The sharpness regularization coefficientfoot_4 of D-SGD (i.e., Tr(Ξ(t)) ) at t-th iteration is O(|µ(t)| 2 (1 + 1 m m j=1 1 |µj (t)| )), which increases with the total batch size |µ(t)| if we apply the linear scaling rule. Theorem 2 states that the sharpness regularization effect of D-SGD is amplified in large-batch settings if we apply the linear scaling rule. It is worth noting that this amplified sharpness regularization effect requires no additional communication and computation, which verifies that significant advantages in generalizability surprisingly exist in the large-batch D-SGD. The proof is included in Appendix C.

4.3. ESCAPING EFFICIENCY OF D-SGD FROM LOCAL MINIMA

This subsection presents an analysis of the escaping efficiency of D-SGD, based on the construction of a stochastic differential equation (SDE) approximation (Jastrzebski et al., 2017; M et al., 2017; Li et al., 2021) of D-SGD. This escaping efficiency analysis shows that D-SGD favors super-quadratic minima. To construct the SDE approximation of D-SGD, we combine Equation (3) and Equation ( 4) and write the iterates of D-SGD as follows, w a (t+1) = w a (t)-η∇ L (w a (t)) + 1 2 Tr(H(w a (t))Ξ * (t)) +ηϵ 0 (t)+O(η 1 2 1)+O (η∥w j (t)-w a (t)∥ 3 2 1), ) where ϵ 0 (t) denotes the zero-mean noise in D-SGD. Applying Lemma C.4, Equation ( 6) can be viewed as the discretization of the following SDE dw a (t) = -∇L (w a (t)) + 1 2 T (w a (t)) ⊗ Ξ * (t) dt + ηΣ D (t)dW (t), where ⊗ denotes the tensor product (see Appendix A.2), Σ D (t) denotes the covariance matrix of the total noise ϵ D (t) = 1 m m j=1 [∇L µj (t) (w j (t)) -∇L (w a (t))], and W (t) is a standard Brownian motion (Feynman, 1964) in R d . We then utilize the SDE approximation of D-SGD to study the escaping efficiency of D-SGD, defined as follows. Definition 1 (Escaping efficiency). Let w * denote one of the local minimum of the loss function L. Then, we call E wa(t) [L(w a (t)) -L(w * )] the escaping efficiency of the dynamic w a (t+1) from w * , where E wa(t) denotes the expectation with respect to the distribution of w a (t). Suppose that w a (t+1) gets stuck in a minimum w *foot_5 , the escaping efficiency characterizes the probability that the dynamics w a (t+1) escapes w * , since Markov's inequality guarantees ∀δ, P (L(w a (t+1)) -L(w * ) ≥ δ) ≤ E wa(t) [L(w a (t+1)) -L(w * )] /δ. We then have the following theorem on the escaping efficiency of D-SGD. Theorem 3 (Escaping efficiency of D-SGD). If the loss L is continuous and has fourth-order partial derivatives, the escaping efficiency of D-SGD from minimum w * satisfies E wa(t) [L(w a (t)) -L(w * )] = - t 0 E wa(t) [∇L(w a (t)) T ∇L(w a (t)) - 1 2 grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t))]dt + t 0 η 2 Tr (H(w a (t))Σ D (t)) dt, where ⊙ denotes the Hadamard product (Davis, 1962) , and grandsum(•) (Merikoski, 1984) of a matrix M satisfies grandsum( M ) = i,j Mij . A detailed proof and the escaping efficiency of C-SGD (see Proposition C.5) are given in Appendix C. Comparing Theorem 3 and Proposition C.5, we can see that the main difference between the escaping efficiency of D-SGD and C-SGD lies in the integral of grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t)), which correlates with the gradient diversity in Equation ( 4). We then study how this term affects the escaping efficiency of D-SGD on super-quadratic minima, a typical class of minima as defined below. Definition 2 (Super-quadratic minimum). Given that the loss L is continuous and has second-order partial derivatives, we call the mimimum w * of L δ-locally super-quadratic if for any w in the open punctured neighbourhood of w * (i.e., w ∈ Ů (w * , δ)), the following condition holds: (1) H(w * ) ≼ H(w); and (2) ∃ α(w), β(w The super-quadratic growth implies that the losses become flatter when the parameters get closer to minima. We then present the intuition of the second condition in Definition 2. A second-order Taylor approximation of L around w * reads, ) ∈ R + s.t. H(w)(w -w * ) = α(w)(∥w -w * ∥ β(w) 2 (w -w * )). L(w) -L(w * ) = ∇L(w) T (w -w * ) + (w -w * ) T H(w)(w -w * ), and the second condition in Definition 2 further guarantees that, L(w) -L(w * ) = ∇L(w) T (w -w * ) + α(w)∥w -w * ∥ β(w) 2 (w -w * ) T (w -w * ) quadratic growth , which suggests that the growth of L(w) is δ-locally super-quadratic as long as α(w), β(w) > 0. A related study by Ma et al. (2022) observes that the minima learned by centralized gradient descent methods obey a "sub-quadratic growth" (i.e., the loss becomes sharper as parameters get closer to the minimum). We also give a formalization of the sub-quadratic minima in Definition C.1. Intuitively, super-quadratic minima are flatter than sub-quadratic minima with the same depth, as illustrated in Figure 3 . The following proposition studies the sign of grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t)) on the super-quadratic and sub-quadratic minima. Proposition 4. Suppose that w a (t) is sufficiently close to a local minimum w * , grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t)) is (1) zero if w * is a quadratic minima, (2) positive if w * is a δ-locally super-quadratic minima, and (3) negative if w * is a δ-locally sub-quadratic minima. Combined with Theorem 3, Proposition 4 shows that D-SGD favors super-quadratic minima over sub-quadratic minima with a higher probability. The proof is included in Appendix C. Theorem 1 and Proposition 4 indicate that the additional noise (i.e., the gradient diversity in Equation (4)) of D-SGD may play a distinct role in shaping the generalizability of D-SGD.

5. EMPIRIAL RESULTS

This section empirically validates our theory. We first introduce the experimental setup and then study how decentralization favours the flatness of minima. (He et al., 2016a) . We apply the linear scaling law to avoid different total batch sizes caused by the different local batch size (see Subsection 4.2). In order to understand the effect of decentralization on the flatness of minima, all other training techniques are strictly controlled. The code is written based on PyTorch (Paszke et al., 2019) . Hardware enviornment. The experiments are conducted on a computing facility with NVIDIA ® Tesla ™ V100 16GB GPUs and Intel ® Xeon ® Gold 6140 CPU @ 2.30GHz CPUs. We plot the minima learned by C-SGD and D-SGD in Figure 4 

6. DISCUSSION AND FUTURE WORK

Scalability to complex or sparse topologies. Our theory holds for arbitrary topologies (see Definition A.1). We also conduct experiments on grid-like and static exponential topologies (Ying et al., 2021a) and obtain results similar to Figure 4 and Figure B .1. For spare topologies, which has a very small spectral gap, the regularization term in Theorem 1 would be extremely large during training, which may hinder optimization and lead to a large total excess risk of D-SGD. Can we design a new decentralized training algorithm that can alleviate the optimization issue on spare topologies while maintaining the generalization advantage in large-batch setting? Non-IIDness and the flatness of minima. In real-world settings, a fundamental challenge in distributed learning is that data may not be i.i.d. across workers (Tang et al., 2018; Vogels et al., 2021; Mendieta et al., 2022) . In this case, different workers may collect distinct or even contradictory samples (i.e., data-heterogeneity) (Criado et al., 2021) . It is widely observed that the non-IIDness hurts the generalizability of D-SGD. Can we rigorously analyze how the degree of data-heterogeneity affects the flatness of minima and design theoretically motivated algorithms to promote the generalizability of D-SGD in non-IID settings?

7. CONCLUSION

This work provides a new theory that reconciles the conflict between the empirical observations showing that D-SGD can generalize better than centralized SGD (C-SGD) in large-batch settings and the existing generalization theories of D-SGD which suggest that decentralization degrades generalizability. We prove that D-SGD introduces an implicit regularization that penalizes the learned minima's sharpness and this effect will be amplified in large-batch settings if we apply the linear scaling rule. We further analyze the escaping efficiency of D-SGD, which shows that D-SGD favors super-quadratic flat minima. To our best knowledge, this is the first work on the implicit sharpness regularization and escaping efficiency of D-SGD. A ADDITIONAL BACKGROUND

A.1 DECENTRALIZED LEARNING

To handle an increasing amount of data and model parameters, distributed learning across multiple computing workers emerges. A traditional distributed learning system usually follows a centralized setup. However, such a central server-based learning scheme suffers from two main issues: (1) A centralized communication protocol significantly slows down training since central servers are easily overloaded, especially in low-bandwidth or high-latency cases (Lian et al., 2017) ; (2) There exists a potential information leakage through privacy attacks on model parameters despite decentralizing data using Federated Learning (Zhu et al., 2019a; Geiping et al., 2020; Yin et al., 2021) . As an alternative, decentralized training allows workers to balance the load on the central server through the gossip technique (Lian et al., 2017) , as well as maintain confidentiality (Warnat-Herresthal et al., 2021) . We then summarize some commonly used notions regarding decentralized learning. Definition A.1 (Doubly Stochastic Matrix). Let G = (V, E) stand for the decentralized communication topology where V denotes the set of m computational nodes and E represents the edge set. For any given topology G = (V, E), the doubly stochastic gossip matrix P = [P j,k ] ∈ R m×m is defined on the edge set E that satisfies • P = P T (symmetric); • If j ̸ = k and (j, k) / ∈ E, then P j,k = 0 (disconnected) and otherwise, P j,k > 0 (connected); • P j,k ∈ [0, 1] ∀k, l and k P j,k = l P j,k = 1 (standard weight matrix for undirected graph). In the following we illustrate some commonly-used communication topologies. According to the definition of doubly stochastic matrix (Definition A.1), we have 0 ≤ λ < 1. The spectral gap measures the connectivity of the communication topology, which is close to 0 for sparse topologies and will approach 1 for well-connected topologies. Assumption A.1. We assume that the sum of the off-diagonal entries of Ξ(t) is smaller than d -1 times of the sum of the diagonal entries of Ξ(t) in expectation: E µj (τ )∼D j=1,...,m τ =1,••• ,t-1 ( k̸ =l 1 m m j=1 (w j (t)-w a (t)) k (w j (t)-w a (t)) l ) ≤ E µj (τ )∼D j=1,...,m τ =1,••• ,t ((d -1) m k=1 1 m m j=1 (w j (t)-w a (t)) 2 k ), where d stands for the dimensionality of w j (t)-w a (t).

B ADDITIONAL MINIMA VISUALIZATION

We plot the minima learned by C-SGD and D-SGD as follows using the 2D loss landscape visualization tool in Li et al. (2018) . On quadratic loss, we have 1 m m j=1 [∇L µj (t) (w j (t)) -∇L µj (t) (w a (t))] = 1 m m j=1 [Hw j (t)-H w a (t)] = H 1 m m j=1 [w j (t)-w a (t)] 0 . In distributed centralized SGD, the gradient diversity statisfies 1 m m j=1 [∇L µj (t) (w j (t)) =wa(t) -∇L µj (t) (w a (t))] = 0. Lemma C.2. We denote Ξ(t) = 1 m m j=1 (w j (t)-w a (t))(w j (t)-w a (t)) T the weight diversity matrix and Ξ * (t) = d i=1 ⟨e i e T i , Ξ(t)⟩ F e i e T i . We assume that d 1 (d 1 < d), the marginal contribution of λ H(wa(t)),1 on the full spectrum of H(w a (t)), is non-negative and satisfies λ H(wa(t)),1 = d1 d Tr(H(w a (t)). Then the product of Tr(Ξ(t)) and the maximum eigenvalue of H(w a (t)) is upper and lower bounded as 0 ≤ Tr(H(w a (t))Ξ * (t)) ≤ λ H(wa(t)),1 sharpness • Tr(Ξ(t)) consensus distance ≤ d 1 Tr(H(w a (t))Ξ * (t)). Proof of Lemma C.2. On the one hand, von Neumann's trace inequality (Von Neumann, 1937) guarantees Tr(H(w a (t))Ξ * (t)) ≤ d r=1 λ H(wa(t)),r • λ Ξ * (t),r ≤ λ H(wa(t)),1 • Tr(Ξ(t)), (C.1) where λ H(wa(t)),r and λ Ξ(t),r represent the r-th largest eigenvalue of H(w a (t)) and Ξ(t), respectively. On the other hand, we will prove that λ H(wa(t)),1  0 ≤ λ H(wa(t)),1 • Tr(Ξ(t)) ≤ d 1 dξ 2 Tr(H(w a (t))Ξ * (t)) • Tr(Ξ(t)) ≤ d 1 Tr(H(w a (t))Ξ * (t) ). Note than we can also obtain 0 ≤ Tr(H(w a (t))Ξ * (t)) ≤ Tr(H(w a (t))) • Tr(Ξ(t)) ≤ d 1 Tr(H(w a (t))Ξ * (t)), which shows that D-SGD also implicitly regularizes Tr(H(w a (t))). Lemma C.3 ((Kong et al., 2021) ). Suppose that the averaged gradient norm satisfies 1 m m j=1 ∥∇L (w j (t))∥ 2 ≤ (1 + 1-λ 4 ) 1 m m j=1 ∥∇L (w j (t+1))∥ 2 , then the the consensus distance of D-SGD satisfies Tr(Ξ(t)) = 1 m m j=1 ∥w j (t) -w a (t)∥ 2 = λη 2 • O   1 m m j=1 ∥∇L (w j (t))∥ 2 (1 -λ) 2 + 1 m m j=1 E µj (t)∼D ∇L µj (t) (w j (t)) -∇L (w j (t)) 2 2 1 -λ   , where λ equals to 1spectral gap (see Definition A.2). Lemma C.4. D-SGD is approximated by the following SDE dw a (t) = -∇L (w a (t)) + 1 2 T (w a (t)) ⊗ Ξ * (t) dt + ηΣ D (t)dW (t), where ⊗ denotes the tensor product (see Appendix A.2), Σ D (t) denotes the covariance matrix of the unbiased noise ϵ(t), and W (t) is a standard Brownian motion (Feynman, 1964)  in R d . Proof of Lemma C.4. If we omit the residual terms, the iterate of D-SGD becomes w a (t+1) = w a (t)-η∇ L (w a (t)) + Tr(H(w a (t))Ξ * (t)) + ηϵ D (t) = w a (t)-[∇L (w a (t)) + 1 2 T (w a (t)) ⊗ Ξ * (t)]η+ ηΣ D (t) √ ηϵ * , where ϵ D (t) ∼ N (0, Σ D (t)) (Gaussian approximation) and ϵ * is a standard Gaussian random variable. For small enough constant learning rate η, we arrive at dw a (t) = -∇L (w a (t)) + 1 2 T (w a (t)) ⊗ Ξ * (t) dt + ηΣ D (t)dW (t). The stochastic processes give a way to model D-SGD as a continuous-time evolution (i.e., SDE) without ignoring the role of mini-batch noise if the learning rate is infinitesimal. . Proof of Theorem 1. We start by rewriting the update of the global averaged model w a (t) of D-SGD as follows, w a (t+1) =w a (t)-η ∇L (w a (t)) unbiased gradient + ∇L (w a (t)) -∇L µ(t) (w a (t)) gradient noise over the superbatch µ(t) + 1 m m j=1 [∇L µj (t) (w j (t)) -∇L µj (t) (w a (t))] gradient diversity among workers . Analyzing the effect of the gradient diversity on the training dynamics of D-SGD on the general non-convex losses is highly non-trivial. Technically, we perform a second-order Taylor expansion (see Appendix A.2) on the gradient diversity around w a (t), omitting the high-order residuals R: 1 m m j=1 [∇L µj (t) (w j (t)) -∇L µj (t) (w a (t))] = 1 m m j=1 H µj (t) (w a (t))(w j (t)-w a (t)) + 1 2m m j=1 T µj (t) (w a (t)) ⊗ [(w j (t)-w a (t))(w j (t)-w a (t)) T ]. Here H µj (t) (w a (t)) ≜ 1 |µj (t)| |µj (t)| ζ(t)=1 H(w a (t); z j,ζ(t) ) stands for the empirical Hessian at w a (t) and T µj (t) (w a (t)) ≜ 1 |µj (t)| |µj (t)| ζ(t)=1 T (w a (t); z j,ζ(t) ) denotes the tensor containing all empirical third-order partial derivatives at w a (t), where µ j (t) and z j,ζ(t) follows the notation in Equation (1). Analogous to the works investigating the SGD dynamics (M et al., 2017; Zhu et al., 2019b; Ziyin et al., 2022; Wu et al., 2022) , we will calculate the expectation and covariance of the gradient diversity. The expectation of gradient diversity is first calculated as follows. We defer the analysis of its covariance to Subsection 4.3. Taking expectation over all local mini-batches µ j (t) (j = 1, . . . , m) provides E µj (t)∼D j=1,...,m 1 m m j=1 [∇L µj (t) (w j (t)) -∇L µj (t) (w a (t))] =H(w a (t)) 1 m m j=1 (w j (t)-w a (t)) =0 + 1 2 T (w a (t)) ⊗ [ 1 m m j=1 (w j (t)-w a (t))(w j (t)-w a (t)) T ]+R. The i-th entry of the above equation will be where (w j (t)-w a (t)) k denotes the k-th entry of the vector w j (t)-w a (t). The equality in the brace is due to Clairaut's theorem (Rudin et al., 1976) . E µj (t)∼D j=1,...,m 1 m m j=1 [∂ i L µj (t) (w j (t)) -∂ i L µj (t) (w a (t))] = 1 2 k,l According to Markov's inequality and Assumption A.1, we obtain Then we derive that with probability greater than 1-O(η), the iterate of D-SGD can be written as where λ H(wa(t)),1 denotes the largest eigenvalue of H(w a (t)) and d 1 stands for the marginal contribution of λ H(wa(t)),1 on the full spectrum of H(w a (t)) (i.e., λ H(wa(t)),1 = d1 d Tr(H(w a (t)))). Therefore, combined with Equation (3), we conclude that D-SGD also implicitly regularizes λ H(wa(t)),1 • Tr(Ξ(t)). The proof is complete. where ⊙ denotes the Hadamard product (Davis, 1962) , and the grandsum(•) (Merikoski, 1984) of a matrix M satisfies grandsum( M ) = i,j Mij . E Proof of Theorem 3. Since L is continuous and has second-order partial derivatives, we can write dL(w a (t)) = -(∇L(w a (t)) T ∇L(w a (t)) -1 2 ∇L(w a (t)) T (T (w a (t)) ⊗ Ξ * (t)) grandsum((T (wa(t))∇L(wa(t)))⊙Ξ * (t)) )dt



The word "centralized" indicates that in C-SGD, there is a central server receiving gradient information from local workers (see Figure 2). Note that there is no central server In D-SGD. In the following we analyze the training dynamics of the global averaged model wa(t) of D-SGD, which has been proved to be close to the individual models wj(t)(j = 1, ..., m)(Yuan et al., 2016;Fallah et al., 2022). Taking expectation over µj(t) means taking expectation over all z j,ζ(t) (ζ(t) = 1, . . . , |µj|). Recall that Theorem 1 implies that the loss function D-SGD optimizes is close to the original loss L plus 1 2 Tr Ξ (t) • λ H(wa(t)),1 . The second term λ H(wa(t)),1 is a sharpness measure, and the first term Tr(Ξ(t)) is the "regularization coefficient" which characterizes the strength of the sharpness regularization. Note that there is no guarantee that D-SGD can converge to any local minimum in the non-convex settings.



Figure 1: Comparison of the validation accuracy of C-SGD and D-SGD on CIFAR-10. The number of workers (one GPU as a worker) is set as 16; and the local batch size is set as 64, and 512 per worker (1024 and 8196 total batch size). The training setting is included in Section 5.

of D-SGD. Recently, Sun et al. (2021) and Zhu et al. (2022) have established generalization bounds of D-SGD and have shown that decentralized training hurts generalization.

Figure 2: An illustration of C-SGD and D-SGD.

w; z j,ζ ) denotes the local empirical risk on the j-th worker and z j,ζ ∈ µ j (ζ = 1, . . . , |µ j |) stands for the local training data.

Figure 3: An illustration of super-quadratic and sub-quadratic minimum.

Figure 4: Minima 3D visualization of C-SGD and D-SGD with ResNet-18 on CIFAR-10.

Figure A.1: An illustration of some commonly used topologies.

Figure B.1: Minima 2D visualization of C-SGD and D-SGD with ResNet-18 on CIFAR-10.

is the lower bound of 1 m m j=1 (w j (t)-w a (t)) 2 k (k = 1, . . . , d).Knowing that Tr(Ξ * (t)) = Tr(Ξ(t)), we can bound the right hand side of Equation (C.1) as follows:

(t)-w a (t)) k (w j (t)-w a (t)) l =∂i kl ∂ 2 kl L(zn) 1 m m j=1(wj (t)-wa(t)) k (wj (t)-wa(t)) l +O (∥w j (t)-w a (t

(t)-w a (t)) k (w j (t)-w a (t)) l > η j (t)-w a (t)) k (w j (t)-w a (t)) l )where d stands for the dimensionality of w j (t)-w a (t) and the penultimate equality is due to Lemma C.3.For sufficiently small η = o(d -2 ), 1 2 ∂ i kl ∂ 2 kl L (z n ) 1 m m j=1 (w j (t)-w a (t)) k (w j (t)-w a (t)) l in Equation (C.2) is of the order O(η).

Suppose that the averaged gradient norm satisfies 1 (w j (t+1))∥ 2 , where 1 -λ denotes the spectral gap (see Definition A.2). The sharpness regularization coefficient of D-SGD at t-th iteration is O(|µ(twhich increases with the total batch size |µ(t)| if we apply the linear scaling rule.Proof of Theorem 2.Theorem 1 states that the regularization coefficient of λ H(wa(t)),1 is η Tr(Ξ(t)). According to Lemma C.3, Tr(Ξ(t)) satisfies Tr(Ξ(t)) = η 2 • O µj (t)∼D ∇L µj (t) (w j (t)) -∇L (w j (tapply the linear scaling rule (see Subsection 4.2), we have η = O(|µ(t)|), which completes the proof.Theorem 3 (Escaping efficiency of D-SGD). If the loss L is continuous and has fourth-order partial derivatives, the escaping efficiency of D-SGD from minimum w * satisfiesE wa(t) [L(w a (t)) -L(w * )] = wa(t) [∇L(w a (t)) T ∇L(w a (t)) -1 2grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t))]dt w a (t))Σ D (t)) dt,

The goal of supervised learning is to learn a predictor (hypothesis) g(•; w), parameterized by w = w(z 1 , z 2 , . . . , z N ) ∈ R d , to approximate the mapping between the input variable x ∈ X and the output variable y ∈ Y, based on the training set µ. Let c : Y × Y → R + be a function that evaluates the prediction performance of hypothesis g.

) stands for the local mini-batch gradient of L w.r.t. the first argument w. The total batch size of C-SGD at t-th iteration is |µ(t)| =

Implementation settings. Vanilla D-SGD and C-SGD are employed to train image classifiers on CIFAR-10 (Krizhevsky et al., 2009) with AlexNet(Krizhevsky et al., 2017), ResNet-18 and ResNet-34(He et al., 2016b), three popular neural networks. Batch normalization(Ioffe & Szegedy, 2015) is employed in training AlexNet. The number of workers (one GPU as a worker) is set as 16; and the local batch size is set as 8, 64, and 512 per worker in three different cases. For the case of local batch size 64, the initial learning rate is set as 0.1 for ResNet-18 and 0.01 for AlexNet. The learning rate is divided by 10 when the model has passed the 2/5 and 4/5 of the total number of iterations

µj (t)∼D j=1,...,m [w a (t+1)] =w a (t)-η∇ L (w a (t)) + 1 2 Tr(H(w a (t))Ξ * (t)) Ξ(t)⟩ F e i e T i is the diagonal of Ξ(t). According to Lemma C.2, λ H(wa(t)),1 • Tr(Ξ(t)) scales positively with Tr(H(w a (t))Ξ * (t)): 0 ≤ Tr(H(w a (t))Ξ * (t)) ≤ λ H(wa(t)),1 Tr(H(w a (t))Ξ * (t)),

A.2 EXPLANATION OF TENSOR PRODUCT

The tensor product between a third-order tensor T ∈ R d×d×d and a second-order tensor (matrix) M ∈ R d×d in this paper is defined aswhere T i ∈ R d×d is a second-order tensor (matrix), ⊙ denotes the Hadamard product (Davis, 1962) , and the grandsum(•) (Merikoski, 1984) of a second-order tensor (matrix) M satisfies grandsum( M ) = i,j Mij .according to the Ito's lemma (Øksendal, 2003) . The term ∇L(w a (t)) T Σ D (t)dW (t) will be averaged if we take the expectation with respect to the distribution of w a (t). Finally, integrating over t will providewhich completes the proof.Proposition C.5 (Escaping efficiency of C-SGD). If the loss L has second-order partial derivatives, the escaping efficiency of C-SGD from minimum w * satisfieswhere Σ C (t) denotes the covariance matrix of the gradient noise of C-SGD (Equation ( 1)).The proof is analogous to Theorem 3. (w -w * )).Proposition 4. grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t)) is (1) zero on quadratic minima, (2) positive on super-quadratic minima, and (3) negative on sub-quadratic minima.Proof of Proposition 4.(1) quadratic minima.It is obvious that on quadratic loss, grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t)) = 0 due to zero gradient diversity (see Corollary C.1).(2) super-quadratic minima.Performing the Taylor expansion of H(w) -H(w * ) around w * providesAccording to the definition of super-quadratic minima, we know that ∃ α(wAnother Taylor expansion of ∇L(w) -∇L(w * ) around w * will giveThen we arrive at grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t)) > 0 since Ξ * (t) is a diagonal matrix with all positive entries.(3) sub-quadratic minima.By the same token, we can prove that grandsum((T (w a (t))∇L(w a (t))) ⊙ Ξ * (t)) < 0 on subquadratic minima.

