WOULD DECENTRALIZATION HURT GENERALIZATION?

Abstract

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices without the control of a central server. Existing theory suggests that decentralization degrades generalizability, which conflicts with experimental results in large-batch settings that D-SGD generalizes better than centralized SGD (C-SGD). This work presents a new theory that reconciles the conflict between the two perspectives. We prove that D-SGD introduces an implicit regularization that simultaneously penalizes (1) the sharpness of the learned minima and (2) the consensus distance between the global averaged model and local models. We then prove that the implicit regularization is amplified in large-batch settings when the linear scaling rule is applied. We further analyze the escaping efficiency of D-SGD and show that D-SGD favors super-quadratic flat minima. Experiments are in full agreement with our theory. The code will be released publicly. To our best knowledge, this is the first work on the implicit regularization and escaping efficiency of D-SGD.

1. INTRODUCTION

Decentralized stochastic gradient descent (D-SGD) enables simultaneous model training on massive workers without being controlled by a central server, where every worker communicates only with its directly connected neighbors (Xiao & Boyd, 2004; Lopes & Sayed, 2008; Nedic & Ozdaglar, 2009; Lian et al., 2017; Koloskova et al., 2020) . This decentralization avoids the requirements of a costly central server with heavy communication and computation burdens. Despite the absence of a central server, existing theoretical results demonstrate that the massive models on the edge converge to a unique steady consensus model (Shi et al., 2015; Lian et al., 2017; Lu et al., 2011) , with asymptotic linear speedup in convergence rate (Lian et al., 2017) as the distributed centralized SGD (C-SGD) does (Dean et al., 2012; Li et al., 2014) . Consequently, D-SGD offers a promising distributed learning solution with significant advantages in privacy (Nedic, 2020 ), scalability (Lian et al., 2017 ), and communication efficiency (Ying et al., 2021b) . However, existing theoretical studies show that the decentralization nature of D-SGD introduces an additional positive term into the generalization error bounds, which suggests that decentralization may hurt generalization (Sun et al., 2021; Zhu et al., 2022) . This poses a crippling conflict with empirical results by Zhang et al. (2021) which show that D-SGD generalizes better than C-SGD by a large margin in large batch settings; see Figure 1 . This conflict signifies that the major characteristics were overlooked in the existing literature. Therefore, would decentralization hurt generalization? This work reconciles the conflict. We prove that decentralization introduces implicit regularization in D-SGD, which promotes the generalization. To our best knowledge, this is the first paper that surprisingly shows the advantages of D-SGD in generalizability, which redresses the former misunderstanding. Specifically, our contributions are in twofold. • We prove that the mean iterate of D-SGD closely follows the path of C-SGD on a regularized loss, which is the addition of the original loss and a regularization term introduced by decentralization. This regularization term penalizes the largest eigenvalue of the Hessian matrix, as well as the consensus distance (see Theorem 1). These regularization effects are shown to be considerably amplified in large-batch settings (see Theorem 2), which is consistent with our visualization (see Figure 4 ) and the empirical results in (Zhang et al., 2021) . To prove the above results, we apply second-order multivariate Taylor approximation (Königsberger, 2013) on the gradient diversity (see Equation ( 5)) to derive the regularized loss. Then, we prove that the regularization term contained in the regularized loss scales positively with the largest Hessian eigenvalue, which suggests that D-SGD implicitly minimizes the sharpness of the learned minima (see Lemma C.2). • We prove the first result on the expected escaping speed of D-SGD from local minima (see Theorem 3). Our results show that D-SGD prefers super-quadratic flat minima to sub-quadratic minima with higher probability (see Proposition 4). The proof is based on the construction of a stochastic differential equation (SDE) approximation (Jastrzebski et al., 2017; M et al., 2017; Li et al., 2021) of D-SGD.

2. RELATED WORK

Flatness and generalization. The flatness of minimum is a commonly used concept in the optimization and machine learning literature and has long been regarded as a proxy of generalization (Hochreiter & Schmidhuber, 1997; Izmailov et al., 2018; Jiang et al., 2020) . Intuitively, the loss around a flat minimum varies slowly in a large neighborhood, while a sharp minimum increases rapidly in a small neighborhood (Hochreiter & Schmidhuber, 1997) . Through the lens of the minimum description length theory (Rissanen, 1983) , flat minimizers tend to generalize better than sharp minimizers, since they are specified with lower precision (Keskar et al., 2017) . From a Bayesian perspective, sharp minimizers have posterior distributions highly concentrated around them, indicating that they are more specialized on the training set and thus are less robust to data perturbations than flat minimizers (MacKay, 1992; Chaudhari et al., 2019) . Generalization of large-batch training. Large-batch training is of significant interest for deep learning deployment, which can contribute to a significant speed-up in training neural networks (Goyal et al., 2017; You et al., 2018; Shallue et al., 2019) . Unfortunately, it is widely observed that in the centralized learning setting, large-batch training often suffers from a drastic generalization degradation, even with fine-tuned hyper-parameters, from both empirical (Chen & Huo, 2016; Keskar et al., 2017; Hoffer et al., 2017; Shallue et al., 2019; Smith et al., 2020) and theoretical (Li et al., 2021) aspects. An explanation of this phenomenon is that large-batch training leads to "sharper" minima (Keskar et al., 2017) , which are more sensitive to perturbations (Hochreiter & Schmidhuber, 1997) . Development of D-SGD. The earliest work of classical decentralized optimization can be traced back to Tsitsiklis (1984 ), Tsitsiklis et al. (1986 ) and Nedic & Ozdaglar (2009) . D-SGD, a typical decentralized optimization algorithm, has been extended to various settings in deep learning, including time-varying topologies (Lu & Wu, 2020; Koloskova et al., 2020) , asynchronous settings (Lian et al., 2018; Xu et al., 2021; Nadiradze et al., 2021 ), directed topologies (Assran et al., 2019; Taheri et al., 2020) , and data-heterogeneous scenarios (Tang et al., 2018; Vogels et al., 2021) . 



Figure 1: Comparison of the validation accuracy of C-SGD and D-SGD on CIFAR-10. The number of workers (one GPU as a worker) is set as 16; and the local batch size is set as 64, and 512 per worker (1024 and 8196 total batch size). The training setting is included in Section 5.

of D-SGD. Recently, Sun et al. (2021) and Zhu et al. (2022) have established generalization bounds of D-SGD and have shown that decentralized training hurts generalization.

