SCALE-INVARIANT BAYESIAN NEURAL NETWORKS WITH CONNECTIVITY TANGENT KERNEL

Abstract

Studying the loss landscapes of neural networks is critical to identifying generalizations and avoiding overconfident predictions. Flatness, which measures the perturbation resilience of pre-trained parameters for loss values, is widely acknowledged as an essential predictor of generalization. While the concept of flatness has been formalized as a PAC-Bayes bound, it has been observed that the generalization bounds can vary arbitrarily depending on the scale of the model parameters. Despite previous attempts to address this issue, generalization bounds remain vulnerable to function-preserving scaling transformations or are limited to impractical network structures. In this paper, we introduce new PAC-Bayes prior and posterior distributions invariant to scaling transformations, achieved through the decomposition of perturbations into scale and connectivity components. In this way, this approach expands the range of networks to which the resulting generalization bound can be applied, including those with practical transformations such as weight decay with batch normalization. Moreover, we demonstrate that scale-dependency issues of flatness can adversely affect the uncertainty calibration of Laplace approximation, and we propose a solution using our invariant posterior. Our proposed invariant posterior allows for effective measurement of flatness and calibration with low complexity while remaining invariant to practical parameter transformations, also applying it as a reliable predictor of neural network generalization.

1. INTRODUCTION

Neural networks (NNs) have succeeded tremendously, but understanding their generalization mechanism in real-world scenarios remains challenging (Kendall & Gal, 2017; Ovadia et al., 2019) . Although it is widely recognized that NNs naturally generalize well and avoid overfitting, the underlying reasons are not well understood (Neyshabur et al., 2015b; Zhang et al., 2017; Arora et al., 2018) . Recent studies on the loss landscapes of NNs attempt to address these issues. For example, Hochreiter & Schmidhuber (1995) proposed the flat minima (FM) hypothesis, which states that loss stability for parameter perturbations positively correlates with network generalizability, as empirically demonstrated by Jiang et al. (2020) . However, the FM hypothesis still has limitations. According to Dinh et al. (2017) , rescaling two successive layers can arbitrarily degrade a flatness measure while maintaining the generalizability of NNs. Meanwhile, Li et al. (2018) argued that weight decay (WD) leads to a contradiction of the FM hypothesis in practice: Although WD sharpens pre-trained NNs (i.e., decreased loss resilience), it generally improves the generalization. In short, they suggest that transformations on network parameters (e.g., re-scaling layers and WD) may lead to contradictions to the FM hypothesis. A thorough discussion on this can be found in Appendix E. To resolve this contradiction, we investigate PAC-Bayesian prior and posterior distributions to derive a new scale-invariant generalization bound. As a result, our bound guarantees invariance for a general class of function-preserving scale transformations with a broad class of networks. Specifically, our bound is more general than existing works (Tsuzuku et al., 2020; Kwon et al., 2021) , both in terms of transformations (e.g., activation-wise rescaling (Neyshabur et al., 2015a) and WD with batch normalization (BN; Ioffe & Szegedy (2015))) that guarantee invariance and in terms of NN architectures. Therefore, our bound ensures no FM contradiction for the first time, which should not occur in practical NNs, including ResNet (He et al., 2016) and Transformer (Vaswani et al., 2017) . Our generalization bound is derived from scale invariances of prior and posterior distributions, guaranteeing not only its scale invariance but also the scale invariance of its substance, the Kullback-Leibler (KL) divergence-based kernel. We call this kernel an empirical Connectivity Tangent Kernel (CTK), as a modification of empirical Neural Tangent Kernel (Jacot et al., 2018) with the scaleinvariance property. Moreover, we define a new sharpness metric as the trace of CTK, named Connectivity Sharpness (CS). We show via empirical studies that CS predicts NN generalization performance better than existing sharpness measures (Liang et al., 2019; Neyshabur et al., 2017) . In Bayesian NN regimes, we connect the contradictions of the FM hypothesis with the issue of amplifying predictive uncertainty. Then, we alleviate this issue by using a Bayesian NN based on the posterior distribution of our PAC-Bayesian analysis. We name this Bayesian NN as Connectivity Laplace (CL), as it can be seen as a variation of Laplace approximation (LA; MacKay (1992)) using a different Jacobian. Specifically, we demonstrate the major pitfalls of WD with BN in LA and show how to remedy this issue using CL. 1 We summarize our contributions as follows: • Our novel PAC-Bayes generalization bound guarantees invariance for general function-preserving scale transformations with a broad class of networks (Sec. 2.2 and 2.3). We empirically verify this bound gives non-vacuous results for ResNet with 11M parameters (Sec. 2.4). • Based on our bound, we propose a low-complexity sharpness metric CS (Sec. 2.5), which empirically shows a stronger correlation with generalization error than other metrics (Sec. 4.1). • To prevent overconfident predictions, we show how our scale-invariant Bayesian NN can be used to solve pitfalls of WD with BNs, proving its practicality (Sec. 3 and 4.2).

2. PAC-BAYES BOUND WITH SCALE-INVARIANCE

This section introduces a data-dependent PAC-Bayes generalization bound without scale-dependency issues. To this end, we introduce our setup in Sec. 2.1, construct the scale-invariant PAC-Bayes prior and posterior in Sec. 2.2, and present the detailed bound in Sec. 2.3. Then, we demonstrate the effectiveness of this bound for ResNet-18 with CIFAR in Sec. 2.4. An efficient proxy of this bound without complex hyper-parameter optimization is provided in Sec. 2.5.

2.1. BACKGROUND

Setup and Definitions. We consider a Neural Network (NN), f (•, •) : R D × R P → R K , given input x ∈ R D and network parameter θ ∈ R P . Hereafter, for simplicity, we consider vectors as single-column matrices unless otherwise stated. We use the output of NN f (x, θ) as a prediction for input x. Let S := {(x n , y n )} N n=1 denote the independently and identically distributed (i.i.d.) training data drawn from true data distribution D, where x n ∈ R D and y n ∈ R K are input and output representations of n-th training instance, respectively. For simplicity, we denote concatenated input and output of all instances as X ∈ R N D and Y ∈ R N K , respectively, and f (X , θ) ∈ R N K as a concatenation of {f (x n , θ)} N n=1 . Given a prior distribution of network parameters p(θ) and a likelihood function p(S|θ) := N n=1 p(y n |f (x n , θ)), Bayesian inference defines posterior distribution of network parameter θ as p(θ|S) := exp(-L(S, θ))/Z(S), where L(S, θ) := -log p(θ) -N n=1 log p(y n |x n , θ) is training loss and Z(S) := p(θ)p(S|θ)dθ is the normalizing factor. For example, the likelihood function for a regression task will be Gaussian: p(y|x, θ) = N (y|f (x, θ), σ 2 I K ) where σ is (homoscedastic) observation noise scale. For a classification task, we treat it as a one-hot regression task following Lee et al. (2019a); He et al. (2020) . While we adopt this modification for theoretical tractability, Hui & Belkin (2021) showed this modification offers good performance competitive to the cross-entropy loss. Details on this modification are given in Appendix C.



https://github.com/sungyubkim/connectivity-tangent-kernel

