SCALE-INVARIANT BAYESIAN NEURAL NETWORKS WITH CONNECTIVITY TANGENT KERNEL

Abstract

Studying the loss landscapes of neural networks is critical to identifying generalizations and avoiding overconfident predictions. Flatness, which measures the perturbation resilience of pre-trained parameters for loss values, is widely acknowledged as an essential predictor of generalization. While the concept of flatness has been formalized as a PAC-Bayes bound, it has been observed that the generalization bounds can vary arbitrarily depending on the scale of the model parameters. Despite previous attempts to address this issue, generalization bounds remain vulnerable to function-preserving scaling transformations or are limited to impractical network structures. In this paper, we introduce new PAC-Bayes prior and posterior distributions invariant to scaling transformations, achieved through the decomposition of perturbations into scale and connectivity components. In this way, this approach expands the range of networks to which the resulting generalization bound can be applied, including those with practical transformations such as weight decay with batch normalization. Moreover, we demonstrate that scale-dependency issues of flatness can adversely affect the uncertainty calibration of Laplace approximation, and we propose a solution using our invariant posterior. Our proposed invariant posterior allows for effective measurement of flatness and calibration with low complexity while remaining invariant to practical parameter transformations, also applying it as a reliable predictor of neural network generalization.

1. INTRODUCTION

Neural networks (NNs) have succeeded tremendously, but understanding their generalization mechanism in real-world scenarios remains challenging (Kendall & Gal, 2017; Ovadia et al., 2019) . Although it is widely recognized that NNs naturally generalize well and avoid overfitting, the underlying reasons are not well understood (Neyshabur et al., 2015b; Zhang et al., 2017; Arora et al., 2018) . Recent studies on the loss landscapes of NNs attempt to address these issues. For example, Hochreiter & Schmidhuber (1995) proposed the flat minima (FM) hypothesis, which states that loss stability for parameter perturbations positively correlates with network generalizability, as empirically demonstrated by Jiang et al. (2020) . However, the FM hypothesis still has limitations. According to Dinh et al. (2017) , rescaling two successive layers can arbitrarily degrade a flatness measure while maintaining the generalizability of NNs. Meanwhile, Li et al. (2018) argued that weight decay (WD) leads to a contradiction of the FM hypothesis in practice: Although WD sharpens pre-trained NNs (i.e., decreased loss resilience), it generally improves the generalization. In short, they suggest that transformations on network parameters (e.g., re-scaling layers and WD) may lead to contradictions to the FM hypothesis. A thorough discussion on this can be found in Appendix E. To resolve this contradiction, we investigate PAC-Bayesian prior and posterior distributions to derive a new scale-invariant generalization bound. As a result, our bound guarantees invariance for a general class of function-preserving scale transformations with a broad class of networks. Specifically, our bound is more general than existing works (Tsuzuku et al., 2020; Kwon et al., 2021) , both in terms of transformations (e.g., activation-wise rescaling (Neyshabur et al., 2015a) and WD with batch normalization (BN; Ioffe & Szegedy (2015) )) that guarantee invariance and in terms of NN architectures. Therefore, our bound ensures no FM contradiction for the first time, which should not occur in practical NNs, including ResNet (He et al., 2016) and Transformer (Vaswani et al., 2017) . Our generalization bound is derived from scale invariances of prior and posterior distributions, guaranteeing not only its scale invariance but also the scale invariance of its substance, the Kullback-Leibler (KL) divergence-based kernel. We call this kernel an empirical Connectivity Tangent Kernel (CTK), as a modification of empirical Neural Tangent Kernel (Jacot et al., 2018) with the scaleinvariance property. Moreover, we define a new sharpness metric as the trace of CTK, named Connectivity Sharpness (CS). We show via empirical studies that CS predicts NN generalization performance better than existing sharpness measures (Liang et al., 2019; Neyshabur et al., 2017) . In Bayesian NN regimes, we connect the contradictions of the FM hypothesis with the issue of amplifying predictive uncertainty. Then, we alleviate this issue by using a Bayesian NN based on the posterior distribution of our PAC-Bayesian analysis. We name this Bayesian NN as Connectivity Laplace (CL), as it can be seen as a variation of Laplace approximation (LA; MacKay (1992)) using a different Jacobian. Specifically, we demonstrate the major pitfalls of WD with BN in LA and show how to remedy this issue using CL. 1 We summarize our contributions as follows: • Our novel PAC-Bayes generalization bound guarantees invariance for general function-preserving scale transformations with a broad class of networks (Sec. 2.2 and 2.3). We empirically verify this bound gives non-vacuous results for ResNet with 11M parameters (Sec. 2.4). • Based on our bound, we propose a low-complexity sharpness metric CS (Sec. 2.5), which empirically shows a stronger correlation with generalization error than other metrics (Sec. 4.1). • To prevent overconfident predictions, we show how our scale-invariant Bayesian NN can be used to solve pitfalls of WD with BNs, proving its practicality (Sec. 3 and 4.2).

2. PAC-BAYES BOUND WITH SCALE-INVARIANCE

This section introduces a data-dependent PAC-Bayes generalization bound without scale-dependency issues. To this end, we introduce our setup in Sec. 2.1, construct the scale-invariant PAC-Bayes prior and posterior in Sec. 2.2, and present the detailed bound in Sec. 2.3. Then, we demonstrate the effectiveness of this bound for ResNet-18 with CIFAR in Sec. 2.4. An efficient proxy of this bound without complex hyper-parameter optimization is provided in Sec. 2.5.

2.1. BACKGROUND

Setup and Definitions. We consider a Neural Network (NN), f (•, •) : R D × R P → R K , given input x ∈ R D and network parameter θ ∈ R P . Hereafter, for simplicity, we consider vectors as single-column matrices unless otherwise stated. We use the output of NN f (x, θ) as a prediction for input x. Let S := {(x n , y n )} N n=1 denote the independently and identically distributed (i.i.d.) training data drawn from true data distribution D, where x n ∈ R D and y n ∈ R K are input and output representations of n-th training instance, respectively. For simplicity, we denote concatenated input and output of all instances as X ∈ R N D and Y ∈ R N K , respectively, and f (X , θ) ∈ R N K as a concatenation of {f (x n , θ)} N n=1 . Given a prior distribution of network parameters p(θ) and a likelihood function p(S|θ) := N n=1 p(y n |f (x n , θ)), Bayesian inference defines posterior distribution of network parameter θ as p(θ|S) := exp(-L(S, θ))/Z(S), where L(S, θ) := -log p(θ) -N n=1 log p(y n |x n , θ) is training loss and Z(S) := p(θ)p(S|θ)dθ is the normalizing factor. For example, the likelihood function for a regression task will be Gaussian: p(y|x, θ) = N (y|f (x, θ), σ 2 I K ) where σ is (homoscedastic) observation noise scale. For a classification task, we treat it as a one-hot regression task following Lee et al. (2019a) ; He et al. (2020) . While we adopt this modification for theoretical tractability, Hui & Belkin (2021) showed this modification offers good performance competitive to the cross-entropy loss. Details on this modification are given in Appendix C. Laplace approximation. In general, the exact computation for the Bayesian posterior of a network parameter is intractable. The Laplace approximation (LA; MacKay (1992) ) is proposed to approximate the posterior distribution with a Gaussian distribution defined as p LA (ψ|S) ∼ N (ψ|θ * , (∇ 2 θ L(S, θ * )) -1 ) where θ * ∈ R P is a pre-trained parameter with training loss and ∇ 2 θ L(S, θ * ) ∈ R P ×P is a Hessian matrix of loss function w.r.t. parameter at θ * . Recent works on LA replace the Hessian matrix with (Generalized) Gauss-Newton matrix to make computation easier (Khan et al., 2019; Immer et al., 2021) . With this approximation, the LA posterior of the regression problem can be represented as p LA (ψ|S) ∼ N (ψ|θ * , (I P /α 2 Damping + J ⊤ θ J θ /σ 2 Curvature ) -1 ) where α, σ > 0, I P ∈ R P ×P is a identity matrix, and J θ ∈ R N K×P is a concatenation of J θ (x, θ * ) ∈ R K×P (Jacobian of f w.r.t. θ at input x and parameter θ * ) along training input X . Inference with LA requires a further sub-curvature approximation for modern NN architectures (e.g., ResNet (He et al., 2016) and Transformer (Vaswani et al., 2017) ) because of the prohibitively large covariance matrix. This approximation includes diagonal, Kronecker-factored approximate curvature (KFAC), last-layer, and subnetwork approximation (Ritter et al., 2018; Kristiadi et al., 2020; Daxberger et al., 2021) . Meanwhile, it is well known that proper selection of prior scale α is needed to balance the dilemma between overconfidence and underfitting in LA. PAC-Bayes bound with data-dependent prior. 2021)). Let P be a PAC-Bayes prior distribution over R P independent of training dataset S, and err(•, •) : R K×K → [0, 1] be an error function defined separately from the loss function. For any constant δ ∈ (0, 1] and λ > 0, and any PAC-Bayes posterior distribution Q over R P , the following holds with probability at least 1 -δ: err D (Q) ≤ err S (Q) + KL[Q∥P]+log(2 √ N /δ) 2N where err D (Q) := E (x,y)∼D,θ∼Q [err(f (x, θ), y)], err S (Q) := E (x,y)∼S,θ∼Q [err(f (x, θ), y)], and N denotes the cardinality of S. That is, err D (Q) and err S (Q) are generalization error and empirical error, respectively. The only restriction on P here is that it cannot depend on the dataset S. Following the recent discussion in Perez-Ortiz et al. (2021) , one can construct data-dependent PAC-Bayes bounds by (i) randomly partitioning dataset S into S Q and S P so that they are independent, (ii) pre-training a PAC-Bayes prior distribution P D only dependent of S P (i.e., P D belongs to a PAC-Bayes prior due to the independence of S Q ), (iii) fine-tuning a PAC-Bayes posterior distribution Q dependent of entire dataset S, and (iv) computing empirical error err S Q (Q) with target subset S Q (not entire dataset S). In summary, we modify the aforementioned original PAC-Bayes bound through a data-dependent prior P D as err D (Q) ≤ err S Q (Q) + KL[Q∥P D ] + log(2 N Q /δ) 2N Q (2) where N Q is the cardinality of S Q . We denote sets of input and output of partitioned datasets (S P , S Q ) by X P , Y P , X Q , Y Q for simplicity.

2.2. SCALE-INVARIANT PRIOR AND POSTERIOR FROM LINEARIZATION W.R.T. CONNECTIVITY

Our goal is to construct scale-invariant P D and Q. To this end, we assume a pre-trained parameter θ * ∈ R P with the prior dataset S P . This parameter can be attained with standard NN optimization procedures (e.g., stochastic gradient descent (SGD) with momentum). Then, we consider a linearized NN at the pre-trained parameter with an auxiliary variable c ∈ R P as g lin θ * (x, c) := f (x, θ * ) + J θ (x, θ * )diag(θ * )c (3) where diag is a vector-to-matrix diagonal operator. Note that equation 3 is the first-order Taylor approximation of NN with perturbation θ * ⊙ c given input x and parameter θ * : g pert θ * (x, c) := f (x, θ * + θ * ⊙ c) ≈ g lin θ * (x, c) , where ⊙ denotes element-wise multiplication of two vectors. Here we express the perturbation in parameter space as θ * ⊙ c instead of a single variable such as δ ∈ R P . By decomposing the scale and connectivity of perturbation, this linearization design matches the scale of perturbation (i.e., θ * ⊙ c) to the scale of θ * in a component-wise manner. Note that a similar decomposition was proposed in pruning at initialization (Lee et al., 2019c; b) to measure the importance of each connection independently of its weight. However, they only consider this form to predict the effect of each connection before pre-training. Based on equation 3, we define a data-dependent prior (P D ) over connectivity as P θ * (c) := N (c | 0 P , α 2 I P ). (4) This distribution can be translated to a distribution over parameter by considering the distribution of perturbed parameter (ψ := θ * + θ * ⊙ c): P θ * (ψ) := N (ψ | θ * , α 2 diag(θ * ) 2 ). We now define the PAC-Bayes posterior over connectivity Q(c) as follows: Q θ * (c) := N (c|µ Q , Σ Q ) , µ Q := Σ Q J ⊤ c (Y -f (X , θ * )) σ 2 = Σ Q diag(θ * )J ⊤ θ (Y -f (X , θ * )) σ 2 , ( ) Σ Q := I P α 2 + J ⊤ c J c σ 2 -1 = I P α 2 + diag(θ * )J ⊤ θ J θ diag(θ * ) σ 2 -1 where J c ∈ N K × P is a concatenation of J c (x, 0 P ) := J θ (x, θ * )diag(θ * ) ∈ R K×P (i. e., Jacobian of perturbed NN g pert θ * (x, c) w.r.t. c at input x and connectivity 0 P ) along training input X . Indeed, Q θ * is the posterior of Bayesian linear regression w.r.t. connectivity c. We refer to Appendix D for detailed derivations. Again, it is equivalent to the posterior distribution over parameter Q θ * (ψ) = N ψ|θ * + θ * ⊙ µ Q , (diag(θ * ) -2 /α 2 + J ⊤ θ J θ /σ 2 -1 ) where diag(θ * ) -2 := (diag(θ * ) -1 ) 2 by assuming that all components of θ * are non-zero. This assumption can be easily satisfied by considering the prior and posterior distributions of non-zero components of NNs only. Although we choose this restriction for theoretical tractability, future works can relax it to achieve diverse predictions by considering the distribution of zero coordinates. In summary, a data-dependent PAC-Bayes bound can be computed with our PAC-Bayes distributions. The validity of this data-dependent PAC-Bayes bound is ensured as follows: our PAC-Bayes prior depends on the S P through θ * , but independent to the S Q that measures the errors. Note that here a two-phase training (i.e., pre-training with S P and fine-tuning with S) explained in Sec. 2.1 is used to attain our PAC-Bayes posterior. Similar ideas of two-phase training with linearization were proposed in the context of transfer learning in Achille et al. (2021) ; Maddox et al. (2021) . In transfer learning, there is a distribution shift between S P and S Q . Therefore, S P cannot be used for their fine-tuning phase in contrast to our PAC-Bayes posterior. Now we provide an invariance property of our prior and posterior distributions w.r.t. functionpreserving scale transformations as follows: The main intuition behind this proposition is that Jacobian w.r.t. connectivity is invariant to the function-preserving scaling transformation, i.e., J θ (x, T (θ * ))diag(T (θ * )) = J θ (x, θ * )diag(θ * ). Representative cases of T in Proposition 2.1 are presented in Appendix E to highlight theoretical implications; these include the case of WD applied to the general network, including BN. Proposition 2.1 (Scale-invariance of PAC-Bayes prior and posterior). Let T : R P → R P is a invertible diagonal linear transformation such that f (x, T (ψ)) = f (x, ψ) , ∀x ∈ R D , ∀ψ ∈ R P . Then, both PAC-Bayes prior and posterior are invariant under T : P T (θ * ) (c) d = P θ * (c), Q T (θ * ) (c) d = Q θ * (c). Furthermore, generalization and empirical errors are also invariant to T .

2.3. RESULTING PAC-BAYES BOUND

Now we plug in our prior and posterior into the modified PAC-Bayes generalization error bound in equation 2. As a result, we obtain a novel generalization error bound, named PAC-Bayes-CTK, which is guaranteed to be invariant to scale transformations (hence without the contradiction of FM hypothesis mentioned in Sec. 1). Theorem 2.2 (PAC-Bayes-CTK and its invariance). Let us assume pre-trained parameter θ * with data S P . By applying P θ * and Q θ * to data-dependent PAC-Bayes bound (equation 2), we get err D (Q θ * ) ≤ err S Q (Q θ * ) + KL divergence µ ⊤ Q µ Q 4α 2 N Q (average) perturbation + P i=1 h (β i ) 4N Q sharpness + log(2 N Q /δ) 2N Q (8) where {β i } P i=1 are eigenvalues of (I P + α 2 σ 2 J ⊤ c J c ) -1 and h(x) := x -log(x) -1. This upper bound is invariant to T for the function-preserving scale transformation by Proposition 2.1. Note that recent works on FM contradiction focus only on the scale-invariance of sharpness metrics: Indeed, their generalization bounds are not invariant to scale transformations due to the scaledependent terms (equation (34) in Tsuzuku et al. (2020) and equation ( 6) in Kwon et al. (2021) ). Specifically, these terms are proportional to the norm of pre-trained parameters. On the other hand, the generalization bound in Petzka et al. (2021) (Theorem 11 in their paper) only holds for single-layer NNs, whereas our bound has no restrictions for network structure. As a result, our PAC-Bayes bound is the first scale-invariant PAC-Bayes bound to the best of our knowledge. The following corollary explains why we name PAC-Bayes bound in Theorem 2.2 PAC-Bayes-CTK. Corollary 2.3 (Relation between CTK and PAC-Bayes-CTK). Let us define empirical Connectivity Tangent Kernel (CTK) of S as C θ * X := J c J ⊤ c = J θ diag(θ * ) 2 J ⊤ θ ∈ R N K×N K by removing below term. Note that empirical CTK has T (≤ N K) non-zero eigenvalues of {λ i } T i=1 , then the followings hold for {β} P i=1 in Theorem 2.2: (i) β i = σ 2 /(σ 2 + α 2 λ i ) < 1 for i = 1, . . . , T and (ii) β i = 1 for i = T + 1, . . . , P . Since h(1) = 0, this means P -T terms of summation in the sharpness part of PAC-Bayes-CTK vanish to 0. Furthermore, this sharpness term of PAC-Bayes-CTK is a monotonically increasing function for each eigenvalue of empirical CTK. Corollary 2.3 clarifies why P i=1 h(β i )/4N Q in Theorem 2.2 is called the sharpness term of PAC-Bayes-CTK: A sharp pre-trained parameter would have large CTK eigenvalues (since eigenvalues of CTK measure the sensitivity of output w.r.t. connectivity), increasing the sharpness term and the generalization gap. Finally, Proposition 2.4 shows that empirical CTK is also scale-invariant. Proposition 2.4 (Scale-invariance of empirical CTK). Let T : R P → R P be an function-preserving scale transformation in Proposition 2.1. Then empirical CTK at parameter ψ is invariant under T : C T (ψ) xy := C ψ xy , ∀x, y ∈ R D , ∀ψ ∈ R P . Remark 2.5 (Connections to empirical NTK). The empirical CTK C ψ xy resembles the existing empirical Neural Tangent Kernel (NTK) at parameter ψ (Jacot et al., 2018) : Θ ψ xy := J θ (x, ψ)J θ (y, ψ) ⊤ ∈ R K×K . Note that the deterministic NTK in Jacot et al. (2018) is the infinite-width limiting kernel at initialized parameters, while empirical NTK can be defined on any (finite-width) NNs. We focus on empirical kernels for finite pre-trained parameters throughout the paper, and we leave deterministic kernels defined for future studies. Comparing empirical kernels, the main difference between empirical CTK and the existing empirical NTK is in the definition of Jacobian. In CTK, Jacobian is computed w.r.t. connectivity c while the empirical NTK uses Jacobian w.r.t. parameters θ. Therefore, another PAC-Bayes bound can be derived from the linearization of f lin θ * (x, δ) := f (x, θ * ) + J θ (x, θ * )δ. As this bound is related to the eigenvalues of Θ θ * X , we call this bound PAC-Bayes-NTK and provide derivations in Appendix B. Note that PAC-Bayes-NTK is scale-variant as Θ T (ψ) xy ̸ = Θ ψ xy in general.

2.4. COMPUTING APPROXIMATE BOUND IN REAL WORLD PROBLEMS

To verify that PAC-Bayes bound in Theorem 2.2 is non-vacuous, we compute it for real-world problems. We use CIFAR-10 and 100 datasets (Krizhevsky, 2009) , where the 50K training instances are randomly partitioned into S P of cardinality 45K and S Q of cardinality 5K. We refer to Appendix H for detailed experimental settings. To compute equation 8, one needs (i) µ Q -based perturbation term, (ii) C θ * X -based sharpness term, and (iii) samples from PAC-Bayes posterior Q θ * . µ Q in equation 6 can be obtained by minimizing  arg min c∈R P L(c) = 1 2N ∥Y -f (X , θ * ) -J c c∥ 2 + σ 2 2α 2 N c ⊤ c by first-order optimality condition. Note that this problem is a convex optimization problem w.r.t. c, since c is the parameter of the linear regression problem. We use Adam optimizer (Kingma & Ba, 2014) with a fixed learning rate 1e-4 to solve this. For the sharpness term, we apply the Lanczos algorithm to approximate the eigenspectrum of C θ * X following Ghorbani et al. (2019) . We use 100 Lanczos iterations based on their setting. Lastly, we estimate empirical and test errors with 8 samples of CL/LL implementation of the Randomize-Then-Optimize (RTO) framework (Bardsley et al., 2014; Matthews et al., 2017) . The pseudo-code and computational complexity of RTO implementation can be found in Appendix F. Table 1 provides the bounds and related terms of PAC-Bayes-CTK (Theorem 2.2) and NTK (Theorem B.1). First, we found that our estimated PAC-Bayes-CTK and NTK are non-vacuous (i.e., estimated bounds are better than guessing at random) for ResNet-18 with 11M parameters. Note that deriving non-vacuous bound is challenging in PAC-Bayes analysis: only a few PAC-Bayes works (Dziugaite & Roy, 2017; Zhou et al., 2018; Perez-Ortiz et al., 2021) verified the non-vacuous property of their bounds, and other PAC-Bayes works (Foret et al., 2020; Tsuzuku et al., 2020) did not. To check the invariance property of PAC-Bayes-CTK, we scale the scale-invariant parameters in ResNet-18 (i.e., parameters preceding BN layers) for fixed constants {0.5, 1.0, 2.0, 4.0}. Due to BN layers, these transformations do not affect the function represented by NN, and the error bounds should be preserved for scale-invariant bounds. Table 1 shows that PAC-Bayes-CTK bound is stable to these transformations. On the other hand, PAC-Bayes-NTK bound is very sensitive to the parameter scale.

2.5. CONNECTIVITY SHARPNESS AND ITS EFFICIENT COMPUTATION

Now, we focus on the fact that the trace of CTK is also invariant to the parameter scale by Proposition 2.4. Unlike PAC-Bayes-CTK and NTK, traces of CTK and NTK do not require onerous hyperparameter selection of δ, α, σ. Therefore, we simply define CS(θ * ) := tr(C θ * X ) as a practical sharpness measure at θ * , named Connectivity Sharpness (CS) to detour the complex computation of PAC-Bayes-CTK. This metric can be easily applied to find NNs with better generalization, similar to other sharpness metrics (e.g., trace of Hessian), as shown in Jiang et al. (2020) . We evaluate the detecting performance of CS in Sec. 4.1. The following corollary shows how CS can explain the generalization performance of NNs, conceptually. Corollary 2.6 (Connectivity sharpness, Informal). Let us assume CTK and KL divergence terms of PAC-Bayes-CTK as defined in Theorem 2.2. Then, if CS vanishes to zero or infinity, the KL divergence term of PAC-Bayes-CTK also does so. As traces of a matrix can be efficiently estimated by Hutchinson's method (Hutchinson, 1989) , one can compute the CS without explicitly computing the entire CTK. We refer to Appendix F for detailed procedures of computing CS. As CS is invariant to function-preserving scale transformations by Proposition 2.4, it does not contradict the FM hypothesis.

3. BAYESIAN NNS WITH SCALE-INVARIANCE

In this section, we discuss a practical implication of our posterior distribution (equation 5) used in the PAC-Bayes analysis. To this end, we first interpret our PAC-Bayes posterior as a modified result of LA (MacKay, 1992) . Then, we demonstrate that this modification improves existing LA when WD is applied to NNs with normalization layers (Proposition 3.1). One can view the parameter space version of Q θ * as a modified version of LA posterior (equation 1) by (i) substituting parameter-dependent damping (diag(θ * ) -2 ) for isotropic damping and (ii) adding perturbation θ * ⊙ µ Q to the mean of Gaussian distribution. Here, we focus on the effect of replacing the damping term of LA in batch-normalized NNs in the presence of WD. We refer to Antoran et al. (2021; 2022) for the discussion on the effect of adding perturbation to the LA with linearized NNs. The main difference between the covariance terms of LA in equation 1 and equation 7 is the definition of Jacobian (i.e., parameter or connectivity) similar to the difference between empirical CTK and NTK in Remark 2.5. Therefore, we name p CL (ψ|S ) ∼ N (ψ|θ * , diag(θ * ) -2 /α 2 + J ⊤ θ J θ /σ 2 -1 ) as Connectivity Laplace (CL) approximated posterior. To compare CL posteriors against existing LAs, we explain how WD with BN can produce unexpected side effects of amplifying uncertainty. This side effect can be quantified if we consider linearized NN for LA, called Linearized Laplace (LL; Foong et al. ( 2019)). Assuming σ 2 ≪ α 2 , the predictive distribution of LL and CL are f lin θ * (x, ψ)|p LA (ψ|S) ∼ N (f (x, θ * ), α 2 Θ θ * xx -α 2 Θ θ * xX Θ θ * -1 X Θ θ * X x ) f lin θ * (x, ψ)|p CL (ψ|S) ∼ N (f (x, θ * ), α 2 C θ * xx -α 2 C θ * xX C θ * -1 X C θ * X x ) for any input x ∈ R d where X in subscript means concatenation. We refer to Appendix G for the detailed derivations. The following proposition illustrates how WD with BN can increase the prediction uncertainty of equation 10. Proposition 3.1 (Uncertainty amplifying effect for LL). Let us assume that W γ : R P → R P is a WD on scale-invariant parameters (e.g., parameters preceding BN layers) by multiplying γ < 1 and all the non-scale-invariant parameters are fixed. Then, the predictive uncertainty of LL is amplified by 1/γ 2 > 1 while the predictive uncertainty of CTK is preserved as Var ψ∼pLA(ψ|S) (f lin Wγ (θ * ) (x, ψ)) = Var ψ∼pLA(ψ|S) (f lin θ * (x, ψ))/γ 2 Var ψ∼pCL(ψ|S) (f lin Wγ (θ * ) (x, ψ)) = Var ψ∼pCL(ψ|S) (f lin θ * (x, ψ)) where Var(•) is variance of random variable. Since the primal regularization effect of WD actually occurs when combined with BN as experimentally shown in Zhang et al. (2019) , Proposition 3.1 describes a real-world issue. Recently, Antoran et al. (2021; 2022) observed similar pitfalls in Proposition 3.1. However, their solution requires a more complicated hyper-parameter search: independent prior selection for each normalized parameter group. On the other hand, CL does not increase the hyper-parameter to be optimized compared to LL. We believe this difference will make CL more attractive to practitioners.

4. EXPERIMENTS

Here we describe experiments demonstrating (i) the effectiveness of Connectivity Sharpness (CS) as a generalization measurement metric and (ii) the usefulness of Connectivity Laplace (CL) as a general-purpose Bayesian NN: With CS and CL, we can resolve the contradiction in the FM hypothesis concerning the generalization of NNs and attain stable calibration performance for various ranges of prior scales. 

4.1. CONNECTIVITY SHARPNESS AS A GENERALIZATION MEASUREMENT METRIC

Based on the CIFAR-10 dataset, we evaluate three correlation metrics to determine whether CS is more correlated with generalization performance than existing sharpness measures: (a) Kendall's rank-correlation coefficient (τ ; Kendall (1938) ) (b) granulated Kendall's coefficient and their average (Ψ; Jiang et al. ( 2020)) (c) conditional independence test (K; Jiang et al. (2020) ). In all correlation metrics, a higher value indicates a stronger relationship between sharpness and generalization. We compare CS to the following baseline sharpness measures: trace of Hessian (tr(H); Keskar et al. ( (SM), and PAC-Bayes-Mag. (PM), which are eq. ( 52), ( 49), ( 62), (61) in Jiang et al. (2020) . We compute granulated Kendall's correlation by using five hyper-parameters (network depth, network width, learning rate, weight decay, and mini-batch size) and three options for each. Thus, we train models with 3 5 = 243 different training configurations. We vary the depth and width of NN based on VGG-13 (Simonyan & Zisserman, 2015) . Further experimental details can be found in Appendix H. In Table 2 , CS shows the best results for τ , Ψ, and K compared to all other sharpness measures. Additionally, granulated Kendall of CS is higher than other sharpness measures for 3 out of 5 hyperparameters and competitive with other sharpness measures for the remaining hyperparameters. The main difference between our CS and other sharpness measures is in the correlation with weight decay/network width: We found that SO and PM can capture the correlation with weight decay, and hypothesize that this is due to the weight norm term of SO/PO. As this weight norm term would interfere in capturing the sharpness-generalization correlation related to the number of parameters (i.e., width/depth), SO/PO fail to capture correlation with network width in Table 2 . On the other hand, CS/AS do not suffer from such a problem. Also, it is notable that FR weakly captures this correlation despite its invariant property. For network width, we found that sharpness measures except for CS, tr(F), AS/FR fail to capture a strong correlation. In summary, only CS/AS detect clear correlations with all hyperparameters; among them, CS captures clearer correlations.

4.2. CONNECTIVITY LAPLACE AS AN EFFICIENT GENERAL-PURPOSE BAYESIAN NN

We evaluate CL's effectiveness as a general-purpose Bayesian NN using the UCI and CIFAR datasets. We refer to Appendix H for detailed experimental settings. (c) Brier Score 2019). We use the Randomize-Then-Optimize (RTO) implementation of LL/CL in Appendix F. We measure negative log-likelihood (NLL), expected calibration error (ECE; Guo et al. (2017) ), and Brier score (Brier.) for ensemble predictions. We also measure the area under receiver operating curve (AUC) for OOD detection, where we set the SVHN (Netzer et al., 2011) dataset as an OOD. Robustness to the selection of prior scale. Figure1 shows the uncertainty calibration results over various α values for LL, CL, and Deterministic (Det.) baseline. As mentioned in previous works (Ritter et al., 2018; Kristiadi et al., 2020) , the uncertainty calibration results of LL are extremely sensitive to the selection of α. Especially, LL shows severe under-fitting for large α (i.e., small damping) regime. On the other hand, CL shows stable performance in the various ranges of α.

5. CONCLUSION

In this work, we proposed a new approach to enhance the robustness of generalization bound using PAC-Bayes prior and posterior distributions. By separating scales and connectivities, our approach achieved invariance to function-preserving scale transformations, which is not addressed by existing generalization error bounds. As a result, our method successfully resolved the contradiction in the FM hypothesis caused by general scale transformation. In addition, our posterior distribution for PAC-Bayes analysis improved the Laplace approximation without significant drawbacks when dealing with weight decay with BN. To further improve our understanding of NN generalization effects, future research could explore extending prior and posterior distributions beyond Gaussian distributions to more task-specific distributions. This could help bridge the gap between theory and practice.

A NOTATIONS

Table 5 : Notations used in the main paper  x n ∈ R D , y n ∈ R K ≜ training inputs/outputs θ, ψ ∈ R P ≜ parameter of NNs f (x, θ) ≜ output of NN given x, θ f lin θ * (x, δ) ≜ f (x, θ * ) + J θ (x, θ * )δ (linearization of NN w.r.t. parameter) g lin θ * (x, c) ≜ f (x, θ * ) + J θ (x, θ * )diag(θ * )c ( , N Q ≜ cardinality of S, S P , S Q X , Y/X P , Y P /X Q , Y Q ≜ concatenated inputs and outputs of all instances of S/S P /S Q . J θ (x, θ * ), J c (x, θ * ) ∈ R K×P ≜ Jacobian of outputs w.r.t parameter/connectivity given x, θ * f (X , θ) ∈ R N K×P ≜ concatenated outputs of NN along X J θ , J c ∈ R N K×P ≜ concatenated Jacobian w.r.t. parameter/connectivity along X P θ * (c), Q θ * (c) ≜ our PAC-Bayes prior/posterior err D (Q), err S (Q) ≜ generalization/empirical error of PAC-Bayes posterior α ≜ standard deviation of our PAC-Bayes prior σ ≜ scale (standard deviation) of observational noise µ Q ∈ R P , Σ Q ∈ R P ×P ≜ mean and covariance of our PAC-Bayes posterior Θ ψ xx ′ , C ψ xx ′ ∈ R ≜ empirical NTK/CTK of NN given ψ and input pair x, x ′ ∈ R D Θ ψ X , C ψ X ∈ R N ×N ≜ empirical NTK/CTK of NN for training inputs (X ) given ψ {λ i } P i=1 ≜ eigenvalues of empirical CTK {β i } P i=1 ≜ eigenvalues of (I P + α 2 σ 2 J ⊤ c J c ) -1 h(x) ≜ x -log(x) -1 (non-negative convex function w.r.t. β i . See Fig. 2) (h • s)(x) ≜ non-negative concave function w.r.t. λ i . See Fig. 2) B PROOFS B.1 PROOF OF PROPOSITION 2.1 Proof. Since the prior P θ * (c) is independent to the parameter scale, P θ * (c) d = P T (θ * ) (c) is trivial. For Jacobian w.r.t. parameters, we have J T (θ) (x, T (ψ)) = ∂f (x, T (ψ)) ∂T (θ) = ∂f (x, ψ) ∂T (θ) = ∂f (x, ψ) ∂θ ∂θ ∂T (θ) = J θ (x, ψ)T -1 Then, the Jacobian of NN w.r.t. connectivity at T (ψ) holds J θ (x, T (ψ))diag(T (ψ)) = J θ (x, ψ)T -1 T diag(ψ) (12) = J θ (x, ψ)diag(ψ) where the first equality holds from the above one and the fact that T is a diagonal linear transformation. Therefore, the covariance of posterior is invariant to T . I P α 2 + diag(T (θ * ))J ⊤ θ (X , T (θ * ))J θ (X , T (θ * ))diag(T (θ * )) σ 2 -1 = I P α 2 + diag(θ * )J ⊤ θ (X , θ * )J θ (X , θ * )diag(θ * ) σ 2 -1 = I P α 2 + diag(θ * )J ⊤ θ J θ diag(θ * ) σ 2 -1 Moreover, the mean of posterior is also invariant to T . Σ Q diag(T (θ * ))J ⊤ θ (X , T (θ * )) (Y -f (X , T (θ * ))) σ 2 = Σ Q diag(T (θ * ))J ⊤ θ (X , T (θ * )) (Y -f (X , θ * )) σ 2 = Σ Q diag(θ * )J ⊤ θ (X , θ * ) (Y -f (X , θ * )) σ 2 = Σ Q diag(θ * )J ⊤ θ (Y -f (X , θ * )) σ 2 Therefore, equation 6 and equation 7 are invariant to function-preserving scale transformations. The remaining part of the proposition is related to the definition of function-preserving scale transformation T . For generalization error, the following holds err D (Q T (θ * ) ) = E (x,y)∼D,ψ∼Q T (θ * ) [err(f (x, ψ), y)] = E (x,y)∼D,c∼Q T (θ * ) [err(g pert θ * (x, c), y)] = E (x,y)∼D,c∼Q θ * [err(g pert θ * (x, c), y)] = E (x,y)∼D,ψ∼Q θ * [err(f (x, ψ), y)] = err D (Q θ * ) WLOG, this proof can be extended to the empirical error err S Q .

B.2 PROOF OF THEOREM 2.2

Proof. (Construction of KL divergence) To construct PAC-Bayes-CTK, we need to arrange KL divergence between posterior and prior as follows: KL[Q∥P] = 1 2 tr Σ -1 P (Σ Q + (µ Q -µ P )(µ Q -µ P ) ⊤ ) + log |Σ P | -log |Σ Q | -P = 1 2 tr(Σ -1 P (µ Q -µ P )(µ Q -µ P ) ⊤ )) + 1 2 tr(Σ -1 P Σ Q ) + log |Σ P | -log |Σ Q | -P = 1 2 (µ Q -µ P ) ⊤ Σ -1 P (µ Q -µ P ) + 1 2 tr(Σ -1 P Σ Q ) -log |Σ -1 P Σ Q | -P = µ ⊤ Q µ Q 2α 2 perturbation + 1 2 tr(Σ -1 P Σ Q ) -log |Σ -1 P Σ Q | -p sharpness where the first equality uses the KL divergence between two Gaussian distributions, the third equality uses trace property (tr(AB) = tr(BA) and tr(a) = a for scalar a), and the last equality uses the definition of PAC-Bayes prior (P θ * (c) = N (c|0 P , α 2 I P )). For sharpness term, we first compute the Σ -1 P Σ Q term as Σ -1 P Σ Q = I P + α 2 σ 2 J ⊤ c J c -1 Since α 2 , σ 2 > 0 and J ⊤ c J c is positive semi-definite, the matrix Σ -1 P Σ Q have non-zero eigenvalues of {β i } P i=1 . Since a trace is the sum of eigenvalues and the log-determinant is the sum of the log of eigenvalues, we have KL[Q∥P] = µ ⊤ Q µ Q 2α 2 + 1 2 P i=1 (β i -log(β i ) -1) = µ ⊤ Q µ Q 2α 2 + 1 2 P i=1 h(β i ) where h(x) = x -log(x) -1. By plugging this KL divergence into the equation 2, we get equation 8. (Eigenvalues of Σ -1 P Σ Q ) To show the scale-invariance of PAC-Bayes-CTK, it is sufficient to show that the KL divergence between posterior and prior is scale-invariant: log(2 N Q /δ)/2N Q is independent to KL PAC-Bayes prior/posterior. We already show the invariance property of empirical/generalization error term in Proposition 2.1. To show the invariance property of KL divergence, let us write a singular value decomposition of Jacobian w.r.t. connectivity J c ∈ R N K×P as J c = U ΣV ⊤ , where U ∈ R N K×N K and V ∈ R P ×P are orthogonal matrices and Σ ∈ R N K×P is a rectangular diagonal matrix with descending order for singular values. Then, the following holds for Σ -1 P Σ Q Σ -1 P Σ Q = I P + α 2 σ 2 J ⊤ c J c -1 = I P + α 2 σ 2 V Σ ⊤ ΣV ⊤ -1 = V I P + α 2 σ 2 Λ -1 V ⊤ where Λ = Σ ⊤ Σ ∈ R P ×P is a diagonal matrix with λ i := Λ ii = 0 for i ≥ N K. Therefore, eigenvalues of Σ -1 P Σ Q are 1 1+α 2 λi/σ 2 = σ 2 σ 2 +α 2 λi . Now, we consider Connectivity Tangent Kernel (CTK) as defined in equation 2.3: C θ * X := J c J ⊤ c = J θ diag(θ * ) 2 J ⊤ θ ∈ R N K×N K . Similar to J ⊤ c J c , CTK can be expressed as follows C θ * X = J c J ⊤ c = U ΣV ⊤ V Σ ⊤ U ⊤ = U ΣΣ ⊤ U ⊤ = U Λ ′ U ⊤ where Λ ′ = ΣΣ ⊤ ∈ R N K×N K . As the smallest (P -N K) eigenvalues of Λ = Σ ⊤ Σ are just zeros, Λ ′ is just a reduced diagonal matrix of Λ with these eigenvalues removed. As a result, {λ i } N K i=1 are eigenvalues of CTK. (Scale invariance of CTK) The scale-invariance property of CTK is a simple application of equation 13: C T (ψ) xy = J T (θ) (x, T (ψ))diag(T (ψ) 2 )J T (θ) (y, T (ψ)) ⊤ = J θ (x, ψ)T -1 T diag(ψ)diag(ψ)T T -1 J θ (x, ψ) ⊤ = J θ (x, ψ)diag(ψ)diag(ψ)J θ (x, ψ) ⊤ = C ψ xy , ∀x, y ∈ R D , ∀ψ ∈ R P . Therefore, CTK is invariant to any function-preserving scale transformation T and so do its eigenvalues. This guarantees the invariance of Σ -1 P Σ Q and its eigenvalues. In summary, we showed the scale-invariance property of the sharpness term of KL divergence. Now all that remains is to show the invariance of the perturbation term. However, this is already proved in the proof of Proposition 2.1. Therefore, we show PAC-Bayes-CTK is invariant to any function-preserving scale transformation.

B.3 PROOF OF COROLLARY 2.3

Proof. In proof of Theorem 2.2, we showed that eigenvalues of Σ -1 P Σ Q can be represented as σ 2 σ 2 + α 2 λ i  λ i = Λ ii = Σ 2 ii ≥ 0. Therefore, eigenvalues of CTK are squares of singular values of J c and so λ i ≥ 0, ∀i. As a result, β i ≤ 1 for all i = 1, . . . , P for eigenvalues {β i } P i=1 of Σ -1 P Σ Q and equality holds for λ i = 0 (i.e., i ≥ N K). Now all that remains is to show that the sharpness term of PAC-Bayes-CTK is a monotonically increasing function on each eigenvalue of CTK. To show this, we first keep in mind that s(x) := σ 2 σ 2 + α 2 x is a monotonically decreasing function for x ≥ 0 and h(x) := x -log(x) -1 is a monotonically decreasing function for x ∈ (0, 1]. Since the sharpness term of KL divergence is P i=1 h(β i ) 4N Q = P i=1 (h • s)(λ i ) 4N Q This is a monotonically increasing function for x ≥ 0 since s(x) ≤ 1 for x ≥ 0. For your information, we plot y = h(x) and y = (h • s)(x) in Figure 2 .

B.4 PROOF OF PROPOSITION 2.4

We refer to Scale invariance of CTK part of the proof of Theorem 2.2. This is a direct application of the scale-invariance property of Jacobian w.r.t. connectivity.

B.5 PROOF OF COROLLARY 2.6

Proof. Since CS is a trace of CTK, it is a sum of the eigenvalues of CTK. As shown in the proof of Corollary 2.3, eigenvalues of CTK are square of singular values of Jacobian w.r.t. connectivity c. Therefore, the eigenvalues of CTK are non-negative and vanish to zero if CS vanishes to zero. P i=1 λ i = 0 ⇒ λ i = 0 ⇒ β i = s(λ i ) = 1 ⇒ h(β i ) = 0, ∀i = 1, . . . , P This means the sharpness term of KL divergence vanishes to zero. Furthermore, singular values of Jacobian w.r.t. c also vanish to zero in this case. Therefore, µ Q vanishes to zero, also. Similarly, if CS diverges to infinity, this means (at least) one of the eigenvalues of CTK diverges to infinity. In this case, the following holds λ i → ∞ ⇒ β i = s(λ i ) → 0 ⇒ h(β i ) → ∞, ∀i = 1, . . . , P Therefore, the KL divergence term of PAC-Bayes-CTK also diverges to infinity. B.6 PROOF OF PROPOSITION 3.1 Proof. By assumption, we fixed all non-scale invariant parameters. This means we exclude these parameters in the sampling procedure of CL and LL. In terms of predictive distribution, this can be translated as f lin θ * (x, ψ)|p LA (ψ|S) ∼ N (f (x, θ * ), α 2 Θθ * xx -α 2 Θθ * xX Θθ * -1 X Θθ * X x ) f lin θ * (x, ψ)|p CL (ψ|S) ∼ N (f (x, θ * ), α 2 Ĉθ * xx -α 2 Ĉθ * xX Ĉθ * -1 X Ĉθ * X x ) where Θψ xx ′ := i∈P ∂f (x,ψ) ∂θi ∂f (x ′ ,ψ) ∂θi and Ĉψ xx ′ := i∈P ∂f (x,ψ) ∂θi ∂f (x ′ ,ψ) ∂θi (ψ i ) 2 for scale-invariant parameter set P. Thereby, we mask the gradient of the non-scale-invariant parameters as zero. Therefore, this can be arranged as follows Θψ xx ′ = J θ (x, ψ)diag(1 P )J θ (x, ψ) ⊤ , Ĉψ xx ′ = J θ (x, ψ)diag(ψ)diag(1 P )diag(ψ)J θ (x, ψ) ⊤ where 1 P ∈ R P is a masking vector (i.e., one for included components and zero for excluded components). Then, the WD for scale-invariant parameters can be represented as W γ (ψ) i = γψ i , if ψ i ∈ P. ψ i , if ψ i ̸ ∈ P. Therefore, we get ΘWγ(ψ) xx ′ = J θ (x, W γ (ψ))diag(1 P )J θ (x, W γ (ψ))) ⊤ = J θ (x, ψ)W -1 γ diag(1 P )W -1 γ J θ (x, ψ) ⊤ = J θ (x, ψ)diag(1 P /γ 2 )J θ (x, ψ) ⊤ = 1/γ 2 J θ (x, ψ)diag(1 P )J θ (x, ψ) ⊤ = 1/γ 2 Θψ xx ′ for empirical NTK and ĈWγ(ψ) xx ′ = J θ (x, W γ (ψ))diag(W γ (ψ))diag(1 P )diag(W γ (ψ))J θ (x, W γ (ψ))) ⊤ = J θ (x, ψ)W -1 γ W γ diag(ψ)diag(1 P )diag(ψ)W γ W -1 γ J θ (x, ψ) ⊤ = J θ (x, ψ)diag(ψ)diag(1 P )diag(ψ)J θ (x, ψ) ⊤ = Ĉψ xx ′ for empirical CTK. Therefore, we get f lin Wγ (θ * ) (x, ψ)|p LA (ψ|S) ∼ N (f (x, θ * ), α 2 /γ 2 Θθ * xx -α 2 /γ 2 Θθ * xX Θθ * -1 X Θθ * X x ) f lin Wγ (θ * ) (x, ψ)|p CL (ψ|S) ∼ N (f (x, θ * ), α 2 Ĉθ * xx -α 2 Ĉθ * xX Ĉθ * -1 X Ĉθ * X x ) This gives us Var ψ∼pLA(ψ|S) (f lin Wγ (θ * ) (x, ψ)) = Var ψ∼pLA(ψ|S) (f lin θ * (x, ψ))/γ 2 Var ψ∼pCL(ψ|S) (f lin Wγ (θ * ) (x, ψ)) = Var ψ∼pCL(ψ|S) (f lin θ * (x, ψ)) B.7 DERIVATION OF PAC-BAYES-NTK Theorem B.1 (PAC-Bayes-NTK). Let us assume pre-trained parameter θ * with data S P . Let us assume PAC-Bayes prior and posterior as P ′ θ * (δ) := N (δ|0 P , α 2 I P ) Q ′ θ * (δ) := N (δ|µ Q ′ , Σ Q ′ ) µ Q ′ := Σ Q ′ J ⊤ θ (Y -f (X , θ * )) σ 2 Σ Q ′ := I P α 2 + J ⊤ θ J θ σ 2 -1 By applying P ′ θ * , Q ′ θ * to data-dependent PAC-Bayes bound (equation 2), we get err D (Q ′ θ * ) ≤ err S Q ′ (Q ′ θ * ) + KL divergence µ ⊤ Q ′ µ Q ′ 4α 2 N Q ′ (average) perturbation + P i=1 h (β ′ i ) 4N Q ′ sharpness + log(2 N Q ′ /δ) 2N Q ′ (18) where {β ′ i } P i=1 are eigenvalues of (I P + α 2 σ 2 J ⊤ θ J θ ) -1 and h(x) := x -log(x) -1. This upper bound is not scale-invariant in general. Proof. The main difference between PAC-Bayes-CTK and PAC-Bayes-NTK is the definition of Jacobian: PAC-Bayes-CTK uses Jacobian w.r.t connectivity and PAC-Bayes-NTK uses Jacobian w.r.t. parameter. Therefore, Construction of KL divergence of proof of Theorem 2.2 is preserved except Σ -1 P ′ Σ Q ′ = (I P + α 2 σ 2 J ⊤ θ J θ ) -1 and β ′ i are eigenvalues of Σ P ′ Σ Q ′ . Note that these eigenvalues satisfy β ′ i = σ 2 σ 2 + α 2 λ ′ i where λ ′ i are eigenvalues of J ⊤ θ J θ . Remark B.2 (Function-preserving scale transformation to NTK). In contrast to the CTK, the scale invariance property is not applicable to the NTK due to the Jacobian w.r.t. parameter: J θ (x, T (ψ)) = ∂ ∂T (ψ) f (x, T (ψ)) = ∂ ∂T (ψ) f (x, ψ) = J θ (x, ψ)T -1 If we assume all parameters are scale-invariant (or equivalently masking the Jacobian for all nonscale-invariant parameters as in the proof of Proposition 3.1), the scale of NTK is proportional to the inverse scale of parameters.

C DETAILS OF SQUARED LOSS FOR CLASSIFICATION TASKS

For the classification tasks in Sec. 4.2, we use the squared loss instead of the cross-entropy loss since our theoretical results are built on the squared loss. Here, we describe how we use the squared loss to mimic the cross-entropy loss. There are several works (Lee et al., 2020; Hui & Belkin, 2021 ) that utilized the squared loss for the classification task instead of the cross-entropy loss. Specifically, Lee et al. (2020) used L(S, θ) = 1 2N K (xi,yi)∈S ∥f (x i , θ) -y i ∥ 2 where C is the number of classes, and Hui & Belkin (2021)  used ℓ((x, c), θ) = 1 2K   k(f c (x, θ) -M ) 2 + K i=1,i̸ =c f i (x, θ) 2   for single data loss, where ℓ((x, c), θ) is sample loss given input x, target c and parameter θ, f i (x, θ) ∈ R is the i-th component of f (x, θ) ∈ R K , k and M are dataset-specific hyper-parameters. These works used the mean for reducing the vector-valued loss into a scalar loss. However, this can be problematic when the number of classes is large. When the number of classes increases, the denominator of the mean (the number of classes) increases while the target value remains 1 (one-hot label). As a result, the scale of a gradient for the target class becomes smaller. To avoid such an unfavorable effect, we use the sum for reducing vector-valued loss into a scalar loss instead of taking the mean, i.e., ℓ((x, c), θ) = 1 2   (f c (x, θ) -1) 2 + K i=1,i̸ =c f i (x, θ) 2   This analysis is consistent with the hyper-parameter selection in Hui & Belkin (2021) . They used larger k and M as the number of classes increases (e.g., k = 1, M = 1 for CIFAR-10 ( Krizhevsky, 2009) , but k = 15, M = 30 for ImageNet (Deng et al., 2009) ) which results in manually compensating the scale of the gradient to the target class label.

D DERIVATION OF PAC-BAYES POSTERIOR

Derivation of Q θ * (c) For Bayesian linear regression, we compute the posterior of β ∈ R P y i = x i β + ϵ i , for i = 1 . . . , M where ϵ i ∼ N (0, σ 2 ) is i.i.d. sampled and the prior of β is given as β ∼ N (0 P , α 2 I P ). By concatenating this, we get y = Xβ + ε where y ∈ R M , X ∈ R M ×p are concatenation of y i , x i , respectively and ε ∼ N (0 M , σ 2 I M ). It is well known (Bishop, 2006; Murphy, 2012) that the posterior of β for this problem is β ∼ N (β|µ, Σ) µ := ΣX ⊤ y σ 2 Σ := I P α 2 + X ⊤ X σ 2 -1 . Similarly, we define the Bayesian linear regression problem as y i = f (x i , θ * ) + J θ (x i , θ * )diag(θ * )c + ϵ i , for i = 1 . . . , N K where M = N K and the regression coefficient is β = c in this case. Thus, we treat y i -f (x i , θ * ) as a target and J θ (x i , θ * )diag(θ * ) as an input of linear regression problem. By concatenating this, we get Y = f (X , θ * ) + J c c + ε ⇒ (Y -f (X , θ * )) = J c c + ε. By plugging this into the posterior of the Bayesian linear regression problem, we get Q θ * (c) := N (c|µ Q , Σ Q ) µ Q := Σ Q J ⊤ c (Y -f (X , θ * )) σ 2 = Σ Q diag(θ * )J ⊤ θ (Y -f (X , θ * )) σ 2 Σ Q := I P α 2 + J ⊤ c J c σ 2 -1 = I P α 2 + diag(θ * )J ⊤ θ J θ diag(θ * ) σ 2 -1 Derivation of Q θ * (ψ) We define perturbed parameter ψ as follows ψ := θ * + θ * ⊙ c. Since ψ is affine to c, we get the distribution of ψ as Q θ * (ψ) := N (ψ|µ ψ Q , Σ ψ Q ) µ ψ Q := θ * + θ * ⊙ µ Q Σ ψ Q := diag(θ * )Σ Q diag(θ * ) = diag(θ * ) -2 α 2 + J ⊤ θ J θ σ 2 -1

E REPRESENTATIVE CASES OF FUNCTION-PRESERVING SCALING TRANSFORMATIONS

Activation-wise rescaling transformation (Tsuzuku et al., 2020; Neyshabur et al., 2015a) For NNs with positive homogeneous (e.g., ReLU) activations, following holds for ∀x ∈ R d , γ > 0: f (x, θ) = f (x, R γ,l,k (θ)) , where rescale transformation R γ,l,k (•)foot_1 is defined as (R γ,l,k (θ)) i =    γ • θi , if θi ∈ {param. subset connecting as input edges to k-th activation at l-th layer} θi/γ , if θi ∈ {param. subset connecting as output edges to k-th activation at l-th layer} θi , for θi in the other cases (19) Note that R γ,l,k (•) is a finer-grained rescaling transformation than layer-wise rescaling transformation (i.e., common γ for all activations in layer l) discussed in Dinh et al. (2017) . Dinh et al. (2017) showed that even layer-wise rescaling transformations can sharpen pre-trained solutions in terms of the trace of Hessian (i.e., contradicting the FM hypothesis). This contradiction also occurs in previous PAC-Bayes bounds (Tsuzuku et al., 2020; Kwon et al., 2021) due to the scale-dependent term. For example, let us assume a linear NN as f (x, (θ 1 , θ 2 , θ 3 , θ 4 )) = (θ 3 , θ 4 ) ⊤ θ 1 θ 2 x = θ 1 θ 3 x + θ 2 θ 4 x Then, the Jacobian of this NN would be J θ (x, θ) = θ 3 x θ 4 x θ 1 x θ 2 x From above discussion, (0.5, 1, 2, 1) = R 2,1,1 ((1, 1, 1, 1)) would change the Jacobian while maintaining the predictions. To show this, f (x, (0.5, 1, 2, 1)) = f (x, (1, 1, 1, 1)) = x + x = 2x for all x ∈ R. However, J θ (x, (0.5, 1, 2, 1)) = 2x x 0.5x x ̸ = x x x x = J θ (x, (1, 1, 1, 1)) in general. Therefore, we verified that the activation-wise rescaling transformation R 2,1,1 is a valid function-preserving scale transformation. WD with BN layers (Ioffe & Szegedy, 2015) For parameters W ∈ R n l ×n l-1 preceding BN layer, BN((diag(γ)W )u) = BN(W u) for an input u ∈ R n l-1 and a positive vector γ ∈ R n l + . This implies that scaling transformations on these parameters preserve function represented by NNs for ∀x ∈ R d , γ ∈ R n l + : f (x, θ) = f (x, S γ,l,k (θ)), where scaling transformation S γ,l,k (•) is defined for i = 1, . . . , P (S γ,l,k (θ)) i = γ k • θi , if θi ∈ {param. subset connecting as input edges to k-th activation at l-th layer} θi , for θi in the other cases (21) Note that the WD (Loshchilov & Hutter, 2019; Zhang et al., 2019) can be implemented as a realization of S γ,l,k (•) (e.g., γ = 0.9995 for all activations preceding BN layers). Therefore, thanks to Proposition 2.1 and Theorem 2.4, our CTK-based bound is invariant to WD applied to parameters before BN layers. We also refer to (Van Laarhoven, 2017; Lobacheva et al., 2021) for the optimization perspective of WD with BN. For example, let us assume a BN layer with single activation as BN(θx) where x, θ ∈ R and batch of input is given as 0, 0, 2, 2. Then, the Jacobian of NN would be J θ (x, θ) = x |θ| as f (x, θ) = θx |θ| and the denominator is detached from backpropagation computation for autodifferentiation packages (e.g., TensorFlow, Pytorch, and JAX). From the above discussion, 0.9995 = S 0.9995,1,1 (1) would change the Jacobian while maintaining the predictions. To show this, f (x, 1) = f (x, 0.9995) = x for all inputs. On the other hand, J θ (x, 0.9995) = x 0.9995 ̸ = x = J θ (x, 1). Therefore, we showed that the WD with BN layer S 0.9995,1,1 is a valid function-preserving scale transformation. 

F IMPLEMENTATIONS

m ∼ N (ε|0 N K , σ 2 I N K ) for t = 1, . . . , T do Randomly draw a mini-batch B from S. Define g lin θ * (X B , c m ) = f (X B , θ * ) + J θ (X B , θ * )diag(θ * )c m for mini-batch input X B . Define L(c m ) = 1 2|B|σ 2 ∥Y B + ε m B -g lin θ * (X B , c m )∥ 2 2 + 1 2|B|α 2 ∥c m -c m 0 ∥ 2 2 . Compute backpropagation of L(c m ) w.r.t. connectivity c m . Update c m with SGD optimizer. end for end for return Samples from Connectivity Laplace {c m } M m=1 . To estimate the empirical/generalization bound in Sec. 2.4 and calibrate uncertainty in Sec. 4.2, we need to sample c from the posterior Q θ * (c). For this, we sample perturbations δ in connectivity space δ ∼ N δ|0 P , I P α 2 + J ⊤ c J c σ 2 -1 so that c = µ Q + δ for equation 6. To sample this, we provide a novel approach to sample from LA/CL without curvature approximation. To this end, we consider the following optimization problem arg min c L(c) := arg min c 1 2N σ 2 ∥Y -f (X , θ * ) -J c c + ε∥ 2 + 1 2N α 2 ∥c -c 0 ∥ 2 2 where ε ∼ N (ε|0 N K , σ 2 I N K ) and c 0 ∼ N (c 0 |0 P , α 2 I P ). By first-order optimality condition, we have N ∇ c L(c) = - J ⊤ c (Y -f (X , θ * ) -J c c * + ε) σ 2 + c * -c 0 α 2 = 0 P . By arranging this w.r.t. optimizer c * , we get c * = J ⊤ c J c + σ 2 α 2 I P -1 J ⊤ c (Y -f (X , θ * )) + σ 2 α 2 c 0 + J c ε = J ⊤ c J c + σ 2 α 2 I P -1 J ⊤ c (Y -f (X , θ * )) + J ⊤ c J c + σ 2 α 2 I P -1 σ 2 α 2 c 0 + J c ε = I P α 2 + J ⊤ c J c σ 2 -1 J ⊤ c (Y -f (X , θ * )) σ 2 Deterministic + I P α 2 + J ⊤ c J c σ 2 -1 c 0 α 2 + J ⊤ c ε σ 2

Stochastic

Since both ε and c 0 are sampled from independent Gaussian distributions, we have z := c 0 α 2 + J ⊤ c ε σ 2 ∼ N z|0 P , I P α 2 + J ⊤ c J c σ 2 Therefore, optimal solution of randomized optimization problem arg min c L(c) is c ∼ N c I P α 2 + J ⊤ c J c σ 2 -1 J ⊤ c (Y -f (X , θ * )) σ 2 , I P α 2 + J ⊤ c J c σ 2 -1 = N (c|µ Q , Σ Q ) Similarly, sampling from CL can be implemented as the following optimization problem. arg min c L(c) := arg min c 1 2N σ 2 ∥J c c -ε∥ 2 + 1 2N α 2 ∥c -c 0 ∥ 2 2 where ε ∼ N (ε|0 N K , σ 2 I N K ) and c 0 ∼ N (c 0 |0 P , α 2 I P ). Since we sample the noise of data/perturbation and optimize the perturbation, this can be interpreted as a Randomize-Then-Optimize implementation of Laplace approximation and Connectivity Laplace (Bardsley et al., 2014; Matthews et al., 2017) . In Algorithm 1, we provide a pseudo-code for the RTO implementation of CL. Note that both time and memory complexity of computing linearized NN for mini-batch B is comparable to a forward propagation as shown in Novak et al. (2022) with jax.jvp function in JAX (Bradbury et al., 2018) . Therefore, the time/memory complexity of mini-batch JVP would be O(|B|LW 2 )/O(|B|W + LW 2 + N K) for MLPs with width W and depth L. BN layers after the convolution layer for each block. Specifically, the number of convolution layers of each conv block is the depth and the number of channels of convolution layers of the first conv block is the width. For the subsequent conv blocks, we follow the original VGG width multipliers (×2, ×4, ×8). An example with depth 1 and width 128 is depicted in Table 7 .

F.2 COMPUTING CONNECTIVITY SHARPNESS

We use an SGD optimizer with momentum of 0.9. We train each model for 200 epochs and use cosine learning rate scheduler (Loshchilov & Hutter, 2016) with 30% of initial epochs as warm-up epochs. The standard data augmentations (padding, random crop, random horizontal flip, and normalization) for CIFAR-10 are used for training data. For the analysis, we only use models with above 99% training accuracy following Jiang et al. (2020) . As a result, we use 200 out of 243 trained models for our correlation analysis. For every experiment, we use 8 NVIDIA RTX 3090 GPUs.

UCI regression

We train MLP with a single hidden layer. We fix σ = 1 and choose α from {0.01, 0.1, 1, 10, 100} using log-likelihood of validation dataset since the optimal α varies for each regression dataset. We use 8 random seeds to compute the average and standard error of the test negative log-likelihoods.

Image classification task

We pre-train models for 200 epochs CIFAR-10/100 dataset (Krizhevsky, 2009) with ResNet-18 (He et al., 2016) as mentioned in Section 2.4. We choose ensemble size M as 8 except Deep Ensemble (Lakshminarayanan et al., 2017) and Batch Ensemble (Wen et al., 2020) . We use 4 ensemble members for Deep Ensemble and Batch Ensemble due to computational cost. We used 4 random seeds to compute the standard errors except for Deep Ensemble, which ensembles the NNs from 4 different random seeds. For evaluation, we define a prediction of single-member as the one-hot representation of network output with label smoothing. We select the label smoothing coefficient as 0.01 for CIFAR-10 and 0.1 for CIFAR-100. We define ensemble prediction as an averaged prediction of single-member predictions. For OOD detection, we use the variance of prediction in output space, which is competitive to recent OOD detection methods (Ren et al., 2019; Van Amersfoort et al., 2020) . We use 0.01 for σ and select the best α with cross-validation. For every experiment, we used 8 NVIDIA RTX 3090 GPUs. 



https://github.com/sungyubkim/connectivity-tangent-kernel For a simple two layer linear NN f (x) := W2σ(W1x) with weight matrix W1, W2, the first case of equation 19 corresponds to k-th row of W1 and the second case of equation 19 corresponds to k-th column of W2.



)), trace of empirical Fisher (tr(F); Jastrzebski et al. (2021)), trace of empirical NTK at θ * , Fisher-Rao (FR; Liang et al. (2019)) metric, Adaptive Sharpness (AS; Kwon et al. (2021)), and four PAC-Bayes bound based measures: Sharpness-Orig. (SO), PAC-Bayes-Orig. (PO), Sharpness-Mag.

Figure 1: Sensitivity to α. Expected calibration error (ECE), Negative Log-likelihood (NLL), and Brier score results on corrupted CIFAR-100 for ResNet-18. Showing the mean (line) and standard deviation (shaded area) across four different seeds.

Figure 2: Functions used in proofs

Comparison between PAC-Bayes-CTK and PAC-Bayes-NTK for ResNet-18

Correlation analysis of sharpness measures with generalization gap. We refer Sec. 4.1 for the details of sharpness measures (row) and correlation metrics for sharpness-generalization relationship (column).

Test negative log-likelihood on two UCI variants(Hernández-Lobato & Adams, 2015;Foong et al., 2019). We marked the best method among the four in bold and marked the best method among LL/CL in italics.

We implement full-curvature versions of LL and CL and evaluate these to the 9 UCI regression datasets(Hernández-Lobato & Adams, 2015) and itsGAP-variants (Foong et al., 2019)  to compare calibration performance on in-between uncertainty. We measure test NLL for LL/CL and 2 baselines (Deep Ensemble(Lakshminarayanan et al., 2017)  and Monte-Carlo DropOut (MCDO;Gal & Ghahramani (2016))). Eight ensemble members are used in Deep Ensemble, and 32 MC samples are used in LL, CL, and MCDO. Table3shows that CL performs better than LL on 6 of 9 datasets. Even though LL produces better calibration results on 3 of the datasets for both settings, the performance gaps between LL and CL are not as severe as on the other 6 datasets. Uncertainty calibration results on CIFAR-100(Krizhevsky, 2009) for ResNet-18(He et al., 2016)



linearization of NN w.r.t. connectivity) f pert θ * (x, δ), g pert θ * (x, c) ≜ Perturbed NN w.r.t. parameter/connectivity 0 d ≜ zero vector with dimension d I d ≜ identity matrix with dimension d × d tr(A) ≜ a trace (i.e., sum of diagonal elements) of matrix A diag(v) ≜ a diagonal matrix whose diagonal elements correspond to v T : R P → R P ≜ a function-preserving scale transformation D ≜ (true) data distribution S ≜ i.i.d. sampled training dataset from S S P , S Q ≜ randomly partitioned prior/posterior datasets from S N, N P

Require: training data S, pre-trained parameter θ * , number of samples M , prior scale α, observational noise scale σ for m = 1, . . . , M do Sample parameter noise c m 0 ∼ N (c 0 |0 P , α 2 I P ) and label noise ε

Example of network configuration with respect to the depth 1, width 128 in(Simonyan & Zisserman, 2015)-style.

Results for experiments on PAC-Bayes-CTK/NTK estimation with N P = 5, 000 and N Q = 45, 000.We use 4 random seeds to compute error bars.

Uncertainty calibration results on CIFAR-10 (Krizhevsky, 2009) for ResNet-18(He et al., 2016).

Uncertainty calibration results on CIFAR-10 (Krizhevsky, 2009) for VGG-13(Simonyan & Zisserman,

Uncertainty calibration results on CIFAR-100(Krizhevsky, 2009) for VGG-13(Simonyan & Zisserman, 2015).

Transposed Table

First 5 columns of transposed table

Last 4 columns of transposed table

ACKNOWLEDGEMENTS

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075 / No.2017-0-01779) and the National Research Foundation of Korea (NRF) grants (No.2018R1A5A1059921) funded by the Korea government (MSIT). This work was also supported by Samsung Electronics Co., Ltd (No.IO201214-08133-01).

annex

Algorithm 2 Hutchison's method for computing Connectivity sharpness Require: training data S, pre-trained parameter θ * , number of samples Mx B = 0 for m = 1, . . . , M do Sample z m ∼ N (ε|0 N K , σ 2 I N K ) for t = 1, . . . , T do Sequentially draw a mini-batch B from S.Compute vector-Jacobian product v m B = z m B J θ (X , θ * )diag(θ * ). Compute x B = x B + ∥v m B ∥ 2 2 /T end for end for return Estimated Connectivity Sharpness x B It is well known that empirical NTK or Jacobian is intractable in the modern architecture of NNs (e.g., ResNet (He et al., 2016) or BERT (Devlin et al., 2018) ). Therefore, one might wonder how Connectivity Sharpness (CS) can be computed for these architectures. One can compute CS with Hutchison's method (Hutchinson, 1989; Meyer et al., 2021) as it is defined as a trace of empirical CTK. According to Hutchison's method, trace of a matrix A ∈ R m×m iswhere z ∈ R m is a random variable with cov(z) = I m (e.g., standard normal distribution or Rademacher distribution). Since A = C θ * X = J c J ⊤ c ∈ R N K in our case, we further use minibatch approximation to compute z ⊤ Az: (i) Sample z m B ∈ R M K from Rademacher distribution for mini-batch M with size M and (ii) compute v m B := J c (X B , 0 p ) ⊤ z m B ∈ R P with vector-Jacobian product of JAX (Bradbury et al., 2018) (or it can simply computed using standard backprop) and (iii) computeThen, the sum of x m B for all mini-batch in the training dataset is a Monte-Carlo approximation of CS with sample size 1. Empirically, we found that this approximation is sufficiently stable to capture the correlation between sharpness and generalization as shown in Sec. 4.1. In Algorithm 2, we provide a pseudo-code for the implementation. Note that both time and memory complexity of computing v m B is comparable to a backward propagation as shown in Novak et al. (2022) with jax.vjp function in JAX (Bradbury et al., 2018) . Therefore, the time/memory complexity of mini-batch VJP would be O(|B|LW 2 )/O(|B|LW + LW 2 + N KW ) for MLPs with width W and depth L.

G PREDICTIVE UNCERTAINTY OF CONNECTIVITY/LINEARIZED LAPLACE

In this section, we derive predictive uncertainty of Linearized Laplace (LL) and Connectivity Laplace (CL). By matrix inversion lemma (Woodbury, 1950) , the weight covariance of LL isTherefore, if σ 2 /α 2 → 0, then the weight covariance of LL converges toWith this weight covariance and linearized NN, the predictive uncertainty of LL isSimilarly, the predictive uncertainty of CL is

H DETAILED EXPERIMENTAL SETTINGS H.1 BOUND ESTIMATION

We pre-train ResNet-18 (He et al., 2016) with a mini-batch size of 1K on S P with SGD of initial learning rate 0.4 and momentum 0.9. We use cosine annealing for learning rate scheduling (Loshchilov & Hutter, 2016) with a warmup for the initial 10% training step. We fix δ = 0.1, α = 0.1, and σ = 1.0 to compute equation 8. These values are chosen so that the PAC-Bayes-CTK and NTK bounds fall within the non-vacuous range. We use 3 random seeds to compute the standard errors.Additional results on few pre-training data with N P = 5, 000 and N Q = 45, 000 are presented in Appendix I. To verify that the CS has a better correlation with generalization performance compared to existing sharpness measures, we evaluate the three metrics: (a) Kendall's rank-correlation coefficient τ (Kendall, 1938) which considers the consistency of a sharpness measure with generalization gap (i.e., if one has higher sharpness, then so has higher generalization gap) (b) granulated Kendall's coefficient (Jiang et al., 2020) which examines Kendall's rank-correlation coefficient w.r.t. individual hyper-parameters to separately evaluate the effect of each hyper-parameter to generalization gap (c) conditional independence test (Jiang et al., 2020) which captures the causal relationship between measure and generalization.

H.2 CONNECTIVITY SHARPNESS

Three metrics are compared with the following baselines: trace of Hessian (tr(H); (Keskar et al., 2017) ), trace of Fisher information matrix (tr(F); (Jastrzebski et al., 2021) ), trace of empirical NTK at θ * (tr(Θ θ * )), and four PAC-Bayes bound based measures, sharpness-orig (SO), PAC-Bayes-orig (PO), 1/α ′ sharpness mag (SM), and 1/σ ′ PAC-Bayes mag (PM), which are eq. ( 52), ( 49), ( 62), (61) in Jiang et al. (2020) .For the granulated Kendall's coefficient, we use 5 hyper-parameters: network depth, network width, learning rate, WD and mini-batch size, along with 3 options for each hyper-parameters as in Table 6 . We use the VGG-13 as a base model and we adjust the depth and width of each conv block. We add In this section, we follow Algorithm 1 of Ghorbani et al. (2019) to visualize the eigenvalue densities of empirical CTK and NTK. We use 100 Lanczos iterations (Appendix F) with 4 realizations and fix the bandwidth of the RBF kernel as the difference between the maximum and minimum eigenvalue following the implementation of Ghorbani et al. (2019) . We present the results in Figure 3 .Figure 3 shows that empirical CTKs have positively skewed (i.e., right-tailed) eigenspectrums with modes close to 0. In other words, many of the non-zero eigenvalues of CTK are close to zero. Therefore, the corresponding h(β i ) for these eigenvalues are also close to zero, as shown in Corollary 2.3 (See Fig. 2b for the visualization). As with empirical CTKs, the empirical NTKs are positively skewed, but their eigenvalue scales are much larger than CTKs. In summary, our empirical study demonstrates that although an empirical CTK can have up to NK non-zero eigenvalues, only a few eigenvalues are critical to the scale of P i=1 h(β i ).

