DEMYSTIFYING THE OPTIMIZATION AND GENERALIZA-TION OF DEEP PAC-BAYESIAN LEARNING

Abstract

PAC-Bayes has long been a generalization analysis framework where the expected population error can be bounded by the sum of training error and the divergence between posterior and prior distribution. In addition to being a successful generalization bound analysis tool, the PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, which we refer to simply as PAC-Bayesian Learning. PAC-Bayesian learning has been proven to be able to achieve a competitive expected test set error numerically, while providing a tight generalization bound in practice, through gradient descent training. Despite its empirical success, the theoretical analysis of deep PAC-Bayesian learning for neural networks is rarely explored. To this end, this paper proposes a theoretical convergence and generalization analysis for PAC-Bayesian learning. For a deep and wide probabilistic neural network, we show that when PAC-Bayesian learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further obtain an analytic and guaranteed PAC-Bayesian generalization bound for the first time, which is an improvement over the Rademacher complexitybased bound for deterministic neural networks. Finally, drawing insight from our theoretical results, we propose a proxy measure for efficient hyperparameter selection, which is proven to be time-saving on various benchmarks. PAC-Bayesian analysis. A Probably Approximately Correct (PAC) Bayes framework (McAllester, 1999a;b) can incorporate knowledge about the learning algorithm and probability distribution over a set of hypotheses, thus providing a test performance (generalization) guarantee. Subsequently, the PAC-Bayesian method is adopted to analyze the generalization bound of the probabilistic neural networks (Langford & Caruana, 2002b). While the original PAC-Bayes theory only works with a bounded loss function, Haddouche et al. (2021) expanded the PAC-Bayesian theory to learning problems with unbounded loss functions. Furthermore, several improved PAC-Bayesian bounds suitable for different scenarios are introduced by Bégin et al. (2014; 2016) . As a result of the flexibility and generalization properties of PAC-Bayes, it is widely used to analyze complex, non-convex, and overparameterized optimization problems, especially over-parameterized neural networks (Guedj, 2019) . Neyshabur et al. (2017b) presented a generalization bound for feedforward neural networks with ReLU activations in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights.

1. INTRODUCTION

Deep learning has demonstrated powerful learning capability due to its over-parameterization structure, in which various network architectures have been responsible for its significant leap in performance (LeCun et al., 2015) . Over-fitting and complex hyperparameters are two of the major challenges in deep learning, hence designing generalization guarantees for deep networks is an important research goal (Zhang et al., 2021) . Recently, a learning framework that trains a probabilistic neural network with a PAC-Bayesian bound objective function has been proposed (Bégin et al., 2016; Dziugaite & Roy, 2017; Neyshabur et al., 2017b; Raginsky et al., 2017; Neyshabur et al., 2017a; London, 2017; Smith & Le, 2017; Pérez-Ortiz et al., 2020; Guan & Lu, 2022) , which is known as PAC-Bayesian learning. While providing a tight generalization bound, PAC-Bayesian learning has been proven to be able to achieve a competitive expected test set error (Ding et al., 2022) . Furthermore, this generalization bound computed from the training data can obviate the need for splitting data into training, testing, and validation set, which is highly applicable for training a deep network with scarce data (Pérez-Ortiz et al., 2020; Grünwald & Mehta, 2020) . Meanwhile, these advancements on PAC-Bayesian bounds have been widely adapted with different deep neural network structures including convolutional neural network (Zhou et al., 2018; Pérez-Ortiz et al., 2020) , binary activated multilayer networks (Letarte et al., 2019) , partially aggregated neural networks (Biggs & Guedj, 2020) , and graph neural network (Liao et al., 2020) . Due to the impressive empirical success of PAC-Bayesian learning, there is increasing interest in understanding its theoretical properties. However, it is either restricted to a specific technique variant such as Entropy-SGD which minimizes an objective indirectly by approximating stochastic gradient ascent on the so-called local entropy (Dziugaite & Roy, 2018a) and differential privacy (Dziugaite & Roy, 2018b) , or relies heavily on empirical exploration (Neyshabur et al., 2017a; Dziugaite et al., 2020) . To our best knowledge, there has been no investigation so far into why the training of PAC-Bayesian learning is successful and why the PAC-Bayesian bound is tight on unseen data after training. For example, it is still unclear when applying gradient descent to PAC-Bayesian learning: Q1: How effective is gradient descent training on a training set? Q2: How tight is the generalization bound compared to those learning frameworks using nonprobabilistic neural networks? The answers to these questions can be highly non-trivial due to the inherent non-convex problem of over-parameterization (Jain & Kar, 2017) and additional randomness introduced by probabilistic neural networks (Specht, 1990) as well as additional challenges brought by the divergence between posterior/prior distribution pairs known as Kullback-Leibler (KL) divergence. Nevertheless, this paper shows that it is possible to answer the above questions by leveraging the recent advances in deep learning theory with over-parameterized setting. It has been shown that wide networks optimized with gradient descent can achieve a near-zero training error, and the critical factor that governs the training process is the neural tangent kernel (NTK), which can be proven to be unchanged during gradient descent training (Jacot et al., 2018) , thus providing a guarantee for achieving a global minimum (Du et al., 2019; Allen-Zhu et al., 2019) . Under the PAC-Bayesian framework, NTK is no longer calculated from the derivative of the weights directly, but instead is calculated based on the gradient of the distribution parameters of the weights. We call this Probabilistic NTK (PNTK), based on which we build a convergence analysis to characterize the optimization process of PAC-Bayes learning. Due to the explicit solution obtained by optimization analysis, we further formulate the generalization bound of PAC-Bayesian learning for the first time, and demonstrate its advantage by comparing it with the theoretical generalization bound of learning framework with non-stochastic neural networks (Arora et al., 2019a; Cao & Gu, 2019; Hu et al., 2019) . We summarize our contributions as follows: • With a detailed characterization of gradient descent training of the PAC-Bayes objective function, we derive that the final solution is kernel ridge regression with its kernel being the PNTK. • Based on the optimization solution, we derive an analytical and guaranteed PAC-Bayesian bound for deep networks for the first time. Moreover, our bound differs from other PAC-Bayes bounds. Recent papers require distribution of posterior, while our bound is completely independent of computing the distribution of posterior. • The performance of PAC-Bayesian learning depends on the selection of a large number of hyperparameters. We design a training-free proxy based on our theoretical bound and show it is effective and time-saving. • Our technique of analyzing optimization and generalization of probabilistic neural networks through over-parameterization has a wide range of applications such as the Variational Auto-encoder (Kingma & Welling, 2013; Rezende et al., 2014) and deep Bayesian networks (MacKay, 1992; Neal, 2012) , we believe our technique can provide the basis for the analysis of over-parameterized probabilistic neural networks.

2. RELATED WORK

PAC-Bayesian learning. In addition to obtaining the theoretical analysis for the generalization properties of deep learning, it is important to achieve a numerical bound on generalization for practical deep learning algorithms. Langford & Caruana (2002a) introduced a method to train a Bayesian neural network and used a refined PAC-Bayesian bound for computing the error upper bound. Later, Neyshabur et al. (2017a) extended Langford et al. Langford & Caruana (2002a) 's work by developing a training objective function derived from a relaxed PAC-Bayesian bound. In the standard application of PAC-Bayes, the prior is typically chosen to be a spherical Gaussian centered at the origin. However, without incorporating the information of data, the KL divergence might be unreasonably large, limiting the performance of the PAC-Bayes method. To address this gap, a large volume of literature proposes to obtain localized PAC-Bayes bounds via distribution-dependent priors through data (Ambroladze et al., 2007; Negrea et al., 2019; Dziugaite et al., 2020; Perez-Ortiz et al., 2021) . Furthermore, Dziugaite & Roy (2018b); Tinsi & Dalalyan (2022) showed how a differentially private data-dependent prior yields a valid PAC-Bayes bound for a situation where the data distribution is presumed to be unknown. More recently, research has focused on providing a PAC-Bayesian bound for more realistic architectures, such as convolutional neural network (Zhou et al., 2018) , binary activated multilayer networks (Letarte et al., 2019) , partially aggregated neural networks (Biggs & Guedj, 2020) , and graph neural networks (Liao et al., 2020) . We denote the practical use of the PAC-Bayesian algorithm to train over-parameterized neural networks as PAC-Bayesian learning and the target of this work is to demystify the success behind deep learning trained via the PAC-Bayesian bound through PNTK.

3. PRELIMINARY

Notation. We use bold-faced letters for vectors and matrices and non-bold-faced letters for scalars. We use ∥ • ∥ 2 to denote the Euclidean norm of a vector or the spectral norm of a matrix, while denoting ∥ • ∥ F as the Frobenius norm of a matrix. For a neural network, we denote σ(x) as the activation function. We denote [n] = {1, 2, . . . , n}. The least eigenvalue of matrix A is denoted as λ 0 (A) = λ min (A).

3.1. DEEP PROBABILISTIC NEURAL NETWORK

In PAC-Bayesian learning we use probabilistic neural networks (PNNs) instead of deterministic networks, where the weights always follow a certain distribution. In this work, we adopt the Gaussian distribution for the weights, and define a L-layer probabilistic neural network governed by the following recursive expression, x (l) = 1 √ m σ W (l) x (l-1) , 1 ≤ l ≤ L; f = v ⊤ x (L) where x (0) = x ∈ R d is the input, W ) ∈ R m×d is the weight matrix at the first layer, W (l) ∈ R m×m is the weight at the l-th layer for 2 ≤ l ≤ L, and v ∈ R m is the weight vector at the output layer. To keep weights follow Gaussian distribution during gradient descent training, we introduce the re-parameterization trick (Kingma & Welling, 2013; Kingma et al., 2015) : W (l) = W (l) µ + W (l) σ ⊙ ξ (l) , ξ (l) ∼ N (0, I), 1 ≤ l ≤ L; v = v µ + v σ ⊙ ξ (v) , ξ (v) ∼ N (0, I), (2) where ⊙ denotes the element-wide product operation, thus ξ (l) for 1 ≤ l ≤ L and ξ (v) share the same size as their corresponding weight matrix or vector. The key insight of re-parameterization is to sample ξ (l) for 1 ≤ l ≤ L and ξ (v) from a normal distribution N (0, I), and leave W (l) µ , W (l) σ , v µ , and v σ to be deterministic. We adopt random initialization for mean weights, where W (l) µ , v µ ∼ N (0, c 2 µ • I) for l ∈ [1, L] . With an abuse of notation, we omit the size of mean 0 and variance I which is in coordinate with their weight matrix or vector. On the other hand, we use an absolute constant to initialize variance weights, namely W (l) σ , v σ = c 2 σ • 1, where 1 is a matrix or vector with all elements to be 1.

3.2. PAC-BAYESIAN LEARNING

Suppose data S = {(x i , y i )} n i=1 are i.i.d. samples from a non-degenerate distribution D. Define H to be the hypothesis space, h(x) to be the prediction of hypothesis h ∈ H over for x. Let R D (h) = E (x,y)∼D [ℓ(y, h(x))] represent the generalization error of classifier h and R S (h) = 1 n n i=1 ℓ (y i , h (x i )) represent the empirical error of classifier h, where ℓ(•) is the loss function. In PAC-Bayes, the prior Q(0) ∈ H is the prior distribution in H at initialization or before training, and the posterior Q ∈ H is the distribution of parameters after training. To make the evaluation of prediction based on weight parameters W (l) feasible for l ∈ [L], we adopt the Gaussian distribution for the parameters, and the expected value for population risk and empirical error are R D (Q) = E (x,y)∼D,h∼Q [ℓ(y, h(x))] = E h∼Q [R D (h)], R S (Q) = E h∼Q [ R S (h)]. The PAC-Bayes theory (Langford & Seeger, 2001; Seeger, 2002; Maurer, 2004) gives the following theorem: Theorem 3.1. Then for any δ ∈ (0, 1], the following inequality holds uniformly for all posteriors distributions Q ∈ H with a probability of at least 1 -δ, kl R S (Q)∥R D (Q) ≤ KL(Q∥Q(0)) + log 2 √ n δ n . ( ) where KL(Q∥Q(0 )) = E Q ln Q Q(0) is the Kullback-Leibler (KL) divergence and kl(q∥q ′ ) = q log( q q ′ ) + (1 -q) log( 1-q 1-q ′ ) is the binary KL divergence. Furthermore, combined with Pinsker's inequality for binary KL divergence, kl(p∥p) ≥ (p -p) 2 /(2p), when p < p, yields, R D (Q) -R S (Q) ≤ 2R D (Q) KL(Q∥Q(0)) + log 2 √ n δ n . Equation ( 4) is a classical result. This result can be further combined with the inequality √ ab ≤ 1 2 ( λa + b λ ) , for all λ > 0, which leads to a PAC-Bayes-λ bound in Theorem 3.2, as proposed by Thiemann et al. (2017) : Theorem 3.2. Let Q 0 ∈ H be some prior distribution over H. Then for any δ ∈ (0, 1], the following inequality holds uniformly for all posteriors distributions Q ∈ H with a probability of at least 1 -δ R D (Q) ≤ R S (Q) 1 -λ/2 + KL(Q∥Q(0)) + log 2 √ n δ n λ(1 -λ/2) . In this work, inspired by Catoni (2007) ; Rivasplata et al. (2019) , we aim to promote the training objective as PAC-Bayes bound and choose Equation (5) as the training objective. We highlight that the original interest of Theorem 3.2 in (Thiemann et al., 2017) is to allow the optimization of a quasiconvex objective both λ and Q. However, since our main goal is to study the optimization and generalization properties of PNNs, we set directly λ = 1 and we omit the factor of two and express the objective function as follows: L(Q) = R S (Q) + λ KL(Q∥Q(0)) n = E h∼Q 1 n n i=1 ℓ (y i , h (x i )) + λ KL(Q∥Q(0)) n ( ) where λ is a hyperparameter introduced in a heuristic manner to make the method more flexible. Since the term regarding δ is a constant, we omit it in the objective function. We set ℓ to be the squared loss in the training objective function because it has a nice property such that the final solution of the output function is explicit in the infinite-width limit. The global convergence can be extended to cross-entropy loss like existing works (Ji & Telgarsky, 2019; Chen et al., 2019) . Instead of optimizing W (l) and v directly, the gradient descent with reparameterization trick leads to W (l) µ (t + 1) = W (l) µ (t) -η ∂L(Q) ∂W (l) µ (t) ; W (l) σ (t + 1) = W (l) σ (t) -η ∂L(Q) ∂W (l) σ (t) where η is the learning rate. For simplicity, we omit the gradient descent expression for v µ and v σ , and will omit the corresponding terms regarding v µ , v σ in the following text unless otherwise specified. To simplify theoretical analysis, this work considers gradient flow instead, and the same results can be extended to gradient descent case with a careful analysis.

4. MAIN THEORETICAL RESULTS

In this section, Theorem 4.2 gives a precise characterization of how the objective function without KL divergence decreases to zero. We then extend the convergence characterization to the full objective, and find the final solution is a kernel ridge regression, as demonstrated by Theorem 4.3. As a consequence, we are able to establish an analytic generalization bound through Theorem 4.4.

4.1. OPTIMIZATION ANALYSIS

To simplify the analysis, we first consider the optimization of probabilistic neural networks of the form (1) with objective R S (Q). In other words, we neglect the KL divergence term at this stage and show that the corresponding results can be extended to the target function with KL divergence in the next section. Given this premise, we show that for a L-layer probabilistic neural network, the gradient flow of output function admits the following dynamics df (X; t) dt = ∂f (X; t) ∂θ µ ∂θ µ ∂t + ∂f (X; t) ∂θ σ ∂θ σ ∂t = (y -f (X; t))(Θ µ (X, X; t) + Θ σ (X, X; t)) (8) where θ µ ≡ ({W (l) µ } L l=1 , v µ ) and θ σ ≡ ({W (l) σ } L l=1 , v σ ) are collection of mean weights and variance weights. Besides, Θ µ (X, X; t) ∈ R n×n and Θ σ (X, X; t) ∈ R n×n are probabilistic neural tangent kernels (PNTKs) defined as follows, Definition 4.1 (Probabilistic Neural Tangent Kernel). The tangent kernels associated with the output function f (X; t) at parameters θ µ and θ σ are defined as, Θµ(X, X; t) = ∂f (X; t) ∂θµ ∂f (X; t) ∂θµ ⊤ = L l=1 ∇ W (l) µ f (X; t)∇ W (l) µ f (X; t) ⊤ + ∇v µ f (X; t)∇v µ f (X; t) ⊤ Θσ(X, X; t) = ∂f (X; t) ∂θσ ∂f (X; t) ∂θσ ⊤ = L l=1 ∇ W (l) σ f (X; t)∇ W (l) σ f (X; t) ⊤ + ∇v σ f (X; t)∇v σ f (X; t) ⊤ Different from standard (deterministic) neural networks, the probabilistic network consist of two sets of parameters θ µ and θ σ , thus the PNTK has two corresponding tangent kernels. One of the key findings of this work is that the PNTKs Θ µ (X, X) and Θ σ (X, X) will both converge to a limiting deterministic kernel denoted as Θ ∞ (X, X) at initialization and during training if m is sufficiently large, namely lim m→∞ Θ µ (X, X) = Θ ∞ (X, X) and lim m→∞ Θ σ (X, X) = Θ ∞ (X, X). As a result, in the infinite-width limit, dynamics of output function with gradient flow is linear: df (X; t) dt = 2 y -f (X; t) Θ ∞ (X, X) By leveraging this insight, we arrive at our main convergence theory for deep probabilistic neural networks, which is stated formally as follows Theorem 4.2 (Convergence of probabilistic networks with large width). Suppose σ(•) is H-Lipschitz, λ 0 (K ∞ ) > 0, and the network's width is of m = Ω 2 O(L) max n 2 log(Ln/δ) λ 2 0 (K (L) ∞ ) , n δ , n 5 log(2/δ) 10 λ 2 0 (K (L) ∞ ) with the initialization. Then, with a probability of at least 1 -δ over the random initialization, we have, R S (Q(t)) ≤ exp -λ 0 (K (L) ∞ )t R S (Q(0)) where we define K (l) (x i , x j ) ≡ (x (l) i ) ⊤ x (l) j and K (l) ∞ (x i , x j ) ≡ lim m→∞ (x (l) i ) ⊤ x (l) j Our theorem establishes that if m is large enough, the expected training error converges to zero at a linear rate. In particular, the least eigenvalue of PNTK governs the convergence rate. Besides, we find the change of weight is bounded during training, which is consistent with the requirement of PAC-Bayes theory that the loss function is bounded.

4.2. TRAINING WITH KL DIVERGENCE

According to Equation ( 6), there is a KL divergence term in the objective function. We expand the KL-divergence for two Gaussian distributions, P(t) ≡ N (µ t , σ 2 t ), and P(0) ≡ N (µ 0 , σ 2 0 ), KL(P(t)|P(0)) = 1 2 log σ 0 σ t + (µ t -µ 0 ) 2 σ 2 0 + σ t σ 0 -1 (12) We compare the gradient of mean and variance weights w (l) µ w (l) σ . With a direct calculation, we have l) . It is shown that there is one more random variable ξ (l) associated with the gradient regarding variance weights, which results in the expected gradient norm being zero. Therefore, it is equivalent to fix w σ during gradient descent training and we arrive at the conclusion that probabilistic neural network is performing kernel ridge regression in the infinite-width limit: Theorem 4.3. Consider gradient descent on objective function ( 6). ∂f (xi) ∂w (l) µ = 1 √ m ∂f (xi) ∂x (l) σ ′ (w (l) x (l-1) )x (l-1) and ∂f (xi) ∂w (l) σ = 1 √ m ∂f (xi) ∂x (l) σ ′ (w (l) x (l-1) )x (l-1) ⊙ ξ ( Suppose m ≥ poly(n, 1/λ 0 , 1/δ, 1/E). Then, with a probability of at least 1 -δ over the random initialization, we have f x, Q(t) | t=∞ = Θ ∞ µ (x, X) Θ ∞ µ (X, X) + λ/c 2 σ I -1 y ± E (13) where f (x, Q(t)) = E f ∼Q(t) f (x; t) aligns with the definition of the empirical loss function. Theorem 4.3 reveals the regularization effect of the KL term in PAC-Bayesian learning, and presents an explicit expression for the convergence result of the output function.

4.3. GENERALIZATION ANALYSIS

We use squared loss to train the probabilistic neural network but adopt a general and suitable loss ℓ ∈ [0, 1] to evaluate the PNN's generalization. Recall that in Theorem 3.2, the PAC-Bayesian bound concerning the distribution at initialization and after optimization is given. Therefore, combined with the results from Theorem 4.3, we provide a generalization bound for PAC-Bayesian learning with ultra-wide condition. Theorem 4.4 (PAC-Bayesian bound with NTK). Suppose data S = {(x i , y i )} n i=1 are i.i.d. samples from a non-degenerate distribution D, and m ≥ poly(n, λ -1 0 , δ -1 ). Consider any loss function ℓ : R × R → [0, 1] that is 1-Lipschitz in the first argument such that ℓ(y, y) = 0. Then with a probability of at least 1 -δ over the random initialization and the training samples, the probabilistic neural network (PNN) trained by gradient descent for T ≥ Ω( 1 ηλ0 log n δ ) iterations has population risk R D (Q) that is bounded as follows: RD(Q) ≤ y ⊤ (Θ ∞ µ (X, X) + λ/c 2 σ I) -1 y nc 2 σ + λ c 2 σ y ⊤ (Θ ∞ µ (X, X) + λ/c 2 σ I) -2 y n + O log 2 √ n δ n . The proof can be found in the Appendix C. In this theorem, we establish a reasonable generalization bound for the PAC-Bayesian learning framework, thus providing a theoretical guarantee. Compared to the PAC-bayes bound (5), our bound is analytic and computable. We further demonstrate the advantage of PAC-Bayesian learning by comparing it with the Rademacher complexity-based generalization bound for deterministic neural networks with a kernel ridge solution. Theorem 4.5 (Rademacher bound with NTK). Suppose data S = {(x i , y i )} n i=1 are i.i.d. samples from a non-degenerate distribution D, and m ≥ poly(n, λ -1 0 , δ -1 ). Consider any loss function ℓ : R × R → [0, 1] that is 1-Lipschitz in the first argument such that ℓ(y, y) = 0. Then with a probability of at least 1 -δ over the random initialization and training samples, the deterministic neural network trained by gradient descent for T ≥ Ω( 1 ηλ0 log n δ ) iterations has population risk R D that is bounded as follows: RD ≤ y ⊤ (Θ ∞ µ (X, X) + λ/c 2 σ I) -1 y n + λ c 2 σ y ⊤ (Θ ∞ µ (X, X) + λ/c 2 σ I) -2 y n + O log n λ 0 δ n . Theorem 4.5 is obtained by following Theorem 5.1 in Hu et al. (2019) , which presents a Rademacher complexity-based generalization bound for ultra-wide neural networks with a kernel ridge regression solution. Similar analysis for kernel regression without regularization based on NTK can be found in Arora et al. (2019a) ; Cao & Gu (2019) . The main difference between two generalization bounds is y ⊤ (Θ ∞ µ (X,X)+λ/c 2 σ I) -1 y n versus y ⊤ (Θ ∞ µ (X,X)+λ/c 2 σ I) -1 y n , which is due to the fact that the PAC-Bayesian bound count the KL divergence while Rademacher bound calculate the reproducing kernel Hilbert space (RKHS) norm. We find that the convergence rate of the focused term are different. One is O(1/n) and the other is O(1/ √ n). Therefore, we conclude that the PAC-Bayesian bound has a numerical improvement over the Rademacher complexity-based bound when n is large.

5. PROOF SKETCH

To prove Theorem 4.2, we first show that PNTKs at initialization are close to the limiting kernel given the width is large enough. Then we prove the distance between PNTKs and limiting kernel during training is also bounded, meanwhile loss has a linear convergence rate by induction. Our proof framework is similar to Du et al. (2019) ; Arora et al. (2019a) 's. However, The main difference is that our network architecture is much more complex (e.g. probabilistic network contains two sets of parameters) and each set involves its own randomness which requires bounding many terms more elaborately. The detailed proof can be found in Appendix A. The proof of Theorem 4.3 utilizes an argument of linearization of the network model in the infinitewidth limit. This allows us to obtain an ordinary differential equation for output function with the solution of kernel ridge regression. The details are given in Appendix B. For generalization analysis, we defer the proofs of Theorem 4.4 to Appendix C. Our proof is based on a characterization of the empirical error and KL divergence term via the explicit solution found in Theorem 4.3.

6. EXPERIMENTS

As an extension of our finding of the PAC-Bayesian bound in Theorem 4.4, we provide a training-free metric to approximate the PAC-Bayesian bound via PNTK, which can be used to select the best hyperparameters without involving any training and eliminate excessive computation time. Besides, we provide a empirical verification of our theory in Appendix D.1 and comparison of theoretical bounds with empirical bounds in Appendix D.2.

6.1. EXPERIMENTAL SETUP

In all experiments, the NTK parameterization is chosen to initialize the parameters, which follows Equation (1). Specifically, the initial mean weights θ µ , are sampled from a truncated Gaussian distribution with a mean of zero and one standard variance of 1, truncating at two standard deviations. To ensure that the variance is positive, the initial variance for weight is transformed from the given value of c σ through the formula c σ = log(1 + exp(ρ 0 )). In section 6.2, we describe the use of both fully connected and convoluted neural network structures to perform experiments on MINIST and CIFAR10 datasets to demonstrate the effectiveness of our training-free PAC-Bayesian network bound for searching hyperparameters under different datasets and network structures. In particular, we build a 3-layer fully-connected neural network with 600 neurons on each layer. On the other hand, the convolutional architecture is equipped with a total of 13 layers with around 10 million learnable parameters. We adopt a data-dependent prior since it is a practical and popular method (Perez-Ortiz et al., 2021; Fortuin, 2022) . Specifically, this data-dependent prior is pre-trained on a subset of total training data with empirical risk minimization. The networks for posterior training are then initialized by the weights learned from the prior. Finally, the generalization bound is computed using Equation ( 5). The relevant settings are referred to in the work by Pérez-Ortiz et al. (2020) , such as confidence parameters for the risk certificate and Chernoff bound, and the 150,000 times of Monte Carlo samples to estimate the risk certificate. The PAC-Bayesian learning framework provides competitive performance with non-vacuous generalization bounds. However, the tightness of this generalization bound depends on the hyperparameters used, such as the proportionality of data used for the prior, the initialization of ρ 0 , and the KL penalty weight (λ). Since these three values do not change during the training, we refer to them as hyperparameters. Choosing the right hyperparameters via a grid search is obviously prohibitive, as each attempt to compute the generalization bound can involve significant computational resources. Another plausible approach is to design some kind of predictive, "training-free" metric so that we can approximate the error bound without going through an expensive training process. In light of this goal, we have already developed a generalization bound in theorem 4.4 via NTK. Since NTK changes are held constant during training, we can predict the generalization bound by this proxy metric, which can be formulated as follows: PA = Tr ( Θ + λ/c 2 σ I) -1 • yy ⊤ c 2 σ • n + λ c 2 σ ( Θ + λ/c 2 σ I) -2 • yy ⊤ n ( ) where Θ is an empirical NTK associated with mean weights, measured on a finite-width neural network at initialization. yy ⊤ is a n × n label similarity matrix (if two data have the same label, their joint entry in the matrix is one and zero otherwise), and n is the number of data used. Note that the proposed proxy metric in Eq. ( 16) share the same spirit of kernel alignment, a label similarity metric, which has been widely used in the application of deep active learning (Wang et al., 2021) , model selection for fine-tuning (Deshpande et al., 2021) , and neural architecture search (NAS) (Mok et al., 2022) . To demonstrate the computational practicality of this training-free metric, we compute PA using only a subset of the data for each class (325 per class for FCN and 75 per class for CNN). We should also mention that training-free methods for searching neural architectures are not new, and can be found in NAS (Chen et al., 2021; Deshpande et al., 2021) , MAE Random Sampling (Camero et al., 2021) , pruning at initialization (Abdelfattah et al., 2021) . To the best of our knowledge, there is currently no training-free method for selecting hyperparameters in the PAC-Bayesian framework, which we consider to be one of the novelties of this paper. Figure 1 demonstrates a strong correlation between PA and the actual generalization bound. Finally, we demonstrate that by searching through all possible combinations of hyperparameters using PA, it is possible to select a hyperparameter leading towards a result that is comparable to the best generalization bound, but without excessive computation. To put things in perspective, in Table 1 , we compare the risk certificates and computation time for three hyperparameters finding methods (exhaustive search, Bayesian search and PA) on the two architectures (FCN and CNN) and two datasets (MNIST and CIFAR10). Unlike exhaustively searching where the best set of hyperparameters are selected from 648 different hyperparameter combinations (9 data-dependent prior with different subsets data for prior training, 9 different values of KL penalty, and 8 different values of ρ 0 ), Bayesian search takes only 36 iterations to find the lowest bound since it evaluates the information in past iterations of searching and efficiently selecting the next set of hyperparameters based on the prior knowledge. Yet, reducing the number of search iterations cannot sufficiently reduce the overall computation time when training a large and complex model. For instance, under the CIFAR10 dataset, it takes 45 hours to train a CNN with the bound. In contrast, using the training-free method of PA save 83.33 times the computational time to find the bound that is close to the lowest risk certificate in accuracy.

7. DISCUSSION

In this work, we theoretically prove that the learning dynamics of deep probabilistic neural networks using training objectives derived from PAC-Bayes bounds are exactly described by the NTK in an over-parameterized setting. Empirical investigation reveals that this agrees well with the actual training process. Furthermore, the expected output function trained with a PAC-Bayesian bound converges to the kernel ridge regression under a mild assumption. Based on this finding, we obtain an explicit generalization bound with respect to NTK for PAC-Bayesian learning, which improves over the generalization bound obtained through NTK on a non-probabilistic neural network. Finally, we show that the PAC-Bayesian bound score, the training-free method, can effectively select the hyperparameters which leads to a lower generalization bound without cost excessive computation time cost which the brute-force grid search would incur. In summary, we establish our theoretical analysis on PAC-Bayes with a random initialized prior. Notice that neural tangent kernel cannot characterize the feature learning process in deep learning (Damian et al., 2022; Ba et al., 2022) . This paper does not try to capture the feature learning for probabilistic neural networks given the NTK techniques used, but does provide sufficient new important convergence and generalization analysis for PAC-Bayesian learning. One promising direction would be to study PAC-Bayesian learning with data-dependent priors by NTK.

8. REPRODUCIBILITY STATEMENT

To ensure the results and conclusions of our paper are reproducible, we make the following efforts: Theoretically, we state the full set of assumptions and include complete proofs of our theoretical results in Section 4; Appendix A, B, and C. Experimentally, we provide our code, and instructions needed to reproduce the main experimental results. And we specify all the training and implementation details in Section 6 and Appendix D. A PROOF OF THEOREM 4.2 Theorem A.1 (Restatement of Theorem 4.2). Suppose σ(•) is H-Lipschitz and the network's width is of m = Ω 2 O(L) max n 2 log(Ln/δ) λ 2 0 (K (L) ∞ ) , n δ , n 5 log(2/δ) 10 λ 2 0 (K (L) ∞ ) with the initialization. Then, with a probability of at least 1 -δ over the random initialization, we have, R S (Q(t)) ≤ exp -λ 0 (K (L) ∞ )t R S (Q(0)) where we define K (l) (x i , x j ) ≡ (x (l) i ) ⊤ x (l) j and K (l) ∞ (x i , x j ) ≡ lim m→∞ (x (l) i ) ⊤ x (l) j Proof Sketch of Theorem A.1. To study the behavior of output function under gradient flow, we first write down its dynamics df (X; t) dt = ∂f (X; t) ∂θ µ ∂θ µ ∂t + ∂f (X; t) ∂θ σ ∂θ σ ∂t = (y -f (X; t))(Θ µ (X, X; t) + Θ σ (X, X; t)) where Θ µ and Θ σ are the PNTKs of the whole network, composed by the NTK of each layer. We observe that if Θ µ and Θ σ converge to a deterministic kernel, then the dynamics of output function admit a linear system, which is tractable during evolution. Before demonstrating the main steps, we introduce a Neural Network Gaussian Process (NNGP) of our studied neural network in the infinite-width limit (Lee et al., 2017) , which is defined as follows: K (l) (x i , x j ) ≡ (x (l) i ) ⊤ x (l) j K (l) ∞ (x i , x j ) ≡ lim m→∞ (x (l) i ) ⊤ x (l) j where subscript i, j denote the index of input samples. Instead of showing the Θ µ and Θ σ are close to Θ ∞ in infinite-width limit, we use K (L) as an anchor kernel. With the relation of NKT and NNGP, we can simplify our proof. Therefore, to prove Theorem A.1, three core steps are: Step 1 Show at initialization λ min (Θ µ (0)), λ min (Θ σ (0)) ≥ λmin(K (L) ) 2 and the required condition on m. Step 2 Show during training λ min (Θ µ (0)), λ min (Θ σ (0)) ≥ λmin(K (L) ) 2 and the required condition on m. Step 3 Show during training the empirical loss has a linear convergence rate. In our proof, we mainly focus on deriving the condition on m by analyzing λ min (Θ µ (0)) and λ min (Θ σ (0)) at initialization through Lemma A.2 and Lemma A.3. For step 2, we construct Lemma A.5 and Lemma A.6 to demonstrate that λ min (Θ µ (0)), λ min (Θ σ (0)) ≥ λmin(K (L) )

2

. This This leads to the conclusion that the required condition on m during train. Finally, we summarize all the previous lemmas and conclude that the training error converges at a linear rate through Lemma A.7.

A.1 STEP 1. BOUNDING LEAST EIGENVALUE OF PNTK AT INITIALIZATION

We first study the behavior of tangent kernels with an ultra-wide condition, namely m = poly(n, 1/λ 0 , 1/δ) at initialization. Lemmas A.2 and A.3 demonstrate that if m is large, then the feature of each layer is approximately normalized, Θ µ (0) and Θ σ (0) have a lower bound on the smallest eigenvalue with a high probability. Lemma A.2 (Initial norm at initialization). Suppose σ(•) is H-Lipschitz. If m = Ω nLg C (L) 2 δ , where C ≡ (c 2 µ + c 2 σ )H(2|σ(0)| 2 π + 2H ), while W µ and W σ are initialized by the form described in Section 3.1, then with probability at least 1 -δ over random initialization, for each l ∈ [L] and i ∈ [n], we have 1 2 ≤ ∥x (l) i (0)∥ 2 ≤ 2 where the geometric series function g C (l) = n-1 i=0 C i . Lemma A.3 (PNTK at initialization). Suppose σ(•) is H-Lipschitz. If m = Ω n 2 log(Ln/δ)2 O(L) λ 2 min (K (L) ) , while W µ and W σ are initialized by the form described in Section 3.1, then with probability at least 1 -δ, we have λ min (Θ (L) µ (0)) ≥ 3 4 λ min (K (L) ) λ min (Θ (L) σ (0)) ≥ 3 4 λ min (K (L) ) Proof of Lemma A.2. The proof is by induction method. The induction hypothesis is that with probability at least 1 -(l -1) δ nL over W (1) (0), . . . , W (l-1) (0), for every 1 ≤ l ′ ≤ l -1, we have 1 2 ≤ 1 - g C (l ′ ) 2g C (L) ≤ ∥x (l ′ ) i (0)∥ 2 ≤ 1 + g C (l ′ ) 2g C (L) ≤ 2 where the geometric series function g C (l) = n-1 i=0 C i . Note that there are two randomness in each W (l) for l ∈ [L], which can be seen from the expression: W (l) = W (l) µ + W (l) σ ⊙ ξ (l) The first randomness comes from the initialization of W (l) µ , and the second randomness is from random variable ξ. We then unify the two randomness into one, namely W (l) ∼ N (0, (c 2 µ + c 2 σ ) • I) through the following argument: P(W (l) ij ) = 1 √ 2πc σ e - W (l) ij -W (l) µ,ij 2 2c 2 σ P(W (l) µ,ij ) Plugging the density function P(W µ,ij ) of variable µ i into the above expression, we can obtain, P(W (l) ij ) = 1 √ 2πc σ e - W (l) ij -W (l) µ,ij 2 2c 2 σ 1 √ 2πc µ e - W (l) µ,ij -0 2 2c 2 µ dW (l) µ,ij = 1 2π(c 2 µ + c 2 σ ) e - W (l) ij -0 2 2(c 2 µ +c 2 σ ) With the result of P(W µ,ij ) at hand, we continue to bound ∥x (l) i ∥ 2 2 . We calculate the expectation over the randomness from W (l) (0). According to the feed-forward expression, we know that ∥x (l) i (0)∥ 2 2 = 1 m m r=1 σ(w (l) r (0) ⊤ x l-1 i (0)) 2 Then we have E ∥x (l) i (0)∥ 2 2 = E σ(w (l) r (0) ⊤ x (l-1) i (0)) 2 = (c 2 µ + c 2 σ )E Z∼N (0,1) σ(∥x (l-1) ∥ 2 Z) 2 Because σ(•) is H-Lipschitz, for 1 2 ≤ α ≤ 2, we have E Z∼N (0,1) σ(αZ) 2 -E Z∼N (0,1) σ(Z) 2 ≤ E Z∼N (0,1) |σ(αZ) 2 -σ(Z) 2 | ≤ H|α -1| • E Z∼N (0,1) |Z(σ(αZ) + σ(Z))| ≤ H|α -1| • E Z∼N (0,1) |Z|(|2σ(0)| + H|(α + 1)Z|) ≤ H|α -1| • (2|σ(0)|E Z∼N (0,1) [|Z|] + H|α + 1|E Z∼N (0,1) [Z 2 ]) = H|α -1| • (2|σ(0)| 2 π + H|α + 1|) ≤ C c 2 µ + c 2 σ |α -1| where we define C ≡ (c 2 µ + c 2 σ )H(2σ(0) 2 π + 2H). For the variance we have Var ∥x (l) i (0)∥ 2 2 = (c 2 σ + c 2 µ ) 2 m Var σw (l) r (0) ⊤ x (l) i (0) 2 ≤ (c 2 σ + c 2 µ ) 2 m E σ(w (l) r (0) ⊤ x (l) i (0)) 4 ≤ (c 2 σ + c 2 µ ) 2 m E |σ(0)| + H|w (l) r (0) ⊤ x (l) i (0)| 4 ≤ C 2 m . where C 2 ≡ σ(0) 4 + 8|σ(0)| 3 H 2/π + 24σ(0) 2 H 2 + 64σ(0)H 3 2/π + 512H 4 and the last inequality we used the formula for the first four absolute moments of Gaussian. Applying Chebyshev's inequality and plugging in our assumption on m, we have with probability 1 -δ nL over W (l) , x (l) i (0) 2 2 -E x (l) i (0) 2 2 ≤ 1 2g C (L) . Thus with probability 1 -d δ nL over W (1) , . . . , W (l) , x (l) i (0) 2 -1 ≤ x (l) i (0) 2 2 -1 ≤ Cg C (l -1) 2g C (L) + 1 2g(L) = g C (l) 2g C (L) Using union bounds over [n], we prove the lemma. Proof of Lemma A.3. For a weight matrix, we decompose it into m weight vectors, namely W (l) = [w (l) 1 , w (l) 2 , • • • , w m ]. Then the derivative of output over the parameters w (l) µ,r and w (l) σ,r can be expressed as ∂f (x i ) ∂w (l) µ,r = 1 √ m ∂f (x i ) ∂x (l) σ ′ (w (l) r x (l-1) )x (l-1) ∂f (x i ) ∂w (l) σ,r = 1 √ m ∂f (x i ) ∂x (l) σ ′ (w (l) r x (l-1) )x (l-1) ⊙ ξ (l) r According to the definition of PNTK for each layer: Θ (l) µ = ∇ W (l) µ f (X; t)∇ W (l) µ f (X; t) ⊤ Θ (l) σ = ∇ W (l) σ f (X; t)∇ W (l) σ f (X; t) ⊤ Through a standard calculation, we show that PNTKs can be expressed as Θ (L) µ,ij = (x (L-1) i ) ⊤ x (L-1) j • 1 m m r=1 v 2 r σ ′ ((w (L) r ) ⊤ x (L-1) i )σ ′ ((w (L) r ) ⊤ x (L-1) j ) Θ (L) σ,ij = (x (L-1) i ) ⊤ x (L-1) j • 1 m m r=1 v 2 r σ ′ ((w (L) r ) ⊤ x (L-1) i )σ ′ ((w (L) r ) ⊤ x (L-1) j ) • ξ 2 r Note that the difference between Θ (L) µ , Θ σ and K (L) can be decomposed as follows: Θ (L) µ -K (H) = (Θ (L) µ (0) -K (L) ∞ ) + (K (L) ∞ -K (L) ) Θ (L) σ -K (H) = (Θ (L) σ (0) -K (L) ∞ ) + (K (L) ∞ -K (L) ) We split the proof process into two phases: • First we use concentration inequality to show that if m = Ω n 2 log(n 2 /δ) λ 2 min (K (L) ) , we have Θ (L) µ (0) -K (L) ∞ 2 ≤ λ min (K (L) ) 4 Θ (L) σ (0) -K (L) ∞ 2 ≤ λ min (K (L) ) 4 • Second, we show that if m = Ω n 2 log(Ln/δ)2 O(L) λ 2 min (K (L) ) , then we have, K (L) ∞ -K (L) ∞ ≤ λ min (K (L) ) 2 Phase 1, bounding Θ (L) µ . Plugging the derivative result regarding mean weights in Equation ( 17) into the definition of PNTK (Eqution 9) yields: Θ (L) µ,ij (0) = (x (L-1) i ) ⊤ x (L-1) j • 1 m m r=1 v 2 r σ ′ ((w (L) r ) ⊤ x (L-1) i )σ ′ ((w (L) r ) ⊤ x (L-1) j ) By an analysis, we find that for all pairs of i, j, Θ µ,ij (0) is the average of m i.i.d. random variables, with the expectation K (L) ∞,ij = (c 2 µ + c 2 σ ) • E w∼N (0,I) (x (L-1) i ) ⊤ x (L-1) j σ ′ (w ⊤ x (L-1) i ) ⊤ σ ′ (w ⊤ x (L-1) j ) Then by Hoeffding's inequality, we know that the following inequality holds with probability at least 1 -δ ′ , Θ (L) µ,ij (0) -K (L) ∞,ij ≤ log(2/δ ′ ) 2m Because NTK matrix is of size n × n, we then apply a union bound over all i, j ∈ [n] (by setting δ ′ = δ/n 2 ), and obtain that Θ (L) µ,ij (0) -K (L) ∞,ij ≤ log(2n 2 /δ) Thus we have, Θ (L) µ (0) -K (L) ∞ 2 2 ≤ Θ (L) µ (0) -K (L) ∞ 2 F ≤ i,j Θ (L) µ,ij (0) -K (L) ∞,ij 2 = O n 2 log(2n 2 /δ) m Finally, if n 2 log(2n 2 /δ) m ≤ λmin(K (L) ) 4 , which implies m = Ω n 2 log(n 2 /δ) λ 2 min (K (L) ) , then with probability at least 1 -δ, Θ (L) µ (0) -K (L) ∞ 2 ≤ λ min (K (L) ) 4 Phase 1, bounding Θ (L) σ (0). Plugging the derivative result regarding mean weights in Equation ( 17) into the definition of PNTK (Eqution 9) yields: Θ (L) σ,ij (0) = (x (L-1) i ) ⊤ x (L-1) j • 1 m m r=1 v 2 r σ ′ ((w (L) r ) ⊤ x (L-1) i )σ ′ ((w (L) r ) ⊤ x (L-1) j ) • ξ 2 r Note that the tangent kernel Θ (L) σ,ij (0) differs from Θ (L) µ,ij (0) with an additional term ξ 2 r . It is known the ξ 2 r ∼ χ 1 independently with σ ′ ((w (L) r ) ⊤ x (L-1) i ≥ 0) ⊤ σ ′ ((w (L) r ) ⊤ x (L-1) j ≥ 0). Because E[χ 1 ] = 1, the expectation of Θ (L) σ,ij (0) equals the expectation of Θ (L) µ,ij . Thus for all pairs of i, j, Θ σ ij (0) is the average of m i.i.d. random variables with the expectation E Θ (L) σ (0) = K (L)

∞

Now we calculate the concentration bound. It is known that ξ 2 r is independent and sub-exponential. Then, by sub-exponential tail bound, we know that the following holds with probability at least 1 -δ ′ , Θ (L) σ,ij (0) -K (L) ∞,ij ≤ log(8/δ ′ ) 2m This bound is of the same order to concentration bound for Θ (L) µ,ij (0). Thus we can take all the arguments for Θ (L) µ,ij (0) above to finalize the proof. If n 2 log(8n 2 /δ) m ≤ λmin(K (L) ) 4 , which implies m = Ω n 2 log(n 2 /δ) λ 2 min (K (L) ) , then with probability at least 1 -δ, Θ (L) σ (0) -K (L) ∞ 2 ≤ λ min (K (L) ) 4 Phase 2, bounding K (L) ∞ -K (L)

2

. We show with probability 1 -δ over the W (l) , for any 1 ≤ l ≤ L -1, 1 ≤ i, j ≤ n, 1 m m r=1 (x (l) i ) ⊤ x (l) i -K (l) ∞,ij ∞ ≤ E log(Ln/δ) m The error constant E depends on the choice of activation function, and satisfies E ≤ C • 2 O(L) with C being a positive constant. The 2 O(L) term comes form perturbation propagation through the neural network. The proof is by induction, and detailed proof can be found in the proof of Theorem E.1 in Du et al. (2019) . Applying the union bound to the number of paths concludes the theorem, and the condition of m follows: m = Ω n 2 log(Ln/δ)2 O(L) λ 2 min (K (L) ) Remark A.1. The concentration bound is over two randomness, one is initialization of W µ and the other is Gaussian variable ξ. A.2 STEP 2. BOUNDING LEAST EIGENVALUE OF PNTK DURING TRAINING. The next problem is that PNTKs are time-dependent matrices, thus varying during training. To account for this problem, we establish following lemmas stating that if the weight W (l) (t) is close to W (l) (0) during gradient descent training, then the corresponding PNTKs Θ Importantly, we introduce an auxiliary weight matrix W (l) (t) ≡ W (l) 0), and an auxiliary weight vector v(t) ≡ v µ (t) + v σ (t) ⊙ ξ v (0), where ξ(0) and ξ v (0) are the exact value of random variables at initialization. Then we demonstrate lemmas in step 2 as follows: Lemma A.4. If W µ (0) and W σ (0) are initialized by the form described in Section 3.1, and suppose for every l ∈ µ (t) + W (l) σ (t) ⊙ ξ( [L], W (l) (0) 2 ≤ c w,0 √ m, x (l) (0) 2 ≤ c x,0 and W (l) (t) -W (l) (0) F ≤ √ mR for some constant c w,0 , c x,0 > 0 and R ≤ c w,0 . If σ(•) is H-Lipschitz, then with probability at least 1 -δ, we have x (l) (t) -x (l) (0) 2 ≤ HRc x,0 g cx (l)(1 + log(2/δ)) where c x = 2 √ c σ Hc w,0 . Lemma A.5. If W µ (0) and W σ (0) are initialized by the form described in Section 3.1, and uppose σ(•) is H-Lipschitz and β-smooth. Suppose for l ∈ [L], W (l) (0) 2 ≤ c w,0 √ m, v(0) 2 ≤ v 2,0 √ m, v(0) 4 ≤ a 4,0 m 1/4 , 1 cx,0 ≤ x (l) (0) 2 ≤ c x,0 . If W (l) (t) - W (l) (0) F , v(t) -v(0) 2 ≤ √ mR where R ≤ cg cx (L) -1 λ min (K (L) )n -1 (1 + log(2/δ)) -2 , R ≤ cg cx (L) -1 λ min (K (L) )n -1 (1 + log(2/δ)) -3 , and R ≤ cg cx (L) -1 for some small constant c and c x = 2 √ c σ Hc w,0 then with probability at least 1 -δ, we have Θ (L) µ (t) -Θ (L) µ (0) 2 ≤ λ min (K (L) ) 4 Θ (L) σ (t) -Θ (L) σ (0) 2 ≤ λ min (K (L) ) 4 Lemma A.6. If R S (Q, t ′ ) ≤ exp(-λ min (K (L) )t ′ ) R S (Q, 0) holds for 0 ≤ t ′ ≤ t, we have for any 0 ≤ s ≤ t W (l) (s) -W (l) (0) F , v(s) -v(0) 2 ≤ R ′ √ m where R ′ = 16(1+log(2/δ)) 2 cx,0v2,0(cx) L √ n∥y-f (X,Q(0))∥2 λ0 √ m for some small constant c with c x = max{2 √ c σ Lc w,0 , 1}. Proof of Lemma A.4. The proof sketch is by induction method. For l = 0, where the target is input which is fixed, thus satisfying the hypothesis. Now suppose the induction hypothesis holds for l ′ = 0, . . . , l -1, we consider l ′ = l. x (l) (t) -x (l) (0) 2 = 1 m σ(W (l) (t)x (l-1) (t)) -σ(W (l) (0)x (l-1) (0)) 2 ≤ 1 m σ(W (l) (t)x (l-1) (t)) -σ(W (l) (t)x (l-1) (0)) 2 + 1 m σ(W (l) (t)x (l-1) (0)) -σ(W (l) (0)x (l-1) (0)) 2 ≤ 1 m H W (l) (0) 2 + W (l) (t) -W (l) (t) 2 + W (l) (t) -W (l) (0) F • x (l-1) (t) -x (l-1) (0) 2 + 1 m H W (l) (t) -W (l) (t) 2 + W (l) (t) -W (l) (0) F x h-1 (0) 2 ≤ 1 m H c w,0 √ m + R √ m(1 + log(2/δ)) HRc x,0 g cx (l -1) + 1 m H √ mR(1 + log(2/δ)c x,0 ≤HRc x,0 (c x g cx (l -1) + 1) (1 + log(2/δ)) ≤HRc x,0 g cx (l)(1 + log(2/δ)) Proof of Lemma A.5. For simplicity we define z i,r (t) = w µ,ij (0) through the following inequality: Θ (L) µ,ij (t) -Θ (L) µ,ij (0) = x (L-1) i (t) ⊤ x (L-1) j (t) 1 m m r=1 v r (t) 2 σ ′ (z i,r (t)) σ ′ (z j,r (t)) -x (L-1) i (0) ⊤ x (L-1) j (0) 1 m m r=1 v r (0) 2 σ ′ (z i,r (0)) σ ′ (z j,r (0)) ≤ x (L-1) i (t) ⊤ x (L-1) j (t) -x (L-1) i (0) ⊤ x (L-1) j (0) 1 m m r=1 v r (0) 2 σ ′ (z i,r (t)) σ ′ (z j,r (t)) + x (L-1) i (0) ⊤ x (L-1) j (0) 1 m m r=1 v r (0) 2 (σ ′ (z i,r (t)) σ ′ (z j,r (t)) -σ ′ (z i,r (0)) σ ′ (z j,r (0))) + x (L-1) i (t) ⊤ x (L-1) j (t) 1 m m r=1 v r (t) 2 -v r (0) 2 σ ′ (z i,r (t)) σ ′ (z j,r (t)) ≤H 2 v 2 2,0 x (L-1) i (t) ⊤ x (L-1) j (t) -x (L-1) i (0) ⊤ x (L-1) j (0) + c 2 x,0 1 m m r=1 v r (0) 2 (σ ′ (z i,r (t)) σ ′ (z j,r (t)) -σ ′ (z i,r (0)) σ ′ (z j,r (0))) + 4H 2 c 2 x,0 1 m m r=1 v r (t) 2 -v r (0) 2 ≡I i,j 1 + I i,j 2 + I i,j 3 . For I i,j 1 , by Lemma A.4, we have I i,j 1 =H 2 v 2 2,0 x (L-1) i (t) ⊤ x (L-1) j (t) -x (L-1) i (0) ⊤ x (L-1) j (0) ≤H 2 v 2 2,0 (x (L-1) i (t) -x (L-1) i (0)) ⊤ x (L-1) j (t) + H 2 v 2 2,0 x (L-1) i (0) ⊤ (x (L-1) j (t) -x (L-1) j (0)) ≤v 2 2,0 H 3 c x,0 g cx (L)R(1 + log(2/δ)) • (c x,0 + Hc x,0 g cx (L)R(1 + log(2/δ))) + v 2 2,0 H 3 c x,0 g cx (L)Rc x,0 (1 + log(2/δ)) ≤3v 2 2,0 c 2 x,0 H 3 g cx (L)R(1 + log(2/δ)) 2 For I i,j 2 , we have I i,j 2 =c 2 x,0 1 m m r=1 v r (0) 2 σ ′ (z i,r (t)) σ ′ (z j,r (t)) -v r (0) 2 σ ′ (z i,r (0)) σ ′ (z j,r (0)) ≤c 2 x,0 1 m m r=1 v r (0) 2 (σ ′ (z i,r (t)) -σ ′ (z i,r (0))) σ ′ (z j,r (t)) + v r (0) 2 (σ ′ (z j,r (t)) -σ ′ (z j,r (0))) σ ′ (z i,r (0)) ≤ βHc 2 x,0 m m r=1 v r (0) 2 z i,r (t) -z i,r (0) + v r (0) 2 z j,r (t) -z j,r ≤ βHv 2 4,0 c 2 x,0 √ m   m r=1 z i,r (t) -z i,r (0) 2 + m r=1 z j,r (t) -z j,r (0) 2   . Using the same proof for Lemma A.4, it is easy to see that m r=1 z i,r (t) -z i,r (0) 2 ≤ c 2 x,0 g cx (L) 2 mR 2 (1 + log(2/δ)) 2 . Thus I i,j 2 ≤ 2βv 2 4,0 c 3 x,0 Lg cx (L)R(1 + log(2/δ)). For I i,j 3 , I i,j 3 = 4H 2 c 2 x,0 1 m m r=1 v r (t) 2 -v r (0) 2 ≤ 4H 2 c 2 x,0 1 m m r=1 v r (t) -v r (0) v r (t) + v r (t) -v r (0) v r (0) ≤ 12H 2 c 2 x,0 v 2,0 R(1 + log(2/δ) ). Therefore we can bound the perturbation Θ (L) µ (t) -Θ (L) µ (0) F = n i,j=1 Θ (L) µ,ij (t) -Θ (L) µ,ij (0) 2 ≤ 2βc x,0 v 2 4,0 + 3H 2 Hc 2 x,0 v 2 2,0 g cx (L)(1 + log(2/δ)) 2 + 12H 2 c 2 x,0 v 2,0 (1 + log(2/δ)) nR Recall the bound on R, which is R ≤ cg cx (L) -1 λ min (K (L) )n -1 (1 + log(2/δ)) -2 , we have the desired result for Θ (L) µ : Θ (L) µ (t) -Θ (L) µ (0) 2 ≤ λ min (K (L) ) 4 Then we bound the distance between Θ (L) σ,ij (t) and Θ (L) σ,ij (0) through the following inequality: Θ (L) σ,ij (t) -Θ (L) σ,ij (0) = x (L-1) i (t) ⊤ x (L-1) j (t) 1 m m r=1 v r (t) 2 σ ′ (z i,r (t)) σ ′ (z j,r (t)) • ξ 2 r (t) -x (L-1) i (0) ⊤ x (L-1) j (0) 1 m m r=1 v r (0) 2 σ ′ (z i,r (0)) σ ′ z j,r (0) • ξ 2 r (0) ≤ x (L-1) i (t) ⊤ x (L-1) j (t) -x (L-1) i (0) ⊤ x (L-1) j (0) 1 m m r=1 v r (0) 2 σ ′ (z i,r (t)) σ ′ (z j,r (t)) • ξ 2 r + x (L-1) i (0) ⊤ x (L-1) j (0) 1 m m r=1 v r (0) 2 (σ ′ (z i,r (t)) σ ′ (z j,r (t)) -σ ′ (z i,r (0)) σ ′ (z j,r (0))) + x (L-1) i (t) ⊤ x (L-1) j (t) 1 m m r=1 v r (t) 2 -v r (0) 2 σ ′ (z i,r (t)) σ ′ (z j,r (t)) • ξ 2 r (t) + x (L-1) i (0) ⊤ x (L-1) j (0) 1 m m r=1 v r (0) 2 (σ ′ (z i,r (0)) σ ′ (z j,r (0))) (ξ 2 r (t) -ξ 2 r (0)) ≤H 2 v 2 2,0 (1 + log(2/δ)) x (L-1) i (t) ⊤ x (L-1) j (t) -x (L-1) i (0) ⊤ x (L-1) j (0) (1 + log(2/δ)) + c 2 x,0 (1 + log(2/δ)) 1 m m r=1 v r (0) 2 (σ ′ (z i,r (t)) σ ′ (z j,r (t)) -σ ′ (z i,r (0)) σ ′ (z j,r (0))) + 4H 2 c 2 x,0 (1 + log(2/δ)) 1 m m r=1 v r (t) 2 -v r (0) 2 + 4H 2 c 2 x,0 v 2 2,0 1 m m r=1 ξ r (t) 2 -ξ r (0) 2 =(1 + log(2/δ))(I i,j 1 + I i,j 2 + I i,j 3 ) + I i,j 4 . For I i,j 4 , by the tail bound fora chi-square variable, we have I i,j 4 ≤ 4H 2 c 2 x,0 v 2 2,0 1 + log(2/δ) m Therefore we can bound the perturbation Θ (L) σ (t) -Θ (L) σ (0) F = n i,j=1 Θ (L) σ,ij (t) -Θ (L) σ,ij (0) 2 ≤ 2βc x,0 v 2 4,0 + 3H 2 Hc 2 x,0 v 2 2,0 g cx (L)(1 + log(2/δ)) 3 + 12H 2 c 2 x,0 v 2,0 (1 + log(2/δ)) 2 + 4H 2 c 2 x,0 v 2 2,0 (1 + log(2/δ) m ) nR Recall the bound on R, which is R ≤ cg cx (L) -1 λ min (K (L) )n -1 (1 + log(2/δ)) -3 , we have the desired result for Θ (L) σ : Θ (L) σ (t) -Θ (L) σ (0) 2 ≤ λ min (K (L) ) 4 Proof of Lemma A.6. We first consider the derivative of W (l) µ and have: d ds W (l) µ (s) F =η 1 m L-l+1 2 n i=1 (y i -f (x i ; s))x (l-1) i (s) v(s) ⊤ L k=l+1 J (k) i (s)W (k) (s) J (l) i (s) F ≤η 1 m L-l+1 2 v(s) 2 n i=1 y i -f (x i ; s) x (l-1) i (s) 2 L k=l+1 W (k) (s) 2 L k=l J (k) (s) 2 , d ds v µ (s) 2 = η n i=1 (y i -f (x i ; s))x (L) i (s) 2 . where J (l ′ ) ≡ diag σ ′ (w (l ′ ) 1 ) ⊤ x (l ′ -1) , . . . , σ ′ (w (l ′ ) m ) ⊤ x (l ′ -1) ∈ R m×m are the derivative matrices induced by the activation function. To bound x (l-1) i (s) 2 , we can just apply Lemma A.4 and get x (l-1) i (s) 2 ≤ Hc x,0 g cx (h)R(1 + log(2/δ)) + c x,0 ≤ 2c x,0 (1 + log(2/δ)). To bound W (k) (s) 2 , we use our assumption L k=l+1 W (k) (s) 2 ≤ L k=l+1 W (k) (0) 2 + W (k) (s) -W (k) (0) 2 ≤ L k=l+1 (c w,0 √ m + R ′ √ m)(1 + log(2/δ)) = (c w,0 + R ′ ) L-l m L-l 2 (1 + log(2/δ)) ≤ (2c w,0 ) L-l m L-l 2 (1 + log(2/δ)). Note that J (k) (s) 2 ≤ H. Plugging in these two bounds back, we obtain d ds W (l) µ (s) where we have used we use the definition of loss R S (Q) = E f ∼Q R S (f ) and interchanged integration and differentiation. Similarly, for v σ we have, d ds v σ (s) 2 = η E Q(s) n i=1 (y i -f (x i ; s))x (L) i (s) ⊙ ξ (v) 2 = 0 Integrating the derivative of weights, we obtain W (l) (s) -W (l) (0) F ≤ W (l) µ (s) -W (l) µ (0) F + W (l) σ (s) -W (l) σ (0) • ξ (l) (0) d dt R S (Q(t)) = 1 2 d dt E f ∼Q(t) f (X; t) -y 2 2 ≤ -y -f (X, Q(t)) ⊤ Θ (L) µ (t) + Θ (L) σ (t) y -f (X, Q(t)) ≤ -λ 0 y -E f ∼Q(t) f (X; t) 2 2 where we have used the condition R ′ < R. Therefore, we have the desired result: R S (Q, t) ≤ exp(-λ 0 t) R S (Q, 0) Finally, we provide a bound for R S (Q(0)): y -f (X, Q(0)) 2 2 = n i=1 y 2 i + y i f (x i , Q(0)) + f (x i , Q(0)) 2 ] = n i=1 (1 + O(1)) = O(n) Recall that in Lemma A. Suppose m ≥ poly(n, 1/λ 0 , 1/δ, 1/E). Then, with a probability of at least 1 -δ over the random initialization, we have f x, Q(t) | t=∞ = Θ ∞ µ (x, X) Θ ∞ µ (X, X) + λ/c 2 σ I -1 y ± E where f (x, Q(t)) = E f ∼Q(t) f (x; t) aligns with the definition of the empirical loss function. Proof of Theorem B.1. To proceed the proof, we first establish the result of kernel ridge regression in the infinite-width limit, and then bound the perturbation on the predict. According the linearization rules for infinitely-wide networks (Lee et al., 2019) , the output function can be expressed as, We then conduct an experiment to compare the gradient of norm with respect to θ µ and θ σ . The result is shown in Figure 4 . We can see that the gradient norm of ∇ θµ f (x) is much larger than that of ∇ θσ f (x), which implies that θ σ is effectively fixed during gradient descent training.

D.4 CORRELATION BETWEEN GENERALIZATION BOUND PROXY METRIC AND GENERALIZATION BOUND

In Figure 1 , we observe a positive and significant correlation between PA and generalization bound held among different values of a selected hyperparameter while fixing other hyperparameters. Furthermore, we provide a Figure 5 presenting the correlation for aggregated values of ρ 0 and λ, under the circumstance where 50% data is used for prior training. We can clearly see that lower PA corresponds to the lower bound, with a strong positive Kendall-tau correlation of 0.7.

D.5 GRID SEARCH

For selecting hyperparameters, we conduct a grid search over ρ 0 , percent of prior data, and KL penalty λ. Notably, we do grid sweep over the data for prior training with different proportion in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] since 0.2 is the minimum proportion required for obtaining a reasonably lower value generalization bound (Dziugaite et al., 2020) . For the rest, we run over ρ 0 at value [0.03, 0.05, 0.07, 0.09, 0.1, 0.3, 0.5, 0.7] for FCN ([0.05, 0.07, 0.09, 0.1, 0.3, 0.5, 0.7, 0.9] for CNN) and KL penalty at [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1] for both structures.



Figure 1: The first row shows correlation results of FCN structure on the MNIST dataset. Kendall-tau correlations between generalization bound with respect to the proportion of prior data, coefficient of KL penalty, and ρ 0 are 0.89, 0.89, and 0.93 at 1% level of significance. Similar results are found in the CNN structure with the CIFAR10 dataset where Kendall-tau correlations are 0.89, 0.83, and 0.57, as shown in the second row.

i -f (x i ; s)| ≤e -λ0s 1 4 ηλ 0 R ′ √ m.Next, we consider the derivative of W

5 and Lemma A.6:R ≤ cg cx (L) -1 λ min (K (L) )n -1 (1 + log(2/δ)) -3 R ′ = 16(1 + log(2/δ)) 2 c x,0 v 2,0 (c x ) 1 (Restatement of Theorem 4.3).Consider gradient descent on objective function (6).

Figure 2: Relative Frobenius norm change in µ and σ respectively during training with MSE loss which is derived from the classic PAC-Bayesian bound, where m is the width of the network.

Figure 3: Comparing NTK Rademacher bound, NTK PAC-Bayesian bound and Empirical Bound with different datasets and network structures.

Figure 4: Comparison between the gradient of mean µ and standard deviations σ.

Figure 5: Correlation between aggregated proxy PA and generalization bound.

The performance i.e., risk certificates (cross-entropy ℓ x-e and accuracy ℓ 01 ) and computation time against three hyperparameters searching methods (exhaustive search, Bayesian search, and PA, the training free method). For the lowest risk certificate and computational time are highlighted in boldface, and second best are highlighted by underlining.

A.3 SETP 3. TOWARDS LINEAR CONVERGENCE RATE OF EMPIRICAL LOSS

Now we process to analyze the convergence rate of empirical error. Combined with fact that least eigenvalue of PNTKs and change of weights are bounded during training, the behavior of the loss is traceable. To finalize the proof for Theorem 4.2, we show:Proof of Lemma A.7. According to the gradient flow of output function, we haveThen the dynamics of loss can be calculated, Then the gradient flow equation for θ)µ becomes,which is an ordinary differential equation regarding θ µ (t), and the solution is,Plug this result into the linearization of expected output function, we have,Then we take the time to be infinity and haveThe next step is to show thatwhereThe proof relies a careful analysis on the trajectories induced by gradient flows for optimizing the neural network and the NTK predictor. The detailed proof can be found in the proof of Theorem 3.2 in Arora et al. (2019b) , and we can replace kernel ridge regression here by kernel regression.

C PROOFS OF SECTION 4.3

C.1 PROOF OF THEOREM 4.4Theorem C.1 (Restatement of Theorem 4.4). Suppose data S = {(x i , y i )} n i=1 are i.i.d. samples from a non-degenerate distribution D, and m ≥ poly(n, λ -1 0 , δ -1 ). Consider any loss function ℓ : R × R → [0, 1] that is 1-Lipschitz in the first argument such that ℓ(y, y) = 0. Then with a probability of at least 1 -δ over the random initialization and the training samples, the probabilistic neural network (PNN) trained by gradient descent for T ≥ Ω( 1 ηλ0 log n δ ) iterations has population risk R D (Q) that is bounded as follows:Proof of Theorem C.1. The generalization bound consists two terms, one is the empirical error, and another is KL divergence.(1) We first bound the empirical errorThen we can further bound the error term as follows:(2) The next step is to calculate the KL divergence. According to the solution of differential equation in Theorem B.1, we have,Finally, by Equation 5, we achieve the PAC-Bayesian generalization bound,C.2 PROOF OF THEOREM 4.5Theorem C.2 (Restatement of Theorem 4.5). Suppose data S = {(x i , y i )} n i=1 are i.i.d. samples from a non-degenerate distribution D, and m ≥ poly(n, λ -1 0 , δ -1 ). Consider any loss function ℓ : R × R → [0, 1] that is 1-Lipschitz in the first argument such that ℓ(y, y) = 0. Then with a probability of at least 1 -δ over the random initialization and training samples, the deterministic neural network trained by gradient descent for T ≥ Ω( 1 ηλ0 log n δ ) iterations has population risk R D that is bounded as follows:Proof of Theorem C.2. In this proof, we use Rademacher-complexity analysis. Let H be the reproducing kernel Hilbert space (RKHS) corresponding to the kernel k(•, •). It is known that the RKHS norm of a functionRademacher complexity can be bounded as Arora et al. (2019a) ,Recall the standard generalization bound from Rademacher complexity, with probability at least 1 -δ, we have,There we have,

D ADDITIONAL EXPERIMENTS

This section contains additional experimental results. Training is performed with a server with a CPU with 5,120 cores, and a 32 GB Nvidia Quadro V100.

D.1 VALIDATION OF THEORETICAL RESULTS

We first provide empirical support showing that the training dynamics of wide probabilistic neural networks using the training objective derived from a PAC-Bayes bound are captured by PNTK, which validates Lemma A.6.Consider a three hidden layer ReLU fully-connected network of the training objective derived from the PAC-Bayesian lambda bound in Equation ( 5), using an ordinary MSE function as loss. The neural network is trained with a full-batch gradient descent using learning rates equal to one on a fixed subset of MNIST (|D| = 128) of ten classifications. A random initialized prior with no connection to data is used since it is in line with our theoretical setting and we only intend to observe the change in parameters rather than the performance of the actual bound.After T = 2 17 steps of gradient descent updates from different random initialization, we plot the changes of W (l)µ and W (l) σ of input/output/hidden layer with respect to width m for each layer on Figure 2 . We observe that the relative Frobenius norm change in the input/output layer's weights scales as 1/ √ m while the hidden layers' weight scales is 1/m during the training, which verifies Lemma A.6.

D.2 COMPARISON BETWEEN THEORETICAL BOUNDS AND EMPIRICAL BOUNDS

We make a comparison between theoretical bounds (Equations 14, 15) and empirical bounds. The experiments are performed on two different network structures, a fully connected neural network and a convoluted neural network on MNIST and CIFAR10 datasets. In particular, we build a 3-layer fully-connected neural network with 600 neurons on each layer. The convolutional architecture is equipped with a total of 13 layers with around 10 million learnable parameters. We adopt the same hyper-parameter for both theoretical bounds and empirical bounds. The result is shown in Figure 3 . First, for theoretical bounds, we find that the PAC-Bayes bound is smaller than the Rademacher bound. Secondly, we find that both theoretical bounds are larger than empirical bounds, which meets our expectations.

