DEMYSTIFYING THE OPTIMIZATION AND GENERALIZA-TION OF DEEP PAC-BAYESIAN LEARNING

Abstract

PAC-Bayes has long been a generalization analysis framework where the expected population error can be bounded by the sum of training error and the divergence between posterior and prior distribution. In addition to being a successful generalization bound analysis tool, the PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, which we refer to simply as PAC-Bayesian Learning. PAC-Bayesian learning has been proven to be able to achieve a competitive expected test set error numerically, while providing a tight generalization bound in practice, through gradient descent training. Despite its empirical success, the theoretical analysis of deep PAC-Bayesian learning for neural networks is rarely explored. To this end, this paper proposes a theoretical convergence and generalization analysis for PAC-Bayesian learning. For a deep and wide probabilistic neural network, we show that when PAC-Bayesian learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further obtain an analytic and guaranteed PAC-Bayesian generalization bound for the first time, which is an improvement over the Rademacher complexitybased bound for deterministic neural networks. Finally, drawing insight from our theoretical results, we propose a proxy measure for efficient hyperparameter selection, which is proven to be time-saving on various benchmarks.

1. INTRODUCTION

Deep learning has demonstrated powerful learning capability due to its over-parameterization structure, in which various network architectures have been responsible for its significant leap in performance (LeCun et al., 2015) . Over-fitting and complex hyperparameters are two of the major challenges in deep learning, hence designing generalization guarantees for deep networks is an important research goal (Zhang et al., 2021) . Recently, a learning framework that trains a probabilistic neural network with a PAC-Bayesian bound objective function has been proposed (Bégin et al., 2016; Dziugaite & Roy, 2017; Neyshabur et al., 2017b; Raginsky et al., 2017; Neyshabur et al., 2017a; London, 2017; Smith & Le, 2017; Pérez-Ortiz et al., 2020; Guan & Lu, 2022) , which is known as PAC-Bayesian learning. While providing a tight generalization bound, PAC-Bayesian learning has been proven to be able to achieve a competitive expected test set error (Ding et al., 2022) . Furthermore, this generalization bound computed from the training data can obviate the need for splitting data into training, testing, and validation set, which is highly applicable for training a deep network with scarce data (Pérez-Ortiz et al., 2020; Grünwald & Mehta, 2020) . Meanwhile, these advancements on PAC-Bayesian bounds have been widely adapted with different deep neural network structures including convolutional neural network (Zhou et al., 2018; Pérez-Ortiz et al., 2020) , binary activated multilayer networks (Letarte et al., 2019) , partially aggregated neural networks (Biggs & Guedj, 2020) , and graph neural network (Liao et al., 2020) . Due to the impressive empirical success of PAC-Bayesian learning, there is increasing interest in understanding its theoretical properties. However, it is either restricted to a specific technique variant such as Entropy-SGD which minimizes an objective indirectly by approximating stochastic gradient ascent on the so-called local entropy (Dziugaite & Roy, 2018a) and differential privacy (Dziugaite & Roy, 2018b), or relies heavily on empirical exploration (Neyshabur et al., 2017a; Dziugaite et al., 2020) . To our best knowledge, there has been no investigation so far into why the training of PAC-Bayesian learning is successful and why the PAC-Bayesian bound is tight on unseen data after training. For example, it is still unclear when applying gradient descent to PAC-Bayesian learning: Q1: How effective is gradient descent training on a training set? Q2: How tight is the generalization bound compared to those learning frameworks using nonprobabilistic neural networks? The answers to these questions can be highly non-trivial due to the inherent non-convex problem of over-parameterization (Jain & Kar, 2017) and additional randomness introduced by probabilistic neural networks (Specht, 1990) (Arora et al., 2019a; Cao & Gu, 2019; Hu et al., 2019) . We summarize our contributions as follows: • With a detailed characterization of gradient descent training of the PAC-Bayes objective function, we derive that the final solution is kernel ridge regression with its kernel being the PNTK. • Based on the optimization solution, we derive an analytical and guaranteed PAC-Bayesian bound for deep networks for the first time. Moreover, our bound differs from other PAC-Bayes bounds. Recent papers require distribution of posterior, while our bound is completely independent of computing the distribution of posterior. • The performance of PAC-Bayesian learning depends on the selection of a large number of hyperparameters. We design a training-free proxy based on our theoretical bound and show it is effective and time-saving. • Our technique of analyzing optimization and generalization of probabilistic neural networks through over-parameterization has a wide range of applications such as the Variational Auto-encoder (Kingma & Welling, 2013; Rezende et al., 2014) and deep Bayesian networks (MacKay, 1992; Neal, 2012) , we believe our technique can provide the basis for the analysis of over-parameterized probabilistic neural networks. 



as well as additional challenges brought by the divergence between posterior/prior distribution pairs known as Kullback-Leibler (KL) divergence. Nevertheless, this paper shows that it is possible to answer the above questions by leveraging the recent advances in deep learning theory with over-parameterized setting. It has been shown that wide networks optimized with gradient descent can achieve a near-zero training error, and the critical factor that governs the training process is the neural tangent kernel (NTK), which can be proven to be unchanged during gradient descent training(Jacot et al., 2018), thus providing a guarantee for achieving a global minimum(Du et al., 2019; Allen-Zhu et al., 2019). Under the PAC-Bayesian framework, NTK is no longer calculated from the derivative of the weights directly, but instead is calculated based on the gradient of the distribution parameters of the weights. We call this Probabilistic NTK (PNTK), based on which we build a convergence analysis to characterize the optimization process of PAC-Bayes learning. Due to the explicit solution obtained by optimization analysis, we further formulate the generalization bound of PAC-Bayesian learning for the first time, and demonstrate its advantage by comparing it with the theoretical generalization bound of learning framework with non-stochastic neural networks

Haddouche et al. (2021)  expanded the PAC-Bayesian theory to learning problems with unbounded loss functions. Furthermore, several improved PAC-Bayesian bounds suitable for different scenarios are introduced byBégin et al. (2014; 2016). As a result of the flexibility and generalization properties of PAC-Bayes, it is widely used to analyze complex, non-convex, and overparameterized optimization problems, especially over-parameterized neural networks(Guedj,  2019).Neyshabur et al. (2017b)  presented a generalization bound for feedforward neural networks with ReLU activations in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights.

