DEMYSTIFYING THE OPTIMIZATION AND GENERALIZA-TION OF DEEP PAC-BAYESIAN LEARNING

Abstract

PAC-Bayes has long been a generalization analysis framework where the expected population error can be bounded by the sum of training error and the divergence between posterior and prior distribution. In addition to being a successful generalization bound analysis tool, the PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, which we refer to simply as PAC-Bayesian Learning. PAC-Bayesian learning has been proven to be able to achieve a competitive expected test set error numerically, while providing a tight generalization bound in practice, through gradient descent training. Despite its empirical success, the theoretical analysis of deep PAC-Bayesian learning for neural networks is rarely explored. To this end, this paper proposes a theoretical convergence and generalization analysis for PAC-Bayesian learning. For a deep and wide probabilistic neural network, we show that when PAC-Bayesian learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further obtain an analytic and guaranteed PAC-Bayesian generalization bound for the first time, which is an improvement over the Rademacher complexitybased bound for deterministic neural networks. Finally, drawing insight from our theoretical results, we propose a proxy measure for efficient hyperparameter selection, which is proven to be time-saving on various benchmarks.

1. INTRODUCTION

Deep learning has demonstrated powerful learning capability due to its over-parameterization structure, in which various network architectures have been responsible for its significant leap in performance (LeCun et al., 2015) . Over-fitting and complex hyperparameters are two of the major challenges in deep learning, hence designing generalization guarantees for deep networks is an important research goal (Zhang et al., 2021) . Recently, a learning framework that trains a probabilistic neural network with a PAC-Bayesian bound objective function has been proposed (Bégin et al., 2016; Dziugaite & Roy, 2017; Neyshabur et al., 2017b; Raginsky et al., 2017; Neyshabur et al., 2017a; London, 2017; Smith & Le, 2017; Pérez-Ortiz et al., 2020; Guan & Lu, 2022) , which is known as PAC-Bayesian learning. While providing a tight generalization bound, PAC-Bayesian learning has been proven to be able to achieve a competitive expected test set error (Ding et al., 2022) . Furthermore, this generalization bound computed from the training data can obviate the need for splitting data into training, testing, and validation set, which is highly applicable for training a deep network with scarce data (Pérez-Ortiz et al., 2020; Grünwald & Mehta, 2020) . Meanwhile, these advancements on PAC-Bayesian bounds have been widely adapted with different deep neural network structures including convolutional neural network (Zhou et al., 2018; Pérez-Ortiz et al., 2020) , binary activated multilayer networks (Letarte et al., 2019) , partially aggregated neural networks (Biggs & Guedj, 2020), and graph neural network (Liao et al., 2020) . Due to the impressive empirical success of PAC-Bayesian learning, there is increasing interest in understanding its theoretical properties. However, it is either restricted to a specific technique variant such as Entropy-SGD which minimizes an objective indirectly by approximating stochastic gradient ascent on the so-called local entropy (Dziugaite & Roy, 2018a) and differential privacy (Dziugaite & Roy, 2018b), or relies heavily on empirical exploration (Neyshabur et al., 2017a; Dziugaite et al., 2020) . To our best knowledge, there has been no investigation so far into why the training of

