SCALE-INVARIANT BAYESIAN NEURAL NETWORKS WITH CONNECTIVITY TANGENT KERNEL

Abstract

Studying the loss landscapes of neural networks is critical to identifying generalizations and avoiding overconfident predictions. Flatness, which measures the perturbation resilience of pre-trained parameters for loss values, is widely acknowledged as an essential predictor of generalization. While the concept of flatness has been formalized as a PAC-Bayes bound, it has been observed that the generalization bounds can vary arbitrarily depending on the scale of the model parameters. Despite previous attempts to address this issue, generalization bounds remain vulnerable to function-preserving scaling transformations or are limited to impractical network structures. In this paper, we introduce new PAC-Bayes prior and posterior distributions invariant to scaling transformations, achieved through the decomposition of perturbations into scale and connectivity components. In this way, this approach expands the range of networks to which the resulting generalization bound can be applied, including those with practical transformations such as weight decay with batch normalization. Moreover, we demonstrate that scale-dependency issues of flatness can adversely affect the uncertainty calibration of Laplace approximation, and we propose a solution using our invariant posterior. Our proposed invariant posterior allows for effective measurement of flatness and calibration with low complexity while remaining invariant to practical parameter transformations, also applying it as a reliable predictor of neural network generalization.

1. INTRODUCTION

Neural networks (NNs) have succeeded tremendously, but understanding their generalization mechanism in real-world scenarios remains challenging (Kendall & Gal, 2017; Ovadia et al., 2019) . Although it is widely recognized that NNs naturally generalize well and avoid overfitting, the underlying reasons are not well understood (Neyshabur et al., 2015b; Zhang et al., 2017; Arora et al., 2018) . Recent studies on the loss landscapes of NNs attempt to address these issues. For example, Hochreiter & Schmidhuber (1995) proposed the flat minima (FM) hypothesis, which states that loss stability for parameter perturbations positively correlates with network generalizability, as empirically demonstrated by Jiang et al. (2020) . However, the FM hypothesis still has limitations. According to Dinh et al. (2017) , rescaling two successive layers can arbitrarily degrade a flatness measure while maintaining the generalizability of NNs. Meanwhile, Li et al. (2018) argued that weight decay (WD) leads to a contradiction of the FM hypothesis in practice: Although WD sharpens pre-trained NNs (i.e., decreased loss resilience), it generally improves the generalization. In short, they suggest that transformations on network parameters (e.g., re-scaling layers and WD) may lead to contradictions to the FM hypothesis. A thorough discussion on this can be found in Appendix E. To resolve this contradiction, we investigate PAC-Bayesian prior and posterior distributions to derive a new scale-invariant generalization bound. As a result, our bound guarantees invariance for a general * Correspondence to 1

