SAM AS AN OPTIMAL RELAXATION OF BAYES

Abstract

Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.

1. INTRODUCTION

Sharpness-aware minimization (SAM) (Foret et al., 2021) and related adversarial methods (Zheng et al., 2021; Wu et al., 2020; Kim et al., 2022) have been shown to improve generalization, calibration, and robustness in various applications of deep learning (Chen et al., 2022; Bahri et al., 2022) , but the reasons behind their success are not fully understood. The original proposal of SAM was geared towards biasing training trajectories towards flat minima, and the effectiveness of such minima has various Bayesian explanations, for example, those relying on the optimization of description lengths (Hinton & Van Camp, 1993; Hochreiter & Schmidhuber, 1997) , PAC-Bayes bounds (Dziugaite & Roy, 2017; 2018; Jiang et al., 2020; Alquier, 2021) , or marginal likelihoods (Smith & Le, 2018) . However, SAM is not known to directly optimize any such Bayesian criteria, even though some connections to PAC-Bayes do exist (Foret et al., 2021) . The issue is that the 'max-loss' used in SAM fundamentally departs from a Bayesian-style 'expected-loss' under the posterior; see Fig. 1(a) . The two methodologies are distinct, and little is known about their relationship. Here, we establish a connection by using a relaxation of the Bayes objective where the expected negative-loss is replaced by the tightest convex lower bound. The bound is optimal, and obtained by Fenchel biconjugates which naturally yield the maximum loss used in SAM (Fig. 1(a) ). From this, SAM can be seen as optimizing the relaxation of Bayes to find the mean of an isotropic Gaussian posterior while keeping the variance fixed. Higher variances lead to smoother objectives, which biases the solution towards flatter regions (Fig. 1(b) ). Essentially, the result connects SAM and Bayes through a Fenchel biconjugate that replaces the expected loss in Bayes by a maximum loss. What do we gain by this connection? The generality of our result makes it possible to easily combine the complementary strengths of SAM and Bayes. For example, we show that the relaxed-Bayes objective can be used to learn the variance parameter, which yields an Adam-like extension of SAM (Alg. 1). The variances are cheaply obtained from the vector that adapts the learning rate, and SAM's hyperparameter is adjusted for each parameter dimension via the variance vector. The extension improves the performance of SAM on standard benchmarks while giving comparable uncertainty estimates to the best Bayesian methods. Our work complements similar extensions for SGD, RMSprop, and Adam from the Bayesian deeplearning community (Gal & Ghahramani, 2016; Mandt et al., 2017; Khan et al., 2018; Osawa et al., 2019) . So far there is no work on such connections between SAM and Bayes, except for a recent empirical study by Kaddour et al. (2022) . Husain & Knoblauch (2022) give an adversarial interpretation of Bayes, but it does not cover methods like SAM. Our work here is focused on connections to approximate Bayesian methods, which can have both theoretical and practical impact. We discuss a new path to robustness by connecting the two fields of adversarial and Bayesian methodologies.

