SYMMETRIES, FLAT MINIMA AND THE CONSERVED QUANTITIES OF GRADIENT FLOW

Abstract

Empirical studies of the loss landscape of deep networks have revealed that many local minima are connected through low-loss valleys. Yet, little is known about the theoretical origin of such valleys. We present a general framework for finding continuous symmetries in the parameter space, which carve out low-loss valleys. Our framework uses equivariances of the activation functions and can be applied to different layer architectures. To generalize this framework to nonlinear neural networks, we introduce a novel set of nonlinear, data-dependent symmetries. These symmetries can transform a trained model such that it performs similarly on new samples, which allows ensemble building that improves robustness under certain adversarial attacks. We then show that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. The conserved quantities help reveal that using common initialization methods, gradient flow only explores a small part of the global minimum. By relating conserved quantities to convergence rate and sharpness of the minimum, we provide insights on how initialization impacts convergence and generalizability.

1. INTRODUCTION

Training deep neural networks (NNs) is a highly non-convex optimization problem. The loss landscape of a NN, which is shaped by the model architecture and the dataset, is generally very rugged, with the number of local minima growing rapidly with model size (Bray & Dean, 2007; S ¸ims ¸ek et al., 2021) . Despite this complexity, recent work has revealed many interesting structures in the loss landscape. For example, NN loss landscapes often contain approximately flat directions along which the loss does not change significantly (Freeman & Bruna, 2017; Garipov et al., 2018) . Flat minima have been used to build ensemble or mixture models by sampling different parameter configurations that yield similar loss values (Garipov et al., 2018; Benton et al., 2021) . However, finding such flat directions is mostly done empirically, with few theoretical results. One source of flat directions is parameter transformations that keep the loss invariant (i.e. symmetries). Specifically, moving in the parameter space from a minimum in the direction of a symmetry takes us to another minimum. Motivated by the fact that continuous symmetries of the loss result in flat directions in local minima, we derive a general class of such symmetries in this paper. Our key insight is to focus on equivariances of the nonlinear activation functions; most known continuous symmetries can be derived using this framework. Models related by exact equivalence cannot behave differently on different inputs. Hence, for ensembling or robustness tasks, we need to find data-dependent symmetries. Indeed, aside from the familiar "linear symmetries" of NN, the framework of equivariance allows us to introduce a novel class of symmetries which act nonlinearly on the parameters and are data-dependent. These nonlinear symmetries cover a much larger class of continuous symmetries than their linear counterparts, as they apply for almost any activation function. We provide preliminary experimental evidence that ensembles using these nonlinear symmetries are more robust to adversarial attacks. Extended flat minima arise frequently in the loss landscape of NNs; we show that symmetry-induced flat minima can be parametrized using conserved quantities. Furthermore, we provide a method of deriving explicit conserved quantities (CQ) for different continuous symmetries of NN parameter spaces. CQ had previously been derived from symmetries for one-parameter groups (Kunin et al., 2021; Tanaka & Kunin, 2021) . Using a similar approach we derive the CQ for general continuous symmetries. This approach fails to find CQ for rotational symmetries. Nevertheless, we find the conservation law resulting from the symmetry implies a cancellation of angular momenta between layers. To summarize, our contributions are: 1. A general framework based on equivariance for finding symmetries in NN loss landscapes.

2.. A derivation of the dimensions of minima induced by symmetries.

3. A new class of nonlinear, data-dependent symmetries of NN parameter spaces. 4. An expansion of prior work on deriving conserved quantities (CQ) associated with symmetries, and a discussion of its failure for rotation symmetries. 5. A cancellation of angular momenta result for between layers for rotation symmetries. 6. A parameterization of symmetry-induced flat minima via the associated CQ. This paper is organized as follows. First, we review existing literature on flat minima, continuous symmetries of parameter space, and conserved quantities. In Section 3, we define continuous symmetries and flat minima, and show how linear symmetries lead to extended minima. We illustrate our constructions through examples of linear symmetries of NN parameter spaces. In Section 4, we define nonlinear, data-dependent symmetries. In Section 5, we use infinitesimal symmetries to derive conserved quantities for parameter space symmetries, extending the results in Kunin et al. (2021) to larger groups and more activation functions. Additionally, we show how CQ can be used to define coordinates along flat minima. We close with experiments involving nonlinear symmetries, conserved quantities and a discussion of potential use cases.

2. RELATED WORK

Continuous symmetry in parameter space. Overparametrization in neural networks leads to symmetries in the parameter space (Głuch & Urbanke, 2021). Continuous symmetry has been identified in fully-connected linear networks (Tarmoun et al., 2021) , homogeneous neural networks (Badrinarayanan et al., 2015; Du et al., 2018) , radial neural networks (Ganev et al., 2022) , and softmax and batchnorm functions (Kunin et al., 2021) . We provide a unified framework that generalizes previous findings, and identify nonlinear group actions that have not been studied before. Conserved quantities. The imbalance between layers in linear or homogeneous networks is known to be invariant during gradient flow and related to convergence rate (Saxe et al., 2014; Du et al., 



Figure 1: Visualization of the extended minimum in a 2-layer linear network with loss L = ∥Y -U V X∥ 2 . Points along the minima are related to each other by scaling symmetry U → U g -1 and V → gV . Conserved quantities, Q, associated with scaling symmetry parametrize points along the minimum.

