SYMMETRIES, FLAT MINIMA AND THE CONSERVED QUANTITIES OF GRADIENT FLOW

Abstract

Empirical studies of the loss landscape of deep networks have revealed that many local minima are connected through low-loss valleys. Yet, little is known about the theoretical origin of such valleys. We present a general framework for finding continuous symmetries in the parameter space, which carve out low-loss valleys. Our framework uses equivariances of the activation functions and can be applied to different layer architectures. To generalize this framework to nonlinear neural networks, we introduce a novel set of nonlinear, data-dependent symmetries. These symmetries can transform a trained model such that it performs similarly on new samples, which allows ensemble building that improves robustness under certain adversarial attacks. We then show that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. The conserved quantities help reveal that using common initialization methods, gradient flow only explores a small part of the global minimum. By relating conserved quantities to convergence rate and sharpness of the minimum, we provide insights on how initialization impacts convergence and generalizability.

1. INTRODUCTION

Training deep neural networks (NNs) is a highly non-convex optimization problem. The loss landscape of a NN, which is shaped by the model architecture and the dataset, is generally very rugged, with the number of local minima growing rapidly with model size (Bray & Dean, 2007; S ¸ims ¸ek et al., 2021) . Despite this complexity, recent work has revealed many interesting structures in the loss landscape. For example, NN loss landscapes often contain approximately flat directions along which the loss does not change significantly (Freeman & Bruna, 2017; Garipov et al., 2018) . Flat minima have been used to build ensemble or mixture models by sampling different parameter configurations that yield similar loss values (Garipov et al., 2018; Benton et al., 2021) . However, finding such flat directions is mostly done empirically, with few theoretical results. One source of flat directions is parameter transformations that keep the loss invariant (i.e. symmetries). Specifically, moving in the parameter space from a minimum in the direction of a symmetry takes us to another minimum. Motivated by the fact that continuous symmetries of the loss result in flat directions in local minima, we derive a general class of such symmetries in this paper. Figure 1 : Visualization of the extended minimum in a 2-layer linear network with loss L = ∥Y -U V X∥ 2 . Points along the minima are related to each other by scaling symmetry U → U g -1 and V → gV . Conserved quantities, Q, associated with scaling symmetry parametrize points along the minimum. Our key insight is to focus on equivariances of the nonlinear activation functions; most known continuous symmetries can be derived using this framework. Models related by exact equivalence cannot behave differently on different inputs. Hence, for ensembling or robustness tasks, we need to find data-dependent symmetries. Indeed, aside from the familiar "linear symmetries" of NN, the framework of equivariance allows us to introduce a novel class of symmetries which act nonlinearly on the parameters and are data-dependent. These nonlinear symmetries cover a much larger class of continuous symmetries than their linear counterparts, as they apply for almost any activation function. We provide preliminary experimental evidence that ensembles using these nonlinear symmetries are more robust to adversarial attacks. Extended flat minima arise frequently in the loss landscape of NNs; we show that symmetry-induced flat minima can be parametrized using conserved quantities. Furthermore, we provide a method of deriving explicit conserved quantities (CQ) for different continuous symmetries of NN parameter spaces. CQ had previously been derived from symmetries for one-parameter groups (Kunin et al., 2021; Tanaka & Kunin, 2021) . Using a similar approach we derive the CQ for general continuous symmetries. This approach fails to find CQ for rotational symmetries. Nevertheless, we find the conservation law resulting from the symmetry implies a cancellation of angular momenta between layers. To summarize, our contributions are: 1. A general framework based on equivariance for finding symmetries in NN loss landscapes. 2. A derivation of the dimensions of minima induced by symmetries. 3. A new class of nonlinear, data-dependent symmetries of NN parameter spaces. 4. An expansion of prior work on deriving conserved quantities (CQ) associated with symmetries, and a discussion of its failure for rotation symmetries. 5. A cancellation of angular momenta result for between layers for rotation symmetries. 6. A parameterization of symmetry-induced flat minima via the associated CQ. This paper is organized as follows. First, we review existing literature on flat minima, continuous symmetries of parameter space, and conserved quantities. In Section 3, we define continuous symmetries and flat minima, and show how linear symmetries lead to extended minima. We illustrate our constructions through examples of linear symmetries of NN parameter spaces. In Section 4, we define nonlinear, data-dependent symmetries. In Section 5, we use infinitesimal symmetries to derive conserved quantities for parameter space symmetries, extending the results in Kunin et al. (2021) to larger groups and more activation functions. Additionally, we show how CQ can be used to define coordinates along flat minima. We close with experiments involving nonlinear symmetries, conserved quantities and a discussion of potential use cases.

2. RELATED WORK

Continuous symmetry in parameter space. Overparametrization in neural networks leads to symmetries in the parameter space (Głuch & Urbanke, 2021) . Continuous symmetry has been identified in fully-connected linear networks (Tarmoun et al., 2021) , homogeneous neural networks (Badrinarayanan et al., 2015; Du et al., 2018) , radial neural networks (Ganev et al., 2022) , and softmax and batchnorm functions (Kunin et al., 2021) . We provide a unified framework that generalizes previous findings, and identify nonlinear group actions that have not been studied before. Conserved quantities. The imbalance between layers in linear or homogeneous networks is known to be invariant during gradient flow and related to convergence rate (Saxe et al., 2014; Du et al., 2018; Arora et al., 2018a; b; Tarmoun et al., 2021; Min et al., 2021) . Huh (2020) discovered similar conservation laws in natural gradient descents. Kunin et al. (2021) develop a more general approach for finding conserved quantities for certain one-parameter symmetry groups. Tanaka & Kunin (2021) relate continuous symmetries to dynamics of conserved quantities using an approach similar to Noether's theorem (Noether, 1918) . We develop a procedure that determines conserved quantities from infinitesimal symmetries, which is closely related to Noether's theorem. Topology of minimum. The global minimum of overparametrized neural networks are connected spaces instead of isolated points. We show that parameter space symmetries lead to extended flat minima. Previously, Cooper (2018) proved that the global minima is usually a manifold with dimension equal to the number of parameters subtracted by the number of data points. We derive the dimensionality of the symmetry-induced flat minima and show they are related to the number of infinitesimal symmetry generators and dimension of weight matrices. S ¸ims ¸ek et al. ( 2021) study permutation symmetry and show that in certain overparametrized networks, the minimum related by permutations are connected. Entezari et al. (2022) hypothesize that SGD solutions can be permuted to points on the same connected minima. Ainsworth et al. (2023) develop algorithms that find such permutations. Additional discussion on mode connectivity, sharpness of minima, and the role of symmetry in optimization can be found in Appendix A.

3. CONTINUOUS SYMMETRIES IN DEEP LEARNING

In this section, we first summarize our notation for basic neural network constructions (see Appendix C for more details). Then we consider transformations on the parameter space that leave the loss invariant and demonstrate how they lead to extended flat minima.

3.1. THE PARAMETER SPACE AND LOSS FUNCTION

The parameters of a neural network consist of weightsfoot_0 W i ∈ R ni×mi for each layer i, where n i and m i are the layer output and input dimensions, respectively. For feedforward networks, successive output and input dimensions match: m i = n i-1 . We group the widths into a tuple n = (n L , . . . , n 1 , n 0 ), and the parameter space becomes Param = R n L ×ni-1 × • • • × R n1×n0 . We denote an element therein as a tuple of matrices θ = (W i ∈ R ni×ni-1 ) L i=1 . The activation of the i-th layer is a piecewise differentiable function σ i : R ni → R ni , which may or may not be pointwise. For θ ∈ Param and input x ∈ R n0 , the feature vector of the ith layer in feedforward network is Z i+1 (x) = W i+1 σ(Z i (x)), where the juxtaposition 'W σ(Z)' denotes an arbitrary linear operation depending on the context; for example, matrix product, convolution, etc. For simplicity, we largely focus on the case of multilayer perceptrons (MLPs). We denote the final output by F θ : R n0 → R n L , defined as F θ (x) = σ L (Z L (x)). The "loss function" L of our model is defined as: L : Param × Data → R, L(θ, (x, y)) = Cost(y, F θ (x)). (1) where Data = R n0 × R n L is the space of data and Cost : R n L × R n L → R is a differentiable cost function, such as mean square error or cross-entropy. In the case of multiple samples, we have matrices X ∈ R n0×k and Y ∈ R n L ×k whose columns are the k samplesfoot_1 , and retain the same notation for the feedforward function, namely, F θ : R n0×k → R n L ×k . Most of our results concern properties of L that hold for any training data. Hence, unless specified otherwise, we take a fixed batch of data {(x i , y i )} k i=1 ⊆ Data, and consider the loss as a function of the parameters only. Example 3.1. Two-layer network with MSE Consider a network with n = (n, h, m), the identity output activation (σ L (x) = (x)), and no biases. The parameter space is Param(n) = R n×h × R h×m and we denote an element as θ = (U, V ). Taking the mean square error cost function, the loss function for data (X, Y ) ∈ R n×k × R m×k takes the form L(θ, (X, Y )) = 1 k ∥Y -U σ(V X)∥ 2 .

3.2. ACTION OF CONTINUOUS GROUPS AND FLAT MINIMA

Let G be a group. An action of G on the parameter space Param is a function • : G × Param → Param, written as g • θ, that satisfies the unit and multiplication axioms of the group, meaning id • θ = θ where id is the identity of G, and g 1 • (g 2 • θ) = (g 1 g 2 ) • θ for all g 1 , g 2 ∈ G . Definition 3.1 (Parameter space symmetry). The action G × Param → Param is a symmetry of L if it leaves the loss function invariant, that is: L(g • θ) = L(θ), ∀θ ∈ Param, g ∈ G. (2) We describe examples of parameter space symmetries in the next section. Before doing so, we show how a parameter space symmetry leads to flat minima (see Appendix C.6): Proposition 3.2. Suppose G × Param → Param is a symmetry of L. If θ * is a critical point (resp. local minimum) of L, then so is g • θ * for any g ∈ G. The proof of this result relies on using the differential of the action of g to relate the gradient of L at θ * with the gradient at g • θ * . We see that, if θ * is a local minimum, then so is every element of the set {g • θ * | g ∈ G}. This set is known as the orbit of θ * under the action of G. The orbits of different parameter values may be of different dimensions. However, in many cases, there is a "generic" or most common dimension, which is the orbit dimension of any randomly chosen θ.

3.3. EQUIVARIANCE OF THE ACTIVATION FUNCTION

In this section, we describe a large class of linear symmetries of L using an equivariance property of the activations between layers. For accessibility, we focus on the example of two layers with output F (x) = U σ(V x) for (U, V ) ∈ Param = R m×h × R h×n and x ∈ R n . All results generalize to multiple layers by letting U = W i and V = W i-1 be weights of two successive layers in a deep neural network (see Appendix C.5). Let G ⊆ GL h (R) be a subgroup of the general linear group, and let π : G → GL h (R) a representation (the simplest example is π(g) = g). We consider the following action of the group G on the parameter space Param: g • U = U π(g -1 ), g • V = gV (3) This action becomes a symmetry of L if and only if the following identity holds: σ(gz) = π(g)σ(z) ∀g ∈ G, ∀z ∈ R h ) We now turn our attention to examples. To ease notation, we write GL h instead of GL h (R). Example 3.2. Linear networks A simple example of (4) is that of linear networks, where σ is the identity function: σ(x) = x. One can take π(g) = g and G = GL h . Example 3.3. Homogeneous activations Suppose the activation σ : R h → R h is homogeneous, meaning that (1) σ is applied pointwise in the standard basis and (2) there exists α > 0 such that σ(cz) = c α σ(z) for all c ∈ R >0 and z ∈ R h . Such an activation is equivariant under the positive scaling group G ⊂ GL h consisting of diagonal matrices with positive diagonal entries. Explicitly, the group G consists of diagonal matrices g = diag(c) with c = (c 1 , . . . , c h ) ∈ R h >0 . For z = (z 1 , . . . , z h ) ∈ R h and g ∈ G, we have σ(gz) = j σ(c j z j ) = j c α j σ(z j ) = g α σ(z). Hence, the equivariance equation is satisfied with π(g) = g α . Example 3.4. LeakyReLU This is a special case of homogeneous activation, defined as σ(z) = max(z, 0) + s min(z, 0), with s ∈ R ≥0 . We have α = 1, and π(g) = g. Example 3.5. Radial rescaling activations A less trivial example of continuous symmetries is the case of a radial rescaling activation (Ganev et al., 2022) where for z ∈ R h , we have σ(z) = f (∥z∥)z for some function f : R → R. Radial rescaling activations are equivariant under rotations of the input: for any orthogonal transformation g ∈ O(h) (that is, g T g = I) we have σ(gz) = gσ(z) for all z ∈ R h . Indeed, σ(gz) = f (∥gz∥)(gz) = g(f (∥z∥)z) = gσ(z) , where we use the fact that ∥gz∥ = z T g T gz = z T z = ∥z∥ for g ∈ O(h). Hence, (4) is satisfied with π(g) = g. We arrive at our first novel result, whose proof appears in Appendix C.6. Theorem 3.3. The dimension of a generic orbit in Param under the appropriate symmetry group is given as follows. The cases are divided based on whether h ≤ max(n, m) or not.

Orbit Dimension Activation

Symmetry Group h ≤ max(n, m) h ≥ max(n, m) Identity GL h (R) h 2 h(n + m) -nm Homogeneous Positive rescaling h max(n, m) Radial rescaling O(h) h 2 h 2 -h-max(m,n) 2 As an aside, we note that a familiar example where (4) is satisfied involves the permutation of neurons. More precisely, suppose σ is pointwise and let G be the finite group of h × h permutation matrices. Then (4) holds with π(g) = g. However, the permutation group is finite (0-dimensional), and so does not imply the presence of flat minima.

3.4. INFINITESIMAL SYMMETRIES

Deriving conserved quantities from symmetries requires the infinitesimal versions of parameter space symmetries. Recall that any smooth action of a matrix Lie group G ⊆ GL h induces an action of the infinitesimal generators of the group, i.e., elements of its Lie algebra. Concretely, let g = Lie(G) = T I G be the Lie algebra, which can be identified with a certain subspace of matrices in gl h = R h×h . For every M ∈ g, we have an exponential map exp M : R → G defined as exp M (t) = ∞ k=0 (tM ) k k! . If ρ : G → GL h is a (linear) representation, then the infinitesimal action is given by dρ : g → gl h by dρ(M ) = d dt 0 ρ(exp M (t)). In the case of the action appearing in (3), the corresponding infinitesimal action of the Lie algebra g induced by ( 3) is given by: M • U = -U dπ(M ), M • V = M V (5) More generally, suppose G acts linearly on parameter space (see Appendix C for non-linear versions). Set d to be the dimension of the parameter spacefoot_2 , and make the identification Param ≃ R d by flattening matrices into column vectors. The general linear group GL(Param) ≃ GL d (R) consists of all invertible linear transformations of Param. Suppose G is a subgroup of GL(Param), so its Lie algebra g is a Lie subalgebra of gl d = R d×d . For M ∈ g and θ ∈ Param, the infinitesimal action is given simply by matrix multiplication: M • θ. In the case of a parameter space symmetry, the invariance of L translates into the following orthogonality condition, where the inner product ⟨, ⟩ : Param × Param → R is calculated by contracting all indices, e.g. ⟨A, B⟩ = ijk... A ijk... B ijk... . Proposition 3.4. Let G be a matrix Lie group and a symmetry of L. Then the gradient vector field is point-wise orthogonal to the action of any M ∈ g: ⟨∇ θ L , M • θ⟩ = 0, ∀θ ∈ Param (6) 4 NONLINEAR DATA-DEPENDENT SYMMETRIES For common activation functions, the equivariance σ(gz) = π(g)σ(z) of (4) holds only for g belonging to a relatively small subgroup of GL h . For ReLU, g must be in the positive scaling group, while for the usual sigmoid activation, the equation only holds for trivial g = id. However, under certain conditions, it is possible to define a nonlinear action of the full GL h which applies to many different activations. The subtlety of such an action is that it is data-dependent, which means that, for any g ∈ GL h , the transformation of the parameter space depends on the input datafoot_3 x. The nonlinear action. For any nonzero vector z ∈ R h , let (r, α 1 , . . . , α h-1 ) be the spherical coordinatesfoot_4 of z, and define the following h by h matrix: (R z ) ij =      z i cos(α j-1 ) j-1 k=1 sin(α k ) -1 if j ≤ i and i-1 k=1 sin(α k ) ̸ = 0 -r sin(α i ) if j = i + 1 0 otherwise where α 0 = 0 by convention. We observe that R z is the product of a rotation matrix and rescaling by |z|. Moreover, since z ̸ = 0, the first column of R z is the unit vector z/|z| and R z has inverse given by R -1 z = 1 |z| 2 R T z . Using these facts, one arrives at the following result, stated in the case of a two-layer neural network with notation from Section 3.3, and proven in Appendix D: Theorem 4.1. Suppose σ(z) is nonzero for any z ∈ R h . Then there is an action GL h × (Param × R n ) → Param × R n given by g • (U, V, x) = (U R σ(V x) R -1 σ(gV x) , gV , x). (7) The evaluation of the feedforward function at x unchanged: F (U,V ) (x) = F (U R σ(V x) R -1 σ(gV x) ,gV ) (x). We emphasize that a necessary and sufficient condition for the particular action of Theorem 4.1 to be well-defined is that σ(z) be nonzero for any z ∈ R h ; this is the case for usual sigmoid. Moreover, in Appendix D.2, we provide a generalization to the case where σ(z) is only required to be nonzero for any nonzero z ∈ R h , a condition satisfied by hyperbolic tangent, leaky ReLU, and many other activations. The cost of such a generalization is a restriction to a 'non-degenerate locus' of Param×R n where V x ̸ = 0. Theorem 4.1 also generalizes to mutli-layer networks, as explained in Appendix D.3. We have the following explicit algorithm to compute the action of Theorem 4.1: 0. Input: weight matrices (U, V ), input vector x ∈ R n , matrix g ∈ GL h . 1. Determine the spherical coordinates of σ(V x) and σ(gV x), and construct the matrices R σ(V x) and R σ(gV x) .

2.. Compute the inverse

R -1 σ(gV x) = 1 |σ(gV x)| 2 R T σ(gV x) . 3. Set U ′ = U R σ(V x) R -1 σ(gV x) and V ′ = gV. 4. Output: the transformed weights (U ′ , V ′ ). The data x ∈ R n remains unchanged. Lipschitz bounds. Unlike the exact symmetries of Section 3, a data-dependent action may alter the loss in the function space. This is evident from (7): while the transformed and original feedforward functions have the same value at x, they will differ at other points. That is, if x ∈ R h is an input value different from x, then F (U,V ) (x) ̸ = F (U R σ(V x) R -1 σ(gV x) ,gV ) (x) in general. However, the transformed feedforward function will differ from the original one in a controlled way. More precisely, when σ is Lipschitz continuous, we show that there is a bound on how much the Lipschitz bound of the feedforward changes after the nonlinear action. The relevance of such a bound originates in the fact that we expect the distance between data points to encode important information about shared features. To be more specific, fix weight matrices (U, V ), which provide the feedforward function F (x) = U σ(V x). For any input vector x ∈ R n and matrix g ∈ GL h , the transformed weight matrices (U R σ(V x) R -1 σ(gV x) , gV ) provide a new feedforward function given by: F (g,x) (U,V ) : R n → R m F (g,x) (U,V ) (x) = U R σ(V x) R -1 σ(gV x) σ(gV x) Proposition 4.2 (Lipschitz bounds from equivariance). Let σ be Lipschitz continuous with Lipschitz constant η. Then F (g,x) (U,V ) is Lipschitz continuous with bound η∥U ∥∥V ∥ |σ(V x)|∥g∥ |σ(gV x)| . In particular, the Lipschitz bound of the original feedforward function is η∥U ∥∥V ∥. Thus, if it happens that |σ(V x)|∥g∥ < |σ(gV x)|, then the Lipschitz bound decreases when transforming the parameters. Additionally, we observe that the nonlinear action does not disrupt latent distribution of data significantly. See Appendix D.5 for proof of Proposition 4.2, which relies on iterative applications of the Cauchy-Schwarz inequality, as well as the fact that ∥R ±1 z ∥ = |z| ±1 . General equivariance. The action described is an instance of a more general framework of equivariance. Specifically, a map c : GL h × R h → GL h is said to be an equivariance if it satisfies (1) c(id h , z) = id h for all z, and (2) c(g 1 , g 2 z)c(g 2 , z) = c(g 1 g 2 , z) for all g 1 , g 2 ∈ GL h and z. These two conditions on c translate directly into the unit and multiplication axioms of a groupfoot_5 , generalizing π(g 1 g 2 ) = π(g 1 )π(g 2 ), and π(id h ) = id h . Every equivariance gives rise to a nonlinear action of GL h on Param × R h given by g • (U, V, x) = (U c(g, V x) -1 , gV, x). This action is a symmetry preserving the loss if and only if the following generalization of (4) holds: General Equivariance: σ(gz) = c(g, z)σ(z) ∀g ∈ GL h ∀z ∈ R h (9) An explicit example of such an equivariance is c(g, z) = R σ(gz) R -1 σ(z) , and Proposition 4.2 generalizes to any general equivariance by replacing |σ(V x)|∥g∥ |σ(gV x)| with ∥c(g, V x) -1 ∥.

5. CONSERVED QUANTITIES OF GRADIENT FLOW

We have shown that continuous symmetries lead us along extended flat minima in the loss landscape. In this section, we identify quantities that (partially) parameterize these minima. We first show that certain real-valued functions on the parameter space remain constant during gradient flow. We refer to such functions as conserved quantities. Applying symmetries changes the value of the conserved quantity. Therefore, conserved quantities can be used to parameterize flat minima. Gradient flow (GF). Recall that GD proceeds in discrete steps with the update rule θ t+1 = θ t -ε∇L(θ t ) where ε is the learning rate (which in general can be a symmetric matrix), and t = 0, 1, 2, . . . are the time steps. In gradient flow, we can define a smooth curve in the parameter space from a choice of initial values to the limiting local minimum without discretizing over time. The curve is a function of a continuous time variable t ∈ R, and velocity of this curve at any point is equal to the gradient of the loss function, scaled by the negative of the learning rate. In other words, the dynamics of the parameters under GF are given by: θ(t) = dθ(t)/dt = -ε∇ θ(t) L. ( ) From an initialization θ(0) at t = 0, GF defines a trajectory θ(t) ∈ Param for t ∈ R >0 , which limits to a critical point. In this way, GF is a continuous version of GD. Conserved quantities. A conserved quantity of GF is a function Q : Param → R such that the value of Q at any two time points s, t ∈ R >0 along a GF trajectory is the same: Q(θ(s)) = Q(θ(t)). In other words, we have dQ(θ(t))/dt = 0. Note that, if f : R → R is any function, and Q is a conserved quantity, then the composition f • Q is also a conserved quantity. Several conserved quantities of GF have appeared in the literature, most notably layer imbalance Du et al., 2018) for each pair of successive feedforward linear layers (σ(x) = x), and its full matrix version Q imb ≡ ∥W i ∥ 2 - ∥W i-1 ∥ 2 ( Q i = W T i W i -W i-1 W T i-1 . We now propose a generalization of the layer imbalance by associating a conserved quantity to any infinitesimal symmetry. As in Section 3.2, suppose a matrix Lie group G acts linearly on the parameter space. Then, from (6), we have the identity ⟨∇ θ L, M • θ⟩ = 0 for any element M in the Lie algebra g. Using the gradient flow dynamics (10), this identity becomes: ε -1 θ , M • θ = 0 (11) In other words, the velocity at any point of a gradient flow curve is orthogonal to the infinitesimal action. For simplicity, we set the learning rate to the identity: ε = I (all results generalize to symmetric ε.) The following proposition (whose proof is elementary and well-known) provides a way of 'integrating' equation ( 11), in the appropriate sense, in order to obtain conserved quantities: Proposition 5.1. Suppose the action of G on Param is linearfoot_6 and leaves L invariant. For any M ∈ g, there is a conserved quantity Q M : Param → R given by Q M (θ) = ⟨θ , M • θ⟩. While Proposition 5.1 directly links the infinitesimal action to conserved quantities, it has the limitation that the conserved quantity corresponding to an anti-symmetric matrix M = -M T in g is constantly zero, and we do not obtain meaningful conserved quantities. Instead, we can only conclude that flow curves satisfy the differential equation ( 11). Fixing a basis (θ 1 , • • • θ d ) for Param ≃ R d , this equation becomes i<j M ij r 2 ij φij ≡ 0 where (r ij , ϕ ij ) are the polar coordinates for the point (θ i , θ j ) ∈ R 2 (see Appendix C.9.5). In summary, we find: M ∈ g symmetric M anti-symmetric M differential equation conserved quantity differential equation θT M θ = 0 Q M (θ) = θ T M θ i<j m ij r 2 ij φij ≡ 0 Conserved quantities parametrize symmetry flat directions. We observe that applying a symmetry changes the values of the conserved quantities Q M (Figure 1 ). Indeed, for M ∈ g and g ∈ G, we have Q M (g • θ) = Q g T M g (θ) for all θ ∈ Param, so applying the group actionfoot_7 transforms the conserved quantity Q M to Q g T M g . As discussed in Section 3, applying g to a minimum θ * of L yields another minimum g • θ * ; hence applying symmetries leads to a partial parameterization of flat minima. Note that, in general, we may lack sufficient number of Q M to fully parameterize a flat minimum. For example, in the linear network U V x, G = GL h and flat minima generically have h 2 dimensions, whereas the number of independent nonzero Q M is h(h + 1)/2, which is the dimension of the space of symmetric matrices M = M T in gl h . In gradient descent, the values of these conserved quantities may change due to the time discretization. However, the change in Q is expected to be small. For example, in two-layer linear networks, the change of Q is bounded by the square of learning rate. Appendix E contains derivations and empirical observations of the magnitude of change in Q. Relation to Noether's theorem. In physics, Noether's theorem (Noether, 1918) states that continuous symmetries give rise to conserved quantities. Recently, Tanaka & Kunin (2021) showed that Noether's theorem can also be applied to GD by approximating it as a second order GF. We show that in the limit where the second order GF reduces to first order GF (10), results from Noether's theorem reduce to our conservation law M θ , ∇L = 0 (6). In short, using Noether's theorem, the conserved Noether current is J M = e t/τ J 0M with J 0M = M θ , ε -1 θ . In the limit τ → 0, using (10), J 0M = M θ , ∇L = 0 and the conservation dJ M /dt = 0 implies J 0M = 0, meaning we recover (6). Details appear in Appendix B. Examples. We present examples of conserved quantities for two-layer neural networks. all of which directly generalize to the multi-layer case. See Appendix C.9 for full derivations (which heavily rely on properties of the trace). We adopt the notation of Section 3.3. Example 5.1. General equivariant activation Suppose σ is equivariant under a linear action of a subgroup G ⊆ GL h (R), so that π(g)σ(z) = σ(gz). Then the two-layer network F (z) = U σ(V z) is invariant under G, as is the loss function. For symmetric M ∈ g, Proposition 5.1 yields the following conserved quantity: Q M : Param → R, Q M (U, V ) = Tr V T M V -Tr U T U dπ(M ) Indeed, this follows from the fact that M • (U, V ) = (-U dπ(M ), M V ), as in (5). Example 5.2. Imbalance in linear layers Suppose the network is linear. Then σ(z) = z and the loss is invariant under GL h (R). For symmetric M we have the conserved quantity Q M (U, V ) = Tr (V V T -U T U )M . Moreover, each component of the matrix V V T -U T U is conserved. Example 5.3. Homogeneous activation under scaling Suppose σ is a homogeneous activation of degree α. Let G = (R >0 ) h be the positive rescaling group, so that σ(gz) = g α σ(z) for any g ∈ G and z ∈ R h . Note that the Lie algebra of G consists of all diagonal matrices in gl h , so that, in particular, each M ∈ g is symmetric. Since dπ(M ) = αM for any M ∈ g, we obtain the conserved quantity Q M (U, V ) = Tr (V V T -αU T U )M . Using the basis M = E kk , we see that Q = diag V V T -αU T U is conserved (here, diag[A] is the leading diagonal). Special cases of this are LeakyReLU and ReLU with α = 1. Example 5.4. Radial rescaling activations Let σ be such a radial rescaling activation. As in Section 3.3, the orthogonal group G = O(h) is a symmetry of L. The Lie algebra g = so h comprises anti-symmetric matrices, and so Proposition 5.1 yields no non-trivial conserved quantities. However, using the canonical basis of g = so h given by E [kl] = E kl -E lk (so [kl] indicates anti-symmetrized indices), one uses equation ( 11) to deduce the following novel result (see Appendix C.9): Theorem 5.2. When σ is a radial rescaling activation, we have: V V T -V V T + U T U -U T U = 0 (13) for any (U, V ) ∈ Param, where the dots indicate derivatives with respect to gradient flow. Expanding the (k, l) entry of the matrix on the left-hand-side of (13), we obtain: n s=1 r 2 U,s;kl φU,s;kl + m s=1 r 2 V T ,s;kl φV T ,s;kl = 0, where (r U,s;kl , ϕ U,s;kl ) and (r V T ,s;kl , ϕ V T ,s;kl are the 2D polar coordinates of the points (U sk , U sl ) and (V T sk , V T sl ). This is analogous to the "angular momentum" in 2D, that is: x ∧ ẋ = r 2 φ. Intuitively, Theorem 5.2 implies that in every 2D plane (k, l), the angular momenta of the rows of U and the columns of V sum to zero. These results also apply to linear networks F (x) = U V x, since rotational symmetries are linear. ) (a) (b) (c) (d)

6. APPLICATIONS

We present a set of experiments aimed at assessing the utility of the nonlinear group action and conserved quantitiesfoot_8 . A summary of the results are shown in Figure 2 . We show that the value of conserved quantities can impact convergence rate and generalizability. We also find the nonlinear action to be viable for ensemble building to improve robustness under certain adversarial attacks. Exploration of the minimum. While Q is often unbounded, common initialization methods such as (Glorot & Bengio, 2010) limit the values of Q to a small range (Appendix F). As a result, only a small part of the minimum is reachable by the models. Symmetries allow us to explore portions of flat minima that gradient descent rarely reaches. Convergence rate and generalizability. Conserved quantities are by definition unchanged during gradient flows. By relating the values of conserved quantities to convergence rate and model generalizability, we have access to properties of the trajectory and the final model before the gradient flow starts. This knowledge allows us to choose good conserved quantity values at initialization. In Appendix G, we derive the relation between Q and convergence rate for two example optimization problems, and provide numerical evidence that initializing parameters with certain conserved quantity values accelerates convergence. In Appendix H, we derive the relation between conserved quantities and sharpness of minima in a simple two-layer network, and show empirically that Q values affect the eigenvalues of the Hessian (and possibly generalizability) in larger networks. Ensemble models. Applying the nonlinear group action allows us to obtain an ensemble without any retraining or searching. We show that even with stochasticity in the data, the loss is approximately unchanged under the group action. The ensemble has the potential to improve robustness under adversarial attacks (Appendix I).

7. DISCUSSION

In this paper, we present a general framework of equivariance and introduce a new class of nonlinear, data-dependent symmetries of neural network parameter spaces. These symmetries give rise to conserved quantities in gradient flows, with important implications in improving optimization and robustness of neural networks. While our work sheds new light onto the link between symmetries and group, it contains several limitations, which merit further investigation. First, we have not been able to determine conserved quantities in the radial rescaling case, only a differential equation that gradient flow curves must satisfy. Second, one major contribution of this paper is the non-linear group action of Section 4. However, our formulation only gurantees full GL h equivariance for batch size k = 1. In future work, we plan to explore more consequences and variations of this non-linear group action, with the hope of generalizing to greater batch size. Finally, in many cases, parameter space symmetries lead to model compression: i.e., finding a lower-dimension space of parameters with the same expressivity of the original space. 

I Ensemble models 41

A ADDITIONAL RELATED WORKS Mode connectivity / flat regions / ensembles In neural networks, the optima of the loss functions are connected by curves or volumes, on which the loss is almost constant (Freeman & Bruna, 2017; Garipov et al., 2018; Draxler et al., 2018; Benton et al., 2021; Izmailov et al., 2018) . In these works, various algorithms have been proposed to find these low-cost curves, which provides a low-cost way to create an ensemble of models from a single trained model. Other related works include linear mode connectivity (Frankle et al., 2020) , using mode connectivity for loss landscape analysis (Gotmare et al., 2018) , and studying flat region by removing symmetry (Pittorino et al., 2022) .

Sharpness of minima and generalization

Recent theory and empirical studies suggest that sharp minimum do not generalize well (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017; Petzka et al., 2021) . Explicitly searching for flat minimum has been shown to improve generalization bounds and model performance (Chaudhari et al., 2017; Foret et al., 2020; Kim et al., 2022) . The sharpness of minimum can be defined using the largest loss value in the neighborhood of a minima (Keskar et al., 2017; Foret et al., 2020; Kim et al., 2022) , visualization of the change in loss under perturbation with various magnitudes on weights (Izmailov et al., 2018) , singularity of Hessian (Sagun et al., 2017) , or the volume of the basin that contains the minimum (approximated by Radon measure (Zhou et al., 2020) or product of the eigenvalues of the Hessian (Wu et al., 2017) ). Under most of these metrics, however, equivalent models can be built to have minimum with different sharpness but same generalization ability (Dinh et al., 2017) . Applications include explaining the good generalization of SGD by examining asymmetric minimum (He et al., 2019) , and new pruning algorithms that search for minimizers close to flat regions (Chao et al., 2020) . Parameter space symmetry and optimization While sets of parameter values related by the symmetries produce the same output, the gradients at these points are different, resulting in different learning dynamics (Kunin et al., 2021; Van Laarhoven, 2017) . This insight leads to a number of new advances in optimization. Neyshabur et al. (2015) and Meng et al. (2019) propose optimization algorithms that are invariant to symmetry transformations on parameters. Armenta et al. (2023) and Zhao et al. (2022) apply loss-invariant transformations on parameters to improve the magnitude of gradients, and consequently the convergence speed. The structures encoded in known symmetries have also led to new optimization methods and insights of the loss landscape. Bamler & Mandt (2018) improves the convergence speed when optimizing in the direction of weakly broken symmetry. Zhang et al. (2020) discusses how symmetry helps in obtaining global minimizers for a class of nonconvex problems. The potential relevance of continuous symmetries in optimization problems was also discussed in Leake & Vishnoi (2021) , which also provides an overview of Lie groups.

B RELATION TO NOETHER'S THEOREM

We will now show how the approach in Tanaka & Kunin (2021) relates to our conservation law dQ/dt = θ, M θ = 0. Assuming a small time-step τ ≪ 1, we can write GD as θ(t + τ ) -θ(t) = -ε∇L(θ(t)). Expanding the l.h.s to second order in τ and discarding O(τ 3 ) terms defines the 2nd order GF equation 2nd order GF: dθ dt + τ 2 d 2 θ dt 2 = -ε∇L. Here ε = ε/τ . To use Noether's theorem, the dynamics (i.e. GF) must be a variational (Euler-Lagrange (EL)) equation derived from an "action" S(θ) (objective function), which for ( 14) is the time integral of Bregman Lagrangian (Wibisono & Wilson, 2015) L S(θ) = dtL(θ(t), θ(t); t) = γ dte t/τ τ 2 θ, ε -1 θ -L(θ) where θ : R → Param is a trajectory (flow path) in Param, parametrized by t. The variational EL equations find the paths γ * which minimize the action, meaning ∂S γ /∂γ| γ * = 0. Noether's theorem states that if M ∈ g is a symmetry of the action S(θ) (15) (not just the loss L(θ)), then the Noether current J M is conserved Noether current: J M = M θ , ∂L ∂ θ = e t/τ M θ , ε -1 θ = e t/τ J 0M , Conservation: dJ M dt = e t/τ 1 τ J 0M + dJ 0M dt = 0, ⇒ J 0M (t) = J 0M e -t/τ (16) Tanaka & Kunin (2021) also derived the Noether current ( 16), but concludes that because L(θ, θ) ̸ = L(θ), the symmetries are "broken" and therefore doesn't derive conserved charges for the types of symmetries we discussed above. However, while Tanaka & Kunin (2021) focuses on 2nd order GF, we note that our conserved Q were derived for first order GF, which is found from the τ → 0 limit of 2nd oder GF. In this limit L → e t/τ L and thus symmetries of L also becomes symmetries of L. When τ → 0, 2nd order GF reduces to ε -1 θ = -∇L the conserved charge goes to lim τ →0 J 0M = M θ , ∇L = J 0M (0) lim τ →0 e -t/τ = 0, which means that we recover the invariance under infinitesimal action (6). In fact, for linear symmetries and symmetric M ∈ g, J 0M = dQ M /dt = 0.

C NEURAL NETWORKS: LINEAR GROUP ACTIONS

In this appendix, we provide an extended discussion of the topics of Section 3, including full proofs of all results. Specifically, after some technical background material on Jacobians and differentials, we specify our conventions for neural network parameter space symmetries. In contrast to the discussion of the main text, we (1) assume that neural networks have biases, and (2) focus on the multi-layer case rather than just the two-layer case. We then turn our attention to group actions of the parameter space that leave the loss invariant, and the resulting infinitesimal symmetries. The groups we consider are all subgroups of a large group of change-of-basis transformations of the hidden feature spaces; we call this group the 'hidden symmetry group'. We also compute the dimensions of generic extended flat minima in various relevant examples. Finally, we explore consequences of invariant group actions for conserved quantities.

C.1 JACOBIANS AND DIFFERENTIALS

In this section, we summarize background material on Jacobians and differentials. We adopt notation and conventions from differential geometry. Let U ⊂ R n be an open subset of Euclidean space R n , and let F : U → R m be a differentiable function. Let F 1 , . . . , F m : U → R be the components of F , so that F (u) = (F 1 (u), . . . , F m (u)). The Jacobian of F , also know as differential of F , at u ∈ U is the following matrix of partial derivatives evaluated at u: dF u =       ∂F1 ∂x 1 u ∂F1 ∂x 2 u • • • ∂F1 ∂x n u ∂F2 ∂x 1 u ∂F2 ∂x 2 u • • • ∂F2 ∂x n u . . . . . . . . . . . . ∂Fm ∂x 1 u ∂Fm ∂x 2 u • • • ∂Fm ∂x n u       The differential dF u defines a linear map from R n to R m , that is, an element of R m×n . Observe that if F itself is linear, then, as matrices, dF u = F for all points u ∈ U . If G : R m → R p is another differentiable map, then the chain rule implies that, for all u ∈ R n , we have: d(G • F ) u = dG F (u) • dF u . In the special case the m = 1, the differential is a 1 × n row vector, and the gradient ∇ u F of F at u ∈ R n is defined as the transpose of the Jacobian dF u : ∇ v F = (dF u ) T = ∂F ∂x1 u . . . ∂F ∂xn u T = ∂F ∂x i u n i=1 ∈ R n C.2 NEURAL NETWORK PARAMETER SPACES Consider a neural network with L layers, input dimension n 0 = n, output dimension n L = m, and hidden dimensions given by n 1 , . . . , n L-1 . For convenience, we group the dimensions into a tuple n = (n 0 , n 1 , . . . , n L ). The parameter space is given by: Param(n) = R n L ×n L-1 × R n L-1 ×n L-2 × • • • × R n2×n1 × R n1×n0 × R n L × R n L-1 × • • • × R n1 . We write an element therein as a pair θ = (W, b) of tuples W = (W L , . . . , W 1 ) and b = (b L , . . . , b 1 ), so that W i is a n i × n i-1 matrix and b i is a vector in R ni for i = 1, . . . , L. When n is clear from context, we write simply Param for the parameter space. Fix a piecewise differentiable function σ i : R ni → R ni for each i = 1, . . . , L. The activations can be pointwise (as is conventionally the case), but are not necessarily so. The feedforward function F = F θ : R n → R m corresponding to parameters θ = (W, b) ∈ Param with activations σ i is defined in the usual recursive way. To be explicit, we define the partial feedforward function F θ,i = F i : R n → R ni to be the map taking x ∈ R n to σ i (W i F i-1 (x) + b i ), for i = 1, . . . , L, with F 0 = id R n 0 . Then the feedforward function is F = F L . The "loss function" L of our model is defined as: L : Param × Data → R, L(θ, (x, y)) = Cost(y, F θ (x)). ( ) where Data = R n0 × R n L is the space of data (i.e., possible training data pairs), and Cost : R n L ×R n L → R is a differentiable cost function, such as mean square error or cross-entropy. Many, if not most, of our results involve properties of the loss function L that hold for any training data. Hence, unless specified otherwise, we take a fixed batch of training data {(x i , y i )} k i=1 ⊆ Data, and consider the loss to be a function of the parameters only. The above constructions generalize to multiple samples. Specifically, instead of x ∈ R n0 and y ∈ R n L , one has matrices X ∈ R no×k and Y ∈ R n L ×k whose columns are the k samples. Additionally, one uses the Frobenius norm of n L × k matrices to compute the loss function. The i-th partial feedforward function is F θ,i : R n0×k → R ni×k , and we have the feedforward function F θ : R n0×k → R n L ×k . (We use the same notation as in the case k = 1; the number of samples will be understood from context.) Example C.1. Consider the case L = 2 with no biases and no output activation. The dimension vector is (n 0 , n 1 , n 2 ) = (n, h, m), so the parameter space is Param(n, h, m) = R h×n × R m×h . Taking the mean square error cost function, the loss function for data (X, Y ) ∈ R n×k × R m×k takes the form L(θ, (X, Y )) = 1 k ∥Y -U σ(V X)∥ 2 , where θ = (W 2 , W 1 ) = (U, V ) ∈ Param.

C.3 ACTION OF CONTINUOUS GROUPS AND INFINITESIMAL SYMMETRIES

Let G be a group. An action of G on the parameter space Param is a function • : G × Param → Param, written as g • θ, that satisfies the unit and multiplication axioms of the group, meaning I • θ = θ where I is the identity of G, and g 1 • (g 2 • θ) = (g 1 g 2 ) • θ for all g 1 , g 2 ∈ G. Recall that we say an action G × Param → Param is a symmetry of Param with respect to L if it leaves the loss function invariant, that is: L(g • θ) = L(θ), ∀θ ∈ Param, g ∈ G (19) The groups within the scope of this paper are all matrix Lie groups, which are topologically closed subgroups G ⊆ GL n (R) of the general linear group of invertible n × n real matrices. Any smooth action of such a group induces an action of the infinitesimal generators of the group, i.e., elements of its Lie algebra. Concretely, let g = Lie(G) = T I G be the Lie algebra, which can be thought of as a certain subspace of matrices in gl n = R n×n , or (equivalently) as the tangent spacefoot_9 . at the identity I of G. For every matrix M ∈ g, we have an exponential map exp M : R → G defined as  exp M (t) = ∞ k=0 (tM ) k k! . dL θ d dt 0 (exp M (t) • θ) = d dt 0 L(exp M (t) • θ) = d dt 0 L(θ) = 0 where the first equality follows by the chain rule, the second equality uses the invariance of L, and the third equality follows since L(θ) does not depend on t. Next, we comment on the case of a linear action. Observe that the parameter space is a vector space of dimension d = dim(Param) = L i=1 n i (1 + n i-1 ). Hence, there is an isomorphism Param ≃ R d which flattens any tuple θ = (W, b) into a vector in R d . We can identify the group GL(Param) of all invertible linear transformations of the parameter space with the group GL d (R) of invertible d × d matrices, and the Lie algebra of GL(Param) with gl d = R d×d . Suppose G acts linearly on the parameter space. Then we can identifyfoot_10 G with a subgroup of GL(Param), acting on Param with matrix multiplication. Similarly, we can identify the Lie algebra g of G with a Lie subalgebra of gl d = R d×d . In this case, the infinitesimal action is given by matrix multiplication: M θ = M • θ. We recover Equation (6).

C.4 THE HIDDEN SYMMETRY GROUP

Consider a neural network with L layers and dimensions n. The hidden symmetry group corresponding to dimensions n is defined as the following product of general linear groups: GL n hidden = GL n1 × • • • × GL n L-1 An element is a tuple of invertible matrices g = (g 1 , . . . , g L-1 ), where g i ∈ GL ni . Consider the action of the hidden symmetry group on the parameter space given by GL n hidden ⟳ Param(n) g • (W, b) = g i W i g -1 i-1 , g i b i L i=1 . ( ) where g 0 = id n0 and g L = id n L . This action amounts to changing the basis at each hidden feature space. The Lie algebra of GL n hidden is gl n hidden = gl n1 × • • • × gl n L 1 , and the infinitesimal action of the tuple M ∈ gl n hidden is given by: gl n hidden ⟳ Param(n) M • (W, b) → (M i W i -W i M i-1 , M i b i ) L i=1 where we set M 0 and M L to be the zero matrices. Example C.2. In the case L = 2, with dimension vector n = (n, h, m) and no biases. The hidden symmetry group is GL h with Lie algebra gl h . The action of the group and the infinitesimal action of the Lie algebra are given by: GL h ⟳ Param(n, h, m) = R h×n × R m×h g • (V, U ) = gV, U g -1 gl h ⟳ Param(n, h, m) = R h×n × R m×h M • (V, U ) = (M V, -U M ) C.5 LINEAR SYMMETRIES Consider a feedforward fully-connected neural network with widths n = (n 0 , . . . , n L ), so that the parameters space consists of tuples of weights and biases θ = ( W i ∈ R ni×ni-1 , b i ∈ R ni ) L i=1 . For each hidden layer 0 < i < L, let G i be a subgroup of GL ni , and let π i : G i → GL ni (R) be a representation (in many cases, we take π i (g) = g). Hence the product G = G 1 × G L-1 is a subgroup of the hidden symmetry group GL n hidden . Define an action of G on Param via ∀g = (g 1 , . . . , g L ) ∈ G, g • W i = g i W i π i-1 (g -1 i-1 ) g • b i = g i b i ( ) where g 0 and g L are the identity matrices id n0 and id n L , respectively. This is a version of the action defined in (21), with the addition of the twists resulting from the representations π i . We now consider the resulting infinitesimal action. For each i, the representation π i induces a Lie algebra representation d(π i ) : g i → gl ni . The infinitesimal action of the Lie algebra g = g 1 × • • • × g L induced by 22 is given by: ∀M = (M 1 , . . . , M L ) ∈ g, M • W i = M i W i -W i d(π i-1 )(M i-1 ) M • b i = M i b i (23) The proof of the first part of the following Proposition proceeds by induction, where the key computation is that of from (4). The second part relies on (6). Proposition C.1. Suppose that, for each i = 1, . . . , L, the activation σ i intertwines the two actions of G i , that is,  σ i (g i z i ) = π i (g i )σ(z i ) for all g i ∈ G i , z i ∈ R ni . Then: 1. (Combined equivariance of activations) The action of G = G 1 × • • • × G L defined in Proof. Let g = (g i ) L-1 i=1 ∈ G, so that g i ∈ G i ⊆ GL ni . As usual, set g 0 = id n0 and g L = id n L . Also set π 0 and π L to be the identity maps on GL n0 and GL n L , respectively. Fix parameters θ and an input value x ∈ R n . We show by induction that the following relation between the partial feedforward functions holds: F g•θ,i (x) = π i (g i )(F θ,i (x) ) for i = 0, . . . , L. The base step is trivial. For the induction step, we use the recursive definition of the partial feedforward functions: F g•θ,i (x) = σ i (g i W i π i-1 (g -1 i-1 )F g•θ,i-1 (x) + g i b i ) = σ i (g i (W i π i-1 (g -1 i-1 )π i-1 (g i-1 )F θ,i-1 (x) + b i ) = π i (g i )σ i (W i F θ,i-1 (x) + b i ) = π i (g i )F θ,i (x) Hence θ and g • θ define the same feedforward function. Since the loss function depends on the parameters only through the feedforward function, the first claim follows. The second claim is a consequence of Proposition 3.4. Note that two-layer case of the above result amounts to (4) (the argument can be simplified in that case). While Proposition C.1 is stated for feedforward networks, it can easily be adopted to more general settings, such as networks with skip connections and quiver neural networks. Denoting the Jacobian of σ i : R ni → R ni at z ∈ R ni by d(σ i ) z ∈ R ni×ni , the infinitesimal form of version of σ i (gz) = π i (g)σ i (z) is ⟨d(σ i ) z , M z⟩ = dπ i (M )σ i (z) ∀M ∈ gl ni (24) When the activation is pointwise, we have M z ⊙ σ ′ i (z) = d(π) i (M )σ i (z) , where ⊙ denotes elementwise multiplication. We now illustrate Proposition C.1 through examples in the two-layer case with input dimension n, hidden dimension h, hidden activation σ, and output dimension m. Example C.3. Linear networks For linear networks, we have σ(x) = x. One can take π(g) = g and G = GL h (R). Example C.4. Homogeneous activations Suppose the activation σ : R h → R h is homogeneous, so that (1) σ is applied pointwise in the standard basis, and (2) t there exists α > 0 such that σ(cz) = c α σ(z) for all c ∈ R >0 and z ∈ R h . These σ are equivariant under the positive scaling group G ⊂ GL h consisting of diagonal matrices with positive diagonal entries. For g ∈ G, we have g = diag(c) is a diagonal matrix with c = (c 1 , . . . , c h ) ∈ R h >0 . For z = (z 1 , . . . , z h ) ∈ R h , we have σ(gz) = j σ(c j z j ) = j c α j σ(z j ) = g α σ(z). Hence, the equivariance condition holds with π(g) = g α . Since dπ(M ) = αM for any element M of the Lie algebra g of G, the infinitesimal version of rescaling invariance of homogeneous σ becomes M z ⊙ σ ′ (z) = αM σ(z). Example C.5. LeakyReLU This is a special case of homoegeneous activation, defined as σ(z) = max(z, 0) -s min(z, 0), with s ∈ R >0 . We have α = 1, and π(g ) = g. Since σ(z) = zσ ′ (z), infinitesimal equivariance becomes M z ⊙ σ ′ (z) = M σ(z). Example C.6. Radial rescaling activations A less trivial example of continuous symmetries is the case of a radial rescaling activation (Ganev et al., 2022) where for z ∈ R h \ {0}, σ(z) = f (∥z∥)z for some function f : R → R. Radial rescaling activations are equivariant under rotations of the input: for any orthogonal transformation g ∈ O(h) (that is, g T g = I) we have σ(gz) = gσ(z) for all z ∈ R h . Indeed, σ(gz) = f (∥gz∥)(gz) = g(f (∥z∥)z) = gσ(z) , where we use the fact that ∥gz∥ = z T g T gz = z T z = ∥z∥ for g ∈ O(h). Hence, (4) is satisfied with π(g) = g.

C.6 LINEAR SYMMETRIES LEAD TO EXTENDED, FLAT MINIMA

In this section, we show that, in the case of a linear group action, applying the action of any element of the group to a local minimum yields another local minimum. This fact is a corollary of a more general result; in order to describe it and remove ambiguity, we include the following clarifications. Let G be a matrix Lie group acting as a linear symmetry. Fix a basis (θ 1 , . . . , θ d ) of the parameter space. The gradient ∇ θ L of the loss L at a point θ ∈ Param in the parameter space as another vector in Param ≃ R d , whose i-th coordinate is the partial derivative ∂L ∂θi θ . Hence, it makes sense to apply the group action to the gradient: g • ∇ θ L. We regard vectors in Param ≃ R d as column vectors with d rows. Thus, the transpose of any vector is a row vector with d columns. In the case of the gradient, its transpose at θ matches the Jacobian dL θ ∈ R 1×d of L (see Appendix C.1), that is: dL θ = (∇ θ L) T . Alternative notation for the Jacobian is dL θ0 = ∂L ∂θ θ0 , where we now use θ as a dummy variable and θ 0 ∈ Param as a specific value. As noted above, we are interested in matrix Lie groups G ⊆ GL d (R) = GL(Param), and assume that the matrix transpose g T belongs to G for any g ∈ G. These assumptions hold in all examples of interest. We have the following reformulation of Proposition 3.2: Proposition C.2. Suppose the action of G on the parameter space is linear and leaves the loss invariant. Then the gradients of L at any θ 0 and g • θ 0 are related as follows: g T • ∇ g•θ0 L = ∇ θ0 L ∀g ∈ G, ∀θ 0 ∈ Param (25) If θ * is a critical point (resp. local minimum) of L, then so is g • θ * . Sketch of proof. Let T g : Param → Param be the transformation corresponding to g ∈ G. The the Jacobian dL θ0 is given by: dL θ0 = ∂L ∂θ θ0 = ∂(L • T g ) ∂θ θ0 = ∂L ∂θ g•θ0 ∂T g ∂θ θ0 = dL g•θ0 • T g where we use the definition of the Jacobian, the invariance of the loss (L • T g = L), the chain rule, and the linearity of the action. The result follows from applying T g -1 on the right to both sides, and taking transposes (see Appendix C.1). The last statement follows from the invariance of L under the action of G, and the fact that ∇ θ * L = 0 at a critical point θ * of L. We conclude that, if θ * is a critical point, then the set {g • θ | g ∈ G} belongs to the critical locus. This set is known as the orbit of θ under the action of G, and is isomorphic to the quotient G/Stab G (θ), where Stab G (θ) = {g ∈ G | g • θ = θ} is the stabilizer subgroup of θ in G. In the case of a linear action, the orbit is a smooth manifold. While the results above imply that the critical locus is a union of G-orbits, they do not imply, in general, that the critical locus is a single G-orbit. They also do not rule out the case that the stabilizer is a somewhat 'large' subgroup of G, in which case the orbit would have low dimension. However, in many cases, there is a topologically dense subset of parameter values θ ∈ Param whose orbits all have the same dimension. We call such an orbit a 'generic' orbit. We now turn our attention to examples of two-layer networks where such a generic orbit exists.

C.6.1 FLAT DIRECTIONS IN THE TWO-LAYER CASE

Recall that the parameter space of a two-layer network is Param = R m×h × R h×n , where the dimension vector is (m, h, n), and we write elements as (U, V ). The action of GL h is g • (U, V ) = (U g -1 , gV ). Let Param • ⊆ Param be the subset of pairs (U, V ) where each of U and V have full rank. This is an open dense subset of Param, and is preserved by the GL h -action. Proposition C.3. The GL h -orbit of each element of Param • has dimension dim(Orbit) = h 2 -max(0, h -n) max(0, h -m). Proof. Fix (U, V ) ∈ Param • . Suppose h ≤ n, so that h 2 -max(0, h -n) max(0, h -m) = h 2 . Then V ∈ R h×n defines a surjective linear map, so has a right inverse V † ∈ R n×h . If g ∈ GL h belongs to the stabilizer of (U, V ), then we have gV = V . Applying V † on the right to both sides, we obtain g = id h . Thus, the stabilizer of (U, V ) is trivial, and the orbit has dimension equal to the dimension of the group, namely h 2 . The case h ≤ m is similar. Now suppose h > max(n, m). In this case, h 2 -max(0, h -n) max(0, h -m) = h(n + m) -nm. Set U 0 = [id m 0] ∈ R m×h and V 0 = id n 0 ∈ R h×n , so that the last h -m rows of U 0 are zero, and the last h -n columns of V 0 are zero. Then, by the rank assumption, there exists g 1 ∈ GL h such that U g 1 = U 0 , and there exists g 2 ∈ GL h such that g -1 2 V = V 0 . Without loss of generality, we can take g 1 and g 2 such that det(g 1 ) > 0 and det(g 2 ) > 0. Thus, both g 1 and g 2 belong to the component of the identity in GL h . Next, consider the action of G = GL h on full rank matrices in R m×h and R h×n individually. We have that Stab G (U ) = g 1 Stab G (U 0 )g -1 1 and Stab G (V ) = g 2 Stab G (V 0 )g -1 2 . The stabilizer in GL h of the pair (U, V ) ∈ Param • can be written as: Stab G (U, V ) = {g ∈ G | U g -1 = U and gV = V } (26) = Stab G (U ) ∩ Stab G (V ) (27) = (g 1 Stab G (U 0 )g -1 1 ) ∩ (g 2 Stab G (V 0 )g -1 2 ) Since g 1 and g 2 belong to the connected component of the identity, the dimension of (g 1 Stab G (U 0 )g -1 1 ) ∩ (g 2 Stab G (V 0 )g -1 2 ) is equal to the dimension 12 of Stab G (U 0 ) ∩ Stab G (V 0 ).

Hence we reduce the problem to computing the dimension of Stab

G (U 0 ) ∩ Stab G (V 0 ). To this end, observe that a matrix g belongs to Stab G (U 0 ) (resp. Stab G (V 0 )) if and only if is of the form: g = id m 0 * * (resp. g = id n * 0 * ) where the lower left * ∈ R (h-m)×m and the lower right * ∈ GL h-m (resp. upper right * ∈ R n×(h-n) and lower right * ∈ GL h-n ) are arbitrary. If m ≥ n, taking the intersection amounts to considering matrices of the form: g = id n 0 0 0 id m-n 0 0 * * where the rows and columns are divided according to the partition h = n + (m -n) + (h -m). If m ≥ n, taking the intersection amounts to considering matrices of the form: g = id m 0 0 0 id n-m * 0 0 * where the rows and columns are divided according to h = m + (n -m) + (h -n). In both cases, the dimension of the intersection is (h -n)(h -m) = h 2 -hn -hm + nm. We obtain the dimension of the orbit as: h 2 -(h -n)(h -m) = h(n + m) -nm. Recall that the symmetry group for homogenous activations is the coordinate-wise positive rescaling subgroup of GL h , consisting of diagonal matrices with positive entries along the diagonal. We denote this subgroup as T + (h). Similarly, the symmetry group for radial rescaling activation is the orthogonal group O(h). For linear networks, the activation is the identity function, so the symmetry group is all of GL h . Corollary C.4. The orbit of a point in Param • under the appropriate symmetry group is given by:

Type of activation Symmetry group Dimension of generic orbit Linear

GL h h 2 -max(0, h -n) max(0, h -m) Homogeneous T + (h) min(h, max(n, m)) Radial rescaling O(h) h 2 if h ≤ max(n, m) h 2 -h-max(m,n) 2 otherwise Proof. Adopt the notation of the proof of the above Proposition. The stabilizer in T + (h) of (U 0 , V 0 ) is the intersection of the stabilizer in GL h of (U 0 , V 0 ) and T + (h). This intersection is easily seen to have dimension max(0, h -max(n, m)). Subtracting this from dim(T + (h)) = h, we obtain the result for the homogeneous case. For the orthogonal case, the stabilizer in O(h) of (U 0 , V 0 ) is the intersection of the stabilizer in GL h of (U 0 , V 0 ) and O(h). This intersection has dimension 0 if h ≤ max(n, m) and h-max n,m 2 otherwise. Subtracting from dim(O(h)) = h 2 , we obtain the result for the radial rescaling case.

C.7 CONSERVED QUANTITIES

We now turn our attention to gradient flow and conserved quantities. In this section, we give a formal definition of a conserved quantity. Let V = R d be the standard vector space of dimension d. Suppose L : V → R is a differentiable function. Let Flow t : V → V be the flow for time t along the reverse gradient vector field, so that: d dt 0 [t → Flow t (v)] = -∇ v L Note that Flow 0 is the identity on V, and, for any s, t, the composition of Flow s and Flow t is Flow s+t . We will write v(t) for Flow t (v), so that v(s) = -∇ v(s) L. A conserved quantity is a function Q : V → R that satisfies either of the following equivalent conditions: 1. For any t, we have Q(v(t)) = Q(v). 2. Let Q(v) = d dt 0 [t → Q(v(t)) ] be the derivative of Q along the flow. Then Q ≡ 0. 3. The gradients of Q and L are point-wise orthogonal, that is, ⟨∇ v Q, ∇ v L⟩ = 0 for all v ∈ V. The equivalence of (1) and (2) are immediate. To show the equivalence of the third and second statements, let v ∈ V and compute: ⟨∇ v Q, ∇ v L⟩ = dQ v(0) • ∇ v L = dQ v(0) • d dt 0 v(t) = d dt (Q • v(t)) , where we use the definition of the flow in the second equality, and the chain rule in the third. We note that, if f : R → R is any function, and Q is a conserved quantity, the f • Q is also a conserved quantity. Additionally, any linear combination of conserved quantities is again a conserved quantity. Let Conserv(V, L) denote the vector space of conserved quantities for the gradient flow of L : V → R. For any v ∈ V, there a map: ∇ v : Conserv(V, L) → T v V = R d , Q → ∇ v Q taking a conserved quantity to the value of its gradient at v. By the above discussion, the map ∇ v is valued in the kernel of the differential dL v .

C.8 CONSERVED QUANTITIES FROM A GROUP ACTION

Let G be a subgroup of the general linear group GL d (R). Thus, there is a linear action of G on V = R d . Suppose the function L is invariant for the action of G, that is, L(g • v) = L(v) ∀v ∈ V ∀g ∈ G. Let g = Lie(G) be the Lie algebra of G, which is a Lie subalgebra of gl d = R d×d . The infinitesimal action of g on V is given by g × V → V , taking (M, v) to M v. Proposition C.5. Let L : V → R be a G-invariant function, and let M ∈ g.

1.

For any v ∈ V, the gradient of L and the infinitesimal action of M are orthogonal: ⟨∇ v L, M v⟩ = 0. 2. Suppose γ : (a, b) → V is a gradient flow curve for L. Then: ( γ(t)) T M γ(t) = 0 ∀t ∈ (a, b) 3. Suppose M T belongs to g. Then the function Q M : V → R, v → v T M v is a conserved quantity for the gradient flow of L. Proof. For the first claim, observe that the invariance of L implies that the left diagram commutes: G inc / / R d×d evv / / R d L {v} L / / R g dinc1 / / R d×d d(evx) id d / / R d dLv 0 / / R where inc : G → R d×d is the inclusion (which passes through the inclusion of G in to GL d (R)), ev v : R d×d → R d be the evaluation map at v, and the left vertical map is the constant map at {v}. Indeed, the clockwise composition is g → L(gv), which is equal to the constant map at g → L(v). The chain rule implies that taking Jacobians at the identity 1 of G results in the commutative diagram on the right. where g is the Lie algebra of G, identified with the tangent space of G at the identity; the tangent space of the vector space R d at v is canonically identified with R d ; and the the zero appears because the tangent space of a single point is zero. The derivative of the inclusion map is the inclusion g → gl d = R d×d , while the derivative the evaluation map is itself as it is a linear map. Hence, for M ∈ g, we have: 0 = dL v • d(ev v ) id d • dinc 1 (M ) = dL v • ev v (M ) = dL v (M v) = ⟨∇ v L, M v⟩. The first claim follows. The second claim is consequence of the first claim, together with the definition of a gradient flow curve. For the third claim, we take the derivative of the composition of Q M with a gradient flow curve γ: d dt (Q M • γ) = ( γ(t)) T (M + M T )γ(t) = ⟨ γ(t), (M + M T )γ(t)⟩ = -⟨∇ γ(t) L, (M + M T )γ(t)⟩ = -⟨∇ γ(t) L, M γ(t)⟩ -⟨∇ γ(t) L, M T γ(t)⟩ Both terms in the last expression are constantly zero by the second claim. Hence Q M is constant on any gradient flow curve, and so it is a conserved quantity. We summarize some of the results and constructions of this section diagrammatically. Let g sym denote the vector space of symmetric matrices in g (this is not a Lie subalgebra in general). Observe that g ∩ g T is the set of all M ∈ g such that M T ∈ g. Let Infin G (V, L) denote the vector space of infinitesimal-action conserved quantities for the gradient flow of the G-invariant function L : V → R. We have: g ∩ g T Infin G (V, L) Conserv(V, L) g sym M →Q M ≃ where the map g ∩ g T → g sym takes M to its symmetric part 1 2 M + M T , while the map g sym → g ∩ g T is the natural inclusion. We note that g ∩ g T is the Lie algebra of the group G ∩ G T , while g sym is in general not a Lie algebra. By definition, the vector space Infin G (V, L) is the image of the map M → Q M defined on g ∩ g T . It is straightforward to verify the following result: Corollary C.6. The map M → Q M establishes an isomorphism of vector spaces: g sym ∼ -→ Infin G (V, L). As discussed in Section 3.2, applying a symmetry g to a minimum θ * of L yields another minimum g • θ * . Using flattened θ ∈ R d it is easy to show that acting with g changes some Q M (θ) = θ T M θ. Let g = exp M ′ (t) ≈ I + tM ′ , with M ′ ∈ g and 0 < η ≪ 1 be a g ∈ G close to identity. We have Q M (g • θ) = Q M + tθ T M ′ T M + M M ′ θ + O(η 2 ). Thus, whenever M ′ T M + M M ′ ̸ = 0, applying g changes the value of Q M . Therefore, Q M can be used to parameterize the flat minima. However, for anti-symmetric M , we could not find nonzero Q explicitly.

C.8.1 ANTI-SYMMETRIC CASE

Suppose M ∈ g is anti-symmetric, so M = -M T . Let γ : (a, b) → R d be a gradient flow curve. Write γ in coordinates as γ = (γ 1 , . . . , γ d ). Proposition C.5 implies that ( γ(t)) T M γ = 0. Hence we have: 0 = i<j m ij ( γi γ j -γ i γj ) = i<j m ij γ 2 i γ j γ i ′ = i<j m ij r 2 ij θij where r ij = r ij (t) is equal to γ i (t) 2 -γ j (t) 2 and θ ij = θ ij (t) is the angle between the i-th coordinate axis and the ray from the origin to the projection of γ(t) to the (i, j)-plane. One verifies the last equality using the definition of θ ij in terms of the arctangent of the quotient γ j /γ i . We see that (r ij (t), θ ij (t)) are the polar coordinates for the point (γ i (t), γ j (t)) ∈ R 2 . Case d = 2. Then M = 0 a -a 0 for some nonzero a ∈ R, and so: 0 = γT M γ = a γ1 (t)γ 2 (t) -aγ 1 (t) γ2 (t) = ar(t) 2 θ(t) where r(t) and θ(t) are polar coordinates. Setting the final expression equal to zero, we obtain that θ(t) is constant along any flow line γ(t) that begins away from the origin. Case d = 3. Then M = 0 a b -a 0 c -b -c 0 for some a, b, c ∈ R, and so: 0 = γT M γ = ar 2 12 θ12 + br 2 13 θ13 + cr 2 23 θ23 C.9 EXAMPLES OF CONSERVED QUANTITIES FOR NEURAL NETWORKS We now compute conserved quantities for gradient flow on neural network parameter spaces in the case of linear, homogeneous, and radial networks. In each case, we state results first in the for a general multi-layer network, and then for the running example of a two-layer network. Throughout, ⊙ denotes the Hadamard product of matrices, defined by entrywise multiplication. We also set τ (M ) to be the sum of all entries in a matrix M . We note that, for square matrices M and N of the same size, τ (M ⊙ N ) = Tr(M T N ), which is the same as the inner product of the flattened versions of M and N . The notation for the running example of a two-layer network is as follows. We set the input and output dimensions both equal to one, hidden dimension equal to two, and no bias vectors. The hidden layer activation is σ : R 2 → R 2 . The parameter space is Param = R 2×1 × R 1×2 , with elements written as a pair of matrices: (U, V ) = [u 1 u 2 ] , v 1 v 2 . The hidden symmetry group is GL 2 , with action given by: GL 2 (R) × Param → Param, (g, U, V ) → g • (U, V ) = U g -1 , gV . The Lie algebra gl 2 of GL 2 (R) consists of all two-by-two matrices. C.9.1 CONSERVED QUANTITIES FOR LINEAR NETWORKS Suppose a neural network with L layers has the identity activation σ i = id ni in each layer, so that the resulting network is linear. Then it is straightforward to verify that the networks with parameters g • (W, b) and (W, b) have the same feedforward function. Consequently, the loss is invariant for the group action: its value the original and transformed parameters is the same for any choice of training data. (As we will see below, for more sophisticated activations, one needs to restrict to a subgroup of the hidden symmetry group to achieve such invariance.) Suppose M ∈ gl n hidden is such that M i ∈ gl ni is symmetric for each i. The conserved quantity implied by Proposition C.5 is: Q M (W, b) = L-1 i=1 (τ (W i ⊙ M i W i ) + τ (b i ⊙ M i b i ) -τ (W i+1 ⊙ W i+1 M i )) = L-1 i=1 Tr W i W T i + b i b T i -W T i+1 W i+1 M i We examine these conserved quantities in the following convenient basis for the space of symmetric matrices in gl n hidden . For j = 1, . . . , L and {k, ℓ} ⊆ {1, . . . , n j }, set: E (j) {k,ℓ} := E (nj ) kk if k = ℓ 1 2 E (nj ) kℓ + E (nj ) ℓk if k ̸ = ℓ where E (nj ) kℓ is the elementary n j × n j matrix with the entry in the k-th row and ℓ-th column equal to one, and all other entries equal to zero. Then one computes: Q E (j) {k,ℓ} (W, b) = b (j) k b (j) ℓ + nj-1 t=1 w (j) kt w (j) ℓt - nj+1 r=1 w (j+1) rk w (j+1) rℓ In other words, we take the sum of the following three terms: the product of the k-th and ℓ-th entries of the bias vector b j , the dot product of the k-th and ℓ-th rows of W j , and the dot product of the k-th and ℓ-th columns of W j+1 . In particular, we see that every entry of the matrix µ i (W, b) := W i W T i + b i b T i -W T i+1 W i+1 ∈ gl ni is a conserved quantity valued in gl ni rather than in R. Additionally, we have a moment map: Q : Param → gl * n hidden , (W, b) → M → L-1 i=1 ⟨µ i (W, b), M i ⟩ C.9.2 CONSERVED QUANTITIES FOR LINEAR NETWORKS: TWO-LAYER CASE In the two layer case of a linear network, we have that that the single hidden activation is the identity: σ = id 2 : R 2 → R 2 . The hidden symmetry group is GL 2 with Lie algebra all of gl 2 . The space of symmetric matrices in gl 2 is spanned by the matrices: E 11 = 1 0 0 0 , E 22 0 0 0 1 , E (1,2) = 0 1 1 0 . The corresponding conserved quantities are: Q E11 (U, V ) = v 2 1 -u 2 1 Q E22 (U, V ) = v 2 2 -u 2 2 Q E (1,2) (U, V ) = v 1 v 2 -u 1 u 2 Thus, we obtain a three-dimensional space of conserved quantities. (Since GL 2 also contains the orthogonal group O(2), Equation 29 below holds along any gradient flow curve.)

C.9.3 CONSERVED QUANTITIES FOR RELU NETWORKS

The pointwise ReLU activation commutes with positive rescaling, so we consider the subgroup of the hidden symmetry group consisting of tuples of diagonal matrices with positive diagonal entries, that is: G = {g ∈ GL n hidden | g i = Diag(s 1 , . . . , s ni ) , s j > 0} This subgroup, also known as the positive coordinate-wise rescaling subgroup, is isomorphic to the product (R >0 ) L-1 i=1 ni . Its Lie algebra is spanned by the elements E (j) kk defined above, for j = 1, . . . , L -1 and k = 1, . . . , n j . The conserved quantity implied by Proposition C.5 is: Q E (j) kk (W, b) = b (j) k 2 + nj-1 t=1 w (j) kt 2 - nj+1 r=1 w (j+1) rk 2 In other words, we take the sum of the following three terms: the square of the k-th entry of the bias vector b j , the norm of the k-th row of W j , and the norm of the k-th column of W j+1 .

C.9.4 CONSERVED QUANTITIES FOR RELU NETWORKS: TWO-LAYER CASE

In the two-layer case, the positive rescaling group is: G = g 1 0 0 g 2 ∈ GL 2 (R) | g 1 and g 2 are positive. The Lie algebra of G is the two-dimensional space of diagonal matrices in gl 2 (with not necessarily positive diagonal entries). In other words, g is spanned by the matrices E 11 = 1 0 0 0 and E 22 = 0 0 0 1 . One computes the conserved quantities corresponding to these elements as: Q E11 (U, V ) = v 2 1 -u 2 1 Q E22 (U, V ) = v 2 2 -u 2 2 Hence there is a two-dimensional space of conserved quantities coming from the infinitesimal action. C.9.5 CONSERVED ANGULAR MOMENTUM FOR RADIAL RESCALING NETWORKS Suppose each σ i is a radial rescaling activation σ i (z) = λ i (|z|)z, where λ i : R → R is the rescaling factor. Each such activation commutes with orthogonal transformations, so we consider the subgroup of the hidden symmetry group consisting of tuples of orthogonal matrices: G = {g ∈ GL n hidden | g i g T i = id ni for all i} The Lie algebra of this subgroup consists only of anti-symmetric matrices, and so there are no infinitesimal-action conserved quantities. However, given an anti-symmetric matrix M i ∈ gl ni for each i, any gradient flow curve satisfies the following differential equation (encoding conservation of angular momentum): L-1 i=1 τ ( Ẇi ⊙ M i W i ) + τ ( ḃi ⊙ M i b i ) -τ ( Ẇi+1 ⊙ W i+1 M i ) = 0 (cf. Section C.8.1 ). An equivalent way to write this equation is: L-1 i=1 Tr W i Ẇ T i + b i ḃT i + W T i+1 Ẇi+1 M i = 0 Indeed, one uses the facts that τ (A ⊙ B) = Tr(A T B), Tr(A T ) = Tr(A), and Tr(AB) = Tr(BA), for any two matrices A, B of the appropriate size in each case. Using a basis of anti-symmetric matrices, one can show that the matrix ν i (W, b) := W i Ẇ T i -Ẇi W T i + b i ḃT i -ḃi b T i + W T i+1 Ẇi+1 -Ẇ T i+1 W i+1 ∈ gl ni is equal to zero: ν i (W, b) = 0. Note that ν i depends on taking derivatives with respect to the flow. In fact, ν i is more properly formulated as a function on the tangent bundle T (Param) of Param, which is then evaluated on the gradient flow vector field. Similarly, we have a moment map T (Param) → gl * n hidden , and the gradient flow vector field is contained in the preimage of zero. We omit the details. A basis for the space of anti-symmetric matrices in gl n hidden is given by: E (j) k<ℓ := E (nj ) kℓ -E (nj ) ℓk where j = 1, . . . , L -1, and k, ℓ ∈ {1, . . . , n j } satisfy k < ℓ. The differential equation corresponding to E (j) k≤ℓ is given by: r 2 bj ;k,ℓ θbj;k,ℓ + nj-1 t=1 r Wj ;ks,ℓs θWj;ks,ℓs + nj+1 r=1 r 2 Wj+1;rk,rℓ θWj+1;rk,rℓ = 0 where (r bj ;kℓ , θ bj ;k,ℓ ) are the polar coordinates of the image of b j under the projection R nj → R 2 which selects only the k-th and ℓ-th coordinates. Similarly, for any pair matrix entries we have a projection R nj ×nj-1 → R 2 and can take the polar coordinates of the image of W j under this projection. C.9.6 CONSERVED ANGULAR MOMENTUM FOR RADIAL RESCALING NETWORKS:

TWO-LAYER CASE

In the two-layer radial rescaling case, suppose the dimension vector is (n, h, m), and that there are no bias vectors. For U ∈ R m×h , V ∈ R h×n and M ∈ so(h), we have: θ, M • θ = ( U , V ), (-U M, M V ) = -Tr( U T U M ) + Tr( V T M V ) = Tr(V V T M ) -Tr(M T U T U ) = Tr(V V T M ) + Tr(M U T U ) = Tr(V V T M ) + Tr(U T U M ) = Tr V V T + U T U M Hence we obtain the differential equation: Tr V V T + U T U M = 0 In the case where (n, h, m) = (1, 2, 1), we have the two by two orthogonal group: G = O(2) = g ∈ GL 2 (R) | g T g = id The Lie algebra of G consists of anti-symmetric matrices in gl 2 , and contains no non-zero symmetric matrices. Hence, we do not obtain any conserved quantities from the infinitesimal action in this case. However, using the element 0 1 -1 0 ∈ g, we obtain that the following differential equation holds along any gradient flow curve: r 2 U θU + r 2 V θV = 0 where (r U , θ U ) are the polar coordinates of (u 1 , u 2 ) ∈ R 2 , and similarly for (r V , θ V ). Note that the left-hand side of Equation 29 is a function of t; so if γ : (a, b) → V is a gradient flow curve, then a more precise version of the equation is r 2 U • γ (t) • (θ U • γ) ′ (t) + r 2 V • γ (t) • (θ V • γ) ′ (t) = 0 for all t.

C.10 JACOBIANS: SPECIAL CASES

We conclude this appendix with a side remark on special cases of the Jacobian formalism. Manifolds. Suppose M and N are smooth manifolds, and suppose F : M → N is a smooth map. The differential of F at m ∈ M is a linear map between the tangent spaces: dF m : T m M → T F (m) N The map dF m is computed in local coordinate charts as the Jacobian of partial derivatives. If G : N → L is another smooth map, then the chain rule becomes d(G • F ) m = dG F (m) • dF m , for any m ∈ M . Matrix case. Suppose L : R m×n → R is a differentiable function. In this case, we regard the Jacobian at W ∈ R m×n as an n × m matrix: dL W =            ∂L ∂w11 W ∂L ∂w21 W • • • ∂L ∂wm1 W ∂L ∂w12 W ∂L ∂w22 W • • • ∂L ∂wm2 W . . . . . . . . . . . . ∂L ∂w1n W ∂L ∂w2n W • • • ∂L ∂wmn W            ∈ R m×n where w ij are the matrix coordinates. If F : R → R m×n is a differentiable function, we regard its Jacobian at s ∈ R as a m × n matrix: dF t =            dF11 dt s dF12 dt s • • • dF1n dt s dF21 dt s dF22 dt s • • • dF2n dt s . . . . . . . . . . . . dFm1 dt s dFm2 dt s • • • dFmn dt s            ∈ R n×m where F ij : R → R are the coordinates of F . Then the chain rule becomes: d dt s (L • F ) = m i=1 n j=1 ∂L ∂w ij F (s) dF ij dt s = Tr(dL F (s) • dF s ). In other words, the derivative of the composition L • F at s ∈ R is the trace of the product of the matrices dL F (s) ∈ R n×m and dF s ∈ R m×n .

D NEURAL NETWORKS: NON-LINEAR ACTIONS GROUP ACTIONS

In this section, we consider a non-linear action of the hidden symmetry group on the parameter space. This action has the advantage that exists for a wider variety of activation functions (such as the usual sigmoid, which has no linear equivariance properties), and that it is defined for the full general linear group. However, in constrast to the linear action, the non-linear action is datadependent: the transformation of the weights and biases depends on the input data.

D.1 ROTATIONS

We first define certain orthogonal matrices. Definition D.1. For any tuple of real numbers β = (β 1 , . . . , β n ), define an (n + 1) × (n + 1) matrix R(β) as follows: (R(β)) ij =      cos(β j-1 ) i-1 k=j sin(β k ) cos(β i ) if j ≤ i -sin(β i ) if j = i + 1 0 if j > i + 1 where, by convention, we set β 0 = β n+1 = 0. For example, when n = 1, 2, we have: R(β) = cos(β) -sin(β) sin(β) cos(β) R(β 1 , β 2 ) = cos(β 1 ) -sin(β 1 ) 0 sin(β 1 ) cos(β 2 ) cos(β 1 ) cos(β 2 ) -sin(β 2 ) sin(β 1 ) sin(β 2 ) cos(β 1 ) sin(β 2 ) cos(β 2 ) Lemma D.2. For any tuple of real numbers β = (β 1 , . . . , β n ), we have: 1. n i=1 cos 2 (β i ) i-1 k=1 sin 2 (β k ) + n k=1 sin 2 (β k ) = 1 2. The matrix R(β) is orthogonal. Sketch of proof. The first identity follows from a straightforward induction argument, while the proof of the second claim amounts to a computation that invokes the identity of the first claim. Proposition D.3. There is a continuous map R : R h \ {0} → GL h , written z → R z , such that: 1. For any z ∈ R h \ {0}, the first column of R z is z. Hence R z e 1 = z, where e 1 = (1, 0, . . . , 0) is the first basis vector. 2. The operator norm of R z is ∥R z ∥ = |z|. 3. If |z| = 1, then R z is an orthogonal matrix. Proof. Let z ∈ R h \ {0}, and let (r, α 1 , . . . , α n-1 ) be the (reverse) h-spherical coordinates of z. Hence, r = |z| is the norm of z and the i-th coordinate of z is z i = r i-1 k=1 sin(α k ) cos(α i ), where α h = 0 by convention. Now set R z = |z|R(α 1 , . . . , α h-1 ). Using Lemma D.2, one con- cludes that R z is invertible with inverse 1 |z| 2 R T z , so that R z has operator norm is |z| and R z is orthogonal if |z| = 1. It is also clear that the first column of R z is equal to z. We note the the matrix in D.1 has a form similar, but not identical, to the Jacobian matrix for the transformation to n-spherical coordinates. Euler angles provide another way to construct a map R h \ {0} → GL h the same properties as in Proposition D.3.

D.2 NON-LINEAR ACTION: TWO-LAYER CASE

Consider a two-layer network with dimension vector (m, h, n), no bias vectors, and no output activation. The parameter space is Param = R m×h × R h×n . Define the non-degenerate locus as: (Param × R n ) • = {(U, V, x) ∈ R m×h × R h×n × R n | V x ̸ = 0} Let F : Param×R n → R m be the extended feedforward function, taking (U, V, x) to F (U,V ) (x) = U σ(V x). We now state and prove a more general version of Theorem 4.1. Theorem D.4. Suppose σ(z) ̸ = 0 for all nonzero z ∈ R h \ {0}. 1. There is an action: GL h × (Param × R n ) • → (Param × R n ) • g • (U, V, x) = (U R σ(V x) R -1 σ(gV x) , gV , x) 2. Suppose, in addition, that σ(0) ̸ = 0, so that σ is nonzero on all of R h . There is an action: GL h × (Param × R n ) → (Param × R n ) g • (U, V, x) = (U R σ(V x) R -1 σ(gV x) , gV , x) In both cases, the extended feedforward function is invariant for this action, that is: F (g • (U, V, x)) = F (U, V, x). Proof. We first verify that the action is well-defined. In the second case, σ(gV x) ̸ = 0 for all (U, V, x) and hence R σ(gV x) is defined and invertible for any g ∈ GL h . For the first case, let (U, V, x) be in the non-degenerate locus. The non-degeneracy condition V x ̸ = 0 guarantees that gV x ̸ = 0 for all g ∈ GL h . The hypothesis on σ in turn implies that R σ(gV x) is defined and invertible for any g ∈ GL h . Hence the action is well-defined in both cases. To check the unit axiom, observe that, when g = id h is the identity of GL h , we have R σ(V x) R -1 σ(gV x) = R σ(V x) R -1 σ(V x) = id h and gV = V . It follows that id • (U, V, x) = (U, V, x). To check the multiplication axiom, let g 1 , g 2 ∈ GL h . Then: R σ(g1g2v) R -1 σ(g2v) R σ(g2v) R -1 σ(v) = R σ(g1g2v) R -1 σ(v) . It follows that g 1 • (g 2 • (U, V, x)) = (g 1 g 2 ) • (U, V, x). For the last claim, we compute: F (g • (U, V, x)) = F (U R σ(V x) R -1 σ(gV x) , gV, x) = U R σ(V x) R -1 σ(gV x) σ(gV x) = U R σ(V x) e 1 = U σ(V x) where the first equality follows from the definition of the action; the second from the extended feedforward function F ; and the third and fourth follow from Proposition D.3. From the proof, we see that a key property of the matrices R z is that: R σ(gz) R -1 σ(z) σ(z) = σ(gz) This can be interpreted as a data-dependent generalization of the equivariance condition appearing in Equation ( 4). We emphasize that a sufficient condition for the existence of such an action is that σ(z) is nonzero for any nonzero z ∈ R h ; this is the case for usual sigmoid, hyperbolic tangent, leaky ReLU, and many other activations. Finally, we remark on a differential-geometric interpretation of the construction of this section. One can regard σ as a section of the trivial bundle on R h \ {0} with fiber R h . The map z → R σ(gz) R -1 σ(z) defines a GL h -equivariant structure on this bundle such that σ is an equivariant section. Indeed, the action of GL h on the total space R h \ {0} × R h is given by g • (z, a) = (gz, R σ(gz) R -1 σ(z) a), and the equivariance of σ is precisely the condition R σ(gz) R -1 σ(z) σ(z) = σ(gz).

D.3 NON-LINEAR ACTION: MULTI-LAYER CASE

We adopt the notation of Section C.2. In particular, consider a neural network with L layers and widths n = (n 0 , n 1 , . . . , n L ). The parameter space is given by: Param(n) = R n L ×n L-1 × R n L-1 ×n L-2 × • • • × R n2×n1 × R n1×n0 × R n L × R n L-1 × • • • × R n1 . So for each layer i, we have a matrix W i ∈ R ni×ni-1 and vector b i ∈ R ni . We write θ = (W i , b i ) L i=1 for a choice of parameters. Fix activations σ i : R ni → R ni for each i = 1, . . . , L. Let F = F θ : R n0 → R n L be the feedforward function corresponding to parameters θ = (W, b) ∈ Param with activations σ i . Taking the parameters into account, we form the extended feedforward function: F : Param × R n0 → R n L , F (θ, x) = F θ (x) One can also define the extension of the partial feedforward function F i : Param × R n → R ni as (θ, x) → F θ,i (x). Furthermore, let Z i : Param × R n → R ni be the function defined recursively as: Z i (θ, x) =    x if i = 0 W 1 x + b 1 if i = 1 W i σ i-1 (Z i-1 (θ, x)) + b i for i = 2, . . . , L We have Fi = σ i • Z i for i = 1, . . . , L, and the extended feedforward function is F = σ L • Z L . Define the non-degenerate locus as: (Param × R n ) • = {(θ, x) | Z i (θ, x) ̸ = 0 for i = 1, . . . , L -1}. Proposition D.5. Suppose that, for i = 1, . . . , L -1, the activation σ i : R ni → R ni satisfies σ -1 i (0) ⊆ {0}. Then there is an action of the hidden symmetry group GL n hidden on the non-degenerate locus given by: GL n hidden × (Param × R n ) • → (Param × R n ) • g • (θ, x) = g i W i R Fi-1(θ,x) R -1 σi(gi-1Zi-1(θ,x)) , g i b i L-1 i=1 , x Moreover, this action preserves the extended feedforward function. Proof. The fact that the action is well-defined follows from the assumption on each σ i and the nondegeneracy condition. The unit and multiplication axioms are shown in the same way as in the proof of Theorem D.4. For the last claim, one first verifies by induction that Z i (g • (θ, x)) = g i Z i (θ, x) for i = 0, 1, . . . , L. Hence, F (g • θ, x) = σ L (Z L (g • (θ, x))) = σ L (g L Z L (θ, x)) = σ L (Z L (θ, x)) = F (θ, x) using the fact that g L is the identity. So the extended feedforward function is preserved under this action.

D.4 DISCUSSION: INCREASING THE BATCH SIZE

In this section, we discuss difficulties in adopting the construction of the previous sections to cases where the batch size is greater than one. Fix a batch size k, so that the feature space of the hidden layer is R h×k . By abuse notation, we write σ : R h×k → R h×k for the map applying σ column-wise. We say that σ preserves full rank matrices if σ(Z) is full rank for any full-rank matrix in Z ∈ R h×k . As a final piece of notation, let R h×k • ⊆ R h×k be the subset of full rank matrices. Lemma D.6. Suppose that k ≤ h, and that σ preserves full-rank matrices. Then there exists a map c : GL h × R h×k \ {0} → GL h satisfying the following identities for any nonzero Z ∈ R h×k and g, g 1 , g 2 ∈ GL h : c(id h , Z) = id h (31) c(g 1 , g 2 Z)c(g 2 , Z) = c(g 1 g 2 , Z) (32) c(g, Z)σ(Z) = σ(gZ) We omit a proof of this lemma. A key tool is the fact that, for k ≤ h, any two matrices in R h×k • are related by an element of GL h . This lemma implies that, for a multi-layer network, if σ i preserves full rank matrices in R ni×k for each i, then there is a non-linear group action as in Proposition D.5, where the appropriate version of the non-degenerate locus is: Param × R n0×k • = {(θ, X) | Z i (θ, X) ∈ R ni×k is of full rank for i = 1, . . . , L -1}. However, as the following examples show, the condition that σ preserves full rank matrices is not satisfied in the case of common activation functions. Example D.1. 1. For k > 1, the column-wise application of the usual sigmoid activation does not preserve full rank matrices. For example, for k = 2, take: Z = σ -1 1 5 σ -1 2 5 σ -1 2 5 σ -1 4 5 ≃ -1.3863 -0.4055 -0.4055 1.3863 Then det(σ(Z)) = 0 while det(Z) = -2.0862. 2. For k > 1, the column-wise application of hyperbolic tangent does not preserve full rank matrices. To see this, set k = 1 and consider: Z = tanh -1 1 5 tanh -1 2 5 tanh -1 2 5 tanh -1 4 5 ≃ 0.2027 0.4236 0.4236 1.0986 Then det(tanh(Z)) = 0 while det(Z) = 0.0432.

3.

Let s be a real number with 0 < s < 1. The corresponding leaky ReLU activation function is given by σ(z) = sz min(0, z) + z max(0, z). For k > 1, the column-wise application of leaky ReLU tangent does not preserve full rank matrices. Indeed, for k = 2, set: Z = s -1 -1 s Then det(σ(Z)) = det s -s -s s = 0 while det(Z) = s 2 -1 ̸ = 0. Finally, in the case k > h, the action of GL h on full rank h × k matrices is not transitive. Hence, there will generally be no matrix in GL h taking σ(Z) to σ(gZ), even if both are full rank.

D.5 LIPSCHTIZ BOUNDS

Proof of Proposition 4.2. Let (U, V, x) be in the non-degenerate locus, let g ∈ GL h , and let x 1 , x 2 ∈ R n . Using the Lipschitz constant of σ and the definition of operator norms, we compute: |F (g,x) (U,V ) (x 1 -x 2 )| ≤ |U R σ(V x) R -1 σ(gV x) σ(gV (x 1 -x 2 ))| ≤ η∥U ∥∥R σ(V x) ∥∥R -1 σ(gV x) ∥∥g∥∥V ∥|x 1 -x 2 | = η∥U ∥|σ(V x)|∥ 1 |σ(gV x)| 2 R T σ(gV x) ∥∥g∥∥V ∥|x 1 -x 2 | = η∥U ∥ |σ(V x)|∥g∥ |σ(gV x)| ∥V ∥|x 1 -x 2 | The result follows. E GRADIENT DESCENT AND DRIFTING CONSERVED QUANTITIES While gradient flows are well approximated by gradient descent (Elkabetz & Cohen, 2021) , the conserved quantities of gradient descent are no longer conserved in gradient flow due to noninfinitesimal time steps. However, with small learning rate, we expect the change in the conserved quantities to be small. In this section, we first prove that the change of Q is bounded by the square of learning rate for two layer linear networks, then show empirically that the change Q is small for nonlinear networks. E.1 CHANGE IN Q IN GRADIENT DESCENT (LINEAR LAYERS) Proposition E.1. Consider the two layer linear network, where U ∈ R m×h , V ∈ R h×n are the only parameters, and the loss function L is a function of U V . In gradient descent with learning rate η, the change in the conserved quantity Q = Tr U T U -V V T at step t is bounded by |Q t+1 -Q t | ≤ η 2 dL(t) dt . ( ) Proof. Let U t and V t be the value of U and V at time t in a gradient descent. The update rule is U t+1 = U t -η ∂L ∂U , V t+1 = V t -η ∂L ∂V Consider the two layer linear reparametrization W = U V . Q t = Tr U T t U t -V t V T t Q t+1 = Tr U T t+1 U t+1 -V t+1 V T t+1 Substituting in U t+1 and V t+1 , expanding Q t+1 , and subtracting by Q t , we have Q t+1 -Q t = Tr η 2 ∂L ∂U t T ∂L ∂U t -η ∂L ∂U t T U t -ηU T t ∂L ∂U t -η 2 ∂L ∂V t ∂L ∂V t T + η ∂L ∂V t V T t + ηV t ∂L ∂V t T . ( ) Note that ∂L ∂U t T U t = (∇LV T t ) T U t = V t ∇L T U t = V t ∂L ∂V t T , and similarly U T t ∂L ∂U t = ∂L ∂V t V T t . Therefore, (37) simplifies to Q t+1 -Q t = η 2 Tr ∂L ∂U t T ∂L ∂U t - ∂L ∂V t ∂L ∂V t T = η 2 Tr ∂L ∂U t T ∂L ∂U t -Tr ∂L ∂V t T ∂L ∂V t , and the variation of Q in each step is bounded by the convergence rate: |Q t+1 -Q t | = η 2 Tr ∂L ∂U t T ∂L ∂U t -Tr ∂L ∂V t T ∂L ∂V t ≤ η 2 Tr ∂L ∂U t T ∂L ∂U t + Tr ∂L ∂V t T ∂L ∂V t = η 2 dL dt

E.2 EMPIRICAL OBSERVATIONS

In gradient flow, the conserved quantity Q is constant by definition. In gradient descent, Q varies with time. In order to see how applicable our theoretical results are in gradient descent, we investigate the amount of variation in Q in gradient descent using two-layer neural networks. Since Q is the difference between the two terms f 1 (U ) = 1 2 Tr[U T U ] and f 2 (V ) = a,j Vaj x0 dx σ(x) σ ′ (x) , we normalize Q by the initial value of f 1 (U ) and f 2 (V ), i.e., Q = 1 2 Tr[U T U ] -a,j Vaj x0 dx σ(x) σ ′ (x) 1 2 Tr[U T 0 U 0 ] + a,j V0 aj x0 dx σ(x) σ ′ (x) and denote the amount of change in Q as ∆ Q(t) = Q(t) -Q(0) We run gradient descent on two-layer networks with whitened input with the following objective argmin U,V {L(U, V ) = ∥Y -U σ(V T )∥ 2 F } ( ) where σ is the identity function, ReLU, sigmoid, or tanh. Y ∈ R 5×10 , U ∈ R 5×50 and V ∈ R 10×50 have random Gaussian initialization with zero mean. We repeat the gradient descent with learning rate 0.1, 0.01, and 0.001. The variation ∆ Q(t) and loss is shown in Fig. 3 . The amount of change in Q is small relative to the magnitude of f 1 (U ) and f 2 (V ), indicating that conserved quantities in gradient flow are approximately conserved in gradient descent. The error in Q grows with step size, as ∆ Q(t) is larger with the largest learning rate we used, although it has the same magnitude as those of smaller learning rates. We also observe that Q stays constant after loss converges.

F DISTRIBUTION OF Q UNDER XAVIER INITIALIZATION

We first consider a linear two-layer neural network U V X, where U ∈ R m×h , V ∈ R h×n , and X ∈ R n×k . We choose the following form of the conserved quantity: Xavier initialization keeps the variance of each layer's output the same as the variance of the input. Under Xavier initialization (Glorot & Bengio, 2010) , each element in a given layer is initialized independently, with mean 0 and variance equal to the inverse of the layer's input dimension: Q = 1 2 Tr[U T U -V V T ]. U ij = N 0, 1 h V ij = N 0, 1 n The expected value of Q is E[Q] = V ar(U ij ) × m × h + V ar(V ij ) × h × n = m -h. ( ) Figure 4 shows the distribution of Q for 2-layer linear NN with different layer dimensions. For each dimension tuples (m, h, n), we constructed 1000 sets of parameters using Xavier initialization. The centers of the distributions of Q match Eq. ( 46). Next, we consider the nonlinear two-layer neural network U σ(V X), where σ : R -→ R is an element-wise activation function. For simplicity, we assume whitened input (X = I). We choose the following form of the conserved quantity: Q = 1 2 Tr[U T U ] - a,j Vaj 0 dx σ(x) σ ′ (x) Figure 5 shows the distribution of Q for 2-layer linear NN with different nonlinearities, each with 1000 sets of parameters created under Xavier initialization. The shapes of the distributions are similar to that of linear networks. The value of Q is usually concentrated around a small range of values. Since the range of Q is unbounded, the Xavier initialization limits the model to a small part of the global minimum.

G CONSERVED QUANTITY AND CONVERGENCE RATE

The values of conserved quantities are unchanged throughout the gradient flow. Since the conserved quantities parameterize trajectories, initializing parameters with certain conserved quantity values accelerates convergence. For two-layer linear reparametrization, Tarmoun et al. ( 2021) derived the explicit relation between layer imbalance and convergence rate. We derive the relation between conserved quantities and convergence rate for two example optimization problems and provide numerical evidence that initializing parameters with optimal conserved quantity values accelerates convergence. 

G.1 EXAMPLE 1: ELLIPSE

We first show that the convergence rate is related to the conserved quantity in a toy optimization problem. Consider the following loss function with a ∈ R: L(w 1 , w 2 ) = w 2 1 + aw 2 2 ∇L = (2w 1 , 2aw 2 ) (48) Assuming gradient flow, dw 1 dt = -∇ w1 L = -2w 1 dw 2 dt = -∇ w2 L = -2aw 2 Then w 1 , w 2 are governed by the following differential equations: w 1 (t) = w 10 e -2t w 2 (t) = w 20 e -2at 50) where w 10 , w 20 are initial values of w 1 and w 2 . We can find conserved quantities by using an ansatz Q = f (w i 1 w k 2 ) and solving ∇Q • ∇L = 0 for i, k. Below we use the following form of conserved quantity: Q = w 2a 1 w 2 2 = w 2a 10 w 2 20 (51) To show the effect of Q on the convergence rate, we fix L(0) and derive how Q affects L(t). Let L(0) = w 2 10 + aw 2 20 = L 0 . Let w 20 continue to be an independent variable. Then w 2 10 = L 0 -aw 2 20 . Substitute in w 2 10 , the loss at time t is L(t) = w 1 (t) 2 + aw 2 (t) 2 = (L 0 -aw 2 20 )e -4t + aw 2 20 e -4at (52) and Q becomes Q = w 2a 10 w 2 20 = (L 0 -aw 2 20 ) a w 2 20 (53) The derivative of L in the direction of Q is ∂ Q L(t) = dL(t) dw 20 dw 20 dQ = dL(t) dw 20 dQ dw 20 -1 = -2aw 20 e -4t + 2aw 20 e -4at a(L 0 -aw 2 20 ) a-1 (-2aw 20 )w 2 20 -2w 20 (L 0 -aw 2 20 ) a w 4 20 -1 = -2aw 20 e -4t + 2aw 20 e -4at w 4 20 a(L 0 -aw 2 20 ) a-1 (-2aw 20 )w 2 20 -2w 20 (L 0 -aw 2 20 ) a = 2aw 5 20 e -4at -e -4t 2w 20 (L 0 -aw 2 20 ) a-1 -a 2 w 2 20 -(L 0 -aw 2 20 ) In general, ∂ Q L(t) ̸ = 0, meaning that the loss at time t depends on Q. Since we have fixed the initial loss, the convergence rate L(t) -L(0) also depends on Q. Special cases where ∂ Q L(t) = 0 include a = 1 (circle), a = 0 (collapsed dimension), and certain initializations such as w 20 = 0 (local maximum of gradient magnitude).

G.2 EXAMPLE 2: RADIAL ACTIVATION FUNCTIONS

In this example, we find the conserved quantities and their relation with convergence rate for twolayer reparametrization with radial activation functions under spectral initialization. Define radial function g : R m×n -→ R m×n as g(W ) ij = h (|W i |) W ij , where |W i | = k W 2 ik 1 2 is the norm of the i th row of W , and h : R -→ R outputs a scalar. Consider the following objective: argmin U,V {L(U, V ) = 1 2 ∥Y -U g(V T )∥ 2 F } with spectral initializations U 0 = ΦU 0 , V 0 = ΨV 0 , where Φ, Ψ come from the singular value decomposition Y = ΦΣ Y Ψ T , and U 0 , V 0 are random diagonal matrices. Proposition G.1. Under the gradient flow U = -∇ U L and V = -∇ V L, the following quantity is an invariant: Q = 1 2 Tr[U T U ] - i Vii x0 dx g(x) g ′ (x) Proof. Since g is a radial function on rows and Ψ T is an orthogonal matrix, g(V T Ψ T ) = g(V T )Ψ T . With spectral initialization, the loss function can be reduced to only involving diagonal matrices: L = 1 2 ∥Y -U g(V T )∥ 2 F = 1 2 ∥ΦΣΨ T -ΦU g[(ΨV ) T ]∥ 2 F = 1 2 ∥ΦΣΨ T -ΦU g(V T )Ψ T ∥ 2 F = 1 2 ∥Φ Σ -U g(V T ) Ψ T ∥ 2 F = 1 2 ∥Σ -U g(V T )∥ 2 F ( ) Since V is a diagonal matrix, g is now an element wise function on V . Let W = U g(V T ). The gradients for U and V are ∂L ∂U = ∇ W Lg(V ) T ∂L ∂V = ∇ W L T U ⊙ g ′ (V ) where g ′ (x) = dg(x)/dx is the derivative of the nonlinearity. Additionally, since L does not depend on Φ and Ψ, ∂L ∂Φ = ∂L ∂Ψ = 0 Since the rows of Φ, Ψ are orthogonal, ∂L ∂U = ∂L ∂U Φ T = ∇ W Lg(V ) T Φ T ∂L ∂V = ∂L ∂V Ψ T = ∇ W L T U ⊙ g ′ (V ) Ψ T Φ and Ψ are not changed in gradient flow, so ∂Q ∂U = ∂Q ∂U Φ T and ∂Q ∂V = ∂Q ∂V Ψ T . Define inner product on matrices as ⟨X, Y ⟩ = Tr[X T Y ]. For Q to be a conserved quantity, we need ⟨∇L, ∇Q⟩ = 0: ⟨∇L, ∇Q⟩ = ⟨ ∂L ∂U , ∂Q ∂U ⟩ + ⟨ ∂L ∂V , ∂Q ∂V ⟩ = ⟨∇ W Lg(V ) T Φ T , ∂Q ∂U Φ T ⟩ + ⟨ ∇ W L T U ⊙ g ′ (V ) Ψ T , ∂Q ∂V Ψ T ⟩ = ⟨∇ W Lg(V ) T , ∂Q ∂U ⟩ + ⟨ ∇ W L T U ⊙ g ′ (V ) , ∂Q ∂V ⟩ = Tr ∂ U T Q∇ W Lg(V ) T + U T ∇ W L(∂ V Q ⊙ g ′ (V )) = 0 (62) Following the same procedure as for elementwise functions, to have a Q which satisfies (62) it is sufficient to have ∂Q ∂U ia = f (U , V )U ia ∂Q ∂V aj g ′ (V ) aj = -f (U , V )g(V ) aj f (U , V ) ∈ R For simplicity, let f (U , V ) = 1. Then, ( 63) is satisfied by Q = 1 2 Tr[ Ū T Ū ] - i Vii x0 dx g(x) g ′ (x) Tarmoun et al. (2021) shows that the conserved quantity Q appears as a term in the convergence rate of the matrix factorization gradient flow. We observe a similar relationship between Q and convergence rate when the loss function is augmented with a radial activation function, as shown in the following proposition. Proposition G.2. Consider the objective function and spectral initialization defined in Proposition G.1. Let h (|W i |) = |W i | -2 , and X = U g(V T ) = ΦΣ X Ψ T . Then, the eigencomponent of X approaches the corresponding eigencomponent of Y at a rate of σX i = 1 λ i (σ Y i -σ X i )(σ X i 2 + 1) 2 , ( ) where σ X i = diag(Σ X ) i , σ Y i = diag(Σ Y ) i , and λ i = Ū 2 ii + V 2 ii are conserved quantities. Proof. Similar to Tarmoun et al. ( 2021), components can be decoupled, and we have a set of differential equations on scalars: ui = [σ Y i -u i g(v i )]g(v i ) vi = [σ Y i -u i g(v i )]u i dg(v i ) dv i We also have ġ(v i ) = dg dv i dv i dt = [σ Y i -u i g(v i )]u i dg(v i ) dv i 2 . ( ) Let σ X i = u i g(v i ). Then σX i = ui g(v i ) + u i ġ(v i ) = σ Y i -u i g(v i ) g(v i ) 2 + u 2 i dg(v i ) dv i 2 . ( ) Since V is a diagonal matrix, g is now an element wise function on V . Specifically, g(v i ) = 1 vi . According to Proposition G.1, the following quantity is invariant: 1 2 u 2 i -dx g(x) g ′ (x) = 1 2 u 2 i -dx v -1 i -v -2 i = 1 2 u 2 i + 1 2 v 2 i ( ) Since any function of the invariant is also invariant, we will use the following form: Q = U T U + V T V , and define λ i = Q ii = u 2 i + v 2 i ( ) Using the g that we defined, σ X i = u i g(v i ) = u i v -1 i . In order to relate σ X and Q, we first write u i and v i as functions of σ X i ad Q using ( 71) and ( 72): u 2 i = λ i σ X i 2 σ X i 2 + 1 , v 2 i = λ i σ X i 2 + 1 . ( ) Then, substitute u i , v i , g(v i ), and dg(vi) dvi into (68), and we have σX i = [σ i -u i g(v i )] g(v i ) 2 + u 2 i dg(v i ) dv i 2 = σ Y i -u i g(v i ) 1 v i 2 + u 2 i -v -2 i 2 = σ Y i -u i g(v i ) (v 2 i ) -1 + u 2 i (v 2 i ) -2 = σ Y i -σ X i   λ i σ X i 2 + 1 -1 + λ i σ X i 2 σ X i 2 + 1 λ i σ X i 2 + 1 -2   = σ Y i -σ X i σ X i 2 + 1 λ i + σ X i 2 (σ X i 2 + 1) λ i = σ Y i -σ X i σ X i 4 + 2σ X i 2 + 1 λ i = 1 λ i (σ Y i -σ X i )(σ X i 2 + 1) 2 Proposition G.2 relates the rate of change in parameters σX i and the conserved quantity λ i . To get a more explicit expression of how λ i affects convergence rate, we will derive a bound for |σ Y i -σ X i |, which describes the distance between trainable parameters to their desired value. Proposition G.3. The difference between the singular values of U g(V T ) and Y is bounded by |σ X i -σ Y i | ≤ |σ X i (0) -σ Y i |e -t λ i . ( ) Proof. Note that σ X i = 1 λ (σ Y i -σ X i )(σ X i 2 + 1) 2 ≥ 1 λ i (σ Y i -σ X i ) Consider the following two differential equations, with same initialization a(0) = b(0): ȧ = 1 λ (σ -a)(a 2 + 1) 2 ḃ = 1 λ (σ -b) In these equations, both a and b moves from a(0) = b(0) to σ monotonically. Since ȧ ≥ ḃ at every a = b, a will always be closer to σ than b does. We can explicitly solve for b, which yields  |σ X i -σ Y i | ≤ |σ X i (0) -σ Y i |e -t λ i ( ) Since λ is a conserved quantity, its value set at initialization remains unchanged throughout the gradient flow. Therefore, we are able to optimize the convergence rate by choosing a favorable value for λ at initialization. In this example, smaller λ i 's lead to faster convergence.

G.3 EXPERIMENTS

We compare the convergence rate of two-layer networks initialized with different Q values. We run gradient descent on two-layer networks with whitened input with the following objective argmin U,V {L(U, V ) = ∥Y -U σ(V T )∥ 2 F } ( ) where σ is the identity function, ReLU, sigmoid, or tanh. Matrices Y ∈ R 5×10 , U ∈ R 5×50 and V ∈ R 10×50 have random Gaussian initialization with zero mean. We repeat the gradient descent with learning rate 0.1, 0.01, and 0.001. The learning rate is set to 10 -3 , as we do not observe significant changes in the shape of learning curves at smaller learning rates. U and V are initialized with different variance, which leads to different initial values of Q. As shown in Fig. 6 , the number of steps required for the loss curves to drop to near convergence level is correlated with Q in both linear and element-wise nonlinear networks. This result provides empirical evidence that initializing parameters with optimal values for Q accelerates convergence. We then demonstrate the effect of conserved quantity values on the convergence rate of radial neural networks. Fig. 7 shows the training curve for loss function defined in Proposition G.2. We initialize parameters U ∈ R 5×5 and V ∈ R 10×5 with 4 different values of Q and the learning rate is set to 10 -5 . As predicted in Eq. 75, convergence is faster when Q = Tr[U T U + V T V ] is small.

H CONSERVED QUANTITY AND GENERALIZATION ABILITY

Conserved quantities parameterize the minimum of neural networks and are related to the eigenvalues of the Hessian at minimum. Recent theory and empirical studies suggest that sharp minimum do not generalize well (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017; Petzka et al., 2021) . Explicitly searching for flat minimum has been shown to improve generalization bounds and model performance (Chaudhari et al., 2017; Foret et al., 2020; Kim et al., 2022) . We derive their relationship for the simplest two-layer network, and show empircally that conserved quantity values affect sharpness. Like convergence rate, a systematic study of the relationship between conserved quantity and generalization ability of the solution is an interesting future direction. We again consider the two-layer linear network with loss L = 1 2 ∥Y -U V X∥ 2 . For simplicity, we work with one dimensional parameters U, V ∈ R and assume X = Y = 1 in this example. We show that at the point to which the gradient flow converges, the eigenvalues of the Hessian are related to the value of the conserved quantity. The gradients and Hessian of L are ∇L = -(Y -U V X)V X -(Y -U V X)U X H = V 2 X 2 -Y X + 2U V X 2 -Y X + 2U V X 2 U 2 X 2 At the minima, U, V are related by U V X = Y . Recall that Q = U 2 -V 2 is a conserved quantity. From the above two equations, we can write U, V as functions of Q. Taking the solution U = 1 2 (Q + Q 2 + 4), V = 1 2 (-Q + Q 2 + 4) and substitute in X = Y = 1, we have H = 1 2 (-Q + Q 2 + 4) 1 1 1 2 (Q + Q 2 + 4) , ( ) and the eigenvalues of H are λ 1 = 0, λ 2 = 2 Q 2 + 4. ( ) We have shown that Q is related the eigenvalues of the Hessian at the minimum. Since the eigenvalues determines the curvature, Q also determines the sharpness of the minimum, which is believed to related to model's generalization ability. The result in this example can also be observed in Figure 1 , where the minimum of the Q = 0 trajectory lies at the least sharp point of the loss valley.



For clarity, we suppress the bias vectors; all results can be extended to include bias; see appendix C. We use capital letters for matrix data and small letters for individual samples. In terms of the widths, we have d = L i=1 nini-1. That is, rather than being a map GL h × Param → Param satisfying the group action axioms, a datadependent action is a map GL h × (Param × R n ) → Param × R n satisfying the same axioms. Hence, r = |z| is the norm, and the i-th coordinate of z is zi = r cos(αi) i-1 k=1 sin(α k ), where α h = 0. In fact, c defines a GL h -equivariant structure on the tangent bundle of R h . For simplicity, we also assume that G is closed under taking transposes, and acts faithfully on the parameter space. These assumptions generally hold in practice; see Appendix C for a version with fewer assumptions. Note that this procedure only works if g T M g belongs to g, which is the case the examples we consider. Our code is available at https://github.com/Rose-STL-Lab/Gradient-Flow-Symmetry. Hence, elements of the Lie algebra are 'velocities' at the identity of G. More precisely, for every Lie algebra element ξ, there is a path γ ξ : (-ϵ, ϵ) → G whose value at 0 is the identity of G and whose derivative (i.e., velocity) at zero is ξ. Modding out by the kernel, if necessary. Explicitly, fix a continuous path γi : [0, 1] such that γi(0) is the identity in G and γi(1) = gi, for i = 1, 2. The dimension of (γ1(t)StabG(U0)γ1(t) -1 ) ∩ (γ2(t)StabG(V0)γ2(t) -1 ) is constant along this path.



Figure 2: Overview of empirical observations with more details in Appendix G, H, and I. (a) In a two-layer neural network, the convergence rate depends on the conserved quantity Q. (b) The distribution of the eigenvalues of the Hessian at the minimum is related to the value of Q. (c) The ensemble created by group actions has similar loss values when ε is small. (d) The ensemble model improves robustness against fast gradient sign method attacks.

parameter space and loss function . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Action of continuous groups and flat minima . . . . . . . . . . . . . . . . . . . . . 3.3 Equivariance of the activation function . . . . . . . . . . . . . . . . . . . . . . .3.4 Infinitesimal symmetries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neural networks: linear group actions C.10 Jacobians: special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D Neural networks: non-linear actions group actions 28 D.1 Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 D.2 Non-linear action: two-layer case . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.3 Non-linear action: multi-layer case . . . . . . . . . . . . . . . . . . . . . . . . . . 30 D.4 Discussion: increasing the batch size . . . . . . . . . . . . . . . . . . . . . . . . . 31 D.5 Lipschtiz bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E Gradient descent and drifting conserved quantities 32 E.1 Change in Q in gradient descent (linear layers) . . . . . . . . . . . . . . . . . . . 32 E.2 Empirical observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 F Distribution of Q under Xavier initialization 33 G Conserved quantity and convergence rate 34 G.1 Example 1: ellipse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 G.2 Example 2: radial activation functions . . . . . . . . . . . . . . . . . . . . . . . . 36 G.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 H Conserved quantity and generalization ability 39 H.1 Example: two-layer linear network with 1D parameters . . . . . . . . . . . . . . . 40 H.2 Experiments: two-layer networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Given an action of G on Param, the infinitesimal action of M ∈ g is a vector field M : Infinitesimal action of M vector field: M θ := d dt t→0 (exp M (t) • θ) , ∀θ ∈ Param. (20) Hence, the value of the vector field M at the parameter value θ is given by the derivative at zero of the function t → (exp M (t) • θ). In the case of a parameter space symmetry, the invariance of L translates into the orthogonality condition in Proposition 3.4, where the inner product ⟨, ⟩ : Param×Param → R is calculated by contracting all indices, e.g. ⟨A, B⟩ = ijk... A ijk... B ijk... . Proof of Proposition 3.4. The gradient is the transpose of the Jacobian (see Section C.1), so the left-hand-side becomes dL θ d dt 0 (exp M (t) • θ) . We compute:

22 is a symmetry of the parameter space.2. (Infinitesimal equivariant action) The action of g= g 1 × • • • × g L defined in 23 satisfies ⟨∇ θ L, M• θ⟩ for all θ ∈ Param and all M ∈ g.

Figure3: Dynamics of conserved quantities in GD. The amount of change in Q is small relative to its magnitude, and Q converges when loss converges.

m, h, n = 200, 100, 100

Figure 4: Distribution of Q for 2-layer linear NN with different layer dimensions.

Figure 5: Distribution of Q for 2-layer linear NN with different nonlinearities, with parameter dimensions m = h = n = 100.

b(t) = σ + (b(0) -σ)e -t λ . Then the distance between b and σ is |b -σ| = |b(0) -σ|e -t λ . Using |b -σ|, we can bound |a -σ|: |a -σ| ≤ |b -σ| = |b(0) -σ|e -

Figure 6: Training curves of two-layer networks initialized with different Q. The value of Q affects convergence rate.

Figure 7: Training curve for the loss function defined in Proposition G.2. Smaller value of Q = Tr[U T U + V T V ] at initialization leads to faster convergence.

Figure 8: Gradient flow for L(U, V ) = 1 2 ∥Y -U V X∥ 2 , where U, V ∈ R, Y = 2, and X = 1. Trajectories corresponding to different values of Q intersect the minima at different points.

)

ACKNOWLEDGMENTS

This work was supported in part by U.S. Department Of Energy, Office of Science grant DE-SC0022255, U. S. Army Research Office grant W911NF-20-1-0334, and NSF grants #2134274 and #2146343. I. Ganev was supported by the NWO under the CORTEX project (NWA.1160.18.316). R. Walters was supported by the Roux Institute and the Harold Alfond Foundation and NSF grants #2107256 and #2134178.

H.2 EXPERIMENTS: TWO-LAYER NETWORKS

The goal of this section is to explore the relation between Q and the sharpness of the trained model. We measure sharpness by the magnitude of the eigenvalues of the Hessian, which are related to the curvature at the minima. We use the same loss function (80) in Section G.3. The parameters are U ∈ R 10×50 and V ∈ R 5×50 , each initialized with zero mean and various standard deviations that lead to different Q's. We first train the models using gradient descent. We then use the vectorized parameters in the trained model to compute the eigenvalues of the Hessian.The linear model extends the example in Section H.1 to higher dimension parameter spaces. 700 out of the 750 eigenvalues are around 0 (with magnitude ≤ 10 -3 ), which verifies the dimension of the minima in Proposition C.3. After removing the small eigenvalues, the center of the eigenvalue distribution correlates positively with the value of Q (Figure 9 (a)). In models with nonlinear activations, Q is still related to eigenvalue distributions, although the relations seem to be more complicated. 

I ENSEMBLE MODELS

In neural networks, the optima of the loss functions are connected by curves or volumes, on which the loss is almost constant (Freeman & Bruna, 2017; Garipov et al., 2018; Draxler et al., 2018; Benton et al., 2021; Izmailov et al., 2018) . Various algorithms have been proposed to find these lowcost curves, which provides a low-cost way to create an ensemble of models from a single trained model. Using our group actions, we propose a new way of constructing models with similar loss values. We show that even with stochasticity in the data, the loss is approximately unchanged under the group action (Appendix I). This provides an efficient alternative to build ensemble models, since the transformation only requires random elements in the symmetry group, without any searching or additional optimization.We implement our group actions by modifying the activation function between two consecutive layers. Let H = V X be the output of the previous layer. The group action on the weights U, V iswhere π(g, H) = σ(H)σ(gH) † . The new activation implements the symmetry group actionby wrapping the transformations around an activation function σWe test the group action on CIFAR-10. The model contains a convolution layer with kernel size 3, followed by a max pooling, a fully connected layer, a leaky ReLU activation, and another fully connected layer. The group action is on the last two fully connected layers. After training a single model, we create transformed models using g = I + εM , where M ∈ R 32×32 is a random matrix and ε controls the magnitude of movement in the parameter space. We then use the mode of the transformed models' prediction as the final output.We compare the ensemble formed by group actions to four ensembles formed by various random transformation. Let g = I + εM . The random baselines are:• 'group': (U, V ) → (U π(g, H), gV ). This is the model created by group actions.• 'g -1 ': (U, V ) → (U g -1 , gV ).• 'random': (U, V ) → (U g ′ , gV ), where g ′ = I + εD and D is a random diagonal matrix.• 'shuffle': (U, V ) → (U π ′ (g, H), gV ), where π ′ (g, H) is constructed by randomly shuffling π(g, H).• 'interpolated permute' or 'perm interp': (U, V ) → (Uwhere S ∈ R 32×32 is a random permutation matrix.Figure 10 shows the accuracy of the ensembles compared to single models. The ensemble formed by group actions preserves the model accuracy for small ε and has smaller accuracy drop at larger ε.The ensemble model also improves robustness against Fast Gradient Signed Method (FGSM) attacks (Figure 11 ). Under FGSM attacks with various strength, the ensemble model created using group actions consistently performs better than the baselines with random transformations. However, the same improvement is not observed under Projected Gradient Descent (PGD) attacks. 

