NEURAL MECHANICS: SYMMETRY AND BROKEN CON-SERVATION LAWS IN DEEP LEARNING DYNAMICS

Abstract

Understanding the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuous-time limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic expressions for learning dynamics on VGG-16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.

1. INTRODUCTION

Just like the fundamental laws of classical and quantum mechanics taught us how to control and optimize the physical world for engineering purposes, a better understanding of the laws governing neural network learning dynamics can have a profound impact on the optimization of artificial neural networks. This raises a foundational question: what, if anything, can we quantitatively understand about the learning dynamics of large-scale, non-linear neural network models driven by real-world datasets and optimized via stochastic gradient descent with a finite batch size, learning rate, and with or without momentum? In order to make headway on this extremely difficult question, existing works have made major simplifying assumptions on the network, such as restricting to identity activation functions Saxe et al. (2013) , infinite width layers Jacot et al. ( 2018), or single hidden layers Saad & Solla (1995) . Many of these works have also ignored the complexity introduced by stochasticity and discretization by only focusing on the learning dynamics under gradient flow. In the present work, we make the first step in an orthogonal direction. Rather than introducing unrealistic assumptions on the model or learning dynamics, we uncover restricted, but meaningful, combinations of parameters with simplified dynamics that can be solved exactly without introducing a major assumption (see Fig. 1 ). To find the parameter combinations, we use the lens of symmetry to show that if the training loss doesn't change under some transformation of the parameters, then the gradient and Hessian for those parameters have associated geometric constraints. We systematically apply this approach to modern neural networks to derive exact integral expressions and verify our predictions empirically on large scale models and datasets. We believe our work is the first step towards a foundational understanding 1. We leverage continuous differentiable symmetries in the loss to unify and generalize geometric constraints on neural network gradients and Hessians (section 3). 2. We prove that each of these differentiable symmetries has an associated conservation law under the learning dynamics of gradient flow (section 4). 3. We construct a more realistic continuous model for stochastic gradient descent by modeling weight decay, momentum, stochastic batches, and finite learning rates (section 5). 4. We show that under this more realistic model the conservation laws of gradient flow are broken, yielding simple ODEs governing the dynamics for the previously conserved parameter combinations (section 6). 5. We solve these ODEs to derive exact learning dynamics for the parameter combinations, which we validate empirically on VGG-16 trained on Tiny ImageNet with and without batch normalization (section 6). In the present work, we make the first step in an orthogonal direction. Instead of introducing unrealistic assumptions, we discover restricted combinations of parameters for which we can find exact solutions, as shown in Fig. 1 . We make this fundamental contribution by constructing a framework harnessing the geometry of the loss shaped by symmetry and realistic continuous equations of learning.

2. RELATED WORK

Geometry of the loss. A wide range of literature has discussed constraints on gradients originating from specific architectural building blocks of networks. For the first part of our work, we simplify, unify, and generalize the literature through the lens of symmetry. The earliest works understanding the importance of invariances in neural networks come from the loss landscape literature Baldi & Hornik (1989) and the characterization of critical points in the presence



Figure 1: Neuron level dynamics are simpler than parameter dynamics. We plot the perparameter dynamics (left) and per-channel squared Euclidean norm dynamics (right) for the convolutional layers of a VGG-16 model (with batch normalization) trained on Tiny ImageNet with SGD with learning rate η = 0.1, weight decay λ = 10 -4 , and batch size S = 256. While the parameter dynamics are noisy and chaotic, the neuron dynamics are smooth and patterned. The goal of this work is to construct a theoretical framework to better understand the learning dynamics of state-of-the-art neural networks trained on real-world datasets. Existing works have made progress towards this goal through major simplifying assumptions on the architecture and learning rule. Saxe et al. (2013; 2019) and Lampinen & Ganguli (2018) considered linear neural networks with specific orthogonal initializations, deriving exact solutions for the learning dynamics under gradient flow. The theoretical tractability of linear networks has further enabled analyses on the properties of loss landscapes Kawaguchi (2016), convergence Arora et al. (2018a); Du & Hu (2019), and implicit acceleration by overparameterization Arora et al. (2018b). Saad & Solla (1995) and Goldt et al. (2019) studied single hidden layer architectures with non-linearities in a studentteacher setup, deriving a set of complex ODEs describing the learning dynamics. Such shallow neural networks have also catalyzed recent major advances in understanding convergence properties of neural networks Du et al. (2018b); Mei et al. (2018). Jacot et al. (2018) considered infinitely wide neural networks with non-linearities, demonstrating that the network's prediction becomes linear in its parameters. This setting allows for an insightful mathematical formulation of the network's learning dynamics as a form of kernel regression where the kernel is defined by the initialization (though see also Fort et al. (2020)). Arora et al. (2019) extended these results to convolutional networks and Lee et al. (2019) demonstrated how this understanding also allows for predictions of parameter dynamics.

