A QUASISTATIC DERIVATION OF OPTIMIZATION ALGO-RITHMS' EXPLORATION ON THE MINIMA MANIFOLD

Abstract

A quasistatic approach is proposed to derive the optimization algorithms' effective dynamics on the manifold of minima when the iterator oscillates around the manifold. Compared with existing strict analysis, our derivation method is simple and intuitive, has wide applicability, and produces easy-to-interpret results. As examples, we derive the manifold dynamics for SGD, SGD with momentum (SGDm) and Adam with different noise covariances, and justify the closeness of the derived manifold dynamics with the true dynamics through numerical experiments. We then use minima manifold dynamics to study and compare the properties of optimization algorithms. For SGDm, we show that scaling up learning rate and batch size simultaneously accelerates exploration without affecting generalization, which confirms a benefit of large batch training. For Adam, we show that the speed of its manifold dynamics changes with the direction of the manifold, because Adam is not rotationally invariant. This may cause slow exploration in high dimensional parameter spaces.

1. INTRODUCTION

The ability of stochastic optimization algorithms to explore among (global) minima is believed to be one of the essential mechanisms behind the good generalization performance of stochastically trained over-parameterized neural networks. Until recently, research on this topic has focused on how the iterator jumps between the attraction basins of many isolated minima and settles down around the flattest one Xie et al. (2020) ; Nguyen et al. (2019) ; Dai & Zhu (2020) ; Mori et al. (2021) . However, for over-parameterized models, the picture of isolated minima is not accurate, since global minima usually form manifolds of connected minima Cooper (2018). In addition to crossing barriers and jumping out of the attraction basin of one minima, the optimizer also moves along minima manifold and search for better solutions Wang et al. (2021) . Hence, understanding how optimization algorithms explore along the minima manifold is crucial to understanding how stochastic optimization algorithms are able to find generalizing solutions for over-parameterized neural networks. Some recent works have begun to examine the exploration dynamics of Stochastic Gradient Descent (SGD) along minima manifolds. Many of these works have identified how a change of flatness in the minima manifold adds a driving force to SGD as it oscillates around the minima. (2021b) show that (when the learning rate tends to zero) the changing flatness can give a force to the SGD iterator along the minima manifold and induce a slow dynamics on the manifold that helps the SGD move to the vicinity of flatter minima. In this work, we study the same questions of the flatness-driven exploration along the minima manifold. Instead of searching for a strict proof, we focus on simple and intuitive ways to derive the manifold dynamics. Specifically, we propose a quasistatic approach to derive the manifold dynamics for different optimization algorithms and stochastic noises. The main technique of our derivation is a time-scale decomposition of the motions perpendicular to and parallel with the minima manifold, which we call the normal component and the tangent component, respectively. We treat the normal component as infinitely faster than the tangent component, and thus it is always at equilibrium given the tangent component. The effective dynamics of the tangent component, i.e. the manifold dynamics, is obtained by taking the expectation over the equilibrium distribution of the normal component. The main step in our analysis involves deriving the equilibrium covariance of an SDE. Compared with the theoretical analysis in Li et al. (2021b) , our derivation and results are simpler and easier to interpret, and clearly identifies the roles played by each component of the optimization algorithm (noise covariance, learning rate, momentum). The following simple example demonstrates the main idea of the our derivations. A simple illustrative example: Consider a loss function f (x, y) = h(x)y 2 , where h(x) > 0 is a differentiable function of x. The global minima of this function lie on the x-axis, forming a flat manifold, and h(x) controls the flatness of the loss function at any minimum (x, 0). Let z = [x, y] T . We consider an SGD approximated by SDE Li et al. ( 2017)Li et al. ( 2021a) dz t = -∇f (z t )dt + √ ηD(z t )dW t , where η is the learning rate, D is the square root of the covariance matrix of the gradient noise, and W t is a Brownian motion. For the convenience of presentation, for points (x, y) that are close to the x-axis, we assume the noise covariance aligns with the Hessian of the loss function at (x, 0), i.e. D 2 (z) = σ 2 2 Hf (x, 0) = 0 0 0 σ 2 h(x) , where σ > 0 is a scalar. Then, the SDE equation 1 can be written as dx t = -h ′ (x t )y 2 t dt, dy t = 2h(x t )y t dt + σ ηh(x t )dW t , with W t being a 1-D Brownian motion. When y t is close to 0, the speed of x t is much slower than y t because of the y 2 t in the dynamics of x is much smaller than the y t in the dynamics of y. When this separation of speed is large, the dynamics above can be approximated by the following quasistatic dynamics dx t = -lim τ →∞ E yτ h ′ (x t )y 2 τ dt, dy τ = 2h(x t )y τ dτ + ηh(x t )σdW τ . which assumes y is always at equilibrium given x t . Solving the Ornstein-Uhlenbeck process 2, we know the equilibrium distribution of y τ is ∼ N (0, ησ 2 4 ), and hence the manifold dynamics is dx t dt = - ησ 2 h ′ (x t ) 4 . This derivation shows the slow effective dynamics along the manifold is a gradient flow minimizing the flatness h(x). This simple quasistatic derivation reveals the flatness-driven motion of SGD along the minima manifold, and recovers the same dynamics as given by Li et al. (2021b) in this specific case. On the left panel of Figure 1 , we show an SGD trajectory for f (x, y) = (1 + x 2 )y 2 , illustrating the exploration along the manifold due to the oscillation in the normal space. On the right panel we verify the closeness of the manifold dynamics with the true SGD trajectories for the same objective function. The "Hessian noise" and "Isotropic noise" represent noises whose covariance are the Hessian matrix of f (as analyzed above) and the identity matrix (covered by the analysis in Section 2), respectively. Theoretical applications of the manifold dynamics: The minima manifold dynamics of optimization algorithms can be used as a tool to study and compare the behaviors of optimization algorithms. In Section 3, we illustrate how our derivations can be applied to study the behavior of SGD on a matrix factorization problem. Two more interesting applications are discussed Section 4 and 5. In Section 4, we focus on SGDm, and study the role played by the learning rate, batch size, and momentum coefficient in its manifold dynamics. Based on the analysis, we explore approaches to reliably accelerate the manifold dynamics, which may help accelerate training. Especially, we show that scaling up learning



For example, Damian et al. (2021) considered an SGD training a neural network with label noise, and showed that the optimizer can find the flattest minimum among all global minima. A more recent work Li et al. (2021b) derived an effective stochastic dynamics for SGD on the manifold. The results in Li et al.

Figure 1: (left) The trajectory of SGD (real dynamics) with Hessian noise initialized from (5, 0). (middle) The x-coordinate of the real dynamics and the manifold dynamics for SGD with Hessian and isotropic noises.

