A QUASISTATIC DERIVATION OF OPTIMIZATION ALGO-RITHMS' EXPLORATION ON THE MINIMA MANIFOLD

Abstract

A quasistatic approach is proposed to derive the optimization algorithms' effective dynamics on the manifold of minima when the iterator oscillates around the manifold. Compared with existing strict analysis, our derivation method is simple and intuitive, has wide applicability, and produces easy-to-interpret results. As examples, we derive the manifold dynamics for SGD, SGD with momentum (SGDm) and Adam with different noise covariances, and justify the closeness of the derived manifold dynamics with the true dynamics through numerical experiments. We then use minima manifold dynamics to study and compare the properties of optimization algorithms. For SGDm, we show that scaling up learning rate and batch size simultaneously accelerates exploration without affecting generalization, which confirms a benefit of large batch training. For Adam, we show that the speed of its manifold dynamics changes with the direction of the manifold, because Adam is not rotationally invariant. This may cause slow exploration in high dimensional parameter spaces.

1. INTRODUCTION

The ability of stochastic optimization algorithms to explore among (global) minima is believed to be one of the essential mechanisms behind the good generalization performance of stochastically trained over-parameterized neural networks. Until recently, research on this topic has focused on how the iterator jumps between the attraction basins of many isolated minima and settles down around the flattest one Xie et al. (2020); Nguyen et al. (2019) ; Dai & Zhu (2020); Mori et al. (2021) . However, for over-parameterized models, the picture of isolated minima is not accurate, since global minima usually form manifolds of connected minima Cooper (2018). In addition to crossing barriers and jumping out of the attraction basin of one minima, the optimizer also moves along minima manifold and search for better solutions Wang et al. (2021) . Hence, understanding how optimization algorithms explore along the minima manifold is crucial to understanding how stochastic optimization algorithms are able to find generalizing solutions for over-parameterized neural networks. (2021b) show that (when the learning rate tends to zero) the changing flatness can give a force to the SGD iterator along the minima manifold and induce a slow dynamics on the manifold that helps the SGD move to the vicinity of flatter minima. In this work, we study the same questions of the flatness-driven exploration along the minima manifold. Instead of searching for a strict proof, we focus on simple and intuitive ways to derive the manifold dynamics. Specifically, we propose a quasistatic approach to derive the manifold dynamics for different optimization algorithms and stochastic noises. The main technique of our derivation is a time-scale decomposition of the motions perpendicular to and parallel with the minima manifold, which we call the normal component and the tangent component, respectively. We treat the normal component as infinitely faster than the tangent component, and thus it is always at equilibrium given



Some recent works have begun to examine the exploration dynamics of Stochastic Gradient Descent (SGD) along minima manifolds. Many of these works have identified how a change of flatness in the minima manifold adds a driving force to SGD as it oscillates around the minima. For example, Damian et al. (2021) considered an SGD training a neural network with label noise, and showed that the optimizer can find the flattest minimum among all global minima. A more recent work Li et al. (2021b) derived an effective stochastic dynamics for SGD on the manifold. The results in Li et al.

