SPHERICAL MOTION DYNAMICS: LEARNING DYNAM-ICS OF NEURAL NETWORK WITH NORMALIZATION, WEIGHT DECAY, AND SGD

Abstract

In this work, we comprehensively reveal the learning dynamics of neural network with normalization, weight decay (WD), and SGD (with momentum), named as Spherical Motion Dynamics (SMD). Most related works study SMD by focusing on "effective learning rate" in "equilibrium" condition, i.e. assuming the convergence of weight norm. However, their discussions on why this equilibrium condition can be reached in SMD is either absent or less convincing. Our work investigates SMD by directly exploring the cause of equilibrium condition. Specifically, 1) we introduce the assumptions that can lead to equilibrium condition in SMD, and prove that weight norm can approach its theoretical value in a linear rate regime with given assumptions; 2) we propose "angular update" as a substitute for effective learning rate to measure the evolving of neural network in SMD, and prove angular update can also approach to its theoretical value in a linear rate regime; 3) we verify our assumptions and theoretical results on various computer vision tasks including ImageNet and MSCOCO with standard settings. Experiment results show our theoretical findings agree well with empirical observations.

1. INTRODUCTION AND BACKGROUND

Normalization techniques (e.g. Batch Normalization (Ioffe & Szegedy, 2015) or its variants) are one of the most commonly adopted techniques for training deep neural networks (DNN). A typical normalization can be formulated as following: consider a single unit in a neural network, the input is X, the weight of linear layer is w (bias is included in w), then its output is  y(X; w; γ; β) = g( Xw -µ(Xw) σ(wX) γ + β), y(X; w; γ; β) = g(X w ||w|| 2 γ + β), where || • || 2 denotes l 2 norm of a vector. Characterizing evolving of networks during training. Though formulated in different manners, all normalization techniques mentioned above share an interesting property: the weight w affiliated with a normalized unit is scale-invariant: ∀α ∈ R + , y(X; αW ; γ, β) = y(X; w; γ, β). Due to the scale-invariant property of weight, the Euclidean distance defined in weight space completely fails to measure the evolving of DNN during learning process. As a result, original definition of learning rate η cannot sufficiently represent the update efficiency of normalized DNN. To deal with such issue, van Laarhoven (2017); Hoffer et al. (2018); Zhang et al. (2019) propose "effective learning rate" as a substitute for learning rate to measure the update efficiency of normalized

