SPHERICAL MOTION DYNAMICS: LEARNING DYNAM-ICS OF NEURAL NETWORK WITH NORMALIZATION, WEIGHT DECAY, AND SGD

Abstract

In this work, we comprehensively reveal the learning dynamics of neural network with normalization, weight decay (WD), and SGD (with momentum), named as Spherical Motion Dynamics (SMD). Most related works study SMD by focusing on "effective learning rate" in "equilibrium" condition, i.e. assuming the convergence of weight norm. However, their discussions on why this equilibrium condition can be reached in SMD is either absent or less convincing. Our work investigates SMD by directly exploring the cause of equilibrium condition. Specifically, 1) we introduce the assumptions that can lead to equilibrium condition in SMD, and prove that weight norm can approach its theoretical value in a linear rate regime with given assumptions; 2) we propose "angular update" as a substitute for effective learning rate to measure the evolving of neural network in SMD, and prove angular update can also approach to its theoretical value in a linear rate regime; 3) we verify our assumptions and theoretical results on various computer vision tasks including ImageNet and MSCOCO with standard settings. Experiment results show our theoretical findings agree well with empirical observations.

1. INTRODUCTION AND BACKGROUND

Normalization techniques (e.g. Batch Normalization (Ioffe & Szegedy, 2015) or its variants) are one of the most commonly adopted techniques for training deep neural networks (DNN). A typical normalization can be formulated as following: consider a single unit in a neural network, the input is X, the weight of linear layer is w (bias is included in w), then its output is y(X; w; γ; β) = g( Xw -µ(Xw) σ(wX) γ + β), where g is a nonlinear activation function like ReLU or sigmoid, µ, σ are mean and standard de-  y(X; w; γ; β) = g(X w ||w|| 2 γ + β), where || • || 2 denotes l 2 norm of a vector. Characterizing evolving of networks during training. Though formulated in different manners, all normalization techniques mentioned above share an interesting property: the weight w affiliated with a normalized unit is scale-invariant: ∀α ∈ R + , y(X; αW ; γ, β) = y(X; w; γ, β). Due to the scale-invariant property of weight, the Euclidean distance defined in weight space completely fails to measure the evolving of DNN during learning process. As a result, original definition of learning rate η cannot sufficiently represent the update efficiency of normalized DNN. To deal with such issue, van Laarhoven (2017); Hoffer et al. ( 2018); Zhang et al. (2019) propose "effective learning rate" as a substitute for learning rate to measure the update efficiency of normalized neural network with stochastic gradient descent (SGD), defined as η ef f = η ||w|| 2 2 . (3) Joint effects of normalization and weight decay. van Laarhoven (2017) explores the joint effect of normalization and weight decay (WD), and obtains the magnitudes of weight by assuming the convergence of weight, i.e. if w t = w t+1 , the weight norm can be approximated as ||w t || 2 = O( 4 η/λ), where λ is WD coefficient. Combining with Eq.( 3), we have η ef f = √ ηλ. A more intuitive demonstration about relationship between normalization and weight decay is presented in Chiley et al. (2019) (see Figure 1 ): due to the fact that the gradient of scale invariant weight ∂L/∂w (L is the loss function of normalized network without WD part) is always perpendicular to weight w, one can infer that gradient component ∂L/∂w always tends to increase the weight norm, while the gradient component provided by WD always tends to reduce weight norm. Thus if weight norm remains unchanged, or "equilibrium has been reached"foot_0 , one can obtain w t -w t+1 ||w t || 2 = 2ηλ ∂L/∂w E||∂L/∂w|| 2 . ( ) Eq.( 4) implies the magnitude of update is scale-invariant of gradients, and effective learning rate should be √ 2ηλ. Li & Arora (2020) manages to estimate the magnitude of update in SGDM, their result is presented in limit and accumulation manner: if both lim T -→∞ R T = 1/T T t=0 ||w t || and lim T -→∞ D T = 1/T T t=0 ||w t -w t+1 || exist, then we have lim T -→∞ D T R T = 2ηλ 1 + α . ( ) Though not rigorously, one can easily speculate from Eq.( 5) the magnitude of update in SGDM cases should be 2ηλ/(1 + α) in equilibrium condition. But proof of Eq.( 5) requires more strong assumptions: not only convergence of weight norm, but also convergence of update norm w t+1w t 2 (both in accumulation manner). Figure 1 : Illustration of optimization behavior with BN and WD. Angular update ∆ t represents the angle between the updated weight W t and its former value W t+1 . As discussed above, all previous qualitative results about "effective learning rate" (van Laarhoven, 2017; Chiley et al., 2019; Li & Arora, 2020) highly rely on equilibrium condition, but none of them explores why such equilibrium condition can be achieved. Only van Laarhoven (2017) simply interprets the occurrence of equilibrium as a natural result of convergence of optimization, i.e. when optimization is close to be finished, w t = w t+1 , resulting in equilibrium condition. However, this interpretation has an apparent contradiction: according to Eq.( 4) and ( 5), when equilibrium condition is reached, the magnitude of update is constant, only determined by hyper-parameters, which means optimization process has not converged yet. Li & Arora (2020) also notices the non-convergence of SGD with BN and WD, so they do not discuss reasonableness of assumptions adopted by Eq.( 5). In a word, previous results about "effective learning rate" in equilibrium condition can only provide vague insights, they are difficult to be connected with empirical observation. In this work, we comprehensively reveal the learning dynamics of normalized neural network using stochastic gradient descent without/with momentum (SGD/SGDM) and weight decay, named as Spherical Motion Dynamics (SMD). Our investigation aims to answer the following question:



"Weight norm remains unchanged" means ||wt||2 ≈ ||wt+1||2, Chiley et al. (2019) calls this condition as "equilibrium", which will also be used in the following context of this paper. Note equilibrium condition is not mathematically rigorous, we only use it for intuitive analysis.



viation computed across specific dimension of Xw (like Batch Normalization (Ioffe & Szegedy, 2015), Layer Normalization Ba et al. (2016), Group Normalization (Wu & He, 2018), etc.). β, γ are learnable parameters to remedy for the limited range of normalized feature map. Aside from normalizing feature map, Salimans & Kingma (2016) normalizes weight by l 2 norm instead:

