SPHERICAL MOTION DYNAMICS: LEARNING DYNAM-ICS OF NEURAL NETWORK WITH NORMALIZATION, WEIGHT DECAY, AND SGD

Abstract

In this work, we comprehensively reveal the learning dynamics of neural network with normalization, weight decay (WD), and SGD (with momentum), named as Spherical Motion Dynamics (SMD). Most related works study SMD by focusing on "effective learning rate" in "equilibrium" condition, i.e. assuming the convergence of weight norm. However, their discussions on why this equilibrium condition can be reached in SMD is either absent or less convincing. Our work investigates SMD by directly exploring the cause of equilibrium condition. Specifically, 1) we introduce the assumptions that can lead to equilibrium condition in SMD, and prove that weight norm can approach its theoretical value in a linear rate regime with given assumptions; 2) we propose "angular update" as a substitute for effective learning rate to measure the evolving of neural network in SMD, and prove angular update can also approach to its theoretical value in a linear rate regime; 3) we verify our assumptions and theoretical results on various computer vision tasks including ImageNet and MSCOCO with standard settings. Experiment results show our theoretical findings agree well with empirical observations.

1. INTRODUCTION AND BACKGROUND

Normalization techniques (e.g. Batch Normalization (Ioffe & Szegedy, 2015) or its variants) are one of the most commonly adopted techniques for training deep neural networks (DNN). A typical normalization can be formulated as following: consider a single unit in a neural network, the input is X, the weight of linear layer is w (bias is included in w), then its output is y(X; w; γ; β) = g( Xw -µ(Xw) σ(wX) γ + β), where g is a nonlinear activation function like ReLU or sigmoid, µ, σ are mean and standard deviation computed across specific dimension of Xw (like Batch Normalization (Ioffe & Szegedy, 2015) , Layer Normalization Ba et al. ( 2016), Group Normalization (Wu & He, 2018) , etc.). β, γ are learnable parameters to remedy for the limited range of normalized feature map. Aside from normalizing feature map, Salimans & Kingma (2016) normalizes weight by l 2 norm instead: y(X; w; γ; β) = g(X w ||w|| 2 γ + β), where || • || 2 denotes l 2 norm of a vector. Characterizing evolving of networks during training. Though formulated in different manners, all normalization techniques mentioned above share an interesting property: the weight w affiliated with a normalized unit is scale-invariant: ∀α ∈ R + , y(X; αW ; γ, β) = y(X; w; γ, β). Due to the scale-invariant property of weight, the Euclidean distance defined in weight space completely fails to measure the evolving of DNN during learning process. As a result, original definition of learning rate η cannot sufficiently represent the update efficiency of normalized DNN. To deal with such issue, van Laarhoven (2017); Hoffer et al. (2018) ; Zhang et al. (2019) propose "effective learning rate" as a substitute for learning rate to measure the update efficiency of normalized neural network with stochastic gradient descent (SGD), defined as η ef f = η ||w|| 2 2 . (3) Joint effects of normalization and weight decay. van Laarhoven (2017) explores the joint effect of normalization and weight decay (WD), and obtains the magnitudes of weight by assuming the convergence of weight, i.e. if w t = w t+1 , the weight norm can be approximated as ||w t || 2 = O( 4 η/λ), where λ is WD coefficient. Combining with Eq.(3), we have η ef f = √ ηλ. A more intuitive demonstration about relationship between normalization and weight decay is presented in Chiley et al. (2019) (see Figure 1 ): due to the fact that the gradient of scale invariant weight ∂L/∂w (L is the loss function of normalized network without WD part) is always perpendicular to weight w, one can infer that gradient component ∂L/∂w always tends to increase the weight norm, while the gradient component provided by WD always tends to reduce weight norm. Thus if weight norm remains unchanged, or "equilibrium has been reached"foot_0 , one can obtain w t -w t+1 ||w t || 2 = 2ηλ ∂L/∂w E||∂L/∂w|| 2 . ( ) Eq.( 4) implies the magnitude of update is scale-invariant of gradients, and effective learning rate should be √ 2ηλ. Li & Arora (2020) manages to estimate the magnitude of update in SGDM, their result is presented in limit and accumulation manner: if both lim T -→∞ R T = 1/T (5) Though not rigorously, one can easily speculate from Eq.( 5) the magnitude of update in SGDM cases should be 2ηλ/(1 + α) in equilibrium condition. But proof of Eq.( 5) requires more strong assumptions: not only convergence of weight norm, but also convergence of update norm w t+1w t 2 (both in accumulation manner). Figure 1 : Illustration of optimization behavior with BN and WD. Angular update ∆ t represents the angle between the updated weight W t and its former value W t+1 . As discussed above, all previous qualitative results about "effective learning rate" (van Laarhoven, 2017; Chiley et al., 2019; Li & Arora, 2020) highly rely on equilibrium condition, but none of them explores why such equilibrium condition can be achieved. Only van Laarhoven (2017) simply interprets the occurrence of equilibrium as a natural result of convergence of optimization, i.e. when optimization is close to be finished, w t = w t+1 , resulting in equilibrium condition. However, this interpretation has an apparent contradiction: according to Eq.( 4) and ( 5), when equilibrium condition is reached, the magnitude of update is constant, only determined by hyper-parameters, which means optimization process has not converged yet. Li & Arora (2020) also notices the non-convergence of SGD with BN and WD, so they do not discuss reasonableness of assumptions adopted by Eq.( 5). In a word, previous results about "effective learning rate" in equilibrium condition can only provide vague insights, they are difficult to be connected with empirical observation. In this work, we comprehensively reveal the learning dynamics of normalized neural network using stochastic gradient descent without/with momentum (SGD/SGDM) and weight decay, named as Spherical Motion Dynamics (SMD). Our investigation aims to answer the following question: Why and how equilibrium condition can be reached in Spherical Motion Dynamics? Specifically, our contributions are • We introduce the assumptions that can lead to equilibrium condition in SMD, and justify their reasonableness by sufficient experiments. We also prove under given assumptions, equilibrium condition can be reached as weight norm approach to its theoretical value in a linear rate regime in SMD. Our assumptions show the equilibrium condition can occur long before the finish of the whole optimization; • We define a novel index, angular update, to measure the change of normalized neural network in a single iteration, and derive its theoretical value under equilibrium condition in SMD. We also prove angular update can approach to its theoretical value in a linear regime along with behavior of weight norm. Our results imply that the update efficiency of SGD/SGDM on normalized neural network only depends on pre-defined hyper-parameters in both SGD and SGDM case; • We verify our theorems on different computer vision tasks (including one of most challenging datasets ImageNet (Russakovsky et al., 2015) and MSCOCO (Lin et al., 2014) ) with various networks structures and normalization techniques. Experiments show the theoretical value of angular update and weight norm agree well with empirical observation. Recently, a parallel work (Li et al., 2020b) Our theory on equilibrium condition in Spherical Motion Dynamics implies equilibrium condition mostly relies on the update rules of SGD/SGDM with WD, and scale-invariant property. The cause of equilibrium condition is independent of decrease of loss or trajectory of optimization, but equilibrium condition turns out to significantly affect update efficiency of normalized network by controlling the relative update (Eq.( 4)). We believe equilibrium condition is one of the key reason why learning dynamic of normalized neural network is not consistent with traditional optimization theory (Li et al., 2020b) . We think it has great potential to study the leaning dynamic of normalized network or develop novel efficient learning strategy under the view of Spherical Motion Dynamics.

2. RELATED WORK

Normalization techniques Batch normalization(BN (Ioffe & Szegedy, 2015) ) is proposed to deal with gradient vanishing/explosion, and accelerate the training of DNN. Rapidly, BN has been widely used in almost all kinds of deep learning tasks. Aside from BN, more types of normalization techniques have been proposed to remedy the defects of BN (Ioffe, 2017; Wu & He, 2018; Chiley et al., 2019; Yan et al., 2020) or to achieve better performance (Ba et al., 2016; Ulyanov et al., 2016; Salimans & Kingma, 2016; Shao et al., 2019; Singh & Krishnan, 2020) . Weight decay Weight decay (WD) is well-known as l 2 regularization, or ridge regression, in statistics. WD is also found to be extreme effective when applied in deep learning tasks. Krizhevsky & Geoffrey (2009) shows WD sometimes can even improve training accuracy not just generalization performance; Zhang et al. (2019) show WD can regularize the input-output Jacobian norm and reduce the effective damping coefficient; Li et al. (2020a) discusses the disharmony between WD and weight normalization. A more recent work Lewkowycz & Gur-Ari (2020) empirically finds the number of SGD steps T until a model achieves maximum performance satisfies T ∝ 1 λη , where λ, η are weight decay factor and learning rate respectively, they interpret this phenomenon under the view of Neural Tangent Kernel (Jacot et al., 2018) , showing that weight decay can accelerate the training process. Notice their result has no connection with equilibrium condition discussed in this work. Our results shows the cause of equilibrium condition can be reached long before neural network can get its highest performance. Effective learning rate Due to the scale invariant property caused by normalization, researchers start to study the behavior of effective learning rate. van Laarhoven (2017); Chiley et al. (2019) estimate the magnitude of effective learning rate under equilibrium assumptions in SGD case; Hoffer et al. (2018) quantify effective learning rate without equilibrium assumptions, so their results are much weaker; Arora et al. (2019) proves that without WD, normalized DNN can still converge with fixed/decaying learning rate in GD/SGD cases respectively; Zhang et al. (2019) shows WD can increase effective learning rate; Li & Arora (2020) proves standard multi-stage learning rate schedule with BN and WD is equivalent to an exponential increasing learning rate schedule without WD. As a proposition, Li & Arora (2020) quantifies the magnitude of effective learning rate in SGDM case. But none of them have ever discussed why equilibrium condition can be reached. A recent work Li et al. (2020b) studies the convergence of effective learning rate by SDE, proving that the convergence time is of O(1/(λη)), where λ, η are weight decay factor and learning rate respectively. Their result can only provide intuitive understanding, and is limited on SGD case.

3. PRELIMINARY ON SPHERICAL MOTION DYNAMICS

First of all, we review the property of scale invariant weight, and depict Spherical Motion Dynamics (SMD) in SGD case. Notice except definitions, all intuitive statements or derivations in this section mostly comes from previous literature, they are not mathematically rigorous. We summarize them to provide background of our topic and preliminary knowledge for readers. Lemma 1. If w is scale-invariant with respect to L(w) , then for all k > 0, we have: w t , ∂L ∂w w=wt = 0 (6) ∂L ∂w w=kwt = 1 k • ∂L ∂w w=wt . Proof can be seen in Appendix B.1. Lemma 1 is also discussed in Hoffer et al. (2018) ; van Laarhoven (2017); Li & Arora (2020) . Eq.( 7) implies gradient norm is influenced by weight norm, but weight norm does not affect the output of DNN, thus we define unit gradient to eliminate the effect of weight norm.  Eq.( 10) shows effective learning rate can be viewed as learning rate of SGD on unit sphere S p-1 (Hoffer et al., 2018) . But effective learning rate still cannot properly represent the magnitude of update, since unit gradient norm is unknown. Therefore we propose the angular update defined below. Definition 2 (Angular Update). Assuming w t is a scale-invariant weight from a neural network at iteration t, then the angular update ∆ t is defined as ∆ t = (w t , w t+1 ) = arccos w t , w t+1 ||w t || • ||w t+1 || , where (•, •) denotes the angle between two vectors, •, • denotes the inner product. According to Eq.( 6), ∂L/∂w is perpendicular to weight w. Therefore, if angular update ∆ t is small enough, it can be approximated by first order Taylor series expansion of tan ∆ t , then we can see its connection between effective learning rate and unit gradient norm ∆ t = tan(∆ t ) = η ||w t || • || ∂L ∂w w=wt || 2 = η ef f • ∂L ∂w w= wt . ( ) Another deduction from Eq.( 6) is that weight norm always increases because ||w t+1 || 2 2 = ||w t || 2 2 + (η|| ∂L ∂w w=wt || 2 ) 2 > ||w t || 2 2 . ( ) From Eq.( 7) we can infer that increasing weight norm can lead to smaller gradient norm if unit gradient norm is unchanged. Zhang et al. (2019) states the potential risk that GD/SGD without WD but BN will converge to a stationary point not by reducing loss but by reducing effective learning rate due to increasing weight norm. Arora et al. (2019) proves that full gradient descent can avoid the risk and converge to a stationary point defined in S p-1 , but their results still require sophisticated learning rate decay schedule in SGD case. Besides, practical implementation suggests training DNN without WD always suffers from poor generalization (Zhang et al., 2019; Bengio & LeCun, 2007; Lewkowycz & Gur-Ari, 2020) . Now considering the update rule of SGD with WD: w t+1 = w t -η( ∂L ∂w w=wt + λw t ). ( ) We can approximate the update of weight norm by ||w t+1 || 2 ≈ ||w t || 2 -λη||w t || 2 + η 2 2||w t || 3 2 • || ∂L ∂w w= wt || 2 2 . ( ) The derivation of Eq.( 15) is presented in Appendix A.1. Eq.( 15) implies WD can provide direction to reduce weight norm, hence Chiley et al. (2019) ; Zhang et al. (2019) points out the possibility that weight norm can be steady, but do not explain this clearly. Here we demonstrate the mechanism deeper (see Figure 1 ): if unit gradient norm remains unchanged, note "centripetal force" (-λη||w t || 2 ) is proportional to weight norm, while "centrifugal force" ( η 2 2||wt|| 3 2 • || ∂L ∂w w= wt || 2 ) is inversely proportional to cubic weight norm. As a result, the dynamics of weight norm is like a spherical motion in physics: overly large weight norm makes centripetal force larger than centrifugal force, leading to decreasing weight norm; while too small weight norm makes centripetal force smaller than centrifugal force, resulting in the increase of weight norm. Intuitively, equilibrium condition tends to be reached if the number of iterations is sufficiently large. Notice that the core assumption mentioned above is "unit gradient norm is unchanged". In fact this assumption can solve the contradiction we present in Section 1: convergence of w t leads to nonzero ||w t+1 -w t || 2 /||w t || 2 . Our analysis implies the convergence of weight norm is not equivalent to convergence of weight, steady unit gradient norm can also make weight norm converge, steady unit gradient norm does not rely on the fact that optimization has reached an optimum solution. But a problem about unit gradient assumption rises: unit gradient norm cannot remain unchanged during training in practice, it is not a reasonable assumption. In next section, we formulate this assumption in a reasonable manner, and rigorously prove the existence of equilibrium condition in SGD case. Discussion about SGD cannot be easily extended to SGDM case, because momentum is not always perpendicular to weight as unit gradient. But we can still prove the existence of equilibrium condition in SGDM case under modified assumptions.

4. MAIN RESULTS

First of all, we prove the existence of equilibrium condition in SGD case and provide the approaching rate of weight norm. Theorem 1. (Equilibrium condition in SGD) Assume the loss function is L(X; w) with scaleinvariant weight w, denote g t = ∂L ∂w Xt,wt , gt = g t • ||w t || 2 . Considering the update rule of SGD with weight decay, w t+1 = w t -η • (g t + λw t ) where λ, η ∈ (0, 1). If the following assumptions hold: 1) λη << 1; 2) ∃L, V, ∈ R + , E[||g t || 2 2 |w t ] = L, E[(||g t || 2 2 -L) 2 |w t ] ≤ V ; 3) ∃l ∈ R + , ||g t || 2 2 > l, l > 2[ 2λη 1-2λη ] 2 L. Then ∃B > 0, w * = 4 Lη/(2λ), we have E[||w T || 2 2 -(w * ) 2 ] 2 ≤ (1 -4λη) T B + V η 2 l . ( ) Remark 1. The theoretical value of weight norm w * in Theorem 1 is consistent with the derivation of weight norm in equilibrium in van Laarhoven (2017), though van Laarhoven (2017) assumes the equilibrium condition has been reached in advance, hence van Laarhoven (2017) cannot provide the approaching rate and scale of bias/variance. The vanishing term ((1 -4λη) T B) in Eq.( 17) is consistent with the convergence time O(1/(λη)) presented in Li et al. (2020b) . Proof of Theorem 1 can be seen in Appendix B.2. It shows the square of weight norm can approach to its theoretical value in a linear rate regime (when vanishing term is larger than noisy term in Eq.( 17)), and its variance is bounded by V η 2 l , which is empirically small. Now we discuss the reasonableness of assumptions in Theorem 1. Assumption 1 is consistent with commonly used settings; assumptions 2 and 3 imply the E||g t || 2 2 remains unchanged within several iterations, and its lower bound cannot be far from its expectation. We need to clarify assumption 2 further: we do not require the E||g t || 2 2 must be constant across the whole training process, but only remains unchanged locally. On one hand, small learning rate(η) can guarantee E||g t || 2 2 changes slowly; on the other hand, when E||g t || 2 2 changes, it means that square of weight norm will approach to its new theoretical value as Theorem 1 describes. Experiment result in Figure 2 strongly justifies our analysis. We also conduct extensive experiments to verify our claim further, please refer to Appendix C.1. Now we extend Theorem 1 to SGDM case. SGDM is more complex than SGD since momentum is not always perpendicular to the weight, therefore we need to modify assumptions. Theorem 2. (Equilibrium condition in SGDM) Considering the update rule of SGDM (heavy ball method (Polyak, 1964) ): v t = αv t-1 + g t + λw t (18) w t+1 = w t -ηv t ( ) where λ, η, α ∈ (0, 1). If following assumptions hold: 4 ) λη << 1, λη < (1 - √ α) 2 ; 5) Define h t = ||g t || 2 2 + 2α v t-1 , g t , ht = h t • ||w t || 2 2 . ∃L, V, ∈ R + , E[ ht |w t ] = L, E[( ht -L) 2 |w t ] ≤ V ; 6)∃l ∈ R + , h t > l, l > 2[ 6λη (1-α) 3 (1+α)-8λη(1-α) ] 2 L. then ∃B > 0, w * = 4 Lη/(λ(1 -α)(2 -λη/(1 + α))), we have E[||w T || 2 2 -(w * ) 2 ] 2 ≤ 3B • (1 - 4λη 1 -α ) T + 3V η 2 (1 + 4α 2 + α 4 ) l(1 -α) 4 , ( ) Remark 2. So far, no other work rigorously prove equilibrium condition can be reached in SGDM cases. Even the most relevant work (Li et al., 2020b ) only provides their conjecture on approaching rate of weight norm in SGDM, they speculate that the time of approaching to equilibrium should be O(1/(λη)), same order as approaching time in SGD case, their conjecture cannot provide further insight. While our results (vanishing terms in Eq.( 17), ( 20) respectively) can clearly reflect the difference: the approaching rate of SGDM should be 1/(1 -α) times faster than rate of SGD with same ηλ. α is usually set as 0.9, so SGDM can reach equilibrium condition much faster than SGD. Proof can be seen in Appendix B.3. Like assumption 1, assumption 4 is also satisfied for commonly used hyper-parameter settings. Besides, λη < (1 -√ α) 2 is also mentioned in Li & Arora (2020) for other purpose; Assumption 5 shows not unit gradient gradient norm ||g t || 2 2 but an adjusted value ht dominates the expectation and variance of the weight norm square. We empirically find the expectation of v t-1 , g t is very close to 0, therefore the behavior of ht is similar to that of ||g t || 2 2 (see (d) in Figure 2 ), making square of weight norm approaching its theoretical value in SGDM case. We leave theoretical analysis on ht as future work. As for assumption 6, commonly used settings (ηλ << 1) can make 2[ 6λη (1-α) 3 (1+α)-8λη(1-α) ] 2 as an very small lower bound for l/L. The experiments on justification of assumptions 4,5,6 can be seen in Figure 2 and appendix C.1. Comparing with Eq.( 17) and Eq.( 20), it implies that with same η, λ, SGDM can reach equilibrium condition much faster than SGD, but may have a larger variance, our experiments also verify that(see (b), (e) in Figure 2 ). Since we have proven weight norm will approach to its theoretical value in SMD in a linear rate regime, we can derive the theoretical value of angular update ∆ t and its variance. Theorem 3. (Theoretical value of Angular Update) In SGD case, if assumptions in theorem 1 holds, then ∃C > 0, we have E(∆ T -2λη) 2 < (1 -4ηλ) T C + V Ll (21) In SGDM case, if assumptions in theorem 2 holds, ∃C > 0, we have (2020b), we suspect that ∆ t ∝ √ λη is highly correlated to the performance of DNN: smaller angular update in equilibrium condition can lead to better performance of DNN. According to theorem 3, with same λη, momentum method has faster approaching rate(λη/(1 -α)) and smaller angular update ( 2λη/(1 + α)) than pure SGD. This means momentum method can simultaneously accelerate training process and improve the final performance, comparing pure SGD method. We leave this conjecture as a future work. E(∆ t - 2λη 1 + α ) 2 < (1 - 4ηλ 1 -α ) T C + (1 -α 2 )(1 + 4α 2 + α 4 )V 4Llα 4 Proof of theorem 3 is shown in Appendix B.4. According to theorem 3, the theoretical value of angular update and its approaching rate almost only depends on hyper-parameters: weight decay factor λ, learning rate η, (and momentum factor α). It implies update efficiency of scale-invariant weights within a single step is totally controlled by predefined hyper-parameters in equilibrium condition, regardless other attributes of the weights (shape, size, position in network structure, or effects from other weights). That's the key reasons why we propose Angular Update (Eq.( 11)) to replace effective learning rate (Eq.( 3)): effective learning rate can only reflect the influence of weight norm, it cannot reveal how unit gradient norm affect the relative update (see Eq.( 10)), while angular update in equilibrium condition implies that both weight norm and unit gradient norm do not affect the scale of update within a single iteration, only hyper-parameters (learning rate, weight decay factor, momentum factor) matter. Our experiment results strongly prove our claim (see Figure 3 

5. EXPERIMENTS

In this section, we show that our theorems can agree well with empirical observations on Ima-geNet (Russakovsky et al., 2015) and MSCOCO (Lin et al., 2014) . We conduct experiments in two cases. In the first case we train neural network with fixed learning rate to verify our theorems in SGD and SGDM, respectively; in the second case we investigate SMD with more commonly used settings, multi-stage learning rate schedule. Under review as a conference paper at ICLR 2021

5.1. FIXED LEARNING RATE

With fixed learning rate, we train Resnet50 (He et al., 2016) with SGD/SGDM on ImageNet. Learning rate is fixed as η = 0.2; WD factor is λ = 10 -4 ; with SGDM, the momentum factor is α = 0.9. Figure 2 presents the unit gradient norm square, weight norm, and angular update of the weights from LAYER.2.0.CONV2 of Resnet50 in SGD and SGDM cases, respectively. It can be inferred from Figure 2 the behavior of ||g t || 2 2 , ht and hyper-parameter settings satisfy our assumptions in Theorem 1 and 2, therefore theoretical value of weight norm and angular update agree with empirical value very well. We also can observe SGDM can achieve equilibrium more quickly than SGD. According to Eq.( 17),( 20), the underlying reason might be with same learning rate η and WD factor λ, approaching rate of SGDM (1 -λη 1-α ) is smaller than that of SGD (1 -λη). 

5.2. MULTI-STAGE LEARNING RATE SCHEDULE

Now we turn to study the behavior of angular update with SGDM and multi-stage leanring rate schedule on Imagenet (Russakovsky et al., 2015) and MSCOCO (Lin et al., 2014) . In ImageNet classification task, we still adopt Resnet50 as baseline for it is a widely recognized network structure.The training settings rigorously follow Goyal et al. (2017) : learning rate is initialized as 0.1, and divided by 10 at 30, 60, 80-th epoch; the WD factor is 10 -4 ; the momentum factor is 0.9. In MSCOCO experiment, we conduct experiments on Mask-RCNN (He et al., 2017) benchmark using a Feature Pyramid Network (FPN) (Lin et al., 2017) , ResNet50 backbone and SyncBN (Peng et al., 2018) following the 4x setting in He et al. ( 2019): total number of iteration is 360, 000, learning rate is initialized as 0.02, and divided by 10 at iteration 300, 000, 340, 000; WD coefficient is 10 -4 . There appears to be some mismatch between theorems and empirical observations in (a), (b) of Figure 3 : angular update in the last two stages is smaller than its theoretical value. This phenomenon can be well interpreted by our theory: according to Theorem 1, 2, when equilibrium condition is reached, theoretical value of weight norm satisfies ||w t || 2 ∝ 4 η λ , therefore when learning rate is divided by k, equilibrium condition is broken, theoretical value of weight norm in new equilibrium condition should get 4 1/k smaller. But new equilibrium condition cannot be reached immediately (see (c), (f) in Figure 3 ), following corollary gives the least number of iterations to reach the new equilibrium condition. Corollary 3.1. In SGD case with learning rate η, WD coefficient λ, if learning rate is divided by k, and unit gradient norm remains unchanged, then at least [log(k)]/(4λη) iterations are required to reach the new equilibrium condition; In SGDM case with momentum coefficient α, then at least [log(k)(1 -α)]/(4λη) iterations are required to reach the new equilibrium condition. Corollary 3 ) show this simple strategy can make angular update always approximately equal to its theoretical value across the whole training process though learning rate changes.

6. CONCLUSION

In this paper, we comprehensively reveal the learning dynamics of DNN with normalization, WD, and SGD/SGDM, named as Spherical Motion Dynamics (SMD). Different from most related works (van Laarhoven, 2017; Hoffer et al., 2018; Chiley et al., 2019; Li & Arora, 2020) , we directly explore the cause of equilibrium. Specifically, we introduce the assumptions that can lead to equilibrium condition, and show these assumptions are easily satisfied by practical implementation of DNN; Under given assumptions, we prove equilibrium condition can be reached at linear rate, far before optimization has converged. Most importantly, we show our theorem is widely valid, they can be verified on one of most challenging computer vision tasks, beyond synthetic datasets. We believe our theorems on SMD can bridge the gap between current theoretical progress and practical usage on deep learning techniques.

A.1 APPROXIMATION OF WEIGHT NORM UPDATE

Now considering the update rule of SGD with WD: w t+1 = w t -η( ∂L ∂w w=wt + λw t ). ( ) Since ∂L ∂w w=wt is perpendicular to w t , we have ||w t+1 || 2 2 = ||w t -η( ∂L ∂w w=wt + λw t )|| 2 2 (24) = (1 -λη) 2 ||w t || 2 2 + η 2 || ∂L ∂w w=wt || 2 2 . ( ) Therefore ||w t+1 || 2 -||w t || 2 = (1 -λη) 2 ||w t || 2 2 + η 2 || ∂L ∂w w=wt || 2 2 -||w t || 2 (26) = (1 -λη) 2 ||w t || 2 2 + η 2 || ∂L ∂w w=wt || 2 2 -||w t || 2 2 (1 -λη) 2 ||w t || 2 2 + η 2 || ∂L ∂w w=wt || 2 2 + ||w t || 2 . ( ) Now assume η, λ is extremely small, so that (1 -λη) 2 ||w t || 2 2 + η 2 || ∂L ∂w w=wt || 2 2 -||w t || 2 2 ≈ -2λη||w t || 2 2 + η 2 || ∂L ∂w w=wt || 2 2 (28) (1 -λη) 2 ||w t || 2 2 + η 2 || ∂L ∂w w=wt || 2 2 + ||w t || 2 ≈ 2||w t || 2 . ( ) Therefore, we have ||w t+1 || 2 -||w t || 2 ≈ -λη||w t || 2 + η 2 2||w t || 2 || ∂L ∂w w=wt || 2 2 = -λη||w t || 2 + η 2 2||w t || 3 2 || ∂L ∂w w= wt || 2 2 (30) where ∂L ∂w w= wt is the unit gradient defined in Definition 1.

B PROOF OF THEOREMS

Remark 4. In the following context, we will use the following conclusion multiple times: ∀δ, ε ∈ R, if |δ| 1, |ε| 1, then we have: (1 + δ) 2 ≈ 1 + 2δ, √ 1 + δ ≈ 1 + δ 2 , 1 1 + δ ≈ 1 -δ, (1 + δ)(1 + ε) ≈ 1 + δ + ε. (31) B.1 PROOF OF LEMMA 1 Proof. Given w 0 ∈ R p \{0}, since ∀k > 0, L(w 0 ) = L(kw 0 ), then we have ∂L(w) ∂w w=w0 = ∂L(kw) ∂w w=w0 = ∂L(w) ∂w w=kw0 • k (32) ∂L(kw) ∂k w=w0 = ∂L(w) ∂w w=kw0 , w 0 = 1 k • ∂L(w) ∂w w=w0 , w 0 = 0 (33) B.2 PROOF OF THEOREM 1 Lemma 2. If the sequence {x t } ∞ t=1 satisfies x t ≥ αx t-1 + L x t-1 . where x 1 > 0, L > 0 Then, we have x t ≥ L 1 -α -α t-1 | L 1 -α -x 1 | (35) Proof. If x t ≥ L 1-α , since L/(1 -α) ≥ 2 √ αL then we have x t+1 ≥ αx t + L x t ≥ α L 1 -α + L L/(1 -α) = L 1 -α , ( ) which means ∀k > t, x k ≥ L/(1 -α). If x t < L/(1 -α), then we have L 1 -α -x t+1 ≤ (α - L(1 -α) x t )( L 1 -α -x t ) < α( L 1 -α -x t ). ( ) Therefore, by deduction method we have if x T < L/(1 -α), then ∀t ∈ [1, T -1], we have 0 < x t < x t+1 < X T < L 1 -α , ( L 1 -α -x t ) < α t-1 ( L 1 -α -x 1 ). In summary, we have x t ≥ L 1 -α -α t-1 | L 1 -α -x 1 | (40) Proof of Theorem 1. since w t , g t = 0, then we have: 41) is equivalent to ||w t+1 || 2 2 = (1 -ηλ) 2 ||w t || 2 2 + ||g t || 2 2 η 2 ||w t || 2 2 (41) Denote x t as ||w t || 2 2 , L t as ||g t || 2 2 and omit O((ηλ) 2 ) part. Then L t > l, E[L t |x t ] = L, V ar(L t |x t ) = E[(L t -L) 2 |x t ] < V . Eq.( x t+1 = (1 -2λη)x t + L t η 2 x t . ( ) According to Lemma 2, we have x t > lη 2λ -(1 -2λη) t-1 |x 0 - lη 2λ |. ( ) Eq.( 43) implies when t > 1 + log(( √ 2-1) √ lη/(4λ))-log(|x0- √ lη/(4λ)|) log(1-2λη) , we have x t > lη 4λ . Now, denote x * as the Lη 2λ , then we have E[(x t+1 -x * ) 2 |x t ] = E[((1 -2λη - Lη 2 x t x * )(x t -x * ) + L t -L x t ) 2 |x t ] = (1 -2λη - Lη 2 x t x * ) 2 (x t -x * ) 2 + E[(L t -L) 2 |x t ]η 4 x 2 t (44) If t > T (α, λ, η, l, x 0 ) = 1 + log(( √ 2-1) √ lη/(4λ))-log(|x0- √ lη/(4λ)|) log(1-2λη) , Eq.( 43) implies 1 -2λη - 2L l • 2λη < 1 -2λη - Lη 2 x * x t . ( ) Combining with assumption 3 in Theorem 1, we have 1 -2λη -2L l • 2λη > 0, which means 0 < 1 -2λη - L x * x t . ( ) Combining with Eq.( 44), ( 43), ( 46), if t > T (α, λ, η, l, x 0 ), we have E[(x t+1 -x * ) 2 |x t ] < (1 -2λη) 2 (x t -x * ) 2 + 4V η 3 λ l . ( ) Considering the expectation with respect to the distribution of x t , we have E(x t+1 -x * ) 2 < (1 -2λη) 2 E(x t -x * ) 2 + 4V η 3 λ l . ( ) Approximate (1 -2λη) 2 = 1 -4λη, and iterate Eq.( 48) for t -T (α, λ, η, l, x 0 ) times, we have E(x t -x * ) 2 < (1 -4λη) t-T (α,λ,η,l,x0) E(x T (α,λ,η,l,x0) -x * ) 2 + V η 2 l . ( ) T (α, λ, η, l, x 0 ) is finite, and only depends on α, λ, η, l, x 0 . Hence we can set B = max{(1 - 4λη) -t E(x t -x * ) 2 |t = 0, 1, 2, ..., T (α, λ, η, l, x 0 )}, note B is finite by Ex 2 t+1 = E[(1-2λη) 2 x 2 t +2(1-2λη)L t + L 2 t x 2 t ] < (1-4λη)Ex 2 t +2(1-2λη)L+ V + L 2 min(x 2 0 , lη/(2λ)) . (50) therefore, ∀t > 0, we have E(x t -x * ) 2 < (1 -4λη) t B + V η 2 l . ( ) B.3 PROOF OF THEOREM 2 Lemma 3. Assume α, β, ε ∈ (0, 1), where β << 1. Denote diag(1 -2β 1-α , α, α 2 + 2α 2 1-α β as Λ, k = ( 1 (1-α) 2 , -2α (1-α) 2 , α 2 (1-α) 2 ) T , e = (1, 1, 1) T . If ε < 1 3 [ 1-α 2 β -8 1-α ], then ∀d ∈ R p , we have ||(Λ -εβ(1 -α) 2 ke T )d|| 2 2 < (1 - 4β 1 -α )||d|| 2 2 (52) Proof. Omit O(β 2 ) part, we have ||(Λ-εβ(1-α) 2 ke T )d|| 2 2 = (1- 4β 1 -α )d 2 1 +α 2 d 2 2 +α 4 (1+ 4β 1 -α )d 2 3 -2εβ(d 1 +d 2 +d 3 )(d 1 -2α 2 d 2 +α 4 d 3 ) (53) (d 1 + d 2 + d 3 )(d 1 -2α 2 d 2 + α 4 d 3 ) = [d 1 + (1 -2α 2 )d 2 ] 2 2 + [d 1 + (1 + α 4 )d 3 ] 2 2 -(1/2 + 2α 4 )d 2 2 - 1 + α 8 2 + (α 4 -2α 2 )d 2 d 3 ≥ -( 1 2 + 2α 4 + α 4 2 )d 2 2 -( α 8 2 + α 4 2 -2α 2 + 5 2 )d 2 3 ≥ -3d 2 2 - 5 2 d 2 3 . Then we have ||(Λ -εβ(1 -α) 2 ke T )d|| 2 2 ≤ (1 - 4β 1 -α )d 2 1 + (α 2 + 3βε)d 2 2 + (α 4 + 4βα 4 1 -α + 5βε 2 )d 2 3 . (55) Since ε < 1 3 [ 1-α 2 β -8 1-α ], we have α 2 + 3βε < 1 - 4β 1 -α , α 4 + 4βα 4 1 -α + 5βε 2 < 1 - 4β 1 -α . Hence, we have ||(Λ -εβ(1 -α) 2 ke T )d|| 2 2 < (1 - 4β 1 -α )||d|| 2 2 (58) Proof of Theorem 2. The update rule is w t+1 = w t -ηv t = w t -η(αv t-1 + gt ||w t || + λw t ) = w t -η(α w t-1 -w t η + gt ||w t || + λw t ) = (1 -ηλ + α)w t -αw t-1 -g t η. Then we have ||w t+1 || 2 2 = (1 -ηλ + α) 2 ||w t || 2 2 -2α(1 + α -ηλ) w t , w t-1 + α 2 ||w t-1 || 2 2 + ||g t || 2 2 η 2 + 2 αw t-1 , g t η = (1 -ηλ + α) 2 ||w t || 2 2 -2α(1 + α -ηλ) w t , w t-1 + α 2 ||w t-1 || 2 2 + ||g t || 2 2 η 2 + 2 α(w t + ηv t-1 ), g t η = (1 -ηλ + α) 2 ||w t || 2 2 -2α(1 + α -ηλ) w t , w t-1 + α 2 ||w t-1 || 2 2 + ht η 2 ||w t || 2 2 . ( ) w t+1 , w t = (1 + α -λη)||w t || 2 -α w t , w t-1 , Let X t , A, e denote: X t = a t b t c t =   ||w t || 2 2 w t , w t-1 ||w t-1 || 2 2   , A =   (1 + α -λη) 2 -2α(1 + α -λη) α 2 1 + α -λη -α 0 1 0 0   , = 1 0 0 , respectively. The Eq.( 60), ( 61) is formulated as a iterative map: X t+1 = AX t + ht η 2 e T X t e. When λη < (1 -√ α) 2 , the eigen value of A are all real number: λ 1 = (1 + α -λη) 2 + (1 + α -λη) (1 + α -λη) 2 -4α 2 -α = 1 - 2λη 1 -α + O(λ 2 η 2 ), λ 2 = α, λ 3 = (1 + α -λη) 2 -(1 + α -λη) (1 + α -λη) 2 -4α 2 -α = α 2 + 2α 2 (1 -α) λη + O(λ 2 η 2 ), and they satisfy 0 < λ 3 < λ 2 = α < λ 1 < 1. Therefore, A can be formulated as S -1 AS = Λ, where Λ is a diagonal matrix whose diagonal elements are the eigen value of A; the column vector of S is the eigen vectors of A, note the fromulation of S, Λ are not unique. Specifically, we set Λ, S as Λ = λ 1 0 0 0 λ 2 0 0 0 λ 3 , S =   1 1 1 1+α-λη α+λ1 1+α-λη α+λ2 1+α-λη α+λ3 1 λ1 1 λ2 1 λ3   . Moreover, the inverse of S exists, and can be explicitly expressed as S -1 =    (α+λ1)λ1 (λ1-α)(λ1-λ3) -2λ1(α+λ1)(α+λ3) (λ1-λ3)(λ1-α)(1+α-β) λ1λ3(α+λ1) (λ1-α)(λ1-λ3) - 2α 2 (λ1-α)(α-λ3) 2α(α+λ1)(α+λ3) (λ1-α)(α-λ3)(1+α-β) -2αλ1λ3 (λ1-α)(α-λ3) (α+λ3)λ3 (α-λ3)(λ1-λ3) -2λ3(α+λ3)(α+λ1) (λ1-λ3)(α-λ3)(1+α-β) (α+λ3)λ1λ3 (α-λ3)(λ1-λ3)    . ( ) let Y t = S -1 X t , combining with Eq.( 66), we have Y t+1 = ΛY t + ht η 2 (S T e) T Y t S -1 e. Combining with Eq.( 73) and Eq.( 74), and set Y t = (ã t , bt , ct ) T , we rewrite Eq.( 75) as ãt+1 = λ 1 ãt + ht η 2 ãt + bt + ct • (α + λ 1 )λ 1 (λ 1 -α)(λ 1 -λ 3 ) , ( ) bt+1 = α bt - ht η 2 ãt + bt + ct • 2α 2 (λ 1 -α)(α -λ 3 ) , ct+1 = λ 3 ct + ht η 2 ãt + bt + ct • (α + λ 3 )λ 3 (α -λ 3 )(λ 1 -λ 3 ) . ( ) Note ||w t || 2 2 = ãt + bt + ct . Now we prove the following inequations by mathematical induction bt < 0, (79 ) ct > 0, ( ) (α -λ 1 ) bt > (λ 1 -λ 3 )c t , ãt + bt + ct > 0, ãt+1 + bt+1 + ct+1 , > λ 1 (ã t + bt + ct ) + ht η 2 ãt + bt + ct , Since the start point X 1 = (a 1 , a 1 , a 1 ) T (a 1 > 0), the start point Y 1 = S -1 X 1 . Combining with Eq.( 74), we have b1 = - 2α 2 λη (λ 1 -α)(α -λ 3 ) a 1 , c1 = λ 3 (λ 3 + α)(1 -α + λη) (λ 3 -α)(λ 1 -λ 3 )(1 + α -λη) ( 1 -α -λη 1 -α + λη -λ 1 )a 1 , by which we have (α -λ 1 ) b1 > (λ 1 -λ 3 )c 1 . Besides ã1 + b1 + c1 = e T SY 1 = e T X 1 = a 1 > 0. Suppose for t = T , Eq. ( 79), ( 80), ( 81), (82) hold, combining with Eq.( 77), ( 78), we can derive bT +1 < 0, ãT +1 > 0, so Eq.( 79), ( 80) hold for t = T + 1; Besides, we have (α -λ 1 ) bT +1 = α(α -λ 1 ) bT + ht η 2 ãT + bT + cT • 2α 2 (α -λ 3 ) > λ 3 (λ 1 -λ 3 )c T + ht η 2 ãT + bT + cT • (α + λ 3 )λ 3 (α -λ 3 ) = (λ 1 -λ 3 )c T +1 , thus Eq.( 81) holds for t = T + 1. Sum Eq.( 76), Eq.( 77), Eq.( 78), due to Eq.( 81) we have ãT +1 + bT +1 + cT +1 = λ 1 ãT + α bT + λ 3 cT + ht η 2 ãT + bT + cT > λ 1 (ã T + bT + cT ) + ht η 2 ãT + bT + cT , Eq.( 83) holds for t = T + 1, combining with the fact that ãT + bT + cT > 0, we have ãT +1 + bT +1 + cT +1 > 0. According to Lemma 2, we can estimate the lower bound of ãt + bt + ct : when t > 1 + log(( √ 2-1)  √ lη/(4λ))-log(|||w0|| 2 2 - √ lη/(4λ)|) log(1-2λη) , ãt + bt + ct ≥ lη 2(1 -λ 1 ) ≈ lη(1 -α) 4λ Y t+1 -Y * = (Λ - Lη 2 x t x * ke T )(Y t -Y * ) + ( ht -L)η 2 x t k where k = (k 1 , k 2 , k 3 ) T = S -1 e. In the following context, we will omit the O(λ 2 η 2 ) part since λη 1. k 1 , k 2 , k 3 can be approximated as k 1 = 1 (1 -α) 2 + O(β), k 2 = - 2α (1 -α) 2 + O(β), k 3 = α 2 (1 -α) 2 + O(β). Then we have E[||Y t+1 -Y * || 2 2 |Y t ] = ||(Λ - L x t x * ke T )(Y t -Y * )|| 2 2 + E[( ht -L) 2 |Y t ]η 4 x 2 t ||k|| 2 2 (94) The fixed point of Eq.( 89) is computed as x * = ã * + b * + c * = Lη λ(1-α)(2-λη 1+α ) , and we have known that if t > 1 + log(( √ 2-1) √ lη/(4λ))-log(|||w0|| 2 2 - √ lη/(4λ)|) log(1-2λη) , x t > lη(1-α) 4λ ,, therefore we have Lη 2 x t x * < 2L l • 2λη, E[( ht -L) 2 |Y t ] x 2 t η 4 < 4V η 3 λ l(1 -α) . ( ) According to assumption 3, we can prove 2L l < (1 -α) 2 3 [ 1 -α 2 β - 8 1 -α ], combining with Lemma 3, we have E[||Y t+1 -Y * || 2 2 |Y t ] < (1 - 4λη 1 -α )||Y t -Y * || 2 2 + 4V η 3 λ||k|| 2 2 l(1 -α) , which implies E||Y t+1 -Y * || 2 2 < (1 - 4λη 1 -α ) t-T E||Y T -Y * || 2 2 + V η 2 (1 + 4α 2 + α 4 ) l(1 -α) 4 , ( ) where T = [1 + log(( √ 2-1) √ lη/(4λ))-log(|||w0|| 2 2 - √ lη/(4λ)|) log(1-2λη) ]. Therefore, similar to the proof of theorem 1, ∃B > 0, E||Y t+1 -Y * || 2 2 < (1 - 4λη 1 -α ) t B + V η 2 (1 + 4α 2 + α 4 ) l(1 -α) 4 . ( ) Recall ||w t || 2 2 = e T Y t , therefore E[||w t || 2 2 -(w * ) 2 ] 2 ≤ 3E||Y t+1 -Y * || 2 2 < 3(1 - 4λη 1 -α ) t B + 3V η 2 (1 + 4α 2 + α 4 ) l(1 -α) 4 . ( ) Remark 5. By Eq.( 101), the variance of ||w t || 2 2 is bounded by 3V η 2 (1+4α 2 +α 4 ) l(1-α) 4 , which is not small enough. But we somehow can reduce it: according to Eq.( 96), if x t is close to its theoretical value, then E[( ht-L) 2 |Yt] x 2 t η 4 < 4V η 3 λ(1-α) l , hence variance of ||w t || 2 2 can be bounded by 3V η 2 (1+4α 2 +α 4 ) 2L(1-α) 2 .

B.4 PROOF OF THEOREM 3

Proof. In SGD case, we have w t+1 , w t = (1 -λη)||w t || 2 2 , then we have cos 2 ∆ t = w t+1 , w t 2 ||w t || 2 2 • ||w t+1 || 2 2 = (1 -2λη) ||w t || 2 2 ||w t+1 || 2 2 . ( ) According to the definition of ∆ t , ∆ t ≥ 0, and ∆ t is very close to 0, hence we have ∆ t = sin ∆ t = 1 -cos 2 ∆ t (104) = 1 -(1 -λη) 2 ||w t || 2 2 ||w t+1 || 2 2 (105) = 1 -(1 -λη) 2 x t x t+1 where x t , x t+1 denotes ||w t || 2 2 , ||w t+1 || 2 2 respectively as in Eq. ( 42). Assume t is sufficiently large so that x t , x t+1 are close to x * = Lη 2λ , the first order of Taylor series expansion of Eq.( 106) at x t = x t+1 = x * is ∆ t = 2λη + (1 -λη) 2 2 √ 2λη • 1 x * • [(x t+1 -x * ) -(x t -x * )]. Reorganizing Eq.( 107), and applying Cauchy Inequality, we have |∆ t -2λη| 2 = (1 -λη) 4 8λη • 1 (x * ) 2 • [(x t+1 -x * ) -(x t -x * )] 2 (108) ≤ 1 8λη • 2λ Lη • 2[(x t -x * ) 2 + (x t+1 -x * ) 2 ] (109) = 1 2Lη 2 [(x t -x * ) 2 + (x t+1 -x * ) 2 ] Combining with Eq.( 51) and Eq.( 108), we have E|∆ t -2λη| 2 = 1 2Lη 2 [E(x t -x * ) 2 + E(x t+1 -x * ) 2 ] ≤ (1 -4λη) t • C + V Ll , where C = 1 2Lη 2 • B, B is defined in Eq.( 51). In SGDM case, the angular update can be computed by ∆ t = sin(∆ t ) (112) = 1 -cos 2 ∆ t (113) = 1 - w t , w t+1 2 ||w t || 2 2 • ||w t+1 || 2 2 (114) = 1 - b 2 t a t c t , where (a t , b t , c t ) = (||w t || 2 2 , w t , w t+1 , ||w t+1 || 2 2 ). According to the proof of theorem 2, when t is sufficiently large, (a t , b t , c t ) will be close to the fixed point of Eq.( 66), (a * , b * , c * ), where a * = c * = ( Lη λ(1-α)(2-λη 1+α ) , b * = 1+α-λ 1+α a * . Then the first order Taylor series expansion of Eq.( 115) at (a t = a * , b t = b * , c t = c * ) is ∆ t = 1 - b 2 t a t c t (116) = 2λη 1 + α + √ 1 + α 2 √ 2λη • (1 - λη 1 + α ) 2 [ a t -a * a * - 2(b t -b * ) b * + c t -c * c * ]. Now substituting (a t , b t , c t ) with (ã t , bt , ct ) defined in Eq. (75, 76, 77, 78) , Eq.( 107) is rewritten as ∆ t = 2λη 1 + α + √ 1 + α 2 √ 2λη • (1 - λη 1 + α ) 2 1 a * [ (1 -α) 2 α 2 (c t -c * ) + O(λη)] where (ã * , b * , c * ) is the fixed point of Eq.( 89). Omit O(λη), we have |∆ t - 2λη 1 + α | 2 = 1 + α 8λη • (1 - λη 1 + α ) 4 λ(1 -α)(2 -λη 1+α ) Lη [ (1 -α) 4 α 4 (c -c * ) 2 ] (119) ≤ (1 -α 2 )(1 -α) 4 4Lη 2 α 4 ||Y t -Y * || 2 2 , where Y t , Y * is defined in Eq.(75, 89) respectively. According to Eq.( 100), the mean square error E|∆ t -2λη 1+α | 2 can be bounded by E|∆ t - 2λη 1 + α | 2 ≤ (1 -α 2 )(1 -α) 4 4Lη 2 α 4 E||Y t -Y * || 2 2 (121) ≤ (1 - 4λη 1 -α ) t C + (1 -α 2 )(1 -α) 4 4Lη 2 α 4 • V η 2 (1 + 4α 2 + α 4 ) l(1 -α) 4 (122) = (1 - 4λη 1 -α ) t C + V (1 -α 2 )(1 + 4α 2 + α 4 ) 4Llα 4 , where C = (1-α 2 )(1-α) 4 4Lη 2 α 4 • B, B is defined in Eq.( 101). B.5 PROOF OF COROLLARY 3.1 Proof of Corollary 3.1. In SGD case, and Eq.( 42), we have ||w t+1 || 2 2 > (1 -2λη)||w t || 2 2 , which means ||w t+T || 2 2 > (1 -2λη) T ||w t || 2 2 . ( ) On the other hand, we know that when η is divided by k, ||w t || 2 2 should be divided by √ k to reach the new equilibrium condition, therefore we have ||w t+T || 2 ||w t || 2 = 1 √ k > (1 -2λη) T . (126) Since λη 1, log(1 -2λη) ≈ -2λη, thus T > log(k) 4λη . . In SGDM case, by Eq.( 87), we have ||w t+1 || 2 2 > (1 - 2λη 1 -α )||w t || 2 2 , Similar to SGD case, we have T > log(k)(1 -α) 4λη . .

C EXPERIMENTS C.1 EXPERIMENTS ON SYTHETIC DATA

In this section we apply experiments on sythetic data to prove our claim in section 4: theorem 1 only requires expected square norm of unit gradient is locally steady, the expectations do not need to remain unchanged across the whole training process. The proof of theorem implies square norm of weight is determined by the following iterative map: x t+1 = (1 -2λη)x t + L t η 2 x t , where λ, η ∈ (0, 1), L t denotes the square of unit gradient norm. Hence we simulate x t with different type of {L t } ∞ t=1 . Results in Figure 4 shows as long as the local variance of square norm of unit gradient is not too much, and expectation of L t changes smoothly, weight norm can quickly converge to its theoretical value base on expectation of square norm of unit gradient. We also simulate SGDM case by following iteration map X t+1 = AX t + L t η 2 X t [0] • e, where A, X t , e is defined as Eq.( 62), ( 63), ( 64). Simulation results is shown in Figure 5 .

C.2 COMPLEMENTARY IN MULTI-STAGE LEARNING RATE SCHEDULE

In this section we present complementary results in Multi-Stage Learning Rate Schedule experiment. The plots of weight norm (empirical and predicted values) in multi-learning rate stage is shown in Figure 6 . We also present the test performance of resent50/maskrcnn on Imagenet/MSCOCO with multi-stage learning rate schedule mentioned in Section 5.2. We only provide complementary results for reference only. We do not intend to prove the advantages or disadvantages of rescaling strategy here, it is beyond the discussion of this paper. The results of experiments(Figure 7 , 8) suggests that the assumption of LSP does not always hold in practice because of three reasons: first, the approximate equivalence between a single iteration in large batch setting, and multiple iterations in small batch setting can only hold in pure SGD formulation, but momentum method is far more commonly used; Second, according Theorem 2, the enlargement ratio of angular update is only determined by the increase factor of learning rate. Figure 7 shows in practice, the accumulated angular update (w t , w t+k ) in small batch batch setting is much larger than angular update (w t , w t+1 ) of a single iteration in larger batch setting when using Linear Scaling Principle; Third, even in pure SGD cases, the enlargement of angular update still relies on the increase of learning rate, and has no obvious connection to the enlargement of gradient's norm when equilibrium condition is reached (see Figure 8 ). In conclusion, though LSP usually works well in practical applications, SMD suggests we can find more sophisticated and reasonable schemes to tune the learning rate when batch size increases.

C.4 SPHERICAL MOTION DYNAMICS WITH DIFFERENT NETWORK STRUCTURES

We also verify our theory on other commonly used network structures (MobileNet-V2 (Sandler et al., 2018) , ShuffleNet-V2+ (Ma et al., 2018) ) with standard training settings. The results is shown in Figure 9 . 



"Weight norm remains unchanged" means ||wt||2 ≈ ||wt+1||2,Chiley et al. (2019) calls this condition as "equilibrium", which will also be used in the following context of this paper. Note equilibrium condition is not mathematically rigorous, we only use it for intuitive analysis.



t=0 ||w t || and lim T -→∞ D T = 1/T T t=0 ||w t -w t+1 || exist, then we have lim

22)Remark 3. Theoretical value of angular update in Theorem 3 is consistent with Eq.(4, 5) fromChiley et al. (2019);Li & Arora (2020)  respectively. Notice the variance term in Eq.(21,22) is of O(V /Ll), it is too large comparing with its empirical value, we leave it as a future work to improve the bound of variance term. Though connection between performance of neural network and equilibrium condition is beyond the main discussion of this paper, the findings of a parallel work(Li  et al., 2020b)  can inspire us the possible advantage of momentum:Li et al. (2020b)  interprets that smaller λη can get higher performance when training DNN, but the approaching rate (O(λη)) is slow; larger λη has faster approaching rate but it can only get bad performance. Inspired byLi et al.

(a), 3(b), 3(d), 3(e)).

Figure 2: Performance of layer.2.0.conv2 from Resnet50 in SGD and SGDM, respectively. In (a), (d), semitransparent line represents the raw value of ||g t || 2 2 or ht , while solid line represents the averaged value within consecutive 200 iterations to estimate the expectation of ||g t || 2 2 or ht ; In (b), (e), blue solid lines shows the raw value of weight norm ||w t || 2 , while dashed line represent the theoretical value of weight norm computed in Theorem 1, 2 respectively. The expectations E||g|| 2 2 and E h are estimated as solid lines (a) and (d) respectively; In (c), (f), red lines represent raw value of angular update during training, dashed lines represent the theoretical value of angular update computed by √ 2λη and 2λη/(1 + α) respectively.

Figure 3: In (a),(b),(d),(e), solid lines with different colors represent raw value of angular update of weights from all convolutional layer in the model; In (a), (d), training setting rigorously follows Goyal et al. (2017); He et al. (2019) respectively; In (b), (e), weight norm is dividied by 4 √ 10 as long as learning rate is divided by 10; In (c), (f), weight norm is computed on layer.1.0.conv2 in Resnet50 backbone. Blue line represent original settings, orange line represent rescaled settings.

88)Now we can analyze the expectation of distance(l 2 norm) between Y t = (ã t , bt , ct ) T and the fixed pointY * = (ã * , b * , c * ) T which satisfies Y * = ΛY * + Lη 2 (S T e) T Y * S -1 e.(89)Assume x t = ãt + bt + ct , x * = ã * + b * + c * > 0, then we have:

Figure 4: Simulation of SGD (Eq.(130)), η = 0.1, λ = 0.001, x 0 = 10, t ∼ U(-3, 3). Orange lines represent the square of unit gradient norm; blue solid lines represent simulated value of weight norm square; black dashed lines represent theoretical value of weight norm square

(a) B = 256 versus B = 1024 (b) B = 256 versus B = 4096

Figure 7: Angular update of weights from layer1.0.conv2 in ResNet50. The blue lines represent the angular update of weights within a single iteration in when batch settings is B = 1024(4096); The red lines represent the accumulated angular update within 4(16) iterations in smaller batch setting(B = 256).

Figure 8: Enlargement ratio of gradients' norm of weights from layer1.0.conv2 when batch size increases . ||g (k) || represents the gradient's norm computed using k samples(not average).

Definition 1 (Unit Gradient). If w t = 0, w = w/||w|| 2 , the unit gradient of ∂L/∂w| w=wt is ∂L/∂w| w= wt .

Based on our theorem, we can bridge the gap by skipping the intermediate process from old equilibrium to new one. Specifically, when learning rate is divided by k, norm of scale-invariant weight is also divided by 4

Performance of Resnet50 on Imagenet with multi-stage learning rate scheduler.

Performance of Resnet50 on Imagenet with multi-stage learning rate scheduler in Section.C.3 RETHINKING LINEAR SCALING PRINCIPLE IN SPHERICAL MOTION DYNAMICSIn this section, we will discuss the effect of Linear Scaling Principle (LSP) under the view of SMD. Linear Scaling Principle is proposed byGoyal et al. (2017) to tune the learning rate η with batch size B by η ∝ B. The intuition of LSP is if weights do not change too much within k iterations, then k iterations of SGD with learning rate η and minibatch size B (Eq.(132)) can be approximated by a single iteration of SGD with learning rate kη and minibatch size kB (Eq.(133):Goyal et al. (2017) shows that combining with gradual warmup, LSP can enlarge the batch size up to 8192(256 × 32) without severe degradation on ImageNet experiments.LSP has been proven extremely effective in a wide range of applications. However, from the perspective of SMD, the angular update mostly relies on the pre-defined hyper-parameters, and it is hardly affected by batch size. To clarify the connection between LSP and SMD, we explore the learning dynamics of DNN with different batch size by conducting extensive experiments with ResNet50 on ImageNet, the training settings rigorously followGoyal et al. (2017): momentum coefficient is α = 10 -4 ; WD coefficient is λ = 10 -4 ; Batch size is denoted by B; learning rate is initialized as B 256 • 0.1; Total training epoch is 90 epoch, and learning rate is divided by 10 at 30, 60, 80 epoch respectively.

annex

Figure 9 : The angular update ∆ t of MobileNet-V2 (Sandler et al., 2018) and ShuffleNet-V2+ (Ma et al., 2018) . The solid lines with different colors represent all scale-invariant weights from the model; The dash black line represents the theoretical value of angular update, which is computed by 2λη 1+α . Learning rate η is initialized as 0.5, and divided by 10 at epoch 30, 60, 80 respectively; WD coefficient λ is 4 × 10 -5 ; Momentum parameter α is set as 0.9.

