HOW NORMALIZATION AND WEIGHT DECAY CAN AF-FECT SGD? INSIGHTS FROM A SIMPLE NORMALIZED MODEL

Abstract

Recent works (Li

1. INTRODUCTION

Normalization (Ioffe & Szegedy, 2015; Wu & He, 2018) is one of the most widely used deep learning techniques, and has become an indispensable part in almost all popular architectures of deep neural networks. Though the success of normalization techniques is indubitable, its underlying mechanism still remains mysterious, and has become a hot topic in the realm of deep learning. Many works have contributed in figuring out the mechanism of normalization from different aspects. While some works (Ioffe & Szegedy, 2015; Santurkar et al., 2018; Hoffer et al., 2018; Bjorck et al., 2018; Summers & Dinneen, 2019; De & Smith, 2020) focus on intuitive reasoning or empirical study, others (Dukler et al., 2020; Kohler et al., 2019; Cai et al., 2019; Arora et al., 2018; Yang et al., 2018; Wu et al., 2020) focus on establishing theoretical foundation. A series of works (Van Laarhoven, 2017; Chiley et al., 2019; Kunin et al., 2021; Li et al., 2020; Wan et al., 2021; Lobacheva et al., 2021; Li & Arora, 2019) have noted that, in practical implementation, the gradient of normalized models is usually computed in a straightforward manner which results in its scale-invariant property during training. The gradient of a scale-invariant weight is always orthogonal to the weight, and thus makes the training trajectory behave as motion on a sphere. Besides, in practice, many models are trained using SGD with Weight Decay (WD), hence normalization and WD in SGD can cause a so-called "equilibrium" state, in which the effect of gradient and WD on weight norm cancel out (see Fig. 1(a) ). It has been a long time since the concept of equilibrium was first proposed (Van Laarhoven, 2017) while either theoretical justification or experimental evidence had still been lacking until recently. Recent works (Li et al., 2020; Wan et al., 2021) theoretically justify the existence of equilibrium in both theoretical and empirical aspects, and characterize the underlying mechanism that yields equilibrium, named as "Spherical Motion Dynamics". In Wan et al. (2021) the authors further show SMD exists in a wide range of computer vision tasks, including ImageNet Deng et al. (2009) and MSCOCO (Lin et al., 2014) . More detailed review can be seen in appendix A. Though the existence of SMD has been confirmed both theoretically and empirically, as well as some of its characteristics, we notice that so far no previous work has ever theoretically justified how SMD can affect the evolution of the loss of normalized models. Although some attempts have been made in Li et al. (2020) ; Wan et al. (2021) to explore the role of SMD in the training by conjectures and empirical studies, they still lack theoretical justification on their findings. In hindsight, the main challenge to theoretically analyze the effect of SMD is that SMD comes from the joint effect of normalization and WD which can significantly distort the loss landscape (see Figure 1 (b)), and thus dramatically weaken some commonly used assumptions such as (locally) convexity, Lipschitz continuity, etc. Exploring the optimization trajectory on such distorted loss landscape is very challenging, much less that taking in addition SMD into account in the consideration. In this paper, as the first significant attempt to overcome the challenge on studying the effect of SMD towards evolution of the loss, we propose a simple yet representative normalized model, and theoretically analyze how SMD influences the optimization trajectory. We adopt the SDE framework of Li et al. (2020) to derive the analytical results on the evolution of NRQ, and concepts of Wan et al. (2021) to interpret the theoretical results we obtain in this paper. Our contributions are • We design a simple normalized model, named as Noisy Rayleigh Quotient (NRQ). NRQ possesses all the necessary properties to induce SMD, consistent with those in real neural networks. NRQ contributes a powerful tool for analyzing how normalization affects firstorder optimization algorithms; • We derive the analytical results on the limiting dynamics and the stationary distribution of NRQ. Our results show the influence of SMD is mainly reflected on how angular update (AU), a crucial feature of SMD, affects the convergence rate and limiting risk. We discuss the influence of AU within equilibrium and beyond equilibrium respectively, figuring out the association between the evolution of AU and the evolution of the optimization trajectory in NRQ; • We show that the insights drawn from the theoretical results on NRQ can adequately interpret typical observations in deep learning experiments. Specifically, we confirm the role of learning rate and WD is equivalent to that of scale-invariant weight in SGD. We show the Gaussian type initialization strategy can affect the training process only because it can change the evolution of AU at the beginning. We also confirm that under certain condition, SMD may induce "escape" behavior of optimization trajectory, resulting in "pseudo overfitting" phenomenon in practice. 2 NOISY RAYLEIGH QUOTIENT

2.1. PROBLEM SET UP

We use Rayleigh Quotient Horn & Johnson (2012) as the objective function, defined as L(X) = X T AX 2X T X , where X ∈ R p \{0}, A ∈ R p×p is positive semi-definite. Based on its form, Rayleigh Quotient is equivalent to a quadratic function using weight normalization (Salimans & Kingma, 2016) .

Now considering the following optimization problem

min X∈R p \{0} L(X), Obviously the solutions to equation 2 are X * = {αv|α ∈ R + , v ∈ S p-1 , Av = λ 1 v, λ 1 is the smallest eigen value of A}. Consider solving Eq equation 2 by Stochastic Gradient Descent (SGD) with constant learning rate (LR) η > 0 and Weight Decay (WD), the update rule at step n is X n+1 = X n -ηg n -ληX n , where -ληX n is the weight decay part with a positive factor λ; g n is the stochastic gradient of equation 1 at step n. Inspired by Zhang et al. (2019) , the gradient noise is constructed as Gaussian noise to simulate the "mini-batch training". Specifically, g n is constructed as g n = 1 ||X n || 2 P n A Xn + 1 ||X n || 2 P n ( Σ) 1/2 ε n , where Xn ≜ X n /||X n || 2 ; P n ≜ (I -Xn XT n ); Σ ∈ R p×p is a positive definite matrix; ε n ∼ N (0, I). Then we have Eg n = ∇ X L(X n ), Cov(g n ) = 1 ||X|| 2 2 P n ( Σ)P n . (5) g n defined as equation 4 can simulate the mini-batch stochastic gradient of a scale-invariant loss because ⟨g n , X n ⟩ = 0, g n = 1 k g n X=kXn , ∀k > 0, which are necessary conditions to let SMD occur during the evolution of SGD (Li et al., 2020; Wan et al., 2021) . We call the process Noisy Rayleigh Quotient (NRQ) which optimizes the Rayleigh Quotient equation 3 using stochastic gradient Eq equation 4 as . Remark 1. The form of stochastic gradient of NRQ can be regarded as the normalized form of Noisy Quadratic Model (NQM) (Zhang et al., 2019) , in which the objective function is L(X) = 1 2 X T AX (7) and the stochastic gradient is defined as g n = AX n + Σ 1/2 ε n . (8) But the dynamics of NQM and NRQ are quite different: NQM is basically a convex model and has only one optimal solution 0, while NRQ is a nonconvex problem and has an infinite number of solutions (X * ), thus making it much more difficult to analyze comparing with NQM.

2.2. APPROXIMATE SGD AS STOCHASTIC DIFFERENTIAL EQUATION

Though the thorough analysis on the characteristics of SMD is established on the discrete form (Wan et al., 2021) , directly analyzing evolution dynamics of equation 3 taking SMD into account in discrete form is still too complex. On the other hand, we can tackle the problem using the SDE approximation introduced in Li et al. (2020) , which has established the continuous form of SMD. Using SDE approximation, the evolution dynamics of equation 3 can be approximated by dX t = -η( 1 ||X t || 2 P t A Xt + λX t )dt + ηP t ( Σ) 1/2 ||X t || 2 dB t , where B t is a p-dimensional Brownian motion. Here we follow the form of SDE used in Li et al. (2020) instead of the commonly used form proposed in Li et al. (2017) by extracting a LR factor η from the differential time dt. The extracted factor is useful in connecting the characteristics of SMD in discrete setting and continuous setting. Due to the scale-invariant property, ||X t || 2 cannot influence the Rayleigh Quotient L(X t ) at all, the intrinsic domain of NRQ is a unit sphere S p-1 (Li et al., 2020) . But ||X t || 2 may affect the evolution dynamics of NRQ since it is involved in the system equation 9. To decouple the evolution of X t on its intrinsic domain (i.e. the evolution of Xt ), and the evolution of ||X t || 2 , according to Li et al. (2020) , let M t ≜ ||X t || 2 , then equation 9 can be rewritten as the following two SDEs: d Xt = -[ η M t P t A Xt + η 2 2M 2 t Tr(P t ΣP t ) Xt ]dt - η M t P t Σ 1 2 dB t ( ) dM t =[-2ληM t + η 2 M t Tr(P t Σ)]dt Note the diffusion part is missing in Eq equation 11, so it is possible to derive the explicit solution of Eq equation 11, computed as  M 2 t = e -4ληt M 2 0 + 2η 2 t 0 e -4λη( n = ∠(X n , X n+1 ). When equilibrium of SMD is reached, AU will satisfy E∆ n = √ 2λη. In NRQ, the (discrete) AU can be computed by ∆ n = ∠(X n , X n+1 ) = arctan( ||g n ||η ||X n || 2 ) ≈ ||g n ||η ||X n || 2 . ( ) Then we have E∆ 2 n = [||∇ X L( Xn )|| 2 2 + Tr(P n Σ)]η 2 ||X n || 4 2 . ( ) Though AU has specific geometric meaning in discrete form, as it represents the geodesic distance between Xn and Xn+1 on unit sphere S p-1 , its definition cannot be applied directly in continuous setting. To connect the concept of SMD in discrete setting and continuous setting, a continuous version of AU in NRQ is defined as ) , then AU at t is defined as Definition 1 (AU). E|| Xτ -Xt || 2 2 is differentiable on [t, +∞ ∆ t = lim τ →t E t || Xτ -Xt || 2 2 τ -t . ( ) Remark 2. The definition of AU in continuous setting is inspired from the concept of "angular velocity" used in Kunin et al. (2021) , in which the author formulated the equilibrium of SMD using gradient flow. This definition relies on the fact that E|| Xτ -Xt ||foot_0 2 is differentiable on t. By the definition we can derive the following properties of AU in NRQ: Lemma 1. In the evolution of equation 10, equation 11, we have ∆ 2 t = Tr(Pt Σ)η 2 M 2 t . If Tr(P t Σ) is constant, then lim t→∞ ∆ t = √ 2λη. The proof is in Appendix B.1. Comparing with equation 14 and lemma 1, the theoretical value E∆ 2 n in discrete form is similar to its continuous form except for an additional term ||∇ X L( Xn )|| 2 Assumption 1. A is diagonal matrix with diagonal elements in ascending order, i.e. A = diag(a 1 , a 2 , . . . , a p ), a 1 < a 2 ≤ a 3 , . . . , ≤ a p . Assumption 2. ∃σ > 0, Σ = σ 2 I. In assumption 1, A is diagonalized to simplify derivation, same as Zhang et al. (2019) does in the quadratic model. We further assume a 1 < a 2 to ensure NRQ has at most 2 solutions ±e 1 , where e (1) 1 = 1, e (i) 1 = 0, 2 ≤ i ≤ p -1. Assumption 2 is used to ensure Tr(P t Σ) is constant during the whole process. This constant variance of gradient noise assumption are also used in Zhu et al. (2019) ; Li et al. (2020) . Note even under assumption 1, NRQ still has two different global optimal solutions ±e 1 on S p-1 , so directly analyzing the convergence behavior by distance to the optimal points is inappropriate. Therefore, to properly track the optimization trajectory, we analyze the evolution of f t ≜ (e T 1 Xt ) 2 = ( X(1) t ) 2 , where X(1) t denotes the first element of X. Note that f t is an ideal index to reflect the optimization trajectory because f t = 1 if and only if X t = ±e 1 . Besides, f t can (roughly) bound the evolution of the loss L t by a 1 + (a 2 -a 1 )(1 -f t ) ≤ L t ≤ a 1 + (a p -a 1 )(1 -f t ). Using Itô Lemma, the SDE of f t can be derived from equation 10 and equation 11, written as df t =[ η(L t -a 1 ) M t f t + η 2 σ 2 M 2 t (1 -pf t )]dt + 2ησ M t f t (1 -f t )dB t , M 2 t =e -4ληt (M 2 0 - (p -1)σ 2 η 2λ ) + (p -1)σ 2 η 2λ Remark 3. It is possible to directly explore the evolution of the loss L t by deriving the SDE of L t using Itô's lemma. But some terms in SDE of L t is hard to handle comparing with the SDE of f t . Now we can define the risk of NRQ as r t ≜ 1 -Ef t , our first theorem depicts the convergence behavior of NRQ by giving the bounds of the risk; Theorem 1. The solution of Eqs equation 16 and equation 17 satisfies r t ≥ e -G1(t) [1 -f 0 + t 0 ∆ 2 τ e G1(τ ) dτ ], where G 1 (t) ≜ t 0 [ (a p -a 1 )η √ p -1σ ∆ τ + p p -1 ∆ 2 τ ]dτ ; Furthermore, given ξ ∈ (0, 1), define ε(t) = P(f t < ξ) as the tail probability of f t dynamics. Then for any t ≥ 0, we have r t ≤ e -G1(t) [1 -f 0 + t 0 ( (a 2 -a 1 )ηξε(t) √ p -1σ ∆ τ + ∆ 2 τ )e -Gτ dτ ], where G1 (t) ≜ t 0 [ (a 2 -a 1 )ηξ √ p -1σ ∆ τ + p p -1 ∆ 2 τ ]dτ ; Proof is in Appendix C. Theorem 1 implies the evolution of risk are mostly determined by the evolution of AU. Note the lower bound Eq equation 20 relies on ξ and ε(t). To make the lower bound tighter, ideally ξ should be close to 1, while ε(t) should be close to 0. We will discuss ξ and ε(t) in details later.

3.1. EQUILIBRIUM STATE OF SMD

Though an analytical result is shown in Theorem 1, the global picture of the dynamics is still not clear, since the evolution of r t is associated with the evolution of AU ∆ t . Fortunately, it has been known that equilibrium of SMD must be reached, in which ∆ t = √ 2λη regardless of the evolution of Xt . Hence, we can directly explore the evolution of r t in equilibrium, as the following corollary shows: Corollary 1 (Equilibrium state dynamics). Assume M 0 = η(p-1)σ 2 2λ , λη ≪ 1, and ∃ε > 0, lim t→+∞ ε(t) ≤ ε in Theorem 1, then ∃C > 0, such that r * + (1 -f 0 -r * )e -g * 1 t ≤ r t ≤ r * + ε + e -g * 1 t C (22) where g * 1 = a p -a 1 √ p -1σ 2λη + O(λη), g * 1 = ξ(a p -a 1 ) √ p -1σ 2λη + O(λη), r * = √ p -1σ a p -a 1 2λη + O(λη) r * = √ p -1σ ξ(a 2 -a 1 ) 2λη + O(λη), Proof is in Appendix D.1. Corollary 1 shows that in equilibrium state of SMD, when ε ≪ r * , NRQ still converges in a linear rate regime, where the convergence rate is (roughly) proportional to the AU by equation 23, which is only determined by LR η and WD factor λ. However, the limiting risk is also (roughly) proportional to the AU by equation 24. Thus, a trade-off exists between the convergence rate and limiting risk when tuning AU: large AU can make the loss decrease more quickly at the beginning, but will enlarge the limiting risk in the end, resulting a larger steady loss (see Figure 2 (a)-(d) ). This can explain why decreasing LR strategy is always necessary to obtain the best performance when training models in practice. Remark 4. Such trade-off between convergence rate and limiting risk also exists in the convergence behavior of NQM (Zhang et al., 2019) . Zhang et al. (2019) claims the trade-off in NQM can be adjusted by tuning LR or gradient noise; while in NRQ, the trade-off can not only be adjusted by LR or gradient noise, but also WD factor. Besides, the association between AU and the convergence rate/limiting risk is consistent with the conjectures about the relation between AU and dynamics of normalized DNN in Wan et al. (2021) , in which the authors suppose AU is correlated with the steady loss when training normalized DNN. Stationary distribution of f t Corollary 1 only presents a bound for the risk. With additional assumptions, the stationary distribution of f t and limiting risk r * can be explicitly derived using Fokker Planck equation. Theorem 2. Assume the spectrum of A takes only 2 distinct real value a 1 = a l < a h = a 2 = • • • = a p . Denote the stationary distribution of f t by ρ * (f ). Then ρ * (f ) ∝ e 2κf f -1 2 (1 -f ) p-3 2 , f ∈ [0, 1] where κ = √ p-1 2σ √ 2ηλ . In addition, the limit of risk r t exists and is given by r * ≜ lim t→∞ r t = 1 - √ p -1σ a h -a l 2λη + o( 2λη); Moreover, there exists µ, C > 0, such that for any ξ ∈ (0, 1), the tail probability ε(t) ≜ P(f t < ξ) can be governed by |ε(t) -ε * | ≤ Ce -µt (27) in which ε * is the stationary tail probability ε * ≜ ξ 0 ρ * (f )df Proof is in Appendix D.2. Eq equation 26 supports our insights drawn from Theorem 1, that the limiting risk should be proportional to the theoretical value of AU; Besides, equation 27 implies that it is reasonable to assume ξ is close to 1 while the upper limit of ε(t) is close to 0 as we state in Theorem 1 and Corollary 1.

3.2. BEYOND EQUILIBRIUM OF SMD

We have presented a detailed analysis on the evolution of the NRQ in equilibrium of SMD, showing that the convergence rate and limiting risk are mainly controlled by AU. Based on the insights drawn from the equilibrium case, we can infer how evolution of AU dominates the evolution of NRQ beyond equilibrium. Blue lines are masked by yellow lines in (i), (j), (k) since they have exactly same evolution. "Escape" by the autonomous increase of AU The following corollary shows the evolution of ∆ t can lead to an "escape" behavior of optimization trajectory, resulting in a temporary "decreasing, then increasing" risk: Corollary 2 (A sufficient condition of "escaping" behavior). Given η, λ, if the following conditions hold: 1) ∃ε > 0, ∀t > 0, ε(t) < ε < r * ; 2) f 0 = 1 -r * ; 3) M 0 > (p-1)σ 2 η (r * -ε)(a2-a1)ξ . Then ∃T > 0, r T < r 0 ≤ lim t→∞ r t . Proof is in Appendix E. equation 28 indicates a kind of "escape" behavior, because it implies that the trajectory of Xt will first approach an optimal point X * at the beginning, and then depart from X * . Intuitively, this "escape" behavior is caused by increasing AU (which is also the main idea of the proof for Corollary 2): the initial weight norm √ M 0 is sufficiently large, so ∆ t is relatively smaller than √ 2λη at the beginning, which allows the risk r t to reduce below the inferior of limiting risk given √ 2λη for a while. But when equilibrium is reached and AU increases to √ 2λη, it will force the risk to go back to its limit value. See the blue lines in Figure 2 (e)-(h) . Even though Corollary 2 only gives a sufficient condition for the "escape" behavior in NRQ, such "escape" phenomenon can be seen in real data experiments in practice. Wan et al. (2021) exhibits a so-called "pseudo overfitting" phenomenon observed in CIFAR10 experiments with commonly used settings. They speculate "pseudo overfitting" is caused by temporarily increasing AU after LR decay based on empirical observations. Corollary 2 offers strong theoretical evidence for their conjecture, showing increasing AU indeed can lead to "escape" behavior under specific conditions. We also apply "rescale" strategy proposed in Wan et al. (2021) on NRQ, in which when LR is divided by k, weight norm is divided by k 1/4 . The "rescale" strategy can indeed eliminate the "escape" behavior (see Figure 2 , (e)-(h), red lines). Initialized value of weight norm The evolution of AU can provide new interpretations on how initialization strategy affect the training of normalized model. The weights of neural network are usually initialized as Gaussian N (0, σ2 I), where σ need to be carefully tuned. In mean field theory and NTK theory, standard derivation of Gaussian initializing strategy is crucial in obtaining good performance. Experiments on real datasets seem to support these theorems, where Gaussian initializing strategies with carefully tuned σ such as Kaiming He et al. (2015) or Xavier Glorot & Bengio (2010) indeed outperform the naive Gaussian initializing strategy. However, when initializing normalized model, σ only influences the initialized value of weight norm according to the large number theorem: ||X 0 || 2 2 = p i=0 (X (i) t ) 2 ≈ pσ 2 . The following corollary implies σ are not so crucial for normalized model: Corollary 3. ∀k > 0, if X 0 is multiplied by k, enlarge η, λ by k 2 , 1 k 2 times respectively, r t remains unchanged. Proof is in Appendix E.2. Corollary 3 shows no matter how to set σ, as long as LR η and WD factor λ are adjusted accordingly, the evolution dynamics of NRQ does not change at all. Because the adjustment in Corollary 3 can remain the evolution equation 10 by maintaining the evolution of ∆ t (Comparing blue and yellow lines in Figure 2 Though in NRQ, the conclusion that "same limiting AU will lead to similar limiting risk" is true, same conclusion does not necessarily hold on real data experiments. The two local minima of Rayleigh Quotient have exactly same geometric characteristics, but in real data experiments, the loss landscape may have multiple local minima with different geometric characteristics. Even though the theoretical value of AU is fixed, different evolution of aus may make the optimization trajectory get close to different local minima, resulting in different final performance. This is the reason why in practice, with fixed LR, WD factor, and σ in Gaussian initialization still may influence the performance of neural network.

4. REAL DATA EXPERIMENTS

Aside from NRQ experiments, we also conduct experiments on CIFAR10 (Krizhevsky et al., 2009) , and ImageNet (Deng et al., 2009) respectively to verify the insights drawn from NRQ. We use pure SGD without momentum in all real data experiments to eliminate the possible effect of momentum, though Wan et al. (2021) claim SGD with momentum has similar SMD mechanism. In CIFAR10 experiments, we adopt Resnet18 (He et al., 2015) as the baseline model; total epochs is 200; LR is divided by 5 at epoch 60, 120, 160; Batch size is 256. In ImageNet1000 (Deng et al., 2009) experiments, most settings follow Goyal et al. (2017) , except LR is initialized as 1, momentum is 0. Smith et al. (2020) shows using pure SGD with larger LR can obtain similar performance as the standard SGDM setting. Our experiments' results (Figure 3 ,4) show the insights from NRQ also hold in real data experiments: in cifar10 experiments, when ηλ is fixed, AU in equilibrium of SMD remains unchanged, so do the steady loss and Accuracy (Figure 3 (a)-(d) ). But different LR and WD factor can affect the evolution at beginning. Wan et al. (2021) shows similar observations in ImageNet experiment; Figure 3 (e)-(h ) exhibit the "pseudo overfitting" phenomenon shortly after the first LR decay (epoch 60). Rescaling strategy can avoid such "pseudo overfitting" by eliminating the increasing AU phenomenon after LR decay; From Figure 3 (i)-(l), when initialized weight is enlarged by 10, the AU is smaller at the beginning, hence the training loss (test accuracy) decreases slower (increases quicker). When the LR and WD factor are adjusted according to corollary 3, the evolution of AU remains unchanged, and so do the evolution of training loss/test accuracy. ImageNet experiments have similar phenomenon (Figure 4 (a)-(d)).

5. CONCLUSION

In this paper, we propose a simple yet representative normalized quadratic model, named as Noisy Rayleigh Quotient (NRQ), to study the effect of SMD on the evolution of SGD with WD. Our theoretical results show SMD influences the evolution of SGD by controlling AU, and AU can dominate the convergence rate as well as limiting risk of NRQ. Our real data experiments show the insights drawn from NRQ are consistent with empirical observations. We believe our theorems can deepen our understandings on the underlying mechanism of deep neural networks.



in the top of fraction, denoting the full gradient norm. Note when Xn is close to its optimal point, this term is usually close to zero, hence can be ignored comparing with the magnitude of gradient noise Tr(P n Σ n ). This is called noisy dominated regime(Zhang et al., 2019;Smith et al., 2020), which commonly holds in practice especially in large-scale dataset tasks(Smith et al., 2020;Wan et al., 2021), and happen to be the case where discretization error can be controlled(Li et al., 2021). Besides, the limit of ∆ t is √ 2λη, exactly same as the theoretical value of AU in discrete form.In summary, SMD in continuous form is fundamentally equivalent to SMD in the discrete form in noisy dominated regime, where they share the same characteristics on AU. In the following context, we adopt the unifying concept of SMD and AU, without distinguishing the discrete and continuous form. ANALYTICAL RESULTS ON EVOLUTION OF NRQFirst of all, the following two assumptions are introduced to simplify the derivation and highlight the insights of the analytical results:



Figure 1: (a) Illustration of Spherical Motion Dynamics; (b) Loss landscape of a Rayleigh Quotient with WD (l 2 regularization): x 2 +2y 2 /(x 2 +y 2 )+(x 2 +y 2 )

Figure 2: Experiments of NRQ. We exhibit the averaged results of 100 trials. (a)-(d): Evolution of NRQ in equilibrium state; (e)-(h): Escape behavior of NRQ after LR decay. LR is divided by 10 when t = 10 4 , "Rescale" means X t is divided by (10) 1/4 when learning is shrunk; (i)-(l): Evolution of NRQ with different and LR, WD factor and initialized standard deviation, denoted by (η, λ, σ). Blue lines are masked by yellow lines in (i), (j), (k) since they have exactly same evolution.

, (i)-(l)). Furthermore, combining equation 17 and equation 11, we can interpret why initialization affect the evolution of NRQ: when λ, η are given, the initialized value M 0 can change the evolution of AU ∆ t by changing equation 17, resulting in different training curve at beginning. But their limiting risk remains unchanged, since the theoretical value of AU does not change (Comparing blue and red lines in Figure 2 (i)-(l)); same phenomenon occurs when M 0 , λη are fixed, but λ, η change.

Figure 3: Resnet18 on CIFAR10, we exhibit the averaged results of 10 trials. (a)-(d): Training curves with different LR and WD factor, denoted as (η, λ); (e)-(h): pseudo overfitting in CIFAR10 experiments. LR is 0.5, WD factor is 10 -3 , "rescale" means all weights is divided by 5 1/4 at epoch 60; (i)-(l): Training with different LR, WD factors, and initialized standard deviation in convolution layer, denoted by (η, λ, σ).2p denotes Kaiming Init(He et al., 2015).

Figure 4: Resnet50 on Imagenet Training with different LR, WD factors, and initialized standard deviation in convolution layer, denoted by (η, λ, σ).

t-τ ) Tr(P τ Σ)dτ.Chiley et al., 2019;Kunin et al., 2021;Li et al., 2020) usually regard the convergence of weight norm as the sign of equilibrium state in SMD. However,Wan et al. (2021) argues that equilibrium of SMD in practice is actually a dynamic state, where the convergence of weight norm may not hold when the variance of gradient noise on intrinsic domain varies dramatically during the whole training process. Notwithstanding,Wan et al. (2021) reveals another essential characteristic of equilibrium regardless of the convergence of the norm: the AU, defined as ∆

A APPENDIX

You may include other additional sections here.

