HOW NORMALIZATION AND WEIGHT DECAY CAN AF-FECT SGD? INSIGHTS FROM A SIMPLE NORMALIZED MODEL

Abstract

Recent works (Li

1. INTRODUCTION

Normalization (Ioffe & Szegedy, 2015; Wu & He, 2018) is one of the most widely used deep learning techniques, and has become an indispensable part in almost all popular architectures of deep neural networks. Though the success of normalization techniques is indubitable, its underlying mechanism still remains mysterious, and has become a hot topic in the realm of deep learning. Many works have contributed in figuring out the mechanism of normalization from different aspects. While some works (Ioffe & Szegedy, 2015; Santurkar et al., 2018; Hoffer et al., 2018; Bjorck et al., 2018; Summers & Dinneen, 2019; De & Smith, 2020) focus on intuitive reasoning or empirical study, others (Dukler et al., 2020; Kohler et al., 2019; Cai et al., 2019; Arora et al., 2018; Yang et al., 2018; Wu et al., 2020) focus on establishing theoretical foundation. A series of works (Van Laarhoven, 2017; Chiley et al., 2019; Kunin et al., 2021; Li et al., 2020; Wan et al., 2021; Lobacheva et al., 2021; Li & Arora, 2019) have noted that, in practical implementation, the gradient of normalized models is usually computed in a straightforward manner which results in its scale-invariant property during training. The gradient of a scale-invariant weight is always orthogonal to the weight, and thus makes the training trajectory behave as motion on a sphere. Besides, in practice, many models are trained using SGD with Weight Decay (WD), hence normalization and WD in SGD can cause a so-called "equilibrium" state, in which the effect of gradient and WD on weight norm cancel out (see Fig. 1(a) ). It has been a long time since the concept of equilibrium was first proposed (Van Laarhoven, 2017) while either theoretical justification or experimental evidence had still been lacking until recently. Recent works (Li et al., 2020; Wan et al., 2021) theoretically justify the existence of equilibrium in both theoretical and empirical aspects, and characterize the underlying mechanism that yields equilibrium, named as "Spherical Motion Dynamics". In Wan et al. (2021) the authors further show SMD exists in a wide range of computer vision tasks, including ImageNet Deng et al. (2009) and MSCOCO (Lin et al., 2014) . More detailed review can be seen in appendix A. • We design a simple normalized model, named as Noisy Rayleigh Quotient (NRQ). NRQ possesses all the necessary properties to induce SMD, consistent with those in real neural networks. NRQ contributes a powerful tool for analyzing how normalization affects firstorder optimization algorithms; • We derive the analytical results on the limiting dynamics and the stationary distribution of NRQ. Our results show the influence of SMD is mainly reflected on how angular update (AU), a crucial feature of SMD, affects the convergence rate and limiting risk. We discuss the influence of AU within equilibrium and beyond equilibrium respectively, figuring out the association between the evolution of AU and the evolution of the optimization trajectory in NRQ; • We show that the insights drawn from the theoretical results on NRQ can adequately interpret typical observations in deep learning experiments. Specifically, we confirm the role of learning rate and WD is equivalent to that of scale-invariant weight in SGD. We show the Gaussian type initialization strategy can affect the training process only because it can change the evolution of AU at the beginning. We also confirm that under certain condition, SMD may induce "escape" behavior of optimization trajectory, resulting in "pseudo overfitting" phenomenon in practice. 2 NOISY RAYLEIGH QUOTIENT

2.1. PROBLEM SET UP

We use Rayleigh Quotient Horn & Johnson (2012) as the objective function, defined as L(X) = X T AX 2X T X , where X ∈ R p \{0}, A ∈ R p×p is positive semi-definite. Based on its form, Rayleigh Quotient is equivalent to a quadratic function using weight normalization (Salimans & Kingma, 2016) .

Now considering the following optimization problem

min X∈R p \{0} L(X),



Figure 1: (a) Illustration of Spherical Motion Dynamics; (b) Loss landscape of a Rayleigh Quotient with WD (l 2 regularization): x 2 +2y 2 /(x 2 +y 2 )+(x 2 +y 2 )Though the existence of SMD has been confirmed both theoretically and empirically, as well as some of its characteristics, we notice that so far no previous work has ever theoretically justified how SMD can affect the evolution of the loss of normalized models. Although some attempts have been made in Li et al. (2020); Wan et al. (2021) to explore the role of SMD in the training by conjectures and empirical studies, they still lack theoretical justification on their findings. In hindsight, the main challenge to theoretically analyze the effect of SMD is that SMD comes from the joint effect of normalization and WD which can significantly distort the loss landscape (see Figure1(b)), and thus dramatically weaken some commonly used assumptions such as (locally) convexity, Lipschitz continuity, etc. Exploring the optimization trajectory on such distorted loss landscape is very challenging, much less that taking in addition SMD into account in the consideration.In this paper, as the first significant attempt to overcome the challenge on studying the effect of SMD towards evolution of the loss, we propose a simple yet representative normalized model, and theoretically analyze how SMD influences the optimization trajectory. We adopt the SDE framework ofLi et al. (2020)  to derive the analytical results on the evolution of NRQ, and concepts ofWan et al.  (2021)  to interpret the theoretical results we obtain in this paper. Our contributions are

