HOW NORMALIZATION AND WEIGHT DECAY CAN AF-FECT SGD? INSIGHTS FROM A SIMPLE NORMALIZED MODEL

Abstract

Recent works (Li

1. INTRODUCTION

Normalization (Ioffe & Szegedy, 2015; Wu & He, 2018) is one of the most widely used deep learning techniques, and has become an indispensable part in almost all popular architectures of deep neural networks. Though the success of normalization techniques is indubitable, its underlying mechanism still remains mysterious, and has become a hot topic in the realm of deep learning. Many works have contributed in figuring out the mechanism of normalization from different aspects. While some works (Ioffe & Szegedy, 2015; Santurkar et al., 2018; Hoffer et al., 2018; Bjorck et al., 2018; Summers & Dinneen, 2019; De & Smith, 2020) focus on intuitive reasoning or empirical study, others (Dukler et al., 2020; Kohler et al., 2019; Cai et al., 2019; Arora et al., 2018; Yang et al., 2018; Wu et al., 2020) focus on establishing theoretical foundation. A series of works (Van Laarhoven, 2017; Chiley et al., 2019; Kunin et al., 2021; Li et al., 2020; Wan et al., 2021; Lobacheva et al., 2021; Li & Arora, 2019) have noted that, in practical implementation, the gradient of normalized models is usually computed in a straightforward manner which results in its scale-invariant property during training. The gradient of a scale-invariant weight is always orthogonal to the weight, and thus makes the training trajectory behave as motion on a sphere. Besides, in practice, many models are trained using SGD with Weight Decay (WD), hence normalization and WD in SGD can cause a so-called "equilibrium" state, in which the effect of gradient and WD on weight norm cancel out (see Fig. 1(a) ). It has been a long time since the concept of equilibrium was first proposed (Van Laarhoven, 2017) while either theoretical justification or experimental evidence had still been lacking until recently. Recent works (Li et al., 2020; Wan et al., 2021) theoretically justify the existence of equilibrium in both theoretical and empirical aspects, and characterize the underlying mechanism that yields equilibrium, named as "Spherical Motion Dynamics". In Wan et al. (2021) the authors further show SMD exists in a wide range of computer vision tasks, including ImageNet Deng et al. (2009) and MSCOCO (Lin et al., 2014) . More detailed review can be seen in appendix A.

