A SPHERICAL ANALYSIS OF ADAM WITH BATCH NORMALIZATION

Abstract

Paper under double-blind review Batch Normalization (BN) is a prominent deep learning technique. In spite of its apparent simplicity, its implications over optimization are yet to be fully understood. While previous studies mostly focus on the interaction between BN and stochastic gradient descent (SGD), we develop a geometric perspective which allows us to precisely characterize the relation between BN and Adam. More precisely, we leverage the radial invariance of groups of parameters, such as filters for convolutional neural networks, to translate the optimization steps on the L 2 unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, we use it to derive the first effective learning rate expression of Adam. Then we show that, in the presence of BN layers, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere. Finally, our analysis outlines phenomena that previous variants of Adam act on and we experimentally validate their importance in the optimization process.

1. INTRODUCTION

Figure 1 : Illustration of the spherical perspective for SGD. The loss function L of a NN w.r.t. the parameters x k ∈ R d of a neuron followed by a BN is radially invariant. The neuron update x k → x k+1 in the original space, with velocity η k ∇L(x k ), corresponds to an update u k → u k+1 of its projection through an exponential map on the unit hypersphere S d-1 with velocity η e k ∇L(u k ) at order 2 (see details in Section 2.3). The optimization process of deep neural networks is still poorly understood. Their training involves minimizing a high-dimensional non-convex function, which has been proved to be a NP-hard problem (Blum & Rivest, 1989 ). Yet, elementary gradient-based methods show good results in practice. To improve the quality of reached minima, numerous methods have stemmed in the last years and become common practices. One of the most prominent is Batch Normalization (BN) (Ioffe & Szegedy, 2015) , which improves significantly both the optimization stability and the prediction performance; it is now used in most deep learning architectures. However, the interaction of BN with optimization and its link to regularization remain open research topics. Previous studies highlighted mechanisms of the interaction between BN and SGD, both empirically (Santurkar et al., 2018) and theoretically (Arora et al., 2019; Bjorck et al., 2018; Hoffer et al., 2018b) . None of them studied the interaction between BN and one of the most common adaptive schemes for Neural Networks (NN), Adam (Kingma & Ba, 2015) , except van Laarhoven (2017), which tackled it only in the asymptotic regime. In this work, we provide an extensive analysis of the relation between BN and Adam during the whole training procedure. One of the key effects of BN is to make NNs invariant to positive scalings of groups of parameters. The core idea of this paper is precisely to focus on these groups of radially-invariant parameters and analyze their optimization projected on the L 2 unit hypersphere (see Fig. 1 ), which is topologically equivalent to the quotient manifold of the parameter space by the scaling action. One could directly optimize parameters on the hypersphere as Cho & Lee (2017) , yet, most optimization methods are still performed successfully in the original parameter space. Here we propose to study an optimization scheme for a given group of radially-invariant parameters through its image scheme on the unit hypersphere. This geometric perspective sheds light on the interaction between normalization layers and Adam, and also outlines an interesting link between standard SGD and a variant of Adam adapted and constrained to the unit hypersphere: AdamG (Cho & Lee, 2017) . We believe this kind of analysis is an important step towards a better understanding of the effect of BN on NN optimization. Please note that, although our discussion and experiments focus on BN, our analysis could be applied to any radially-invariant model. The paper is organized as follows. In Section 2, we introduce our spherical framework to study the optimization of radially-invariant models. We also define a generic optimization scheme that encompasses methods such as SGD with momentum (SGD-M) and Adam. We then derive its image step on the unit hypersphere, leading to definitions and expressions of effective learning rate and effective learning direction. This new definition is explicit and has a clear interpretation, whereas the definition of van Laarhoven ( 2017) is asymptotic and the definitions of Arora et al. (2019) and of Hoffer et al. (2018b) are variational. In Section 3, we leverage the tools of our spherical framework to demonstrate that in presence of BN layers, SGD has an adaptive behaviour. Formally, we show that SGD is equivalent to AdamG, a variant of Adam adapted and constrained to the hypersphere, without momentum. In Section 4, we analyze the effective learning direction for Adam. The spherical framework highlights phenomena that previous variants of Adam (Loshchilov & Hutter, 2017; Cho & Lee, 2017) act on. We perform an empirical study of these phenomena and show that they play a significant role in the training of convolutional neural networks (CNNs). In Section 5, these results are put in perspective with related work. Our main contributions are the following: • A framework to analyze and compare order-1 optimization schemes of radially-invariant models; • The first explicit expression of the effective learning rate for Adam; • The demonstration that, in the presence of BN layers, standard SGD has an adaptive behaviour; • The identification and study of geometrical phenomena that occur with Adam and impact significantly the training of CNNs with BN.

2. SPHERICAL FRAMEWORK AND EFFECTIVE LEARNING RATE

In this section, we provide background on radial invariance and introduce a generic optimization scheme. Projecting the scheme update on the unit hypersphere leads to the formal definitions of effective learning rate and learning direction. This geometric perspective leads to the first explicit expression of the effective learning rate for Adam. The main notations are summarized in Figure 1 .

2.1. RADIAL INVARIANCE

We consider a family of parametric functions φ x : R in → R out parameterized by a group of radiallyinvariant parameters x ∈ R d {0}, i.e., ∀ρ > 0, φ ρx = φ x (possible other parameters of φ x are omitted for clarity), a dataset D ⊂ R in × R out , a loss function : R out × R out → R and a training loss function L : R d → R defined as: L(x) def = 1 |D| (s,t)∈D (φ x (s), t). It verifies: ∀ρ > 0, L(ρx) = L(x). In the context of NNs, the group of radially-invariant parameters x can be the parameters of a single neuron in a linear layer or the parameters of a whole filter in a convolutional layer, followed by BN (see Appendix A for details, and Appendix B for the application to other normalization schemes such as InstanceNorm (Ulyanov et al., 2016 ), LayerNorm (Ba et al., 2016) or GroupNorm (Wu & He, 2018) ). The quotient of the parameter space by the equivalence relation associated to radial invariance is topologically equivalent to a sphere. We consider here the L 2 sphere S d-1 = {u ∈ R d / u 2 = 1} whose canonical metric corresponds to angles: d S (u 1 , u 2 ) = arccos( u 1 , u 2 ). This choice of metric is relevant to study NNs since filters in CNNs or neurons in MLPs are applied through scalar product to input data. Besides, normalization in BN layers is also performed using the L 2 norm. Our framework relies on the decomposition of vectors into radial and tangential components. During optimization, we write the radially-invariant parameters at step k ≥ 0 as x k = r k u k where r k = x k and u k = x k / x k . For any quantity q k ∈ R d at step k, we write q ⊥ k = q k -q k , u k u k its tangential component relatively to the current direction u k . The following lemma states that the gradient of a radially-invariant loss function is tangential and -1 homogeneous: Lemma 1 (Gradient of a function with radial invariance). If L : R d → R is radially invariant and almost everywhere differentiable, then, for all ρ > 0 and all x ∈ R d where L is differentiable: ∇L(x), x = 0 and ∇L(x) = ρ ∇L(ρx). (2)

2.2. GENERIC OPTIMIZATION SCHEME

There is a large body of literature on optimization schemes (Sutskever et al., 2013; Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2015; Loshchilov & Hutter, 2019) . We focus here on two of the most popular ones, namely SGD and Adam (Kingma & Ba, 2015) . Yet, to establish general results that may apply to a variety of other schemes, we introduce here a generic optimization update: x k+1 = x k -η k a k b k , (3) a k = βa k-1 + ∇L(x k ) + λx k , where x k ∈ R d is the group of radially-invariant parameters at iteration k, L is the group's loss estimated on a batch of input data, a k ∈ R d is a momentum, b k ∈ R d is a division vector that can depend on the trajectory (x i , ∇L(x i )) i∈ 0,k , η k ∈ R is the scheduled trajectory-independent learning rate, denotes the Hadamard element-wise division, β is the momentum parameter, and λ is the L 2 -regularization parameter. We show how it encompasses several known optimization schemes. Stochastic gradient descent (SGD) has proven to be an effective optimization method in deep learning. It can include L 2 regularization (also called weight decay) and momentum. Its updates are: x k+1 = x k -η k m k , m k = βm k-1 + ∇L(x k ) + λx k , where m k is the momentum, β is the momentum parameter, and λ is the L 2 -regularization parameter. It corresponds to our generic scheme (Eqs. 3-4) with a k = m k and b k = [1 • • • 1] . Adam is likely the most common adaptive scheme for NNs. Its updates are: x k+1 = x k -η k m k 1 -β k+1 1 v k 1 -β k+1 2 + , m k = β 1 m k-1 +(1 -β 1 )(∇L(x k ) + λx k ), v k = β 2 v k-1 + (1 -β 2 )(∇L(x k ) + λx k ) 2 , ( ) where m k is the momentum with parameter β 1 , v k is the second-order moment with parameter β 2 , and prevents division by zero. (Here and in the following, the square and the square root of a vector are to be understood as element-wise.) It corresponds to our generic scheme (Eqs. 3-4) with β=β 1 and: a k = m k 1 -β 1 , b k = 1 -β k+1 1 1 -β 1 v k 1 -β k+1 2 + . (9)

2.3. IMAGE OPTIMIZATION ON THE HYPERSPHERE

The radial invariance implies that the radial part of the parameter update x does not change the function φ x encoded by the model, nor does it change the loss L(x). The goal of training is to find the best possible function encodable by the network. Due to radial invariance, the parameter space projected on the unit hypersphere is topologically closer to the functional space of the network than the full parameter space. It hints that looking at optimization behaviour on the unit hypersphere might be interesting. Thus, we need to separate the quantities that can (tangential part) and cannot (radial part) change the model function. Theorem 2 formulates the spherical decomposition (Eqs. 3-4) in simple terms. It relates the update of radially-invariant parameters in the parameter space R d and their update on S d-1 through an exponential map. Theorem 2 (Image step on S d-1 ). The update of a group of radially-invariant parameters x k at step k corresponds to an update of its projection u k on S d-1 through an exponential map at u k with velocity η e k c ⊥ k , at order 3: u k+1 = Exp u k -1 + O η e k c ⊥ k 2 η e k c ⊥ k , where Exp u k is the exponential map on S d-1 , and with c k def = r k a k b k d -1 /2 b k , η e k def = η k r 2 k d -1 /2 b k 1 - η k c k , u k r 2 k d -1 /2 b k -1 . ( ) More precisely: u k+1 = u k -η e k c ⊥ k 1 + (η e k c ⊥ k ) 2 . ( ) The proof is given in Appendix C.1.1 and the theorem is illustrated in the case of SGD in Figure 1 . Note that with typical values in CNN training we have 1- η k c k ,u k r 2 k d -1 /2 b k > 0, which is a property needed for the proof. Another hypothesis is that steps on the hypersphere are shorter than π. These hypotheses are discussed and empirically verified in Appendix C.1.2.

2.4. EFFECTIVE LEARNING RATE FOR ADAM

In Theorem 2, the normalized parameters update in Eq. 10 can be read u k+1 ≈ Exp u k -η e k c ⊥ k , where η e k and c ⊥ k can then be respectively interpreted as the learning rate and the direction of an optimization step constrained to S d-1 since a k is the momentum and, with Lemma 1, the quantity r k a k in c k can be seen as a momentum on the hypersphere. Due to the radial invariance, only the change of parameter on the unit hypersphere corresponds to a change of model function. Hence we can interpret η e k and c ⊥ k as effective learning rate and effective learning direction. In other words, these quantities correspond to the learning rate and direction on the hypersphere that reproduce the function update of the optimization step.  = rd -1 /2 b . Scheme η e c ⊥ SGD η r 2 ∇L(u) SGD + L2 η r 2 (1-ηλ) ∇L(u) SGD-M η r 2 (1 -η c,u r 2 ) -1 c ⊥ Adam η rν (1 -η c,u rν ) -1 c ⊥ Using Theorem 2, we can derive actual effective learning rates for any optimization scheme that fits our generic framework. These expressions, summarized in Table 1 are explicit and have a clear interpretation, in contrast to learning rates in (van Laarhoven, 2017), which are approximate and asymptotic, and in (Hoffer et al., 2018a; Arora et al., 2019) , which are variational and restricted to SGD without momentum only. In particular, we provide the first explicit expression of the effective learning rate for Adam: η e k = η k rν k 1 - η k c k , u k rν k -1 where ν k = r k d -1 /2 b k is homogeneous to the norm of a gradient on the hypersphere and can be related to an second-order moment on the hypersphere (see Appendix.C.1.3 for details). This notation also simplifies the in-depth analysis in Section 4, allowing a better interpretation of formulas. The expression of the effective learning rate of Adam, i.e., the amplitude of the step taken on the hypersphere, reveals a dependence on the dimension d (through ν) of the considered group of radiallyinvariant parameters. In the case of an MLP or CNN that stacks layers with neurons or filters of different dimensions, the learning rate is thus tuned differently from one layer to another. We can also see that for all schemes the learning rate is tuned by the dynamics of radiuses r k , which follow: r k+1 r k = 1 - η k c k , u k r 2 k d -1 /2 b k 1 + (η e k c ⊥ k ) 2 . ( ) In contrast to previous studies (Arora et al., 2019; van Laarhoven, 2017) , this result demonstrates that for momentum methods, c k , u k , which involves accumulated gradients terms in the momentum as well as L 2 regularization, tunes the learning rate (cf. Fig. 1 ).

3. SGD IS A VARIATION OF ADAM ON THE HYPERSPHERE

We leverage the tools introduced in the spherical framework to find a scheme constrained to the hypersphere that is equivalent to SGD. It shows that for radially-invariant models, SGD is actually an adaptive optimization method. Formally SGD is equivalent to a version of AdamG, a variation of Adam adapted and constrained to the unit hypersphere, without momentum.

3.1. EQUIVALENCE BETWEEN TWO OPTIMIZATION SCHEMES

Due to the radial invariance, the functional space of the model is encoded by S d-1 . In other words, two schemes with the same sequence of groups of radially-invariant parameters on the hypersphere (u k ) k≥0 encode the same sequence of model functions. Two optimization schemes S and S are equivalent iff ∀k ≥ 0, u k = ũk . By using Eq. 12, we obtain the following lemma, which is useful to prove the equivalence of two given optimization schemes: Lemma 3 (Sufficient condition for the equivalence of optimization schemes). u 0 = ũ0 ∀k ≥ 0, η e k = ηe k , c ⊥ k = c⊥ k ⇒ ∀k ≥ 0, u k = ũk . ( )

3.2. A HYPERSPHERE-CONSTRAINED SCHEME EQUIVALENT TO SGD

We now study, within our spherical framework, SGD with L 2 regularization, i.e., the update x k+1 = x k -η k (∇L(x k ) -λ k x k ). From the effective learning rate expression, we know that SGD yields an adaptive behaviour because it is scheduled by the radius dynamic, which depends on gradients. In fact, the tools in our framework allow us to find a scheme constrained to the unit hypersphere that is equivalent to SGD: AdamG (Cho & Lee, 2017) . More precisely, it is AdamG with a null momentum factor β 1 = 0, an non-null initial second-order moment v 0 , an offset of the scalar second-order moment k + 1 → k and the absence of the bias correction term 1 -β k+1

2

.Dubbed AdamG* this scheme reads: (AdamG*) :      xk+1 = x k -η k ∇L(x k ) √ v k , x k+1 = xk+1 xk+1 , v k+1 = βv k + ∇L(x k ) 2 . Starting from SGD, we first use Lemma 3 to find an equivalence scheme with simpler radius dynamic. We resolve this radius dynamic with a Taylor expansion at order 2 in (η k ∇L(u k ) ) 2 /r 2 k . A second use of Lemma 3 finally leads to the following scheme equivalence in Theorem (see proof in Appendix C.1.4). If we call « equivalent at order 2 in the step » a scheme equivalence that holds when we use for r k an expression that satisfies the radius dynamic with a Taylor expansion at order 2 we have the following theorem: Theorem 4 (SGD equivalent scheme on the unit hypersphere). For any λ > 0, η > 0, r 0 > 0, we have the following equivalence when using the radius dynamic at order 2 in (η k ∇L(u k ) ) 2 /r 2 k :      (SGD) x 0 = r 0 u 0 λ k = λ η k = η is scheme-equivalent at order 2 in step with          (AdamG*) x 0 = u 0 β = (1 -ηλ) 4 η k = (2β) -1/2 v 0 = r 4 0 (2η 2 β 1/2 ) -1 . This result is unexpected because SGD, which is not adaptive by itself, is equivalent to a second order moment adaptive method The scheduling performed by the radius dynamics actually replicates the effect of dividing the learning rate by the second-order moment of the gradient norm: v k . First, the only assumption for this equivalence is to neglect the approximation in the Taylor expansion at order 2 of the radius which is highly verified in practice (order of magnitude of 1e -4 isee Appendix C.1.5). Second, with standard values of the hyper-parameters : learning rate η < 1 and weight decay λ < 1, we have β ≤ 1 which corresponds to a standard value for a moment factor. Interestingly, the L2 regularization parameter λ controls the memory of the past gradients norm. If β = 1 (with λ = 0) there is no attenuation, each gradient norm has the same contribution in the order of two moments. If λ = 0, there is a decay factor (β < 1) on past gradients norm in the order 2 moment.

4. GEOMETRIC PHENOMENA IN ADAM

Our framework with its geometrical interpretation reveals intriguing behaviors occurring in Adam. The unit hypersphere is enough to represent the functional space encoded by the network. From the perspective of manifold optimization, the optimization direction would only depend on the trajectory on that manifold. In the case of Adam, the effective direction not only depends on the trajectory on the hypersphere but also on the deformed gradients and additional radial terms. These terms are thus likely to play a role in Adam optimization. In order to understand their role, we describe these geometrical phenomena in Section 4.1. Interestingly, previous variants of Adam, AdamW (Loshchilov & Hutter, 2017) and AdamG (Cho & Lee, 2017) are related to these phenomena. To study empirically their importance, we consider in Section 4.2 variants of Adam that first provide a direction intrinsic to the unit hypersphere, without deformation of the gradients, and then where radial terms are decoupled from the direction. The empirical study of these variants over a variety of datasets and architectures suggests that these behaviors do play a significant role in CNNs training with BN.

4.1. IDENTIFICATION OF GEOMETRICAL PHENOMENA IN ADAM

Here, we perform an in-depth analysis of the effective learning direction of Adam. (a) Deformed gradients. Considering the quantities defined for a generic scheme in Eq. 11, b k has a deformation effect on a k , due to the Hadamard division by b k d -1 /2 b k , and a scheduling effect d -1 /2 b k on the effective learning rate. In the case where the momentum factor is null β 1 = 0, the direction of the update at step k is ∇L(u k ) b k d -1 /2 b k (Eq. 11) and the deformation b k d -1 /2 b k may push the direction of the update outside the tangent space of S d-1 at u k , whereas the gradient itself lies in the tangent space. This deformation is in fact not isotropic: the displacement of the gradient from the tangent space depends on the position of u k on the sphere. We illustrate this anisotropy in Fig. 2(b ). (b) Additional radial terms. In the momentum on the sphere c k , quantities that are radial (resp. orthogonal) at a point on the sphere may not be radial (resp. orthogonal) at another point. To clarify the contribution of c k in the effective learning direction c ⊥ k , we perform the following decomposition (cf. Appendix D.1): c k = (c grad k + λr 2 k c L2 k ) b k d -1 /2 b k with: (16) c grad k def = ∇L(u k ) + k-1 i=0 β k-i r k r i ∇L(u i ) and c L2 k def = u k + k-1 i=0 β k-i r i r k u i . 1. Contribution of c grad k . At step k, the contribution of each past gradient corresponds to the orthogonal part ∇L(u i ) -∇L(u i ), u k u k . It impacts the effective learning direction depending on its orientation relatively to u k . Two past points, although equally distant from u k on the sphere and with equal gradient amplitude may thus contribute differently in c ⊥ k due to their orientation (cf. Fig. 2(c) ). 2. Contribution of c L2 k . Naturally, the current point u k does not contribute to the effective learning direction c ⊥ k , unlike the history of points in k-1 i=0 β k-i ri r k u i , which does. This dependency can be avoided if we decouple the L 2 regularization, in which case we do not accumulate L 2 terms in the momentum. This shows that the decoupling proposed in AdamW (Loshchilov & Hutter, 2019) actually removes the contribution of L 2 regularization in the effective learning direction. (c) The radius ratio r k ri present in both c grad k and c L2 k (in inverse proportion) impacts the effective learning direction c ⊥ k : it can differ for identical sequences (u i ) i≤k on the sphere but with distinct radius histories (r i ) i≤k . Since the radius is closely related to the effective learning rate, it means that the effective learning direction c ⊥ k is adjusted according to the learning rates history. Note that AdamG (Cho & Lee, 2017) , by constraining the optimization to the unit hypersphere and thus removing L 2 regularization, neutralizes all the above phenomena. However, this method has no scheduling effect allowed by the radius dynamics (cf. Eq.14) since it is kept constant during training. 

4.2. EMPIRICAL STUDY

To study empirically the importance of the identified geometric phenomena, we perform an ablation study: we compare the performance (accuracy and training loss speed) of Adam and variants that neutralize each of them. We recall that AdamW neutralizes (b2) and that AdamG neutralizes all of above phenomena but loses the scheduling effect identified in Eq. 14. To complete our analysis, we use geometrical tools to design variations of Adam which neutralizes sequentially each phenomenon while preserving the natural scheduling effect in Theorem 2. We neutralize (a) by replacing the element-wise second-order moment, (b1) and (b2) by transporting the momentum from a current point to the new one, (c) by re-scaling the momentum at step k. The details are in Appendix. D.2. The final scheme reads: x k+1 = x k -η k m k 1 -β k+1 1 / v k 1 -β k+1 2 + , m k = β 1 r k-1 r k Γ u k u k-1 (m k-1 )+(1 -β 1 )(∇L(x k ) + λx k ), v k = β 2 r 2 k-1 r 2 k v k-1 + (1 -β 2 )d -1 ∇L(x k ) + λx k 2 , where Γ u k u k-1 is the hypersphere canonical transport from u k-1 to u k . Implementation details are in Appendix D.3. Protocol. For evaluation, we conduct experiments on two architectures: VGG16 (Simonyan & Zisserman, 2015) and ResNet (He et al., 2016 ) -more precisely ResNet20, a simple variant designed for small images (He et al., 2016) , and ResNet18, a popular variant for image classification. We consider three datasets: SVHN (Netzer et al., 2011) , CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) . Since our goal is to evaluate the significance of phenomena on radially-invariant parameters, i.e., the convolution filters followed by BN, we only apply variants of Adam including AdamG and AdamW on convolution layers. For comparison consistency, we keep standard Adam on the remaining parameters. We also use a fixed grid hyperparameter search budget and frequency for each method and each architecture (see Appendix D.3 for details). Results. In Table 2 we report quantitative results of Adam variants across architectures and datasets. In addition, we compare the evolution of the training loss in Fig. 3 . We observe that each phenomenon displays a specific trade-off between generalization (accuracy on the test set) and training speed, as following. Neutralizing (a) has little effect on the speed over Adam, yet achieves better accuracy. Although it slows down training, neutralizing (ab) leads to minima with the overall best accuracy on test set. Note that AdamW † neutralizes (b2) with its decoupling and is the fastest method, but finds minima with overall worst generalization properties. By constraining the optimization to the hypersphere, AdamG † speeds up training over the other variants. Finally, neutralizing (c) with Adam & Szegedy, 2015) has been challenged and shown to be secondary to smoothing of optimization landscape (Santurkar et al., 2018; Ghorbani et al., 2019) or its modification by creating a different objective function (Lian & Liu, 2019) , or enabling of high learning rates through improved conditioning (Bjorck et al., 2018) . Arora et al. (2019) demonstrate that (S)GD with BN is robust to the choice of the learning rate, with guaranteed asymptotic convergence, while a similar finding for GD with BN is made by Cai et al. (2019) . Invariances in neural networks. Cho & Lee (2017) propose optimizing over the Grassmann manifold using Riemannian GD. Liu et al. (2017) project weights and activations on the unit hypersphere and compute a function of the angle between them instead of inner products, and subsequently generalize these operators by scaling the angle (Liu et al., 2018) . In (Li & Arora, 2020) the radial invariance is leveraged to prove that weight decay (WD) can be replaced by an exponential learning-rate scheduling for SGD with or without momentum. Arora et al. (2019) investigate the radial invariance and show that radius dynamics depends on the past gradients, offering an adaptive behavior to the learning rate. Here we go further and show that SGD projected on the unit hypersphere corresponds to Adam constrained to the hypersphere, and we give an accurate definition of this adaptive behavior. Effective learning rate. Due to its scale invariance, BN can adaptively adjust the learning rate (van Laarhoven, 2017; Cho & Lee, 2017; Arora et al., 2019; Li & Arora, 2020 ). van Laarhoven (2017) shows that in BN-equipped networks, WD increases the effective learning rate by reducing the norm of the weights. Conversely, without WD, the norm grows unbounded (Soudry et al., 2018) , decreasing the effective learning rate. Zhang et al. (2019) brings additional evidence supporting hypothesis in van Laarhoven (2017), while Hoffer et al. (2018a) finds an exact formulation of the effective learning rate for SGD in normalized networks. In contrast with prior work, we find generic definitions of the effective learning rate with exact expressions for SGD and Adam.

6. CONCLUSION

The spherical framework introduced in this study provides a powerful tool to analyse Adam optimization scheme through its projection on the L 2 unit hypersphere. It allows us to give a precise definition and expression of the effective learning rate for Adam, to relate SGD to a variant of Adam, and to identify geometric phenomena which empirically impact training. The framework also brings light to existing variations of Adam, such as L 2 -regularization decoupling. This approach could be extended to other invariances in CNNs such as as filter permutation.



Figure 2: (a) Effect of the radial part of c k on the displacement on S d-1 ; (b) Example of anisotropy and sign instability for the deformation ψ(∇L(u k )) = ∇L(u k ) |∇L(u k )| d -1/2 ∇L(u k ) (where | • | is the element-wise absolute value) occurring in Adam's first optimization step; (c) Different contribution in c ⊥ k of two past gradients ∇1 and ∇2 of equal norm, depending on their orientation. Illustration of the transport of ∇1 from u k-1 to u k : Γ u k u k-1 (∇1) (cf. Appendix D.2 for details)

Figure 3: Training speed comparison with ResNet20 on CIFAR10. Left: Mean training loss over all training epochs (averaged across 5 seeds) for different Adam variants. Right: Zoom-in on the last epochs. Please refer to Table2for the corresponding accuracies.

Effective learning rate and direction for optimization schemes (k omitted), with ν

Accuracy of Adam and its variants. The figures in this table are the mean top1 accuracy ± the standard deviation over 5 seeds on the test set for CIFAR10, CIFAR100 and on the validation set for SVHN.

