ADAMP: SLOWING DOWN THE SLOWDOWN FOR MO-MENTUM OPTIMIZERS ON SCALE-INVARIANT WEIGHTS

Abstract

Normalization techniques, such as batch normalization (BN), are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters (e.g. more than 90% of the weights in ResNet are scale-invariant due to BN). In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in performances in those benchmarks.

1. INTRODUCTION

Normalization techniques, such as batch normalization (BN) (Ioffe & Szegedy, 2015) , layer normalization (LN) (Ba et al., 2016) , instance normalization (IN) (Ulyanov et al., 2016) , and group normalization (GN) (Wu & He, 2018) , have become standard tools for training deep neural network models. Originally proposed to reduce the internal covariate shift (Ioffe & Szegedy, 2015) , normalization methods have proven to encourage several desirable properties in deep neural networks, such as better generalization (Santurkar et al., 2018) and the scale invariance (Hoffer et al., 2018) . Prior studies have observed that the normalization-induced scale invariance of weights stabilizes the convergence for the neural network training (Hoffer et al., 2018; Arora et al., 2019; Kohler et al., 2019; Dukler et al., 2020) . We provide a sketch of the argument here. Given weights w and an input x, we observe that the normalization makes the weights become scale-invariant: Norm(w x) = Norm(cw x) ∀c > 0. (1) The resulting equivalence relation among the weights lets us consider the weights only in terms of their 2 -normalized vectors w := w w 2 on the sphere S d-1 = {v ∈ R d : v 2 = 1}. We refer to S d-1 as the effective space, as opposed to the nominal space R d where the actual optimization algorithms operate. The mismatch between these spaces results in the discrepancy between the gradient descent steps on R d and their effective steps on S d-1 . Specifically, for the gradient descent updates, the effective step sizes ∆ w t+1 2 := w t+1w t 2 are the scaled versions of the nominal step sizes ∆w t+1 2 := w t+1w t 2 by the factor 1 wt 2 (Hoffer et al., 2018) . Since w t 2 increases during training (Soudry et al., 2018; Arora et al., 2019) , the effective step sizes ∆ w t 2 decrease as the optimization progresses. The automatic decrease in step sizes stabilizes the convergence of gradient descent algorithms applied on models with normalization layers: even if the nominal learning rate is set to a constant, the theoretically optimal convergence rate is guaranteed (Arora et al., 2019) . In this work, we show that the widely used momentum-based gradient descent optimizers (e.g. SGD and Adam (Kingma & Ba, 2015)) decreases the effective step size ∆ w t even more rapidly than the momentum-less counterparts considered in (Arora et al., 2019) . This leads to a slower effective convergence for w t and potentially sub-optimal model performances. We illustrate this effect on a 2D toy optimization problem in Figure 1 . Compared to "GD", "GD+momentum" is much faster in the nominal space R 2 , but the norm growth slows down the effective convergence in S 1 , reducing the acceleration effect of momentum. This phenomenon is not confined to the toy setup, for example, 95.5% and 91.8% of the parameters of the widely-used ResNet18 and ResNet50 (He et al., 2016) are scale-invariant due to BN. The majority of deep models nowadays are trained with SGD or Adam with momentum. And yet, our paper is first to delve into the issue in the widely-used combination of scale-invariant parameters and momentum-based optimizers. w 0 w S 1 = { w 2 = 1} We propose a simple solution to slow down the decay of effective step sizes while maintaining the step directions of the original optimizer in the effective space. At each iteration of a momentum-based gradient descent optimizer, we propose to project out the radial component (i.e. component parallel to w) from the update, thereby reducing the increase in the weight norm over time. Because of the scale invariance, the procedure does not alter the update direction in the effective space; it only changes the effective step sizes. We can observe the benefit of our optimizer in the toy setting in Figure 1 . "Ours" suppresses the norm growth and thus slows down the effective learning rate decay, allowing the momentum-accelerated convergence in R 2 to be transferred to the actual space S 1 . "Ours" converges most quickly and achieves the best terminal objective value. We do not discourage the use of momentum-based optimizers; momentum is often an indispensable ingredient that enables best performances by deep neural networks. Instead, we propose to use our method that helps momentum realize its full potential by letting the acceleration operate on the effective space, rather than squandering it on increasing norms to no avail. The projection algorithm is simple and readily applicable to various optimizers for deep neural networks. We apply this technique on SGD and Adam (SGDP and AdamP, respectively) and verify the slower decay of effective learning rates as well as the resulting performance boosts over a diverse set of practical machine learning tasks including image classification, image retrieval, object detection, robustness benchmarks, audio classification, and language modelling. As a side note, we have identified certain similarities between our approaches and Cho & Lee (2017). Cho & Lee (2017) have considered performing the optimization steps for the scale-invariant parameters on the spherical manifold. We argue that our approaches are conceptually different, as ours operate on the ambient Euclidean space, and are more practical. See Appendix §G.1 for a more detailed argumentation based on conceptual and empirical comparisons.

2. PROBLEM

Widely-used normalization techniques (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ba et al., 2016; Ulyanov et al., 2016; Wu & He, 2018) in deep networks result in the scale invariance for



Figure 1. Optimizer trajectories. Shown is the wt for the optimization problem maxw w w w 2 w 2 s. Trajectories start from w0 towards the optimal solution w . The problem is invariant to the scale of w. Video version in the attached code.

availability

Source code is available at https://github.com/clovaai

