ADAMP: SLOWING DOWN THE SLOWDOWN FOR MO-MENTUM OPTIMIZERS ON SCALE-INVARIANT WEIGHTS

Abstract

Normalization techniques, such as batch normalization (BN), are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters (e.g. more than 90% of the weights in ResNet are scale-invariant due to BN). In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in performances in those benchmarks.

1. INTRODUCTION

Normalization techniques, such as batch normalization (BN) (Ioffe & Szegedy, 2015) , layer normalization (LN) (Ba et al., 2016) , instance normalization (IN) (Ulyanov et al., 2016) , and group normalization (GN) (Wu & He, 2018) , have become standard tools for training deep neural network models. Originally proposed to reduce the internal covariate shift (Ioffe & Szegedy, 2015) , normalization methods have proven to encourage several desirable properties in deep neural networks, such as better generalization (Santurkar et al., 2018) and the scale invariance (Hoffer et al., 2018) . Prior studies have observed that the normalization-induced scale invariance of weights stabilizes the convergence for the neural network training (Hoffer et al., 2018; Arora et al., 2019; Kohler et al., 2019; Dukler et al., 2020) . We provide a sketch of the argument here. Given weights w and an input x, we observe that the normalization makes the weights become scale-invariant: Norm(w x) = Norm(cw x) ∀c > 0. (1) The resulting equivalence relation among the weights lets us consider the weights only in terms of their 2 -normalized vectors w := w w 2 on the sphere S d-1 = {v ∈ R d : v 2 = 1}. We refer to S d-1 as the effective space, as opposed to the nominal space R d where the actual optimization algorithms operate. The mismatch between these spaces results in the discrepancy between the gradient descent steps on R d and their effective steps on S d-1 . Specifically, for the gradient descent updates, the effective step sizes ∆ w t+1 2 := w t+1w t 2 are the scaled versions of the nominal step sizes ∆w t+1 2 := w t+1w t 2 by the factor 1 wt 2 (Hoffer et al., 2018) . Since w t 2 increases during training (Soudry et al., 2018; Arora et al., 2019) , the effective step sizes ∆ w t 2 decrease as the optimization progresses. The automatic decrease in step sizes stabilizes the convergence of gradient descent algorithms applied on models with normalization layers: even if the nominal learning rate is set to a constant, the theoretically optimal convergence rate is guaranteed (Arora et al., 2019) . In this work, we show that the widely used momentum-based gradient descent optimizers (e.g. SGD and Adam (Kingma & Ba, 2015) ) decreases the effective step size ∆ w t even more rapidly than the momentum-less counterparts considered in (Arora et al., 2019) . This leads to a slower effective convergence for w t and potentially sub-optimal model performances. We illustrate this effect on a 2D toy optimization problem in Figure 1 . Compared to "GD", "GD+momentum" is much faster in the nominal space R 2 , but the norm growth slows down the effective convergence in S 1 , reducing the acceleration effect of momentum. This phenomenon is not confined to the toy setup, for example, 95.5% and 91.8% of the parameters of the widely-used ResNet18 and ResNet50 (He et al., 2016) are scale-invariant due to BN. The majority of deep models nowadays are trained with SGD or Adam with momentum. And yet, our paper is first to delve into the issue in the widely-used combination of scale-invariant parameters and momentum-based optimizers. We propose a simple solution to slow down the decay of effective step sizes while maintaining the step directions of the original optimizer in the effective space. At each iteration of a momentum-based gradient descent optimizer, we propose to project out the radial component (i.e. component parallel to w) from the update, thereby reducing the increase in the weight norm over time. Because of the scale invariance, the procedure does not alter the update direction in the effective space; it only changes the effective step sizes. We can observe the benefit of our optimizer in the toy setting in Figure 1 . "Ours" suppresses the norm growth and thus slows down the effective learning rate decay, allowing the momentum-accelerated convergence in R 2 to be transferred to the actual space S 1 . "Ours" converges most quickly and achieves the best terminal objective value. We do not discourage the use of momentum-based optimizers; momentum is often an indispensable ingredient that enables best performances by deep neural networks. Instead, we propose to use our method that helps momentum realize its full potential by letting the acceleration operate on the effective space, rather than squandering it on increasing norms to no avail. The projection algorithm is simple and readily applicable to various optimizers for deep neural networks. We apply this technique on SGD and Adam (SGDP and AdamP, respectively) and verify the slower decay of effective learning rates as well as the resulting performance boosts over a diverse set of practical machine learning tasks including image classification, image retrieval, object detection, robustness benchmarks, audio classification, and language modelling. As a side note, we have identified certain similarities between our approaches and Cho & Lee (2017) . Cho & Lee (2017) have considered performing the optimization steps for the scale-invariant parameters on the spherical manifold. We argue that our approaches are conceptually different, as ours operate on the ambient Euclidean space, and are more practical. See Appendix §G.1 for a more detailed argumentation based on conceptual and empirical comparisons.

2. PROBLEM

Widely-used normalization techniques (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ba et al., 2016; Ulyanov et al., 2016; Wu & He, 2018) in deep networks result in the scale invariance for weights. We show that the introduction of momentum in gradient-descent (GD) optimizers, when applied on such scale-invariant parameters, decreases the effective learning rate much more rapidly. This phenomenon has not yet been studied in literature, despite its ubiquity. We suspect the resulting early convergence may have introduced sub-optimality in many SGD and Adam-trained models across machine learning tasks. The analysis motivates our optimizer in §3.

2.1. NORMALIZATION LAYER AND SCALE INVARIANCE

For a tensor x ∈ R n1×•••×nr of rank r, we define the normalization operation along the axes k ∈ {0, 1} x) where µ k , σ k are the mean and standard deviation functions along the axes k, without axes reduction (to allow broadcasted operations with x). Depending on k, Norm k includes special cases like batch normalization (BN) (Ioffe & Szegedy, 2015) . (x) = x-µ k (x) σ k ( For a function g(u), we say that g is scale invariant if g(cu) = g(u) for any c > 0. We then observe that Norm(•) is scale invariant. In particular, under the context of neural networks, Norm(w x) = Norm((cw) x) for any c > 0, leading to the scale invariance against the weights w preceding the normalization layer. The norm of such weights w 2 does not affect the forward f w (x) or the backward ∇ w f w (x) computations of a neural network layer f w parameterized by w. We may represent the scale-invariant weights via their 2 -normalized vectors w := w w 2 ∈ S d-1 (i.e. c = 1 w 2 ).

2.2. NOTATIONS FOR THE OPTIMIZATION STEPS

- η p t w t+1 w t w t+1 w t ∆ w t + 1 2 ∆ w t+1 2 See the illustration on the right for the summary of notations describing an optimization step. We write a gradient descent (GD) algorithm as: w t+1 ← w t -ηp t where η > 0 is the user-defined learning rate. The norm of the difference ∆w t+1 2 := w t+1w t 2 = η p t 2 is referred to as the step size. When p = ∇ w f (w), equation 3 is the vanilla GD algorithm. Momentumbased variants have more complex forms for p. In this work, we study the optimization problem in terms of the 2 normalized weights in S d-1 , as opposed to the nominal space R d . As the result of equation 3, an effective optimization step takes place in S d-1 : ∆ w t+1 := w t+1w t . We refer to the effective step size ∆ w t+1 2 .

2.3. EFFECTIVE STEP SIZES FOR VANILLA GRADIENT DESCENT (GD)

We approximate the effective step sizes for the scale-invariant w under the vanilla GD algorithm. We observe that the scale invariance f (cw) ≡ f (w) leads to the orthogonality: 0 = ∂f (cw) ∂c = w ∇ w f (w). For example, the vanilla GD update step p = ∇ w f (w) is always perpendicular to w. Based on this, we establish the effective step size for w on S d-1 : ∆ w t+1 2 := w t+1 w t+1 2 - w t w t 2 2 ≈ w t+1 w t+1 2 - w t w t+1 2 2 = ∆w t+1 2 w t+1 2 (5) where the approximation assumes 1 wt+1 2 -1 wt 2 = o(η) , which holds when p t ⊥ w t as in the vanilla GD. We have thus derived that the effective step size on S d-1 is inversely proportional to the weight norm, in line with the results in Hoffer et al. (2018) . Having established the relationship between the effective step sizes and the weight norm of a scaleinvariant parameters (s.i.p), we derive the formula for its growth under the vanilla GD optimization. Lemma 2.1 (Norm growth by GD, Lemma 2.4 in Arora et al. (2019) ). For a s.i.p. w and the vanilla GD, where p t = ∇ w f (w t ), w t+1 2 2 = w t 2 2 + η 2 p t 2 2 . ( ) The lemma follows from the orthogonality in equation 4. It follows that the norm of a scale-invariant parameter w 2 is monotonically increasing and consequently decreases the effective step size for w. Arora et al. (2019) has further shown that GD with the above adaptive step sizes converges to a stationary point at the theoretically optimal convergence rate O(T -1/2 ) under a fixed learning rate.

2.4. RAPID DECAY OF EFFECTIVE STEP SIZES FOR MOMENTUM-BASED GD

Momentum is designed to accelerate the convergence of gradient-based optimization by letting w escape high-curvature regions and cope with small and noisy gradients. It has become an indispensable ingredient for training modern deep neural networks. A momentum update follows: w t+1 ← w t -ηp t , p t ← βp t-1 + ∇ wt f (w t ) for steps t ≥ 0, where β ∈ (0, 1) and p -1 is initialized at 0. Note that the step direction p t and the parameter w t may not be perpendicular anymore. We show below that momentum increases the weight norm under the scale invariance, even more so than does the vanilla GD. Lemma 2.2 (Norm growth by momentum). For a s.i.p. w updated via equation 7, we have w t+1 2 2 = w t 2 2 + η 2 p t 2 2 + 2η 2 t-1 k=0 β t-k p k 2 2 . ( ) Proof is in the Appendix §A. Comparing Lemma 2.1 and 2.2, we notice that the formulation is identical, except for the last term on the right hand side of Lemma 2.2. This term is not only non-negative, but also is an accumulation of the past updates. This additional term results in the significantly accelerated increase of weight norms when the momentum is used. We derive a more precise asymptotic ratio of the weight norms for the GD with and without momentum below. Corollary 2.3 (Asymptotic norm growth comparison). Let w GD t 2 and w GDM t 2 be the weight norms at step t ≥ 0, following the recursive formula in Lemma 2.1 and 2.2, respectively. We assume that the norms of the updates p t 2 for GD with and without momentum are identical for every t ≥ 0. We further assume that the sum of the update norms is non-zero and bounded: 0 < t≥0 p t 2 2 < ∞. Then, the asymptotic ratio between the two norms is given by: w GDM t 2 2 -w 0 2 2 w GD t 2 2 -w 0 2 2 -→ 1 + 2β 1 -β as t → ∞. Proof in the Appendix §A. While the identity assumption for p t 2 between GD with and without momentum is strong, the theory is designed to illustrate an approximate norm growth ratios between the algorithms. For a popular choice of β = 0.9, the factor is as high as 1 + 2β/(1β) = 19. Our observations are also applicable to Nesterov momentum and momentum-based adaptive optimizers like Adam. We later verify that the momentum induce the increase in weight norms and thus rapidly reduces the effective learning rates in many realistic setups of practical relevance ( §3.2 and §4).

3. METHOD

We have studied the accelerated decay of effective learning rates for scale-invariant weights (e.g. those preceding a normalization layer) under the momentum. In this section, we propose a projection-based solution that prevents the momentum-induced effective step size decreases while not changing the update directions in the effective weight space S d-1 . 3.1 OUR METHOD: PROJECTED UPDATES We remove the accumulated error term in Lemma 2.2, while retaining the benefits of momentum, through a simple modification. Let Π w (•) be a projection onto the tangent space of w: p t βp t-1 w t Π wt Projection ∇ w f (w t ) Gradient Momentum Ours Π wt (p t ) Π w (x) := x -( w • x) w. ( ) We apply Π w (•) to the momentum update p (equation 7) to remove the radial component, which accumulates the weight norms without contributing to the optimization. Our modified update rule is: w t+1 = w t -ηq t , q t = Π wt (p t ) if cos(w t , ∇ w f (w t )) < δ/ dim(w) p t otherwise (11) where cos(a, b) := |a b| a b is the cosine similarity. Instead of manually registering weights preceding normalization layers, our algorithm automatically detects scale invariances with the cosine similarity for user convenience. In all experiments considered, we found δ = 0.1 to be sufficiently small to precisely detect orthogonality and sufficiently large to recall all scale-invariant weights (Appendix §C): we suggest future users to use the same value. The proposed update rule makes a scale-invariant parameter w perpendicular to its update step q. It follows then that the rapid weight norm accumulation shown in Lemma 2.2 is alleviated back to the vanilla gradient descent growth rate in Lemma 2.1 due to the orthogonality: w t+1 2 2 = w t 2 2 + η 2 q t 2 2 ≤ w t 2 2 + η 2 p t 2 2 (12) where inequality follows from the fact that q t = Π wt (p t ) and Π wt is a projection operation. Although the updates p t are not identical between equation 8 and equation 12, we observe that after our orthogonal projection, the updates no longer get accumulated as in the last term of equation equation 8. We emphasize that this modification only alters the effective learning rate while not changing the effective update directions, as shown in the below proposition. Proposition 3.1 (Effective update direction after projection). Let w o t+1 := w tηp t and w p t+1 := w t -ηΠ wt (p t ) be original and projected updates, respectively. Then, the effective update after the projection w p t+1 lies on the geodesic on S d-1 defined by w t and w o t+1 . Proof in Appendix. As such, we expect that our algorithm inherits the convergence guarantees of GD. As to the convergence rate, we conjecture that a similar analysis by Arora et al. (2019)  p t ← βp t-1 + ∇ w f t (w t ) 3: Compute q t with equation 11.

4:

w t+1 ← w tηq t 5: end while Algorithm 2: AdamP Require: Learning rate η > 0, momentum 0 < β 1 , β 2 < 1, thresholds δ, ε > 0. 1: while w t not converged do 2: m t ← β 1 m t-1 + (1 -β 1 )∇ w f t (w t ) 3: v t ← β 2 v t-1 + (1 -β 2 )(∇ w f t (w t )) 2 4: p t ← m t /( √ v t + ε) 5: Compute q t with equation 11.

6:

w t+1 ← w tηq t 7: end while

3.2. EMPIRICAL ANALYSIS OF EFFECTIVE STEP SIZES AND THE PROPOSED PROJECTION

So far, we have studied the problem ( §2), namely that the momentum accelerates effective step sizes for scale-invariant weights, as well as the corresponding solution ( §3.1) only at a conceptual level. Here, we verify that the problem does indeed exist in practice and that our method successfully addresses it on both synthetic and real-world optimization problems.

Synthetic simulation.

While examples of scale-invariant optimization problems abound in modern deep learning (e.g. BN), they are scant in popular toy optimization objectives designed for sanity checks. We use Rosenbrock function (Rosenbrock, 1960)  : h(x 1 , x 2 ) = (1 -x 1 ) 2 + 300(x 2 - x 1 ) 2 (Figure 3 ). As our verification requires scale invariance, we define a 3D Rosenbrock function by adding a redundant radial axis r, while treating the original coordinates (x 1 , x 2 ) as the polar angles (ψ, φ) of the spherical coordinates, resulting in the function h(r, ψ, φ) = h(ψ, φ). h is optimized in the 3D space with the Cartesian coordinates. We describe the full details in Appendix §B. In Figure 3 , we compare the trajectories for optimizers on the spherical coordinates (ψ, φ). We compare the baseline momentum GD and our projection solution. We additionally examine the impact of weight decay (WD), since a careful choice of WD is another way to regularize the norm growth. We observe that the momentum GD does not converge sufficiently to the optimum. The slowdown is explained by the decreased effective step sizes on S 2 due to the increase in the parameter norm (r = 1.0 → 3.2). Careful tuning of WD partially addresses the slow convergence (momentum GD + WD trajectory) by regularizing the norm growth, but WD is still unable to preclude the initial surge in the weight norm (r = 1.0 → 2.53). In practice, addressing the problem with WD is even less attractive because WD is often a sensitive hyperparameter (see the following experiments). On the other hand, our projection solution successfully subdues the weight norm growth (r = 1.0 → 1.12), ensuring undiminished effective step sizes and a faster convergence to the optimum. Real-world experiments. We verify the surge of weight norm and suboptimality of model performances in momentum-trained deep networks on real-world datasets: ImageNet classification with ResNet-18 and music tagging (Law et al., 2009) with Harmonic CNN (Won et al., 2020a) . See Table 1 for the analysis. In all experiments, our projection solutions (SGDP, AdamP) restrain the weight norm growth much better than the vanilla momentum methods. For example, Adam-induced norm increase (+4.21) is 13.2 times greater than that of AdamP (+0.32) in the music tagging task. In ImageNet SGD experiments, we observe that a careful choice of weight decay (WD) mitigates the norm increases, but the final norm values and performances are sensitive to the WD value. On the other hand, our SGDP results in stable final norm values and performances across different WD values, even at WD= 0. We observe that under the same learning setup models with smaller terminal weight norms tend to obtain improved performances. Though it is difficult to elicit a causal relationship, we verify in §4 that SGDP and AdamP bring about performance gains in a diverse set of real-world tasks. More analysis around the norm growth and the learning curves are in Appendix §F. We also conduct analysis on the momentum coefficient in Appendix §H.3

4. EXPERIMENTS

In this section, we demonstrate the effectiveness of our projection module for training scaleinvariant weights with momentum-based optimizers. We experiment over various real-world tasks and datasets. From the image domain, we show results on ImageNet classification ( §4.1, §D.2, §D.3), object detection ( §4.2), and robustness benchmarks ( §4.3, §D.1). From the audio domain, we study music tagging, speech recognition, and sound event detection ( §4.4). We further show the results when the scale invariance is artificially introduced to a network with no scale-invariant parameters (e.g. Transformer-XL (Dai et al., 2019) ) in §4.5. To diversify the root cause of scale invariances, we consider the case where it stems from the 2 projection of the features, as opposed to the statistical normalization done in e.g. BN, in the image retrieval experiments ( §4.6). In the above set of experiments totaling > 10 setups, our proposed modifications (SGDP and AdamP) bring about consistent performance gains against the baselines (SGD (Sutskever et al., 2013) and AdamW (Loshchilov & Hutter, 2019) ). We provide the implementation details in the Appendix §E and the standard deviation values for the experiments in Appendix §H.2.

4.1. IMAGE CLASSIFICATION

Batch normalization (BN) and momentum-based optimizer are standard techniques to train stateof-the-art image classification models (He et al., 2016; Han et al., 2017; Sandler et al., 2018; Tan & Le, 2019; Han et al., 2020) . We evaluate the proposed method with ResNet (He et al., 2016) , one of the most popular and powerful architectures on ImageNet, and MobileNetV2 (Sandler et al., 2018) , a relatively lightweight model with ReLU6 and depthwise convolutions, on the ImageNet-1K benchmark (Russakovsky et al., 2015) . For ResNet, we employ the training hyperparameters in (He et al., 2016) . For MobileNetV2, we have searched for the best hyperparameters, as it is generally difficult to train it with the usual settings. Recent researches have identified better training setups (Cubuk et al., 2020) where the cosine-annealed learning rates and larger training epochs (100 epochs or 150 epochs) than 90 epochs (He et al., 2016) are used. We use those setups for all experiments in this subsection. Our optimizers are compared against their corresponding baselines in Table 2 . Note that AdamP is compared against AdamW (Loshchilov & Hutter, 2019) , which has closed the gap between Adam and SGD performances on large-scale benchmarks. Across the spectrum of network sizes, our optimizers outperform the baselines. Even when the state-of-the-art CutMix (Yun et al., 2019) regularization is applied, our optimizers introduce further gains. We provide three additional experiments on EfficientNet (Tan & Le, 2019) ( §D.2), the large-batch training scenario ( §D.3) and the comparison in the same computation cost ( §H.4) to demonstrate the benefit of AdamP on diverse training setups. There, again, our methods outperform the baselines. Object detection is another widely-used real-world task where the models often include normalization layers and are trained with momentum-based optimizers. We study the two detectors Center-Net (Zhou et al., 2019) and SSD (Liu et al., 2016a) to verify that the proposed optimizers are also applicable to various objective functions beyond the classification task. The detectors are either initialized with the ImageNet-pretrained network (official PyTorch models) or trained from scratch, in order to separate the effect of our method from that of the pretraining. ResNet18 (He et al., 2016) and VGG16 BN (Simonyan & Zisserman, 2015) are used for the CenterNet and SSD backbones, respectively. In Table 3 , we report average precision performances based on the MS-COCO (Lin et al., 2014) evaluation protocol. We observe that AdamP boosts the performance against the baselines. It demonstrates the versatility of our optimizers. Model robustness is an emerging problem in the real-world applications and the training of robust models often involve complex optimization problems (e.g. minimax). We examine how our optimizers stabilize the complex optimization. We consider two types of robustness: adversarial (below) and cross-bias (Bahng et al., 2020 ) (Appendix §D.1).

4.3. ROBUSTNESS

Adversarial training alternatively optimizes a minimax problem where the inner optimization is an adversarial attack and the outer optimization is the standard classification problem. Adam is commonly employed, in order to handle the complexity of the optimization (Tsipras et al., 2019; Chun et al., 2019) . We consider solving the minimax optimization problem with our proposed optimizers. We train Wide-ResNet (Zagoruyko & Komodakis, 2016) with the projected gradient descent (PGD) (Madry et al., 2018) on CIFAR-10. We use 10 inner PGD iterations and ε = 80/255 for the L 2 PGD and ε = 4/255 for the L ∞ PGD. Figure 4 shows the learning curves of Adam and AdamP. By handling the effective step sizes, AdamP achieves a faster convergence than Adam (less than half the epochs required). AdamP brings more than +9.3 pp performance gap in all settings. We also performed ε = 8/255 and the results are reported in Appendix §E.4.1.

4.4. AUDIO CLASSIFICATION

We evaluate the proposed optimizer on three audio classification tasks with different physical properties: music clips, verbal audios, and acoustic signals. For automatic music tagging, we use the MagnaTagATune (MTAT) benchmark (Law et al., 2009) with 21k samples and 50 tags. Each clip contains multiple tags. We use the Speech Commands dataset (Warden, 2018) for the keyword spotting task (106k samples, 35 classes, single label). For acoustic signals, we use the DCASE sound event detection benchmark (Mesaros et al., 2017 ) (53k samples, 17 tags, multi-labeled). We train the Harmonic CNN (Won et al., 2020a) on the three benchmarks. Harmonic CNN consists of data-driven harmonic filters and stacked convolutional filters with BN. Audio datasets are usually smaller than the image datasets and are multi-labeled, posing another set of challenges for the optimization. Won et al. (2020a) have trained the network with a mixture of Adam and SGD (Won et al., 2019b) . Instead of the mixture solution, we have searched the best hyperparameters for the Adam and AdamP on a validation set. The results are given in Table 4 . AdamP shows better performances than the baselines, without having to adopt the complex mixture solution. The results signify the superiority of AdamP for training scale-invariant weights on the audio domain. Models in language domains, such as the Transformer (Vaswani et al., 2017) , often do not have any scale-invariant weight. The use of layer normalization (LN) in the Transformer does not result in the scale invariance because LN is applied right after the skip connection f (w, x) := LN(g w (x)+ x) (note the difference from equation 2). The skip connection makes f scale-variant with respect to w. To allow our optimizers to effectively operate on Transformer, we introduce scale invariance artificially through the weight normalization (WN) (Salimans & Kingma, 2016) . We have trained Transformer-XL (Dai et al., 2019) on WikiText-103 (Merity et al., 2016) . As shown in 

4.6. RETRIEVAL

In the previous experiments, we have examined the scale invariances induced by statistical normalization methods (e.g. BN). Here, we consider another source of scale invariance, 2 projection of features, which induces a scale invariance in the preceding weights. It is widely used in retrieval tasks for more efficient distance computations and better performances. We fine-tune the ImageNetpretrained ResNet-50 network on CUB (Wah et al., 2011 ), Cars-196 (Krause et al., 2013) , In-Shop Clothes (Liu et al., 2016b) , and Stanford Online Products (SOP) (Oh Song et al., 2016) benchmarks with the triplet (Schroff et al., 2015) and the ProxyAnchor (Kim et al., 2020) losses. In Table 6 , we observe that AdamP outperforms Adam over all four image retrieval datasets. The results support the superiority of AdamP for networks with 2 normalized embeddings. 

5. CONCLUSION

Momentum-based optimizers induce an excessive growth of the scale-invariant weight norms. The growth of weight norms prematurely decays the effective optimization steps, leading to sub-optimal performances. The phenomenon is prevalent in many commonly-used setup. Momentum-based optimizers (e.g. SGD and Adam) are used for training the vast majority of deep models. The widespread use of normalization layers make a large proportion of network weights scale-invariant (e.g. ResNet). We propose a simple and effective solution: project out the radial component from the optimization updates. The resulting SGDP and AdamP successfully suppress the weight norm growth and train a model at an unobstructed speed. Empirically, our optimizers have demonstrated their superiority over the baselines on more than 10 real-world learning tasks.

A PROOFS FOR THE CLAIMS

We provide proofs for Lemma 2.2, Corollary 2.3, and Proposition 3.1 in the main paper. Lemma A.1 (Monotonic norm growth by the momentum). For a scale-invariant parameter w updated via equation 7, we have w t+1 2 2 = w t 2 2 + η 2 p t 2 2 + 2η 2 t-1 k=0 β t-k p k 2 2 . (A.1) Proof. From equation 7, we have w t+1 2 2 = w t 2 2 + η 2 p t 2 2 -2ηw t • p t (A.2) It remains to prove that w t • p t = -η t-1 k=0 β t-k p k 2 2 . We prove by induction on t ≥ 0. First, when t = 0, we have w 0 • p 0 = w 0 • ∇ w f (w 0 ) = 0 because of equation 4. Now, assuming that w τ • p τ = -η τ -1 k=0 β τ -k p k 2 2 , we have w τ +1 • p τ +1 = w τ +1 • (βp τ + ∇ w f (w τ +1 )) = βw τ +1 • p τ = β(w τ -ηp τ ) • p τ (A.3) = -βη τ -1 k=0 β τ -k p k 2 2 -βη p τ 2 2 = -η τ k=0 β τ -k+1 p k 2 2 (A.4) which completes the proof. Corollary A.2 (Asymptotic norm growth comparison). Let w GD t 2 and w GDM t 2 be the weight norms at step t ≥ 0, following the vanilla gradient descent growth (Lemma 2.1) and momentumbased gradient descent growth (Lemma 2.2), respectively. We assume that the norms of the updates p t 2 for GD with and without momentum are identical for every t ≥ 0. We further assume that the sum of the update norms is non-zero and bounded: 0 < t≥0 p t 2 2 < ∞. Then, the asymptotic ratio between the two norms is given by: w GDM t 2 2 -w 0 2 2 w GD t 2 2 -w 0 2 2 -→ 1 + 2β 1 -β as t → ∞. (A.5) Proof. From Lemma 2.1 and Lemma 2.2, we obtain w GD t 2 2 -w 0 2 2 = η 2 t-1 k=0 p k 2 2 (A.6) w GDM t 2 2 -w 0 2 2 = η 2 t-1 k=0 p k 2 2 + 2η 2 t-1 k=0 t-1-k l=1 β l p k 2 2 . (A.7) Thus, the corollary boils down to the claim that F t := t k=0 t-k l=1 β l A k t k=0 A k -→ β 1 -β as t → ∞ (A.8) where A k := p k 2 2 . Let > 0. We will find a large-enough t that bounds F t around β 1-β by a constant multiple of . We first let T be large enough such that k≥T +1 A k ≤ (A.9) which is possible because t≥0 A t < ∞. We then let T be large enough such that β 1 -β - T l=1 β l ≤ T max k A k (A.10) which is possible due to the convergence of the geometric sum and the boundedness of A k (because its infinite sum is bounded). We then define t = T + T and break down the sums in F t as follows: F t = T k=0 T +T -k l=1 β l A k + T +T k=T +1 T +T -k l=1 β l A k T k=0 A k + T +T k=T +1 A k (A.11) = T k=0 β 1-β + r 1 ( ) A k + r 2 ( ) T k=0 A k + r 3 ( ) (A.12) ≤ β 1-β T k=0 A k + T max k A k r 1 ( ) + r 2 ( ) T k=0 A k + r 3 ( ) (A.13) where r 1 , r 2 , and r 3 are the residual terms that are bounded as follows: It follows that |r 1 ( )| ≤ T max k A k (A. F t - β 1 -β ≤ -β 1-β r 3 ( ) + T max k A k r 1 ( ) + r 2 ( ) T k=0 A k + r 3 ( ) (A.16) ≤ 1 T k=0 A k β 1 -β + 1 -β β + 1 (A.17) ≤ 1 M β 1 -β + 1 -β β + 1 (A.18) due to the triangular inequality and the positivity of r 3 . M > 0 is a suitable constant independent of T . Proposition A.3 (Effective update direction after projection). Let w o t+1 := w tηp t and w p t+1 := w t -ηΠ wt (p t ) be original and projected updates, respectively. Then, the effective update after the projection w p t+1 lies on the geodesic on S d-1 defined by w t and w o t+1 . Proof. The geodesic defined by w t and w o t+1 can be written as S d-1 ∩ span(w t , p t ). Thus, it suffices to show that w p t+1 ∈ span(w t , p t ). Indeed, we observe that w p t+1 := wt-ηΠw t (pt) wt-ηΠw t (pt) 2 ∈ span(w t , p t ) because Π wt (p t ) = p t -( w t • p t ) w t .

B TOY EXAMPLE DETAILS

2D toy example in Figure 1 We describe the details of the toy example in Figure 1 . We solve the following optimization problem: min w w w 2 • w w 2 (B.1) where w and w are 2-dimensional vectors. The problem is identical to the maximization of the cosine similarity between w and w . We set the w to (0, -1) and the initial w to (0.001, 1). This toy example has two interesting properties. First, the normalization term makes the optimal w for the problem not unique: if w is optimal, then cw is optimal for any c > 0. In fact, the cost function is scale-invariant. Second, the cost function is not convex. As demonstrated in Figure 1 and videos attached in our submitted code, the momentum gradient method fails to optimize equation B.1 because of the excessive norm increases. In particular, our simulation results show that a larger momentum induces a larger norm increase (maximum norm 2.93 when momentum is 0.9, and 27.87 when momentum is 0.99), as we shown in the main paper § 2.4. On the other hand, our method converges most quickly, among the compared methods, by taking advantage of the momentum-induced accelerated convergence, while avoiding the excessive norm increase. 3D spherical toy example in Figure 3 We employ 3D Rosenbrock function (Rosenbrock, 1960) : h(r, ψ, φ) = (1 -ψ) 2 + 300(φ -ψ 2 ) 2 in the spherical coordinate (r, ψ, φ). Since we are mapping 2D space to a spherical surface, (ψ, φ) is defined on the hemisphere. Thus, -π/2 ≤ (ψ, φ) ≤ π/2. Based on the required range of problem f , scalar c can be multiplied to each angles f (cψ, cφ) to adjust angle scale to problem scale, where we choose c = 1.5 in the experiments. The initial point is (cψ, cφ) = (-2, 2) above the unit sphere r = 1, and the minimum point is (cψ, cφ) = (1, 1). Instead of optimizing h in the spherical coordinate, we optimize the toy example in the Cartesian coordinate (x, y, z) by computing min x,y,z h(r, ψ, φ) = min x,y,z h(T (x, y, z)). We employ the spherical transform T : (x, y, z) → (r, ψ, φ) as follows: r = x 2 + y 2 + z 2 , ψ = cos -1 (x/z), φ = sin -1 (y/r). (B.2) For all optimizers, we set the momentum to 0.9, and we have exhaustively searched the optimal initial learning rates between 0.001 and 0.1. The learning rates are decayed by linearly at every iteration. C δ SENSITIVITY Based on ImageNet training with ResNet18, we analyzed the cosine similarity of scale-invariant parameter and scale-variant parameter, and verified the sensitivity of δ. We first measured the cosine similarity between the gradient and weights (Eq. 11). As a results, the cosine similarities for scale-variant weights are [0.0028, 0.5660] , compared to [5.5 × 10 -10 , 4.4 × 10 -6 ] for scaleinvariant ones (99% confidence intervals). Because of this large gap, our methods are stable on a wide range of δ values. We also measured scalevariant and scale-invariant parameter detection accuracy based on Eq. 11 with various δ values. The results are shown in Table B .1. In a wide range of delta values, AdamP perfectly discriminates the scale-variant and scale-invariant parameters. Also, AdamP consistently shows high performance. Therefore, AdamP is not sensitive to the δ value and δ = 0.1 is suitable to separate scale-variant and scale-invariant parameters. We follow the two cross-bias generalization benchmarks proposed by Bahng et al. (2020) . The first benchmark is the Biased MNIST, the dataset synthesized by injecting colors on the MNIST background pixels. Each sample is colored according to a pre-defined class-color mapping with probability ρ. The color is selected at random with 1ρ chance. For example, ρ = 1.0 leads a completely biased dataset and ρ = 0.1 leads to an unbiased dataset. Each model is trained on the ρ-biased MNIST and tested on the unbiased MNIST. We train a stacked convolutional network with BN and ReLU. The second benchmark is the 9-Class ImageNet representing the real-world biases, such as textures (Geirhos et al., 2019) . The unbiased accuracy is measured by pseudo-labels generated by the texture clustering. We also report the performance on ImageNet-A (Hendrycks et al., 2019) , the collection of failure samples of existing CNNs. In Table D .1, we observe that AdamP outperforms Adam in all the benchmarks. AdamP is a good alternative to Adam for difficult optimization problems applied on scale-invariant parameters. 2 AdamP (ours) 30.5 (+7.5) 70.9 (+7.9) 80.9 (+6.0) 89.6 (+2.6) 68.0 (+6.0) 95.2 (+1.4) 94.5 (+1.8) 32.9 (+1.7)

D.2 TRAINING WITH VARIOUS TECHNIQUES

EfficientNet (Tan & Le, 2019) and ReXNet (Han et al., 2020) are recently proposed highperformance networks that is trained using various techniques such as data augmentation (Cubuk et al., 2018; 2020) , stochastic depth (Huang et al., 2016) , and dropout (Srivastava et al., 2014) sakovsky et al., 2015) , MS-COCO (Lin et al., 2014) , CIFAR-10 (Krizhevsky, 2009) , Biased-MNIST (LeCun et al., 1998; Bahng et al., 2020) , ImageNet-A (Hendrycks et al., 2019) , 9-Class ImageNet (Bahng et al., 2020) , 9-Class ImageNet-A (Bahng et al., 2020) , MagnaTagATune (Law et al., 2009) , Speech Commands (Warden, 2018) , DCASE 2017 task 4 (Mesaros et al., 2017) , CUB (Wah et al., 2011) , In-Shop Clothes (Liu et al., 2016b) and SOP (Oh Song et al., 2016) et al., 2016) is trained over 100 epochs with various batch-size on ImageNet (Russakovsky et al., 2015) . AdamW (Loshchilov & Hutter, 2019) In order to efficiently use multiple machines and huge computational resources, large-batch training is essential. However, a general optimizer suffers from a significant performance decrease in a large-batch training, so large-batch training is another challenge for a deep learning optimizer. (You et al., 2017; 2019; Goyal et al., 2017) We conducted experiments to verify the performance of AdamP in such large-batch training, and the results are shown in Table D .3. The performance improvement of AdamP in large-batch training is greater than that of regular batch-size (Table 2 ). Therefore, the decrease of effective learning rate due to momentum can be considered as one of the causes of performance degradation in large-batch training. However, AdamP does not show as much performance as the large-batch optimizers (You et al., 2017; 2019; Goyal et al., 2017) , and therefore, applying AdamP to the large-batch optimizer should be studied as a future work.

E EXPERIMENTS SETTINGS

We describe the experimental settings in full detail for reproducibility.

E.1 COMMON SETTINGS

All experiments are conducted based on PyTorch. SGDP and AdamP are implemented to handle channel-wise (e.g. batch normalization (Ioffe & Szegedy, 2015) and instance normalization (Ulyanov et al., 2016) ) and layer-wise normalization (Ba et al., 2016) . Based on the empirical measurement of the inner product between the weight vector and the corresponding gradient vector for scale-invariant parameters (they are supposed to be orthogonal), we set the δ in Algorithms 1 and 2 to 0.1. We use the decoupled weight decay (Loshchilov & Hutter, 2019) for SGDP and AdamP in order to separate the gradient due to the weight decay from the gradient due to the loss function. Please refer to the attached codes: sgdp.py and adamp.py for further details.

E.2 IMAGE CLASSIFICATION

Experiments on ResNet (He et al., 2016) are conducted based on the standard settings : learning rate 0.1, weight decay 10 -4 , batch-size 256, momentum 0.9 with Nesterov (Sutskever et al., 2013) for SGD and SGDP. For Adam series, we use the learning rate 0.001, weight decay 10 -4 , batch-size 256, β 1 0.9, β 2 0.999, 10 -8 . For training MobileNetV2 (Sandler et al., 2018) , we have additionally used label-smoothing and large batch size 1024, and have searched the best learning rates and weight decay values for each optimizer. The training sessions are run for 100 epochs (ResNet18, ResNet50) or 150 epochs (MobileNetV2, ResNet50 + CutMix) with the cosine learning rate schedule (Loshchilov & Hutter, 2016) on a machine with four NVIDIA V100 GPUs.

E.3 OBJECT DETECTION

Object detection performances have been measured on the MS-COCO dataset (Lin et al., 2014) with two popular object detectors: CenterNet (Zhou et al., 2019) and SSD (Liu et al., 2016a) . We adopt the CenterNet with ResNet18 (He et al., 2016) backbone and the SSD with VGG16 BN (Simonyan & Zisserman, 2015) backbone as baseline detectors. CenterNet has been trained for 140 epochs with learning rate 2.5 × 10 -4 , weight decay 10 -5 , batch size 64, and the cosine learning rate schedule. SSD has been trained for 110 epochs with learning rate 10 -4 , weight decay 10 -5 , batch size 64, and the step learning rate schedule which decays learning rates by 1/10 at 70% and 90% of training. E .2 shows the detailed results. We follow the two cross-bias generalization benchmarks proposed by (Bahng et al., 2020) . We refer (Bahng et al., 2020) for interested readers. For all experiments, the batch size is 256 and 128 for Biased MNIST and 9-Class ImageNet, respectively. For Biased MNIST, the initial learning rate is 0.001, decayed by factor 0.1 every 20 epochs. For 9-Class ImageNet, the learning rate is 0.001, decayed by cosine annealing. We train the fully convolutional network and ResNet18 for 80 and 120 epochs, respectively. The weight decay is 10 -foot_3 for all experiments.

E.5 AUDIO CLASSIFICATION

Dataset. Three datasets with different physical properties are employed as the audio benchmarks. We illustrate the statistics in Table E .1. The music tagging is a multi-label classification task for the prediction of user-generated tags, e.g., genres, moods, and instruments. We use a subset of Mag-naTagATune (MTAT) dataset (Law et al., 2009) which contains ≈21k audio clips and 50 tags. The average of tag-wise Area Under Receiver Operating Characteristic Curve (ROC-AUC) and Area Under Precision-Recall Curve (PR-AUC) are used as the evaluation metrics. Keyword spotting is a primitive speech recognition task where an audio clip containing a keyword is categorized among a list of limited vocabulary. We use the Speech Commands dataset (Warden, 2018) which contains ≈ 106k samples and 35 command classes such as "yes", "no", "left", "right". The accuracy metric is used for the evaluation. Acoustic sound detection is a multi-label classification task with non-music and non-verbal audios. We use the "large-scale weakly supervised sound event detection for smart cars" dataset used for the DCASE 2017 challenge (Mesaros et al., 2017) . It has ≈53k audio clips with 17 events such as "Car", "Fire truck", and "Train horn". For evaluation, we use the F1-score by setting the prediction threshold as 0.1. Training setting. We use the 16kHz sampling rate for the all experiments, and all hyperparameters, e.g., the number of harmonics, trainable parameters, are set to the same as in (Won et al., 2020a) . The official implementation by (Won et al., 2020a) 4 is used for all the experiments. We compare three different optimizers, Adam, AdamP (ours), and the complex mixture of Adam and SGD proposed by (Won et al., 2019b) . For the mixture of Adam and SGD, we adopt the same hyperparameters as in the previous papers (Won et al., 2019b; c; a; 2020a) . The mixed optimization algorithm first runs Adam for 60 epochs with learning rate 10 -4 . After 60 epochs, the model with the best validation performance is selected as the initialization for the second phase. During the second phase, the model is trained using SGD for 140 epochs with the learning rate 10 -4 , decayed by 1/10 at epochs 20 and 40. We use the weight decay 10 -4 for the optimizers. Using the hyperparameters, we reproduce the ROC-AUC score on MTAT dataset by the recent clean-up paper Won et al. (2020b) , 91.27. To show the effectiveness of our method, we have searched the best hyperparameters for the Adam optimizer on the MTAT validation dataset and have transferred them to AdamP experiments. As the result of our search, we set the weight decay as 0 and the initial learning rate as 0.0001 decayed by the cosine annealing scheduler. The number of training epochs are set to 100 for MTAT dataset and 30 for SpeechCommand and DCASE dataset. As a result, we observe that AdamP shows superior performances compared to the complex mixture, with a fewer number of training epochs (200 → 30).

E.6 RETRIEVAL

Dataset. We use four retrieval benchmark datasets. For the CUB (Wah et al., 2011) dataset which contains bird images with 200 classes, we use 100 classes for training and the rest for evaluation. For evaluation, we query every test image to the test dataset, and measure the recall@1 metric. The same protocol is applied to Cars-196 (Krause et al., 2013) (196 classes) and SOP (Oh Song et al., 2016) (22,634 classes) datasets. For InShop (Liu et al., 2016b ) experiments, we follow the official benchmark setting proposed by (Liu et al., 2016b) . We summarize the dataset statistics in Table E .1 

G RELATED WORK

We provide a brief overview of related prior work. A line of work is dedicated to the development general and effective optimizers, such as Adagrad (Duchi et al., 2011 ), Adam (Kingma & Ba, 2015) , and RMSprop. Researchers have sought strategies to improve Adam through e.g. improved convergence (Reddi et al., 2018) , warmup learning rate (Liu et al., 2020) , moving average (Zhang et al., 2019b) , Nesterov momentum (Dozat, 2016) , rectified weight decay (Loshchilov & Hutter, 2019) , and variance of gradients (Zhuang et al., 2020) . Another line of researches studies existing optimization algorithms in greater depth. For example, (Hoffer et al., 2018; Arora et al., 2019; Zhang et al., 2019a) have delved into the effective learning rates on scale-invariant weights. This paper at the intersection between the two. We study the issues when momentum-based optimizers are applied on scale-invariant weights. We then propose a new optimization method to address the problem.

G.1 COMPARISON WITH CHO & LEE (2017)

Cho & Lee (2017) have proposed optimizers that are similarly motivated from the scale invariance of certain parameters in a neural network. They have also propose a solution that reduces the radial component of the optimization steps. Despite the apparent similarities, the crucial difference between Cho & Lee (2017) and ours is in the space where the optimization is performed. Cho & Lee (2017) performs the gradient steps on the Riemannian manifold. Ours project the updates on the tangent planes of the manifold. Thus, ours operates on the same Euclidean space where SGD and Adam operate on. From a theory point of view, Cho & Lee (2017) has made contributions to the optimization theory on a Riemannian manifold. Our contributions are along a different axis: we focus on the norm growth when the updates are projected onto the tangent spaces ( §2.4, §3.1, and §3.2). contribute present theoretical findings that are not covered by Cho & Lee (2017) . From the practicality point of view, we note that changing the very nature of space from Euclidean to Riemannian requires users to find the sensible ranges for many optimization hyperparameters again. For example, Cho & Lee (2017) has "used different learning rates for the weights in Euclidean space and on Grassmann [Riemannian] manifolds" (page 7 of (Cho & Lee, 2017) ), while in our case hyperparameters are largely compatible between scale-invariant and scale-variant parameters, for they are both accommodated in the same kind of space. We have shown that SGDP and AdamP outperform the SGD and Adam baselines with exactly the same optimization hyperparameters (Section E.2). The widely used Xavier (Glorot & Bengio, 2010) or Gaussian initializations are no longer available in the spherical Riemannian manifold, necessitating changes in the code defining parameters and their initialization schemes: e.g. Cho & Lee (2017) has employed a dedicated initialization based on truncated normals. Finally, Cho & Lee (2017) requires users to manually register scale-invariant hyperparameters. This procedure is not scalable, as the networks nowadays are becoming deeper and more complex, and the architectures are becoming more machine-designed than handcrafted. Our optimizers automatically detect scale invariances through the orthogonality test (Equation 11), and users do not need to register anything by themselves. The shift from linear to curved coordinates introduces non-trivial changes in the optimization settings; our optimizers does not introduce such a shift and it is much easier to apply our method on a new kind of model, from a user's perspective. In addition to the above conceptual considerations, we compare the performances between ours and the Grassmann optimizers (Cho & Lee, 2017) . We first compare them in the 3D scale-invariant Rosenbrock example (see §3.2 for a description). mann optimizers: SGDG and AdamG, respectively. It can be seen that SGDG and AdamG optimizer do not introduce any norm increase by definition: they operate on a spherical space. However, rate of convergence seems slower than our SGDP and AdamP (the rightmost plots in each figure). We note that SGDG and AdamG include a gradient clipping operation to restrict the magnitude of projected gradients on the Grassmann manifold to ν, where ν is empirically set to 0.1 in Cho & Lee (2017) . We have identified an adverse effect of the gradient clipping, at least on our toy example. As the clipping is removed, SGDG and AdamG come closer to the fast convergence of ours (the middle plots in each figure). We may conclude for the toy example that the Grassmann optimizers also address the unnecessary norm increase and converge as quickly as our optimizers do. For a more practical setup, we present experiments on ImageNet with the Grassmann optimizers (SGDG and AdamG) and report the top-1 accuracies. We have conducted the experiments with ResNet18 following the settings in Table 2 . In addition to the learning rate of optimizers (lr Euclidean ), Cho & Lee (2017) introduces three more hyperparameters: (1) learning rate on the Grassmann manifold (lr Grassmann ), (2) degree of the regularization on the Grassmann manifold (α), which replaces the L 2 regularization in Euclidean space, and (3) the gradient clipping threshold (ν). Since Cho & Lee (2017) have not reported results on ImageNet training, we have tuned the hyperparameters ourselves, following the guidelines of Cho & Lee (2017) . Table G .1 shows the exploration of the hyperparameters that are considered above. The first rows show the performance with the recommended hyperparameters in the paper (α = ν = 0.1 and lr Euclidean = 0.01 for SGDG and AdamG; lr Grassmann = 0.2 for SGDG and lr Grassmann = 0.05 for AdamG). We first tested the effects of regularization (α) and gradient clipping (ν), which were additionally introduced in the Grassmann optimizers. The first block of the Table G .1 shows the result. The gradient clipping (ν) did not have a significant effect, but the regularization (α) decrease the performance. Therefore, we turned off regularization (α = 0) in the optimizers and started learning rate search. In Table 2 , the baseline optimizers and our optimizers are compared in the fixed learning rate (SGD: 0.1, Adam: 0.001) 

H.2 STANDARD DEVIATION OF EXPERIMENTS

The most experiments in the paper were performed three times with different seed. The mean value was reported in the main paper and the standard deviation value is shown in the Table H .1, H.2, H.3, H.4, H.5 accordingly. In case of ImageNet classification, mean values are shown in Table 2 and standard deviation values are shown in Table H .1. In most cases, the improvement of our optimizer is significantly larger than the standard deviation. Even in the worse case, (SGDP on ResNet50), the performance increases are at the level of the standard deviations. For the audio classification results (Table 4 and H. 2), the performance increases by AdamP in Music Tagging (ROC-AUC) and Keyword Spotting are much greater than the standard deviation values; for the other entries, the performance increases are at the level of the standard deviations. In the language model, all experiments have similar standard deviation values as shown in Table H .3. In all cases, AdamP's improvement (Table 5 ) is significantly larger than the standard deviation. We observe a different tendency for the image retrieval tasks (Table 6 and H.4 ). In many cases, the standard deviation values are large, so the performance boost is not clearly attributable to AdamP. However, the performance increases for InShop-PA and SOP-PA are still greater than the standard deviation values. Finally, in the results for robustness against real-world biases (Table D .1 and H.5), our optimization algorithm brings about far greater performance gains than the randomness among trials. In most experiments, the performance improvement of our optimizers is much greater than the standard deviation values. Since our optimizer is deeply involved with the momentum of optimizers, we measure and analyze the effect of our optimizer on several momentum coefficients. The experiment was conducted using ResNet18 in ImageNet with the setting of Section 4.1, and weight decay was not used to exclude norm decrease due to weight decay. The results are shown in Table H.6. According to the difference between equation 8 and equation 12, the effect of preventing norm growth of our optimizer is affected by the momentum coefficient. It can be observed in this experiment. The larger the momentum coefficient, the greater the norm difference between SGD and SGDP. In addition, it can be seen that the improvement in accuracy of SGDP also increases as the momentum coefficient increases. The experiment shows that our optimizer can be used with most of the momentum coefficients, especially when the momentum coefficient is large, the effect of our optimizer is significant and essential for the high performance. As specified in Section 3.1, our optimizers requires an additional computation cost, which increases the training cost by 8%. In general, the optimizer's performance is compared in the same iteration, and we followed this convention in other experiments. However, the training cost is also an important issue, so we conduct further verification of our optimizer through comparison at the same training budget. The experimental setup is simple. We performed imagenet classification in Section 4.1 with only 92% epochs for our optimizers (SGDP and AdamP) and set the training budget of our optimizer and baseline optimizer to be the same. The results are shown in Table H .7. Training iteration is reduced, so the performance of our optimizer is reduced, but it still outperforms the baseline optimizer. Thus, it can be seen that our optimizer outperforms the baseline optimizer not only on the same iteration, but also on the same training budget. 



https://github.com/rwightman/pytorch-image-models https://github.com/louis2889184/pytorch-adversarial-training https://github.com/MadryLab/cifar10_challenge https://github.com/minzwon/data-driven-harmonic-filters https://github.com/tjddus9597/Proxy-Anchor-CVPR2020



Figure 1. Optimizer trajectories. Shown is the wt for the optimization problem maxw w w w 2 w 2 s. Trajectories start from w0 towards the optimal solution w . The problem is invariant to the scale of w. Video version in the attached code.

Figure 2. Vector directions of the gradient, momentum, and ours.

Figure 3. 3D scale-invariant Rosenbrock. Three optimization algorithms are compared. Upper row: loss surface and optimization steps. Lower row: norm r of parameters over the iterations. Results for Adam variants in Appendix §H.1.

training Adam (standard accuracy) Adam (attacked accuracy) AdamP (standard accuracy) AdamP (attacked accuracy)

Figure 4. Adversarial training. Learning curves by Adam and AdamP.

generalization problem(Bahng et al., 2020) tackles the scenario where the training and test distributions have different real-world biases. This often occurs when the training-set biases provide an easy shortcut to solve the problem.Bahng et al. (2020) has proposed the ReBias scheme based on the minimax optimization, where the inner problem maximizes the independence between an intentionally biased representation and the target model of interest and the outer optimization solves the standard classification problem. As for the adversarial training,Bahng et al. (2020) has employed the Adam to handle the complex optimization.

results have been reproduced using the unofficial PyTorch implementation of the adversarial training of Wide-ResNet(Zagoruyko & Komodakis, 2016) 2 for the CIFAR-10 attack challenge 3 . Projected gradient descent (PGD) attack variants(Madry et al., 2018) have been used as the threat model for the all the experiments. We employed 10 inner PGD iterations and ε = 80/255 for the L 2 PGD attack and ε = 4/255 for the L ∞ PGD attack. We additionally test stronger threat models, L ∞ PGD with ε = 8/255 and 20 iterations asMadry et al. (2018). FollowingMadry et al. (2018), we employ 7 inner PGD iterations for the training threat model, and 20 inner PGD iterations for the test threat model. In all the experiments, Wide-ResNet-34-10 have been trained with the PGD threat model. The models have been trained for 200 epochs with learning rate 0.01, weight decay 0.0002, batch size 128, and the step learning rate schedule which decays learning rates by 1/10 at epochs 100 and 150. Table

Figure F.6. Norm value analysis for high weight decay: AdamW + cosine learning rate decay.

Figure G.1 and G.2 show the results for the Grass-

Figure G.1. 3D scale-invariant Rosenbrock with SGDG optimizers. 3D toy experiments based on SGDG optimizers. Upper row: loss surface and optimization steps. Lower row: norm r of parameters over the iterations.

Figure G.2. 3D scale-invariant Rosenbrock with AdamG optimizers. 3D toy experiments based on AdamG optimizers. Upper row: loss surface and optimization steps. Lower row: norm r of parameters over the iterations.

Figure H.1. 3D scale-invariant Rosenbrock with Adam optimizers. 3D toy experiments based on Adam optimizers. Upper row: loss surface and optimization steps. Lower row: norm r of parameters over the iterations.

{1

• ,r} as Norm k

holds.

Analysis of optimizers on real-world tasks. The norm values and final performances for different tasks and optimizers are shown. Norm 1 : norm at first epoch. Norm last : norm at last epoch. Score: accuracy for ImageNet and AUC for music tagging.

ImageNet classification. Accuracies of state-of-the-art networks trained with SGDP and AdamP.

Audio classification. Results on three audio tasks with Harmonic CNN(Won et al., 2020a).

Language Modeling. Perplexity on Wiki-Text103. Lower is better.

AdamP does not significantly outperform AdamW (23.33→23.26) without the explicit enforcement of scale invariance. WN on its own harms the performance (23.33→23.90 for Adam). When AdamP is applied on Transformer-XL + WN, it significantly improves over the AdamW baseline (23.90→22.73), beating the original performance 23.33 by Adam on the vanilla network.

Image retrieval. Optimizers are tested on networks with 2-induced scale invariant weights. Recall@1.





. We measured the performance of AdamP on EfficientNets and ReXNets to verify it can be used with other training techniques. The experiment were conducted on the well-known image classification codebase 1 . TableD.2 shows performance of the original paper, our reproduced results, and AdamP. The results show that AdamP is still effective in training with various techniques, and can contribute to improving the best performance of the network.

2.Training with various techniques. EfficientNet and ReXNet were trained using the latest training techniques, and the result shows that AdamP can also contribute to learning with these techniques.

Table E.1. Dataset statistics. Summary of the dataset specs used in the experiments. ImageNet-1k (Rus-

are used in experiments.

2. Adversarial training. Standard and attacked accuracies of PGD-adversarially trained Wide-ResNet on CIFAR-10. For each threat model scenario, we report the perturbation size ε and the number of PGD iterations n. * denotes results from the original paper.

1. Hyperparameter search of the Grassmann optimizers. This talbe shows the performance of ResNet18 in ImageNet trained with Grassmann optimizers (SGDG and AdamG) with various hyperparameters. Except for the hyperparameters specified in the table, the setting is the same as the experiment in Table2.(a) SGDG (SGDP accuracy : 70.70) Use the learning rates of the paper(Cho & Lee, 2017). 2) & 3) Use the fixed learning rates for lr euclidean or lr grassmann , and adjust the other following the learning rate ratio of the paper. 4) Use baseline learning rates for both learning rates (lr euclidean and lr grassmann ). The result is reported in the second block of TableG.1. SGDG shows the best performance at option 4) and AdamG is the best in option 1). However, the performances of Grassmann optimizers are still much lower than our optimizers: SGDP (70.70) and AdamP (70.82). We further tuned the learning rate of the Grassmann optimizer, which is shown in the last block of the TableG.1. After learning rate tuning, Grassmann optimizers show comparable performance with the baseline optimizer (SGD: 70.47, Adam: 68.05, AdamW: 70.39). However, learning rate tuning is essential for the performance, and our optimizer has a higher performance. So, it can be said that our optimizer is more practical and effective for ImageNet training than the Grassmann optimizer.We also evaluated our 3D toy experiments for Adam optimizer. The results are shown in Fig.H.1 Adam optimizer shows quiet different steps in early stage. However, the fact that norm growth reduces the rate of late convergence is the same as SGD. The weight decay and our projection mitigate the norm growth and helps fast convergence.

1. Standard deviation for ImageNet classification. Standard deviation of accuracy of state-of-the-art networks trained with SGDP and AdamP (Table2).

2. Standard deviation for audio classification. Standard deviation for results on the audio tasks with Harmonic CNN(Won et al., 2020a)  (Table4).

3. Standard deviation for Language Modeling. Standard deviation of the perplexity values on Wiki-Text103 (Table5).

4. Standard deviation for image retrieval. Standard deviation of Recall@1 on the retrieval tasks (Table6). ± 0.24 ± 0.82 ± 0.09 ± 0.52 ± 0.08 ± 0.71 ± 0.04 AdamP (ours) ± 0.62 ± 0.77 ± 1.35 ± 0.19 ± 0.82 ± 0.02 ± 0.74 ± 0.09H.3 ANALYSIS WITH MOMENTUM COEFFICIENT

5. Standard deviation for robustness against real-world biases. Standard deviation for Re-Bias(Bahng et al., 2020) performances on Biased MNIST and 9-Class ImageNet benchmarks (TableD.1).

6. Analysis with momentum coefficient. We measured the difference between our SGDP and SGD with various momentum coefficients.

7. ImageNet classification comparison at the same computation cost. Accuracies of state-of-the-art networks trained with SGDP and AdamP. We also conduct training over 92% epochs with SGDP and AdamP for the comparison in the same computation cost.

ACKNOWLEDGEMENT

We thank NAVER AI LAB colleagues for discussion and advice, especially Junsuk Choe for the internal review. Naver Smart Machine Learning (NSML) platform (Kim et al., 2018) has been used in the experiments.

availability

Source code is available at https://github.com/clovaai

APPENDIX

This document provides additional materials for the main paper. Content includes the proofs ( §A), detailed experimental setups ( §B and §E), and the additional analysis on the learning rate scheduling and weight decay ( §F).Training setting. For the all experiments, we use the same backbone network and the same training setting excepting the optimizer and the loss function. The official implementation by (Kim et al., 2020) 5 is used for the all experiments.We use the Pytorch official ImageNet-pretrained ResNet50 model as the initialization. During the training, we freeze the BN statistics as the ImageNet statistics (eval mode in PyTorch). We replace the global average pooling (GAP) layer of ResNet with the summation of GAP and global max pooling layer as in the implementation provided by (Kim et al., 2020) . Pooled features are linearly mapped to the 512 dimension embedding space and 2 -normalized.We set the initial learning rate 10 -4 , decayed by the factor 0.5 for every 5 epochs. Every mini-batch contains 120 randomly chosen samples. For the better stability, we train only the last linear layer for the first 5 epochs, and update all the parameters for the remaining steps. The weight decay is set to 0.

F ANALYSIS WITH LEARNING RATE SCHEDULE AND WEIGHT DECAY

We analyze the norm growth of scale-invariant parameters and the corresponding change in effective step-size. We provide extended results of this experiment by measuring norm growth and effective step-size for SGD, SGDP, Adam and AdamP under various weight decay values. The experiment is based on ResNet18 trained on the ImageNet dataset, and the network was trained for 100 epoch in the standard setting as in E.2. We have analyzed the impact of learning rate schedule and weight decay for the scale-invariant parameters. In all considered settings, SGDP and AdamP effectively prevent the norm growth, which prevents the rapid decrease of the effective step sizes. SGDP and AdamP shows better performances than the baselines. Another way to prevent the norm growth is to control the weight decay. However, this way of norm adjustment is sensitive to the weight decay value and results in poor performances as soon as non-optimal weight decay values are used. 

F.1 ANALYSIS AT HIGH WEIGHT DECAY

In the previous Figures, we only showed the case where the weight decay is less than 1e-4. This is because when the weight decay is large, the scale of the graph changes and it is difficult to demonstrate the difference in small weight decay. Therefore, we separately report the large weight decay cases through Figure F.5 and F.6. The high weight decay further reduces the weight norm and increases Published as a conference paper at ICLR 2021 the effective step-size, but it does not lead to an improvement in performance. Also, this result shows that the weight decay used in Section 4.1 (1e-4) is the best value for the baseline optimizers. 

