DIMENSION-REDUCED ADAPTIVE GRADIENT METHOD

Abstract

Adaptive gradient methods, such as Adam, have shown faster convergence speed than SGD across various kinds of network models. However, adaptive algorithms often suffer from inferior generalization performance than SGD. Though much effort via combining Adam and SGD have been invested to solve this issue, adaptive methods still fail to attain as good generalization as SGD. In this work, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG) to eliminate the generalization gap. DRAG makes an elegant combination of SGD and Adam by adopting a trust-region like framework. We observe that 1) Adam adjusts stepsizes for each gradient coordinate, and indeed decomposes the n-dimensional gradient into n independent directions to search; 2) SGD uniformly scales gradient for all gradient coordinates and actually has only one descent direction to minimize. Accordingly, DRAG reduces the high degree of freedom of Adam and also improves the flexibility of SGD via optimizing the loss along k (≪ n) descent directions, e.g. the gradient direction and momentum direction used in this work. Then per iteration, DRAG finds the best stepsizes for k descent directions by solving a trustregion subproblem whose computational overhead is negligible since the trustregion subproblem is low-dimensional, e.g. k = 2 in this work. DRAG is compatible with the common deep learning training pipeline without introducing extra hyper-parameters and with negligible extra computation. Moreover, we prove the convergence property of DRAG for non-convex stochastic problems that often occur in deep learning training. Experimental results on representative benchmarks testify the fast convergence speed and also superior generalization of DRAG.

1. INTRODUCTION

SGD (Robbins & Monro, 1951) and its variant with momentum (Sutskever et al., 2013) are used widely in training deep neural networks. They perform well empirically and have theoretical guarantee (Szegedy et al., 2015; He et al., 2016; Lee et al., 2016; Hardt et al., 2016) . However, SGD suffers from two issues. It often has slow convergence speed since it adopts a single learning rate for all the gradient coordinates. Moreover, it is also hard to tune the single learning rate (Wilson et al., 2017) , since not all gradient coordinates share the same optimization properties. To resolve this problem, several adaptive gradient methods have been proposed to adopt different learning rate for different gradient coordinates. Typical examples of such methods include Adagrad (Duchi et al., 2011) , RMSProp (Tieleman et al., 2012), and Adam (Kingma & Ba, 2014) . Emprically, these methods have shown faster convergence speed and eased the burden of carefully tuning the learning rate in SGD across many kinds of networks. However, their generalization performance are often worse than SGD in many scenarios (Wilson et al., 2017; Zhou et al., 2020) . Some algorithms are proposed to combine the fast convergence speed of adaptive gradient methods and good generalization performance of SGD. Instances of this type of algorithms include SWATS (Keskar & Socher, 2017) which automatically switchs from Adam to SGD, ND-Adam (Zhang et al., 2017) which utilizes vector learning rate and normalization to control direction and stepsize, and AMSGrad (Reddi et al., 2018) which maintains a monotone increasing second moment. Unfortunately, these methods only slightly bridge the generalization gap between SGD and Adam, but does not attain as good generalization performance as SGD, needless to say the state-of-the-art performance on test set. Accordingly, these algorithms are rarely used to train deep networks in practice. To combine the merits of Adam and SGD, i.e. fast convergence speed in Adam and excellent generalization in SGD, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG for short) which minimizes the loss from several descent directions to trade-off the whole space search in Adam and the minimization along a single gradient direction in SGD. For Adam, adjusting stepsizes for each gradient coordinate actually transforms the n-dimensional gradient into n independent directions to optimize, in which each direction inherits one coordinate element from the gradient and sets the remaining coordinate positions as zeros. In contrast, SGD only uses a single learning rate for all gradient coordinates and minimizes the loss along one descent direction. Though the adaptive learning rate for each coordinate shows faster convergence speed than a single learning rate for all coordinates, as shown in many works (Wilson et al., 2017; Zhou et al., 2018) , it also leads to the inferior generalization in Adam, since minimizing n independent directions means searching the whole parameter space and could results in overfitting. So it is natural to trade-off the number of descent directions. To this end, motivated by DRSOM (Zhang et al., 2022) , we update the parameters along the gradient direction and momentum direction through a trust-region-like approach, which greatly reduces the high adaptivity of Adam while adding flexibility to SGD. At each iteration, DRAG searches for the optimal update along the gradient and the momentum which are widely used in accelerated algorithm (Polyak, 1964; Nesterov, 2003) by solving a two-dimensional trust-region subproblem to find the best stepsizes for these two directions. For the trust-region subproblem, we use a quadratic approximation to estimate the loss with the Hessian matrix estimated by the second moment in Adam which is a diagonal matrix and can greatly reduce the computational cost. Moreover, we heuristically design a simple and effective trust-region radius for the subproblem. Despite the delicate design of our algorithm, we also theoretically prove that on non-convex problems, our DRAG can converge and enjoys a stochastic gradient complexity of O(ϵ -4 ) to find an ϵ-approximate first-order stationary point. To summarize, our contributions are as follows: • We proposed the DRAG algorithm to optimize the loss from several descent directions for balancing the whole space search in Adam and the optimization along a single gradient direction in SGD. Moreover, we formulate the optimum stepsize search for these descent directions into a low-dimensional trust region problem whose computational cost is negligible when compared with the vanilla cost in adaptive gradient algorithms. • We theoretically prove that to find an ϵ-approximate stationary point on non-convex stochastic problems, DRAG has the stochastic gradient complexity of O(ϵ -4 ) which matches the lower bound Ω(ϵ -4 ) in (Arjevani et al., 2022) under the same non-convex optimization setting. • Experimental results show that on several representative benchmarks, our DRAG method can achieve faster convergence speed than SGD, and also state-of-the-art generalization performance.

2. RELATED WORK

Adaptive gradient methods, e.g. Adam (Kingma & Ba, 2014), Adagrad (Duchi et al., 2011) , and RMSprop (Tieleman et al., 2012) , adopt different stepsizes for different gradient coordinates so as to boost training process. Although Adam and its variants are used widely for training deep neural networks, their poor generalization performance makes SGD still dominant in some areas, such as training CNNs (He et al., 2016) . Many researchers tried to improve the generalization capacity of Adam, including SWATS (Keskar & Socher, 2017) which conducts an automatic switch from Adam to SGD training strategy, ND-Adam (Zhang et al., 2017) which controls the stepsizes and update direction in a more precise way, AMSGrad (Reddi et al., 2019) which ensures monotone increasing second moment, and Padam (Chen et al., 2021) which introduces a partially adaptive parameter to control the adaptivity of stepsizes. Our solution to improve the generalization performance is to confine the update of parameters to a two-dimensional subspace of the parameter space. Specifically, we solve a trust-region subproblem to determine the optimal stepsizes along the gradient direction and momentum direction at each iteration. The idea of utilizing gradient and momentum direction to update the variable traces back to Polyak's heavy ball method (Polyak, 1964) x t = x t-1 -α 1 ∇f (x t-1 ) + α 2 d t-1 SGD (with momentum) (Sutskever et al., 2013) can also be written in this form after replacing the gradient direction ∇f (x t-1 ) with stochastic gradient direction g t-1 . Unlike SGD which adopts constant stepsizes, DRAG adopts the stepsizes as the solution of a two-dimensional quadratic model with a spherical constraint, it adaptively learns the optimal stepsize for each direction. Trust-region method has been widely used in optimization, and some works tries to use it for machine learning (Martens et al., 2010; Dudar et al., 2017; Erway et al., 2020) . The closest related work is the Dimension Reduced Second Order Method (DRSOM) proposed by Zhang et al. (2022) . It uses a finite difference method to approximate the Hessian-vector product arisen in the trust-region subproblem and achieves higher convergence rate than first order methods. 

3. METHOD

Training neural networks can be seen as solving the following non-convex optimization problem min x∈R n f (x), where f is the loss function and x ∈ R n is the variable. Among all optimizers, Adam (Kingma & Ba, 2014) is one of the most popular algorithm to solve problem (1). At each training iteration, Adam maintains an exponential moving average (EMA) of first and second moments of stochastic gradient v t and u t as v t = β 1 v t-1 + (1 -β 1 )g t-1 , u t = β 2 u t-1 + (1 -β 2 )g 2 t-1 , where β 1 , β 2 ∈ [0, 1] are constant and g t-1 := ∇f (x t-1 ) is the stochastic gradient. It adaptively scales the learning rates for each gradient coordinate, and actually minimizes the loss function along n descent directions x t = x t-1 -η vt √ ût + ν = x t-1 - n i=1 η ût,i + ν (v t,i e i ), where vt , ût are bias-corrected v t , u t , e i is the standard basis vector with 1 for dimension i and 0 for all other dimensions. Specifically, Adam adopts a stepsize of η √ ût,i+ν for the i-th gradient descent direction vt,i e i . While adaptive stepsize boosts the convergence of Adam, it weakens the generalization performance due to noise and overfitting. In contrast, SGD generalizes well because it uses a single stepsize for all gradient coordinates and indeed optimizes the loss function only along the gradient direction. One interpretation for their different generalization performance is that Adam's update direction no longer falls into the subspace spanned by all stochastic gradients span{g 0 , et al., 2017; Zhang et al., 2017) , while SGD do. Actually, Wilson et al. (2017) proved that on a binary classification problem, SGD converges to the max-margin solution because its update at each step is linear combination of stochastic gradients, while adaptive gradient methods converge to solutions that generalize poorly because adaptivity makes the algorithm susceptible to noises and therefore causes overfitting. • • • , g t } (Wilson To overcome the issue just mentioned, our DRAG algorithm optimizes the loss function in (1) from the gradient direction and the momentum direction. It maintains flexibility in the update direction while inheriting the generalization capacity of SGD. At each step, it searches for the optimal stepsizes along these two directions by solving a two-dimensional trust-region subproblem. Therefore, from the optimization perspective, it conducts the optimal update within the two-dimensional subspace spanned by gradient direction and momentum direction. Moreover, while DRAG adopts the trust-region framework, it is compatible with the dominant deep learning training pipeline without introducing extra hyperparameters.

3.1. DETAILS OF THE ALGORITHM

Details of our algorithm are described in Algorithm 1. At each training epoch, DRAG first computes stochastic gradient g t-1 , and use it to update the first moment v t and second moment u t of stochastic gradient like Adam. Then, we introduce the bias-corrected second moment ût to approximate the Hessian. In this way, DRAG constructs the trust-region subproblem in line 9 of Algorithm 1. While solving this trust-region subproblem in high-dimensional parameter space is computational expensive, DRAG solves it in the two-dimensional subspace spanned by bias-corrected first moment direction vt and momentum direction d t-1 , making the computational overhead negligible. Here we intuitively set the trust-region radius as η∥v t ∥, and the benefits of this setting is described in Section 3.2. After calculating the solution α 1t and α 2t of the subproblem, we get an optimal update p = -α 1t vt + α 2t d t-1 in the two-dimensional subspace. Finally, we follow (Loshchilov & Hutter, 2018) and conduct a decoupled weight decay step. This is the overall framework of our DRAG. Algorithm 1 Dimension-Reduced Adaptive Gradient Method (DRAG) 1: Input: Total number of training epoch m, learning rate η, exponential moving average coefficients β 1 , β 2 , weight decay scale γ, margin coefficient ν. 2: Initialize: Set x 0 , v 0 = 0, u 0 = 0. 3: for t = 1, • • • , m do 4: Compute stochastic gradient g t-1 = ∇f (x t-1 ). 5: v t = β 1 v t-1 + (1 -β 1 )g t-1 , vt = v t /(1 -β t 1 ) 6: u t = β 2 u t-1 + (1 -β 2 )g 2 t-1 , ût = u t /(1 -β t 2 ) 7: H t = diag( √ ût + ν) 8: d t-1 = x t-1 -x t-2 if t ≥ 2 else d t-1 = 0. 9: (α 1t , α 2t ) = argmin p {⟨v t , p⟩ + 1 2 ⟨p, H t p⟩ | ∥p∥ ≤ η∥v t ∥, p = -α 1 vt + α 2 d t-1 . 10: x t = x t-1 -α 1t vt + α 2t d t-1 11: x t = x t -ηγx t-1 (Conduct weight decay) 12: end for 13: Output: x 1 , • • • , x m The only extra computational overhead of DRAG compared with Adam is solving the twodimensional trust-region subproblem in line 9 of Algorithm 1. The trust-region subproblem can be formally formulated as follows: min α1,α2 ⟨v t , -α 1 vt + α 2 d t-1 ⟩ + 1 2 ⟨-α 1 vt + α 2 d t-1 , H t (-α 1 vt + α 2 d t-1 )⟩ = [α 1 α 2 ] -v T t vt vT t d t-1 + 1 2 [α 1 α 2 ] vT t H t vt -v T t H t d t-1 -v T t H t d t-1 d T t-1 H t d t-1 α 1 α 2 s.t. ∥ -α 1 vt + α 2 d t-1 ∥ ≤ η∥v t ∥, where H t = diag( √ ût + ν) as defined in Algorithm 1. This two-dimensional subproblem can be solved efficiently by using its global minimal condition. In Appendix A, we transform this subproblem into a standard trust-region subproblem, and then an ϵ-global primal-dual solution satisfying KKT condition can be found in O(log log( 1 ϵ )) time (Luenberger et al., 1984) . See more details in Appendix A. If we set p in line 9 of Algorithm 1 as p = -α 1 vt + k-1 j=1 α j+1 d t-j , our algorithm DRAG can be generalized to solve the subproblem with any k search directions. Although according to experimental results, one search direction or multiple search directions usually perform worse than two search directions (DRAG), they can give us some intuitions on the flexibility of our method and the optimality of its update in the subspace. We omit ν in the discussions below for the simplicity of notation. One-dimensional subspace Suppose we update the variable in the one-dimensional subspace spanned by gradient direction x t = x t-1 -α 1t • vt , where α 1t is calculated by solving the trust-region subproblem min α1 ⟨v t , -α 1 vt ⟩ + 1 2 ⟨-α 1 vt , diag( ût )(-α 1 vt )⟩ |α 1 | ≤ η . In this case, the subproblem has an explicit solution α 1t = min{η, ⟨vt,vt⟩ ⟨vt,diag( √ ût) vt⟩ }. As we can see from the solution, unlike SGD which adopts the learning rate set externally as the stepsize, our algorithm update the parameter with an adaptive stepsize within the learning rate. Furthermore, this adaptive stepsize is optimal according to the quadratic approximation of the loss function. Full-dimensional parameter space Suppose we update the variable along n directions in the whole parameter space x t = x t-1 + p. Then trust-region subproblem becomes min p ⟨v t , p⟩ + 1 2 ⟨p, diag( ût )p⟩ ∥p∥ ≤ η∥v t ∥ . Under this scenario, the subproblem has solution as p = - vt √ ût + λ = - n i=1 1 ût,i + λ (v t,i e i ), where λ ≥ 0 satisfies λ(∥p∥ -η∥v t ∥) = 0. When ∥ vt √ ut ∥ ≤ η∥v t ∥, λ = 0 and the solution is the same as Adam's update direction (2). From the form of solution, we can see all gradient coordinates have adaptive stepsizes, which means the method optimizes the loss function along n directions in the whole parameter space. Also, this update is optimal with respect to the quadratic approximation of the loss function.

3.2. BENEFITS OF OUR ALGORITHM

Flexibility of update As in Algorithm 1, DRAG updates the variable x along EMA of gradient direction vt and momentum direction d t-1 . This update direction choice acts as a trade-off between the whole space search of Adam and one direction search of SGD. Specifically, Adam adjusts stepsizes for each gradient coordinate as x t = x t-1 - n i=1 η √ ût,i+ν (v t,i e i ), while SGD uniformly scales each coordinate of the gradient as x t = x t-1 -η vt . DRAG, on the other hand, search along two important directions as x t = x t-1 -α 1t vt + α 2t d t . This choice maintains the flexibility of update direction while alleviating overfitting and excessive noises. Moreover, the update of DRAG lies in the subspace span{v t , d t-1 } ∈ span{g 0 , • • • , g t-1 }. This means that the parameter update direction is always a combination of stochastic gradients. According to Wilson et al. (2017) , this property makes DRAG always converge to the max-margin solution of the binary classification problem, which has the best generalization capacity. This helps to explain DRAG's excellent generalization performance in practice. Optimal stepsizes DRAG solves the dimension-reduced subproblem at each training epoch and finds the best update along the gradient direction and momentum direction. This optimal update is evaluated by the quadratic approximation to the loss function, where the Hessian is approximated by second moment √ ût and gradient is approximated by first moment vt . Since DRAG conducts optimal update along gradient and momentum direction within the learning rate we set, it converges faster than SGD on training dataset and is comparable with adaptive gradient methods.

Heuristic trust-region radius

We set the trust-region radius for the subproblem as η∥v t ∥. The intuition is that when gradient is large, we hope our algorithm can make a larger step to minimize the loss function significantly. While when gradient is small, we hope our method to be stable and don't change the parameters too much. This heuristic design not only frees us from changing the radius at each step as trust region method does, but also make our algorithm compatible well with dominant deep learning training pipeline without introducing extra hyperparameters. Other deep learning optimizers that adopt trust-region like framework, such as the Hessian-free optimization method (Martens et al., 2010) , L-SSR1-TR (Erway et al., 2020) , DRSOM (Zhang et al., 2022) introduced extra hyperparameters and may incur high extra computational overhead. From our knowledge, this is the first time that a trust-region like method is well-compatible with dominant deep learning training setting without extra hyperparameters.

4. CONVERGENCE ANALYSIS IN NON-CONVEX STOCHASTIC OPTIMIZATION

For the analysis of stochastic non-convex algorithm, we follow the works Zhuang et al. (2020) ; Guo et al. (2021) and make the following necessary definitions and also mild assumptions. Definition 1. For a differentiable function f , x is said to be an ϵ-approximate first-order stationary point if it satisfies ∥∇f (x)∥ ≤ ϵ. Definition 2. For a differentiable funtion f (x), it is called L-Lipschitz smooth if it statisfies ∥∇f (x) -∇f (y)∥ ≤ L∥x -y∥ for a constant L > 0 and any x, y in domain of f . Based on these definitions, we have the following assumption. Assumption 1. For non-convex problem min x∈R n f (x), we assume the loss f (x) satisfies • f is L-Lipschitz smooth. • The gradient estimation g is unbiased, namely E[g t ] = ∇f (x t ), and its variance can be bounded as E[∥g t -∇f (x t )∥ 2 ] ≤ σ 2 . Then we can derive the convergence of our proposed algorithm and also provide its stochastic gradient complexity to find an ϵ-approximate first-order stationary point. Theorem 1. Suppose Assumption 1 holds. Let β t = β and η t = η for all t. Assume there exist constants α, G > 0, such that α ≤ min t α 1t and α 1t ≤ ηG, |α 2t | ≤ ηG. In addition, η ≤ min 1 2LG , (1-β) 2 α 8GL 2 1 3 , α 2 96G 2 1 4 , α 48LG 2 1 4 , α 192L 2 G 3 1 5 . Then, if 1 -β ≤ ϵ 2 3C2σ 2 and T ≥ max 3C1 αϵ 2 , 3C3 (1-β)ϵ 2 , DRAG can achieve 1 T T -1 t=0 E ∥∇f (x t )∥ 2 ≤ ϵ 2 , 1 T T -1 t=0 E ∥v t ∥ 2 ≤ 8ϵ 2 , where C 1 = 4 (f (x 0 ) -f (x * )), C 2 = 4ηG α and C 3 = 2ηGE[∥∇f (x0)-(1-β0)g0∥ 2 ] α . Remark 1. Theorem 1 with its proof in Appendix C demonstrates that by properly selecting constant trust-region radius η t and constant momentum parameter β t (correspond to β 1 in Algorithm 1), DRAG can converge to an ϵ-approximate first-order stationary point of the non-convex stochastic problem with stochastic gradient complexity O(ϵ -4 ). Note that the assumptions for α 1t and α 2t are satisfied naturally with the design of DRAG, see details in Appendix B. The complexity of DRAG is of the same order as the lower bound provided by Arjevani et al. (2022) . A similar complexity has also been obtained in, for example, LAMB (You et al., 2019) , Adam-family (Guo et al., 2021) . In the analysis of DRAG, we only need a unbiased and variance-bounded stochastic gradient, without any large mini-batch sizes requirement as in LARS (You et al., 2017) and LAMB (You et al., 2019) . In addition, some previous works (Luo et al., 2018; Zaheer et al., 2018; Liu et al., 2019; Shi et al., 2020) require the momentum parameter β t to be very close or decreasing to zero. In contrast, DRAG requires β t to be close to one, which is more consistent with the practice. Theorem 2. Suppose Assumption 1 holds. Assume there exist constants δ, G > 0, such that 0 < δ ≤ α1t ηt ≤ G, |α2t| ηt ≤ G. Set η t = cη √ t+2 , 1 -β t = Ccη √ t+1 , for any c η and C satisfying C ≥ L 8G δ , and c η ≤ 1 √ 2LG , δ 2 96G 2 1 2 , δ 48LG 2 1 3 , δ 192L 2 G 3 1 4 . Then there exist two constant C 1 and C 2 which are independent with T , such that 1 T T -1 t=0 E[∥∇f (x t )∥ 2 ] ≤ C 1 √ T + C 2 log T √ T , 1 T T -1 t=0 E[∥v t+1 ∥ 2 ] ≤ 8C 1 √ T + 8C 2 log T √ T . Given a tolerance ϵ > 0, if T ≥ Õ( 1 ϵ 4 ), we have 2019) has restrictions on the second moment momentum parameter β 2 . In Theorem 2, we only need β t (corresponds β 1 in Algorithm 1) to increase to one. 1 T T -1 t=0 E[∥∇f (x t )∥ 2 ] ≤ ϵ 2 , 1 T T -1 t=0 E[∥v t+1 ∥ 2 ] ≤ 8ϵ 2 .

5. EXPERIMENTS

We conduct experiments on several representative benchmarks, including VGG (Simonyan & Zisserman, 2014), ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) on CIFAR10, CIFAR100 dataset (Krizhevsky et al., 2009) , ResNet18 on ImageNet (Deng et al., 2009) , and LSTM (Hochreiter & Schmidhuber, 1997) on the Penn Treebank dataset (Marcinkiewicz, 1994) . We compare our algorithm DRAG with some popular deep learning optimizers, including SGD (Robbins & Monro, 1951) , Adam (Kingma & Ba, 2014), AdamW (Loshchilov & Hutter, 2018) , AdaBound (Luo et al., 2018) , AdaBelief (Zhuang et al., 2020) , RAdam (Liu et al., 2019) , Yogi (Zaheer et al., 2018) , and Padam (Chen et al., 2021) . Experimental results show that DRAG has faster convergence speed compared with SGD and it achieves state-of-the-art generalization performance. We also conduct ablation study to show 1) two search directions (DRAG) performs better than one direction and multiple directions and 2) DRAG is robust to different learning rate schedules. At the end of ablation study, we give some advice for practitioners to use DRAG.

5.1. CNNS ON IMAGE CLASSIFICATION

We conducted experiments for VGG16 with Batch Normalization, ResNet34, and DenseNet121 on CIFAR10 an CIFAR100 dataset. The experimental setting is borrowed from AdaBelief (Zhuang et al., 2020) and we also use their default setting for all the hyperparameters. For DRAG, we choose its learning rate to be the same as in SGD, which is 0.1, and weight decay factor is 0.0015 for CIFAR10 and 0.0025 for CIFAR100. Other hyperparameters of DRAG is the same as the default setting (β 1 = 0.9, β 2 = 0.999, ϵ = 10 -8 ). As Figure 1 shows, DRAG has convergence speed comparable with adaptive gradient methods and it attains the best generalization performance. To be specific, DRAG obtains more than 0.5% generalization accuracy gain over AdaBelief (Zhuang et al., 2020) on most tasks. The detailed test accuracy is summarized in Table 1 . We also train ResNet18 on ImageNet under the official training setting in He et al. (2016) , and compare the top-1 test accuracy of DRAG with the best result of other optimizers in the literature. Experimental results show DRAG has the best generalization capacity. Details are in Table 2 . The possible reasons for this improvement on the convergence speed and generalization capacity is 1) DRAG searches for the optimal update along two directions and thus converges faster, 2) DRAG confines the search of update within the two-dimensional subspace spanned by gradient and momentum direction to avoid overfitting and alleviating the influence of noises, therefore it generalizes better.

5.2. LSTMS ON LANGUAGE MODELING

We experimented with LSTM on the Penn Treebank dataset and record the perplexity (lower is better). We follow the exact experimental setting in Adabelief (Zhuang et al., 2020) and use their default hyperparameters except for SGD. For SGD, we use the same hyperparameters as DRAG to make a fair comparison between the two. For SGD and DRAG, we set their learning rate as 25, 75, 75 for 1,2,3-layer LSTM and weight decay factor as 2.5 × 10 -6 . SGD's generalization performance in our setting is better than the results provided by Zhuang et al. (2020) . As show in Figure 2 , for 1-layer, 2-layer, and 3-layer LSTM, DRAG's convergence speed is faster than SGD and comparable to adaptive gradient methods. From Table 3 , we can see that DRAG attains more than 0.5 less perplexity than other optimizers. The fast convergence speed may be attributed to the optimal update DRAG takes and the good generalization performance may be due to DRAG's twodirection search. The gradient direction inherits SGD's good generalization property and the extra momentum direction further improves its performance.

5.3. ABLATION STUDY

Different search directions We compare the performance of algorithms that solve the trust-region subproblem in one-dimensional, two-dimensional (DRAG), and three-dimensional subspaces as described in Section 3.1. As show in Table 4 in Appendix E, DRAG generalizes better than its one search direction and three search direction counterparts. The reason is that DRAG updates in more directions than the one search direction counterpart while its subproblem can be solved more accurately than the three direction counterpart, since low-dimensional subproblem can be solved with less numerical errors in single precision arithmetic by GPU. Robustness to learning rate schedule DRAG is robust to different choices of learning rate schedule. Except for letting the learning rate decay at epoch 150 as in Section 5.1, we also conduct experiments on decaying the learning rate at epoch 120 and adopting cosine annealing learning rate schedule. The only change of hyperparameter setting from Section 5.1 is we increase the learning rate of DRAG from 0.1 to 0.12 in cosine annealing schedule. The intuition is that when the trustregion radius is decreased during the training process, we need a larger initial radius to converge to a better local minima. We compared DRAG's test performance with other optimizers with VGG16 on CIFAR10, details are presented in Table 5 in Appendix E, which shows that DRAG enjoys the best generalization performance for all the learning rate schedules. 

6. CONCLUSION

In this paper we propose the DRAG algorithm, which finds the optimal update of the parameters along gradient and momentum directions at each iteration. Compared with Adam, DRAG reduces the flexibility of update direction from searching in the whole parameter space to updating in a two-dimensional subspace: therefore is less susceptible to overfitting and has better generalization performance. Compared with SGD, DRAG inherits the gradient update direction and also update along an extra momentum direction, thus it has faster convergence speed and comparable generalization capacity. Theoretically we prove that DRAG has the same order of stochastic gradient complexity as the lower bound for non-convex stochastic optimization (Arjevani et al., 2022) . Experimentally we show that DRAG has faster convergence speed compared with SGD and it attains state-of-the-art generalization performance. Our algorithm can be further generalized to any number of search directions and any choice of Hessian approximation.

A SOLVE THE TRUST-REGION SUBPROBLEM

Recall the trust region subproblem min α ⟨α, C t ⟩ + 1 2 ⟨α, Q t α⟩ s.t. ⟨α, G t α⟩ ≤ η∥v t ∥, where α := α 1 α 2 , C t := -v T t vt vT t d t-1 , Q t := vT t H t vt -v T t H t -d T t-1 H t vt d T t-1 H t d t-1 , and G t := vT t vt -v T t d t-1 -d T t-1 vt d T t-1 d t-1 , H t = diag( √ ût + ν). In order to solve this trust region subproblem, we transform it into a standard trust region subproblem with L 2 -norm constraint. When matrix G t is positive definite, we have G t = L t L T t (Cholesky Decomposition) α T G t α = (L T t α) T L T t α = ∥L T t α∥ ≤ η∥v t ∥. So we let y = L T t α, then α = L -T t y and the subproblem becomes min y ⟨C t , L -T t y⟩ + 1 2 ⟨L -T t y, Q t L -T t y⟩ s.t. ∥y∥ ≤ η∥v t ∥ ⇐⇒ min y ⟨L -1 t C t , y⟩ + 1 2 ⟨y, L -1 t Q t L -T t y⟩ s.t. ∥y∥ ≤ η∥v t ∥. In this way, the trust region subproblem is transformed to a standard spherical constrained quadratic optimization problem and it can be solved efficiently (Wright et al., 1999) . When |G t | = 0, this means vt is linearly dependent with d t-1 . In this case, we solve the onedimensional subproblem as described in Section 3.1. B HOW α 1t , α 2t SATISFY THE ASSUMPTIONS NATURALLY The trust-region subproblem to be solved in Algorithm 1 has global optimality condition (Luenberger et al., 1984) given by    (Q t + λG t )α + C t = 0 Q t + λG t ⪰ 0 λ(∥α∥ Gt -η∥v t ∥) = 0, λ ≥ 0. By its construction, we know that G t is positive semidefinite. In practice, numerical issues sometimes make it indefinite, leaving the trust-region subproblem insoluble. Thus, we make an adjustment to G t G t = G t if λ min ≥ ε 0 or |G t | = 0 ε 0 I o.w. where λ min is the smallest eigenvalue of G t . In this way, when |G t | ̸ = 0, we have ∥α∥ ≤ ∥G -1/2 t ∥∥α∥ Gt ≤ η∥G -1/2 t ∥∥v t ∥ ≤ η ∥v t ∥ √ ε 0 , which means | α 1t η |, | α 2t η | ≤ ∥v t ∥ √ ε 0 . With the common additional assumption that stochastic gradient g t = ∇f (x t ) has bounded L ∞ norm, i.e.∥g t ∥ ∞ ≤ G ∞ , then vt as an moving average of g t also has bounded norm ∥v t ∥. Therefore, we can see that | α1t η |, | α2t η | are upper bounded by a constant. When |G t | = 0, which means d t-1 is parallel with vt . Then we only need to find the optimal update within the trust-region along gradient direction vt . In this case, we manually set α 2t = 0 in our implementation of DRAG, and then α 1 satisfies |α 1 | ≤ η. From discussions above, we can see the assumption that | α1t η |, | α2t η | are upper bounded in Theorem 1 and Theorem 2 is satisfied given the common assumption that stochastic gradient g t = ∇f (x t ) has bounded L ∞ norm. For the simplicity of notations, we directly make assumptions for α 1t and α 2t in Theorem 1 and Theorem 2. For the assumption that α 1t is positive and α1t η is lower bounded by a constant, we give an explanation here by intuition and empirical results. Gradient direction is what we considered the most important update direction locally, because by the training pipeline of neural networks, stochastic gradients of training parameters are the new information we gain at each iteration. Thus, we consider the update should at least move towards the gradient descent direction rather than move towards the gradient ascent direction. Moreover, from the observations of α 1t under all the experimental settings, α 1t is always positive and α1t η is always larger than 0.1. Therefore, this assumption on α 1t is reasonable based on common sense and holds true in practice.

C PROOF OF THEOREM 1

One key ingredient in our analysis is an existing variance recursion of the stochastic estimator based on moving average, which is given by the following lemma. Lemma 3 (Variance Recursion (Wang et al., 2017) ). Suppose Assumption 1 holds, then we have E t [∥v t+1 -∇f (x t )∥ 2 ] ≤ β∥v t -∇f (x t-1 )∥ 2 + 2(1 -β) 2 E t [∥g t -∇f (x t )∥ 2 ] + L 2 ∥d t ∥ 2 1 -β , where E t [•] denotes the conditional expectation with respect to all randomness before g t . Before proving Theorem 1, we need to prove the following auxiliary lemma. Lemma 4. Suppose Assumption 1 holds. Assume there exist α, η, δ, G > 0, such that α ≤ min t α 1t , max t η t ≤ η, and 0 < δ ≤ α η ≤ α1t ηt ≤ G, |α2t| ηt ≤ G, (δ, G) are constants independent with t. In addition, η ≤ min 1 2LG , 1-β 2L δ 2G , δ 4 √ 6G , δ 48LG 2 1 3 , δ 192L 2 G 3 1 4 . Then there exist positive constants C 1 , C 2 and C 3 , which are all independent with T , such that the following estimation holds: 1 T T -1 t=0 E ∥∇f (x t )∥ 2 ≤ C 1 T α + C 2 (1 -β)σ 2 + C 3 T (1 -β) , 1 T T -1 t=0 E ∥v t ∥ 2 ≤ 8C 1 T α + 8C 2 (1 -β)σ 2 + 8C 3 T (1 -β) . (4) Proof. Since F is L-smooth, we have f (x t+1 ) ≤ f (x t ) + ⟨∇f (x t ), -α 1t v t+1 + α 2t d t ⟩ + L 2 ∥ -α 1t v t+1 + α 2t d t ∥ 2 = f (x t ) -α 1t ⟨∇f (x t ), v t+1 ⟩ + α 2t ⟨∇f (x t ), d t ⟩ + Lα 2 1t 2 ∥v t+1 ∥ 2 + Lα 2 2t 2 ∥d t ∥ 2 -Lα 1t α 2t ⟨v t+1 , d t ⟩ = f (x t ) + α 1t 2 ∥∇f (x t ) -v t+1 ∥ 2 - α 1t (1 -Lα 1t ) 2 ∥v t+1 ∥ 2 - α 1t 2 ∥∇f (x t )∥ 2 + α 2t ⟨∇f (x t ), d t ⟩ + Lα 2 2t 2 ∥d t ∥ 2 -Lα 1t α 2t ⟨v t+1 , d t ⟩. By Lemma 3, we can obtain T t=1 E[∥∇f (x t-1 ) -v t ∥ 2 ] ≤ 1 1 -β E[∥∇f (x 0 ) -v 1 ∥ 2 ] + 2(1 -β)T σ 2 + L 2 (1 -β) 2 E T t=1 ∥d t ∥ 2 . (6) Taking expectation for both sides of (5) and taking summation among t = 0, ..., T -1, combining with (6), we have E [f (x T ) -f (x 0 )] ≤ ηG 2 E[∥∇f (x 0 ) -v 1 ∥ 2 ] 1 -β + 2(1 -β)T σ 2 + L 2 (1 -β) 2 T t=1 E[∥d t ∥ 2 ] - T -1 t=0 α 1t 2 E[∥∇f (x t )∥ 2 ] - T -1 t=0 α 1t (1 -Lα 1t ) 2 E[∥v t+1 ∥ 2 ] + T -1 t=0 E[α 2t ⟨∇f (x t ), d t ⟩] + Lα 2 2t 2 ∥d t ∥ 2 -E[Lα 1t α 2t ⟨v t+1 , d t ⟩] . By AM-GM inequality, α 2t ⟨∇f (x t ), d t ⟩ ≤ α 1t 4 ∥∇f (x t )∥ 2 + α 2 2t α 1t ∥d t ∥ 2 , -Lα 1t α 2t ⟨v t+1 , d t ⟩ ≤ α 1t (1 -Lα 1t ) 4 ∥v t+1 ∥ 2 + L 2 α 1t α 2 2t 1 -Lα 1t ∥d t ∥ 2 . Combining all together, we have  T -1 t=0 α 1t 4 E[∥∇f (x t )∥ 2 ] ≤f (x 0 ) -f (x * ) + ηG 2(1 -β) E[∥∇f (x 0 ) -v 1 ∥ 2 ] + T t=1 ηGL 2 2(1 -β) 2 E[∥d t ∥ 2 ] + ηG(1 -β)T σ 2 + T -1 t=0 α 2 2t α 1t + Lα 2 2t 2 + L 2 α 1t α 2 2t 1 -Lα 1t E[∥d t ∥ 2 ] - T -1 t=0 α 1t (1 -Lα 1t ) 4 E[∥v t+1 ∥ 2 ], ≥ α 16 ≥ η 3 GL 2 2(1-β) 2 ≥ ηGL 2 η 2 t+1 2(1-β) 2 , α1t 96 ≥ α 96 ≥ η 4 G 2 α ≥ α 2 2,t+1 η 2 t+1 α1,t+1 , α1t 96 ≥ α 96 ≥ Lη 4 G 2 2 ≥ Lα 2 2,t+1 η 2 t+1 2 , and α1t 96 ≥ α 96 ≥ 2L 2 η 5 G 3 ≥ L 2 α1,t+1α 2 2,t+1 η 2 t+1 1-Lα1,t+1 . By ∥d t ∥ ≤ η t ∥v t ∥. Since v 0 = 0, we have T -1 t=0 α 2 2t α 1t + Lα 2 2t 2 + L 2 α 1t α 2 2t 1 -Lα 1t ∥d t ∥ 2 + T t=1 ηGL 2 2(1 -β) 2 ∥d t ∥ 2 - T -1 t=0 α 1t (1 -Lα 1t ) 4 ∥v t+1 ∥ 2 ≤ - α 8 T -1 t=0 ∥v t+1 ∥ 2 + T -1 t=0 α 2 2t η 2 t α 1t + Lα 2 2t η 2 t 2 + L 2 α 1t α 2 2t η 2 t 1 -Lα 1t ∥v t ∥ 2 + T -1 t=0 ηGL 2 η 2 t+1 2(1 -β) 2 ∥v t+1 ∥ 2 ≤ - α 32 T -1 t=0 ∥v t+1 ∥ 2 . (9) Combining ( 8) and (9), we can obtain T -1 t=0 α 1t 4 E[∥∇f (x t )∥ 2 ] ≤ f (x 0 ) -f (x * ) + ηG 2(1 -β) E[∥∇f (x 0 ) -v 1 ∥ 2 ] + ηG(1 -β)T σ 2 , α T -1 t=0 E[∥v t+1 ∥ 2 ] ≤ f (x 0 ) -f (x * ) + ηG 2(1 -β) E[∥∇f (x 0 ) -v 1 ∥ 2 ] + ηG(1 -β)T σ 2 . Dividing the above two inequalities by αT 4 and αT 32 respectively, we have 1 T T -1 t=0 E[∥∇f (x t )∥ 2 ] ≤ 4 (f (x 0 ) -f (x * )) T α + 2GE[∥∇f (x 0 ) -v 1 ∥ 2 ] δ(1 -β)T + 4G(1 -β)σ 2 δ , 1 T T -1 t=0 E[∥v t+1 ∥ 2 ] ≤ 32 (f (x 0 ) -f (x * )) T α + 16GE[∥∇f (x 0 ) -v 1 ∥ 2 ] δ(1 -β)T + 32G(1 -β)σ 2 δ , which completes the proof by letting C 1 = 4 (f (x 0 ) -f (x * )), C 2 = 4G δ , C 3 = 2GE[∥∇f (x0)-v1∥ 2 ] δ .

Proof of Theorem 1

Proof. By the selections of α and η t in Theorem 1, let δ = α/η. By Lemma 4, we have 1 T T -1 t=0 E ∥∇f (x t )∥ 2 ≤ C 1 T α + C 2 (1 -β)σ 2 + C 3 T (1 -β) . The conditions 1 -β ≤ ϵ 2 3C2σ 2 and T ≥ max 3C1 αϵ 2 , 3C3 (1-β)ϵ 2 (10) By Lemma 3, we have (1-β t )∥v t -∇f (x t-1 )∥ 2 ≤ ∥v t -∇f (x t-1 )∥ 2 -E t [∥v t+1 -∇f (x t )∥ 2 ]+2(1-β t ) 2 E t [∥∇f (x t )-g t ∥ 2 ]+ L 2 ∥d t ∥ 2 1 -β t . lead to C1 T α ≤ ϵ 2 3 , C 2 (1 -β)σ 2 ≤ ϵ 2 3 , C3 T (1-β) ≤ ϵ 2 3 . Taking expectation and summation for t = 1, ..., T , we get T -1 t=0 E[(1-β t+1 )∥∇f (x t )-v t+1 ∥ 2 ] ≤ E[∥v 1 -∇f (x 0 )∥ 2 ]+ T t=1 2(1 -β t ) 2 σ 2 + L 2 E[∥d t ∥ 2 ] 1 -β t . (11) Note that 1 -β t+1 = Cη t , so α1t 2 ≤ G 2 η t = G 2C (1 -β t+1 ). Taking expectation for both sides of (10) and taking summation among t = 0, ..., T -1, combining with (11), we can obtain E [f (x T ) -f (x 0 )] ≤ G 2C E[∥v 1 -∇f (x 0 )∥ 2 ] + T t=1 2(1 -β t )σ 2 + L 2 ∥d t ∥ 2 1 -β t - T -1 t=0 α 1t (1 -Lα 1t ) 2 E[∥v t+1 ∥ 2 ] - T -1 t=0 α 1t 2 E[∥∇f (x t )∥ 2 ] + T -1 t=0 E[α 2t ⟨∇f (x t ), d t ⟩] + Lα 2 2t 2 ∥d t ∥ 2 -E[Lα 1t α 2t ⟨v t+1 , d t ⟩] . Under review as a conference paper at ICLR 2023 From ( 7) and ( 12), we can get T -1 t=0 α 1t 4 E[∥∇f (x t )∥ 2 ] ≤f (x 0 ) -f (x * ) + G 2C E[∥∇f (x 0 ) -v 1 ∥ 2 ] + T t=1 G C (1 -β t ) 2 σ 2 + T t=1 GL 2 ∥d t ∥ 2 2C(1 -β t ) + T -1 t=0 α 2 2t α 1t + Lα 2 2t 2 + L 2 α 1t α 2 2t 1 -Lα 1t E[∥d t ∥ 2 ] - T -1 t=0 α 1t (1 -Lα 1t ) 4 E[∥v t+1 ∥ 2 ], By the conditions for c η and C, we have α 1t ≤ η t G ≤ 1 2L , . By ∥d t ∥ ≤ η t ∥v t ∥. Since v 0 = 0, we can get T -1 t=0 α 2 2t α 1t + Lα 2 2t 2 + L 2 α 1t α 2 2t 1 -Lα 1t ∥d t ∥ 2 + T t=1 GL 2 2C(1 -β t ) ∥d t ∥ 2 - T -1 t=0 α 1t (1 -Lα 1t ) 4 ∥v t+1 ∥ 2 ≤ - T -1 t=0 α 1t 8 ∥v t+1 ∥ 2 + T -1 t=0 α 2 2t η 2 t α 1t + Lα 2 2t η 2 t 2 + L 2 α 1t α 2 2t η 2 t 1 -Lα 1t ∥v t ∥ 2 + T -1 t=0 GL 2 η 2 t+1 2C(1 -β t+1 ) ∥v t+1 ∥ 2 ≤ - T -1 t=0 α 1t 32 ∥v t+1 ∥ 2 . (14) Combining ( 13) and ( 14), we can obtain T -1 t=0 δη t 4 E[∥∇f (x t )∥ 2 ] ≤ T -1 t=0 α 1t 4 E[∥∇f (x t )∥ 2 ] ≤ f (x 0 ) -f (x * ) + GE[∥∇f (x 0 ) -v 1 ∥ 2 ] 2C + T t=1 Gσ 2 C (1 -β t ) 2 , T -1 t=0 δη t 32 E[∥v t+1 ∥ 2 ] ≤ T -1 t=0 α 1t 32 E[∥v t+1 ∥ 2 ] ≤ f (x 0 ) -f (x * ) + GE[∥∇f (x 0 ) -v 1 ∥ 2 ] 2C + T t=1 Gσ 2 C (1 -β t ) 2 . Then, the final assertion can be obtained by T t=1 1 t+1 = O(log T ). This completes the proof. 



Figure 1: Training and test accuracy of CNNs on CIFAR10 dataset.

1-layer LSTM. (b) 2-layer LSTM.(c) 3-layer LSTM.

Figure 2: Training and test perplexity of 1,2,3-layer LSTM on Penn Treebank dataset.

Top-1 test accuracy (%) of VGG16, ResNet34, DenseNet121 on CIFAR10 and CIFAR100.Remark 2. Theorem 2 with its proof in Appendix D establishes an O(log T / √ T ) sub-linear convergence rate for DRAG by choosing a decreasing η t and 1 -β t with the order O(1/ √ t). Similar sub-linear convergence rates are also established byZou et al. (2019) for Adam andGuo et al. (2021) for Adam-type optimizers. WhileZou et al. (

Top-1 test accuracy (%) of ResNet18 on ImageNet. All results except DRAG are reported byZhuang et al. (2020) andChen et al. (2021).

Test perplexity (lower is better) of 1-layer, 2-layer, and 3-layer LSTM on PTB dataset. All results except DRAG, SGD, and Padam are reported by Adabelief(Zhuang et al., 2020).For practitioners, any task that can use SGD can use DRAG to achieve faster convergence and comparable generalization performance with negligible extra computational overhead. The user only needs to set the learning rate the same as in SGD or slightly larger. For a new task, if one values good generalization performance, one can always use DRAG instead of SGD to enjoy easier hyperparameter tunning. DRAG is more robust than SGD when large learning rate is used. For instance, when training VGG16 on CIFAR10 dataset, setting the learning rate to 0.5 still allows DRAG to attain over 90 percent test accuracy, but SGD diverge and fail in the training process.

8)where x * is one of the global minimizer of F . Since α 1t ≤ ηG ≤ 1 2L , we have α1t(1-Lα1t)

This completes the proof.Proof. From (5) in Lemma 4, we havef (x t+1 ) ≤ f (x t ) + α 1t 2 ∥∇f (x t ) -v t+1 ∥ 2 -α 1t (1 -Lα 1t ) 2 ∥v t+1 ∥ 2 -α 1t 2 ∥∇f (x t )∥ 2 + α 2t ⟨∇f (x t ), d t ⟩ Lα 1t α 2t ⟨v t+1 , d t ⟩.

Test accuracy of algorithms solving the trust-region subproblem with one, two, and three search directions on CIFAR10.

Test accuracy of VGG16 on CIFAR-10 with three different learning rate schedules.

