DIMENSION-REDUCED ADAPTIVE GRADIENT METHOD

Abstract

Adaptive gradient methods, such as Adam, have shown faster convergence speed than SGD across various kinds of network models. However, adaptive algorithms often suffer from inferior generalization performance than SGD. Though much effort via combining Adam and SGD have been invested to solve this issue, adaptive methods still fail to attain as good generalization as SGD. In this work, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG) to eliminate the generalization gap. DRAG makes an elegant combination of SGD and Adam by adopting a trust-region like framework. We observe that 1) Adam adjusts stepsizes for each gradient coordinate, and indeed decomposes the n-dimensional gradient into n independent directions to search; 2) SGD uniformly scales gradient for all gradient coordinates and actually has only one descent direction to minimize. Accordingly, DRAG reduces the high degree of freedom of Adam and also improves the flexibility of SGD via optimizing the loss along k (≪ n) descent directions, e.g. the gradient direction and momentum direction used in this work. Then per iteration, DRAG finds the best stepsizes for k descent directions by solving a trustregion subproblem whose computational overhead is negligible since the trustregion subproblem is low-dimensional, e.g. k = 2 in this work. DRAG is compatible with the common deep learning training pipeline without introducing extra hyper-parameters and with negligible extra computation. Moreover, we prove the convergence property of DRAG for non-convex stochastic problems that often occur in deep learning training. Experimental results on representative benchmarks testify the fast convergence speed and also superior generalization of DRAG.

1. INTRODUCTION

SGD (Robbins & Monro, 1951) and its variant with momentum (Sutskever et al., 2013) are used widely in training deep neural networks. They perform well empirically and have theoretical guarantee (Szegedy et al., 2015; He et al., 2016; Lee et al., 2016; Hardt et al., 2016) . However, SGD suffers from two issues. It often has slow convergence speed since it adopts a single learning rate for all the gradient coordinates. Moreover, it is also hard to tune the single learning rate (Wilson et al., 2017) , since not all gradient coordinates share the same optimization properties. To resolve this problem, several adaptive gradient methods have been proposed to adopt different learning rate for different gradient coordinates. Typical examples of such methods include Adagrad (Duchi et al., 2011 ), RMSProp (Tieleman et al., 2012 ), and Adam (Kingma & Ba, 2014) . Emprically, these methods have shown faster convergence speed and eased the burden of carefully tuning the learning rate in SGD across many kinds of networks. However, their generalization performance are often worse than SGD in many scenarios (Wilson et al., 2017; Zhou et al., 2020) . Some algorithms are proposed to combine the fast convergence speed of adaptive gradient methods and good generalization performance of SGD. Instances of this type of algorithms include SWATS (Keskar & Socher, 2017) which automatically switchs from Adam to SGD, ND-Adam (Zhang et al., 2017) which utilizes vector learning rate and normalization to control direction and stepsize, and AMSGrad (Reddi et al., 2018) which maintains a monotone increasing second moment. Unfortunately, these methods only slightly bridge the generalization gap between SGD and Adam, but does not attain as good generalization performance as SGD, needless to say the state-of-the-art performance on test set. Accordingly, these algorithms are rarely used to train deep networks in practice. To combine the merits of Adam and SGD, i.e. fast convergence speed in Adam and excellent generalization in SGD, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG for short) which minimizes the loss from several descent directions to trade-off the whole space search in Adam and the minimization along a single gradient direction in SGD. For Adam, adjusting stepsizes for each gradient coordinate actually transforms the n-dimensional gradient into n independent directions to optimize, in which each direction inherits one coordinate element from the gradient and sets the remaining coordinate positions as zeros. In contrast, SGD only uses a single learning rate for all gradient coordinates and minimizes the loss along one descent direction. Though the adaptive learning rate for each coordinate shows faster convergence speed than a single learning rate for all coordinates, as shown in many works (Wilson et al., 2017; Zhou et al., 2018) , it also leads to the inferior generalization in Adam, since minimizing n independent directions means searching the whole parameter space and could results in overfitting. So it is natural to trade-off the number of descent directions. To this end, motivated by DRSOM (Zhang et al., 2022) , we update the parameters along the gradient direction and momentum direction through a trust-region-like approach, which greatly reduces the high adaptivity of Adam while adding flexibility to SGD. At each iteration, DRAG searches for the optimal update along the gradient and the momentum which are widely used in accelerated algorithm (Polyak, 1964; Nesterov, 2003) by solving a two-dimensional trust-region subproblem to find the best stepsizes for these two directions. For the trust-region subproblem, we use a quadratic approximation to estimate the loss with the Hessian matrix estimated by the second moment in Adam which is a diagonal matrix and can greatly reduce the computational cost. Moreover, we heuristically design a simple and effective trust-region radius for the subproblem. Despite the delicate design of our algorithm, we also theoretically prove that on non-convex problems, our DRAG can converge and enjoys a stochastic gradient complexity of O(ϵ -4 ) to find an ϵ-approximate first-order stationary point. To summarize, our contributions are as follows: • We proposed the DRAG algorithm to optimize the loss from several descent directions for balancing the whole space search in Adam and the optimization along a single gradient direction in SGD. Moreover, we formulate the optimum stepsize search for these descent directions into a low-dimensional trust region problem whose computational cost is negligible when compared with the vanilla cost in adaptive gradient algorithms. • We theoretically prove that to find an ϵ-approximate stationary point on non-convex stochastic problems, DRAG has the stochastic gradient complexity of O(ϵ -4 ) which matches the lower bound Ω(ϵ -4 ) in (Arjevani et al., 2022) under the same non-convex optimization setting. • Experimental results show that on several representative benchmarks, our DRAG method can achieve faster convergence speed than SGD, and also state-of-the-art generalization performance. et al., 2021) which introduces a partially adaptive parameter to control the adaptivity of stepsizes. Our solution to improve the generalization performance is to confine the update of parameters to a two-dimensional subspace of the parameter space. Specifically, we solve a trust-region subproblem to determine the optimal stepsizes along the gradient direction and momentum direction at each iteration.

2. RELATED WORK

The idea of utilizing gradient and momentum direction to update the variable traces back to Polyak's heavy ball method (Polyak, 1964) x t = x t-1 -α 1 ∇f (x t-1 ) + α 2 d t-1

