DIMENSION-REDUCED ADAPTIVE GRADIENT METHOD

Abstract

Adaptive gradient methods, such as Adam, have shown faster convergence speed than SGD across various kinds of network models. However, adaptive algorithms often suffer from inferior generalization performance than SGD. Though much effort via combining Adam and SGD have been invested to solve this issue, adaptive methods still fail to attain as good generalization as SGD. In this work, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG) to eliminate the generalization gap. DRAG makes an elegant combination of SGD and Adam by adopting a trust-region like framework. We observe that 1) Adam adjusts stepsizes for each gradient coordinate, and indeed decomposes the n-dimensional gradient into n independent directions to search; 2) SGD uniformly scales gradient for all gradient coordinates and actually has only one descent direction to minimize. Accordingly, DRAG reduces the high degree of freedom of Adam and also improves the flexibility of SGD via optimizing the loss along k (≪ n) descent directions, e.g. the gradient direction and momentum direction used in this work. Then per iteration, DRAG finds the best stepsizes for k descent directions by solving a trustregion subproblem whose computational overhead is negligible since the trustregion subproblem is low-dimensional, e.g. k = 2 in this work. DRAG is compatible with the common deep learning training pipeline without introducing extra hyper-parameters and with negligible extra computation. Moreover, we prove the convergence property of DRAG for non-convex stochastic problems that often occur in deep learning training. Experimental results on representative benchmarks testify the fast convergence speed and also superior generalization of DRAG.

1. INTRODUCTION

SGD (Robbins & Monro, 1951) and its variant with momentum (Sutskever et al., 2013) are used widely in training deep neural networks. They perform well empirically and have theoretical guarantee (Szegedy et al., 2015; He et al., 2016; Lee et al., 2016; Hardt et al., 2016) . However, SGD suffers from two issues. It often has slow convergence speed since it adopts a single learning rate for all the gradient coordinates. Moreover, it is also hard to tune the single learning rate (Wilson et al., 2017) , since not all gradient coordinates share the same optimization properties. To resolve this problem, several adaptive gradient methods have been proposed to adopt different learning rate for different gradient coordinates. Typical examples of such methods include Adagrad (Duchi et al., 2011 ), RMSProp (Tieleman et al., 2012 ), and Adam (Kingma & Ba, 2014) . Emprically, these methods have shown faster convergence speed and eased the burden of carefully tuning the learning rate in SGD across many kinds of networks. However, their generalization performance are often worse than SGD in many scenarios (Wilson et al., 2017; Zhou et al., 2020) . Some algorithms are proposed to combine the fast convergence speed of adaptive gradient methods and good generalization performance of SGD. Instances of this type of algorithms include SWATS (Keskar & Socher, 2017) which automatically switchs from Adam to SGD, ND-Adam (Zhang et al., 2017) which utilizes vector learning rate and normalization to control direction and stepsize, and AMSGrad (Reddi et al., 2018) which maintains a monotone increasing second moment. Unfortunately, these methods only slightly bridge the generalization gap between SGD and Adam, but does not attain as good generalization performance as SGD, needless to say the state-of-the-art performance on test set. Accordingly, these algorithms are rarely used to train deep networks in practice.

