ADADQH OPTIMIZER: EVOLVING FROM STOCHASTIC TO ADAPTIVE BY AUTO SWITCH OF PRECONDITION MATRIX Anonymous

Abstract

Adaptive optimizers (e.g., Adam) have achieved tremendous success in deep learning. The key component of the optimizer is the precondition matrix, which provides more gradient information and adjusts the step size of each gradient direction. Intuitively, the closer the precondition matrix approximates the Hessian, the faster convergence and better generalization the optimizer can achieve in terms of iterations. However, this performance improvement is usually accompanied by a huge increase in the amount of computation. In this paper, we propose a new optimizer called AdaDQH to achieve better generalization with acceptable computational overhead. The intuitions are the trade-off of the precondition matrix between computation time and approximation of Hessian, and the auto switch of the precondition matrix from Stochastic Gradient Descent (SGD) to the adaptive optimizer. We evaluate AdaDQH on public datasets of Computer Vision (CV), Natural Language Processing (NLP) and Recommendation Systems (RecSys). The experimental results reveal that, compared to the State-Of-The-Art (SOTA) optimizers, AdaDQH can achieve significantly better or highly competitive performance. Furthermore, we analyze how AdaDQH is able to auto switch from stochastic to adaptive and the actual effects in different scenes. The code is available in the supplemental material.

1. INTRODUCTION

Consider the following empirical risk minimization problems: min w∈R n f (w) := 1 M M k=1 (w; x k ), where w ∈ R n is a vector of parameters to be optimized, {x 1 , . . . , x M } is a training set, and (w; x) is a loss function measuring the performance of the parameter w on the example x. Since it is ineffective to calculate the exact gradient in each optimization iteration when M is large, we usually adopt a mini-batched stochastic gradient, which is g(w) = 1 |B| k∈B ∇ (w; x k ), where B ⊂ {1, . . . , M } is the sample set of size |B| M . Obviously, we have E p(x) [g(w)] = ∇f (w) where p(x) is the distribution of the training data. Equation 1 is usually solved iteratively. Assume w t is already known and let ∆w = w t+1 -w t , then arg min wt+1∈R n f (w t+1 ) = arg min ∆w∈R n f (∆w + w t ) ≈ arg min ∆w∈R n f (w t ) + (∆w) T ∇f (w t ) + 1 2 (∆w) T ∇ 2 f (w t )∆w ≈ arg min ∆w∈R n f (w t ) + (∆w) T ∇f (w t ) + 1 2 (∆w) T B t ∆w h(∆w) , where the first approximation is from Taylor expansion. By solving Equation 2 and using m t to replace ∇f (w t ), the general update formula is w t+1 = w t -α t B -1 t m t , t ∈ {1, 2, . . . , T } , where α t is the step size for avoiding divergence, m t ≈ E p(x) [g t ] is the first moment term which is the weighted average of gradient g t and B t is the so-called precondition matrix that incorporates additional information and adjusts updated velocity of variable w t in each direction. Most of gradient descent algorithms can be summarized with Equation 3 such as SGD (Robbins & Monro, 1951) , MOMENTUM (Polyak, 1964) , ADAGRAD (Duchi et al., 2011) , ADADELTA (Zeiler, 2012), ADAM (Kingma & Ba, 2015) , AMSGRAD (Reddi et al., 2018) , ADABELIEF (Zhuang et al., 2020) and ADAHESSIAN (Yao et al., 2020) . Intuitively, the closer B t approximates the Hessian, the closer h(∆w) approximates f (w t+1 ). Consequently, we can achieve a more accurate solution in terms of iterations. However, it is usually untrue in terms of runtime. For instance, ADAHESSIAN that approximates the diagonal Hessian consumes 2.91× more computation time than ADAM for ResNet32 on Cifar10 (Yao et al., 2020) . Therefore, the key factor of designing the precondition matrix is how to trade off the approximation degree of the Hessian and the computation complexity. In this paper, we propose AdaDQH (Adaptive optimizer with Diagonal Quasi-Hessian), whose precondition matrix is closely related to the Hessian but computationally efficient. Furthermore, AdaDQH can auto switch the precondition matrix from SGD to the adaptive optimizer through the hyperparameter threshold δ. Our contributions can be summarized as follows. • We propose AdaDQH, which originates the new design of the precondition matrix. We establish theoretically proven convergence guarantees in both convex and non-convex stochastic settings. 2015) ). The experimental results reveal that AdaDQH can outperform or be on a par with the SOTA optimizers. • We analyze how AdaDQH is able to auto switch from stochastic to adaptive, and assess the rigorous effect of the hyperparameter δ which controls the auto-switch process in different scenes.

RELATED WORK

By choosing different B t and m t of Equation 3, different optimizers are invented from the standard second order optimizer, i.e., Gauss-Newton method to the standard first order optimizer, i.e., SGD where m t is usually designed for noise reduction and B t for solving the ill-conditioned problems. See Table 1 . Kunstner et al. (2019) shows that the Fisher information matrix can be the reasonable approximation of the Hessian whereas the empirical Fisher can't. Furthermore, they propose the concept of variance adaption to explain the practical success of the empirical Fisher preconditioning. The hybrid optimization methods of switching an adaptive optimizer to SGD have been proposed for improving the generalization performance, such as ADABOUND (Luo et al., 2019) 

NOTATION

We use lowercase letters to denote scalars, boldface lowercase to denote vectors, and uppercase letters to denote matrices. We denote a sequence of vectors by subscripts, that is, x 1 , . . . , x t where t ∈ [T ] := {1, 2, . . . , T }, and entries of each vector by an additional subscript, e.g., x t,i . For any vectors x, y ∈ R n , we write x T y or x • y for the standard inner product, xy for element-wise multiplication, x/y for element-wise division, √ x for element-wise square root, x 2 for element-wise square. For the standard Euclidean norm, x = x 2 = x, x and max(x, y) for element-wise maximum. We also use x ∞ = max i |x (i) | to denote ∞ -norm, where x (i) is the i-th element of x. Let e i denote the unit vector where the i-th element is one and ∇ i f denote the i-th element of ∇f .



and SWATS(Keskar & Socher, 2017).Luo et al. (2019)  adopts clipping on the learning rate of ADAM, whose upper and lower bounds are a non-increasing and non-decreasing functions, respectively, which would converge to the learning rate of SGD. The clipping method is also mentioned inKeskar &  Socher (2017), whose upper and lower bounds are constants.

