ADADQH OPTIMIZER: EVOLVING FROM STOCHASTIC TO ADAPTIVE BY AUTO SWITCH OF PRECONDITION MATRIX Anonymous

Abstract

Adaptive optimizers (e.g., Adam) have achieved tremendous success in deep learning. The key component of the optimizer is the precondition matrix, which provides more gradient information and adjusts the step size of each gradient direction. Intuitively, the closer the precondition matrix approximates the Hessian, the faster convergence and better generalization the optimizer can achieve in terms of iterations. However, this performance improvement is usually accompanied by a huge increase in the amount of computation. In this paper, we propose a new optimizer called AdaDQH to achieve better generalization with acceptable computational overhead. The intuitions are the trade-off of the precondition matrix between computation time and approximation of Hessian, and the auto switch of the precondition matrix from Stochastic Gradient Descent (SGD) to the adaptive optimizer. We evaluate AdaDQH on public datasets of Computer Vision (CV), Natural Language Processing (NLP) and Recommendation Systems (RecSys). The experimental results reveal that, compared to the State-Of-The-Art (SOTA) optimizers, AdaDQH can achieve significantly better or highly competitive performance. Furthermore, we analyze how AdaDQH is able to auto switch from stochastic to adaptive and the actual effects in different scenes. The code is available in the supplemental material.

1. INTRODUCTION

Consider the following empirical risk minimization problems: min w∈R n f (w) := 1 M M k=1 (w; x k ), where w ∈ R n is a vector of parameters to be optimized, {x 1 , . . . , x M } is a training set, and (w; x) is a loss function measuring the performance of the parameter w on the example x. Since it is ineffective to calculate the exact gradient in each optimization iteration when M is large, we usually adopt a mini-batched stochastic gradient, which is g(w) = 1 |B| k∈B ∇ (w; x k ), where B ⊂ {1, . . . , M } is the sample set of size |B| M . Obviously, we have E p(x) [g(w)] = ∇f (w) where p(x) is the distribution of the training data. Equation 1 is usually solved iteratively. Assume w t is already known and let ∆w = w t+1 -w t , then arg min  wt+1∈R n f (w t+1 ) = arg min ∆w∈R n f (∆w + w t ) ≈ arg min ∆w∈R n f (w t ) + (∆w) T ∇f (w t ) + 1 2 (∆w)

