ADADQH OPTIMIZER: EVOLVING FROM STOCHASTIC TO ADAPTIVE BY AUTO SWITCH OF PRECONDITION MATRIX Anonymous

Abstract

Adaptive optimizers (e.g., Adam) have achieved tremendous success in deep learning. The key component of the optimizer is the precondition matrix, which provides more gradient information and adjusts the step size of each gradient direction. Intuitively, the closer the precondition matrix approximates the Hessian, the faster convergence and better generalization the optimizer can achieve in terms of iterations. However, this performance improvement is usually accompanied by a huge increase in the amount of computation. In this paper, we propose a new optimizer called AdaDQH to achieve better generalization with acceptable computational overhead. The intuitions are the trade-off of the precondition matrix between computation time and approximation of Hessian, and the auto switch of the precondition matrix from Stochastic Gradient Descent (SGD) to the adaptive optimizer. We evaluate AdaDQH on public datasets of Computer Vision (CV), Natural Language Processing (NLP) and Recommendation Systems (RecSys). The experimental results reveal that, compared to the State-Of-The-Art (SOTA) optimizers, AdaDQH can achieve significantly better or highly competitive performance. Furthermore, we analyze how AdaDQH is able to auto switch from stochastic to adaptive and the actual effects in different scenes. The code is available in the supplemental material.

1. INTRODUCTION

Consider the following empirical risk minimization problems: min w∈R n f (w) := 1 M M k=1 (w; x k ), where w ∈ R n is a vector of parameters to be optimized, {x 1 , . . . , x M } is a training set, and (w; x) is a loss function measuring the performance of the parameter w on the example x. Since it is ineffective to calculate the exact gradient in each optimization iteration when M is large, we usually adopt a mini-batched stochastic gradient, which is g(w) = 1 |B| k∈B ∇ (w; x k ), where B ⊂ {1, . . . , M } is the sample set of size |B| M . Obviously, we have E p(x) [g(w)] = ∇f (w) where p(x) is the distribution of the training data. Equation 1 is usually solved iteratively. Assume w t is already known and let ∆w = w t+1 -w t , then arg min wt+1∈R n f (w t+1 ) = arg min ∆w∈R n f (∆w + w t ) ≈ arg min ∆w∈R n f (w t ) + (∆w) T ∇f (w t ) + 1 2 (∆w) T ∇ 2 f (w t )∆w ≈ arg min ∆w∈R n f (w t ) + (∆w) T ∇f (w t ) + 1 2 (∆w) T B t ∆w h(∆w) , where the first approximation is from Taylor expansion. By solving Equation 2 and using m t to replace ∇f (w t ), the general update formula is w t+1 = w t -α t B -1 t m t , t ∈ {1, 2, . . . , T } , where α t is the step size for avoiding divergence, m t ≈ E p(x) [g t ] is the first moment term which is the weighted average of gradient g t and B t is the so-called precondition matrix that incorporates additional information and adjusts updated velocity of variable w t in each direction. Most of gradient descent algorithms can be summarized with Equation 3 such as SGD (Robbins & Monro, 1951) , MOMENTUM (Polyak, 1964) , ADAGRAD (Duchi et al., 2011) , ADADELTA (Zeiler, 2012) , ADAM (Kingma & Ba, 2015) , AMSGRAD (Reddi et al., 2018) , ADABELIEF (Zhuang et al., 2020) and ADAHESSIAN (Yao et al., 2020) . Intuitively, the closer B t approximates the Hessian, the closer h(∆w) approximates f (w t+1 ). Consequently, we can achieve a more accurate solution in terms of iterations. However, it is usually untrue in terms of runtime. For instance, ADAHESSIAN that approximates the diagonal Hessian consumes 2.91× more computation time than ADAM for ResNet32 on Cifar10 (Yao et al., 2020) . Therefore, the key factor of designing the precondition matrix is how to trade off the approximation degree of the Hessian and the computation complexity. In this paper, we propose AdaDQH (Adaptive optimizer with Diagonal Quasi-Hessian), whose precondition matrix is closely related to the Hessian but computationally efficient. Furthermore, AdaDQH can auto switch the precondition matrix from SGD to the adaptive optimizer through the hyperparameter threshold δ. Our contributions can be summarized as follows. • We propose AdaDQH, which originates the new design of the precondition matrix. We establish theoretically proven convergence guarantees in both convex and non-convex stochastic settings. • We validate AdaDQH on a total of six public datasets: two from CV (Cifar10 (Krizhevsky et al., 2009) and ImageNet (Russakovsky et al., 2015) ), two from NLP (IWSLT14 (Cettolo et al., 2014) and PTB (Marcus et al., 1993) ) and the rest from RecSys (Criteo (Criteo, 2014) and Avazu (Avazu, 2015) ). The experimental results reveal that AdaDQH can outperform or be on a par with the SOTA optimizers. • We analyze how AdaDQH is able to auto switch from stochastic to adaptive, and assess the rigorous effect of the hyperparameter δ which controls the auto-switch process in different scenes.

RELATED WORK

By choosing different B t and m t of Equation 3, different optimizers are invented from the standard second order optimizer, i.e., Gauss-Newton method to the standard first order optimizer, i.e., SGD where m t is usually designed for noise reduction and B t for solving the ill-conditioned problems. See Table 1 . Kunstner et al. (2019) shows that the Fisher information matrix can be the reasonable approximation of the Hessian whereas the empirical Fisher can't. Furthermore, they propose the concept of variance adaption to explain the practical success of the empirical Fisher preconditioning. The hybrid optimization methods of switching an adaptive optimizer to SGD have been proposed for improving the generalization performance, such as ADABOUND (Luo et al., 2019) and SWATS (Keskar & Socher, 2017) . Luo et al. (2019) adopts clipping on the learning rate of ADAM, whose upper and lower bounds are a non-increasing and non-decreasing functions, respectively, which would converge to the learning rate of SGD. The clipping method is also mentioned in Keskar & Socher (2017) , whose upper and lower bounds are constants.

NOTATION

We use lowercase letters to denote scalars, boldface lowercase to denote vectors, and uppercase letters to denote matrices. We denote a sequence of vectors by subscripts, that is, x 1 , . . . , x t where t ∈ [T ] := {1, 2, . . . , T }, and entries of each vector by an additional subscript, e.g., x t,i . For any vectors x, y ∈ R n , we write x T y or x • y for the standard inner product, xy for element-wise multiplication, x/y for element-wise division, √ x for element-wise square root, x 2 for element-wise square. For the standard Euclidean norm, x = x 2 = x, x and max(x, y) for element-wise maximum. We also use x ∞ = max i |x (i) | to denote ∞ -norm, where x (i) is the i-th element of x. Let e i denote the unit vector where the i-th element is one and ∇ i f denote the i-th element of ∇f . (Robbins & Monro, 1951) , MOMENTUM (Polyak, 1964) H is the Hessian. F is the Fisher information matrix. Femp is the empirical Fisher information matrix. Let f t (w) be the loss function of the model at t-step where w ∈ R n . We consider m t as Exponential Moving Averages (EMA) of g t throughout this paper, i.e., m t = β 1 m t-1 + (1 -β 1 )g t = (1 -β 1 ) t i=1 g t-i+1 β i-1 1 , t ≥ 1, where β 1 ∈ [0, 1) is the exponential decay rate.

2. ALGORITHM 2.1 DETAILS AND INTUITIONS OF ADADQH OPTIMIZER

The algorithm is listed in Algorithm 1. The design of AdaDQH comes from two intuitions: Hessian Algorithm 1 AdaDQH 1: Input: parameters β 1 , β 2 , δ, w 1 ∈ R n , step size α t , initialize m 0 = 0, b 0 = 0 2: for t = 1 to T do 3: g t = ∇f t (w t ) 4: m t ← β 1 m t-1 + (1 -β 1 )g t 5: s t = m 1 /(1 -β 1 ) t = 1 m t /(1 -β t 1 ) -m t-1 /(1 -β t-1 1 ) t > 1 6: b t ← β 2 b t-1 + (1 -β 2 )s 2 t 7: w t+1 = w t -α t √ 1-β t 2 1-β t 1 mt max( √ bt,δ √ 1-β t 2 ) 8: end for approximation and auto switch for fast convergence and good generalization across tasks. HESSIAN APPROXIMATION Let ∆w = -α t B -1 t m t of Equation 3, then we have E[g t,i -g t-1,i ] = ∇ i f (w t ) -∇ i f (w t-1 ) = ∇∇ i f (w t-1 + θ∆w) • ∆w, θ ∈ (0, 1) θ=1 ≈ ∇∇ i f (w t ) • ∆w ∆w=ei ≈ ∇ i ∇ i f (w t ), where the second equality above follows from the mean value theorem and in the second approximation we assume that w t is not updated except for the i-th direction. Therefore, we can see that E[g(w t ) -g(w t-1 )] is closely related to diag(H(w t )). Similar to Kingma & Ba (2015) , we use m t /(1 -β t 1 ) to approximate E[g(w t )] where the denominator is for bias correction. Denote s t = m 1 /(1 -β 1 ) t = 1, m t /(1 -β t 1 ) -m t-1 /(1 -β t-1 1 ) t > 1. Therefore, we choose the precondition matrix B t satisfying B 2 t = diag(EMA(s 1 s T 1 , s 2 s T 2 , • • • , s t s T t ))/(1 -β t 2 ) , where β 2 is the parameter of EMA and the denominator is also for bias correction. AUTO SWITCH Normally, a small value is added to B t for numerical stability, becoming B t + I. However, we replace it with max(B t , δI), where we use a different notation δ to indicate its essential role in auto switch. When bt := b t /(1 -β t 2 ) is relatively larger than δ, AdaDQH takes a confident step in the adaptive way. Otherwise, the update is EMA, i.e. m t , with a constant scale α t /(1 -β t 1 ), similar to SGD with momentum. Moreover, AdaDQH can auto switch modes in a per parameter manner as training progresses. Compared to the additive method, AdaDQH can eliminate the noise caused by in adaptive updates. Another major benefit is that AdaDQH has the ability to generalize in different tasks by tuning δ, without choosing from candidates of obscure optimizers empirically. The effect of δ is discussed in Sec.3.5, experimentally. 2018), we have the following theorems that provide the convergence in convex and non-convex settings. Particularly, we use β 1,t to replace β 1 where β 1,t is non-increasing. Theorem 1. (Convergence in convex settings) Let {w t } be the sequence obtained by AdaDQH (Algorithm 1),  α t = α/ √ t, β 1,t ≤ β 1 ∈ [0, 1), β 2 ∈ [0, 1), b t,i ≤ b t+1,i ∀i ∈ [n] and g t ∞ ≤ G ∞ , ∀t ∈ [T ]. Suppose f t (w) is convex for all t ∈ [T ], w * is an optimal solution of (f t (w t ) -f t (w * )) < 1 1 -β 1 n(2G ∞ + δ)D 2 ∞ 2α √ 1 -β 2 (1 -β 1 ) 2 √ T + T t=1 β 1,t 2α t nD 2 ∞ + nαG 2 ∞ (1 -β 1 ) 3 1 + 1 δ √ 1 -β 2 √ T . The proof of Theorem 1 is given in Appendix A. Corollary 1. Suppose β 1,t = β 1 /t, then we have T t=1 (f t (w t ) -f t (w * )) < 1 1 -β 1 n(2G ∞ + δ)D 2 ∞ 2α √ 1 -β 2 (1 -β 1 ) 2 √ T + nD 2 ∞ β 1 α √ 1 -β 2 √ T + nαG 2 ∞ (1 -β 1 ) 3 1 + 1 δ √ 1 -β 2 √ T . The proof of Corollary 1 is given in Appendix B. Corollary 1 implies the regret is O( √ T ) and can achieve the convergence rate O(1/ √ T ) in convex settings. Theorem 2. (Convergence in non-convex settings) Suppose that the following assumptions are satisfied: 1. f is differential and lower bounded, i.e., f (w * ) > -∞ where w * is an optimal solution. f is also L-smooth, i.e., ∀u, v ∈ R n , we have f (u) ≤ f (v) + ∇f (v), u -v + L 2 u -v 2 . 2. At step t, the algorithm can access a bounded noisy gradient and the true gradient is bounded, i.e., g t ∞ ≤ G ∞ , ∇f (w t ) ∞ ≤ G ∞ , ∀t ∈ [T ] . Without loss of generality, we assume G ∞ ≥ δ. 3. The noisy gradient is unbiased and the noise is independent, i.e., g t = ∇f (w t ) + ζ t , E[ζ t ] = 0 and ζ i is independent of ζ j if i = j. 4. α t = α/ √ t, β 1,t ≤ β 1 ∈ [0, 1), β 2 ∈ [0, 1) and b t,i ≤ b t+1,i ∀i ∈ [n]. Then Algorithm 1 yields min t∈[T ] E[ ∇f (w t ) 2 ] < C 1 1 √ T - √ 2 + C 2 log T √ T - √ 2 + C 3 T t=1 αt (β 1,t -β 1,t+1 ) √ T - √ 2 , where C 1 , C 2 and C 3 are defined as follows: C 1 = G ∞ α(1 -β 1 ) 2 (1 -β 2 ) 2 f (w 1 ) -f (w * ) + nG 2 ∞ α (1 -β 1 ) 8 δ 2 (δ + 8Lα) + αβ 1 nG 2 ∞ (1 -β 1 ) 3 δ , C 2 = 15LnG 3 ∞ α 2(1 -β 2 ) 2 (1 -β 1 ) 10 δ 2 , C 3 = nG 3 ∞ α(1 -β 1 ) 5 (1 -β 2 ) 2 δ . The proof of Theorem 2 is given in Appendix C. Note that we can let b t+1 = max(b t+1 , b t ), which is usually called AMSGrad condition (Reddi et al., 2018) , to make sure the assumption b t,i ≤ b t+1,i ∀i ∈ [n] always true, though it could degenerate the effect in practice. The more detailed analysis is given in Appendix E.3. From Theorem 2, we have the following corollaries. Corollary 2. Suppose β 1,t = β 1 / √ t, then we have min t∈[T ] E[ ∇f (w t ) 2 ] < C 4 1 √ T - √ 2 + C 5 log T √ T - √ 2 , where C 4 and C 5 are defined as follows: C 4 = G ∞ α(1 -β 1 ) 2 (1 -β 2 ) 2 f (w 1 ) -f (w * ) + nG 2 ∞ α (1 -β 1 ) 8 δ 2 (2δ + 8Lα) + αβ 1 nG 2 ∞ (1 -β 1 ) 3 δ , C 5 = nG 3 ∞ (1 -β 2 ) 2 (1 -β 1 ) 10 δ 2 15 2 Lα + δ . The proof of Corollary 2 is given in Appendix D. Corollary 3. Suppose β 1,t = β 1 , ∀t ∈ [T ], then we have min t∈[T ] E[ ∇f (w t ) 2 ] < C 1 1 √ T - √ 2 + C 2 log T √ T - √ 2 , where C 1 and C 2 are the same with Theorem 2. Corollaries 2, 3 imply the convergence (to the stationary point) rate for AdaDQH is O(log T / √ T ) in non-convex settings.

2.3. NUMERICAL ANALYSIS

In this section, we compare AdaDQH against several SOTA optimizers in three test funtions. We adopt the parameter settings from Zhuang et al. (2020) . The learning rate is set to 1e-3 for all adaptive optimizers, along with the same epsilon/delta (1e-8), betas (β 1 = 0.9, β 2 = 0.999). For SGD, momentum is set to 0.9 and learning rate is 1e-6 for numerical stability. AdaDQH shows promising results as it reaches the optimal points across all of the experiments, shown in Figure 1 . Furthermore, we search the best learing rate for each optimizer with regard to Beale function. AdaDQH is still the strongest competitor. The details are provided in E.1. We experimentally compare the performance of different optimizers on a wide range of learning tasks, including CV, NLP and RecSys. The details of the tasks are as follows. CV: We experiment with ResNet20 and ResNet32 on Cifar10 (Krizhevsky et al., 2009) dataset, and ResNet18 on ImageNet (Russakovsky et al., 2015) dataset. The details of the datasets are listed in Table 2 . We train 160 epochs and decay the learning rate by a factor of 10 at epoch 80 and 120 for Cifar10, and train 90 epochs and decay the learning rate by a factor of 10 every 30 epochs for ImageNet. The batch size is 256 for both datasets.

NLP:

We experiment with Neural Machine Translation (NMT) on IWSLT14 German-to-English (De-En) (Cettolo et al., 2014) , and Language Modeling (LM) on Penn TreeBank (Marcus et al., 1993) . For NMT task, transformer small architecture is adopted. We use the same setting and pre-processing method in Yao et al. (2020) , as well as the same length penalty (1.0), beam size (5) and max tokens (4096). We train 55 epochs and average the last 5 checkpoints for inference. For LM task, we train 1,2,3-layer LSTM with batch size of 20 for 200 epochs. The details are listed in Table 2 . Additionally, we keep settings like learning rate scheduler and warm-up steps identical for the same task. RecSys: We experiment on two common datasets including Avazu (Avazu, 2015) and Criteo (Criteo, 2014) which are various display ads logs for the purpose of predicting the Click Through Rate (CTR). For Avazu, the samples from the first nine days are used for training, while the rest are for testing. We use the basic Multilayer Perceptron (MLP) structure of most deep CTR models. Specifically, the model maps each categorical feature as a 16-dimensional embedding vector, following up with 4 fully connected layer of dimension in 64,32,16,1, respectively. For Criteo, we take the early 6/7 part of all samples as the train set. We adopt Deep & Cross Network (DCN) (Wang et al., 2017) with embedding size set to 8, along with 2 deep layers of size 64 and 2 cross layers. The details are listed in Table 2 . We train 1 epoch with a batch size of 512 for both datasets. Optimizers to be compared include SGD (Robbins & Monro, 1951 ), Adam (Kingma & Ba, 2015) , AdamW (Loshchilov & Hutter, 2019) , AdaBelief (Zhuang et al., 2020) and AdaHessian (Yao et al., 2020) . The choices of the hyperparameters are given in Appendix E.2. The experiments of CV and (Paszke et al., 2019) , and the experiments of RecSys are conducted with 3 parameter servers and 5 workers in the TensorFlow framework (Abadi et al., 2016) . We run all the experiments 5 times with random seeds and calculate the statistical results. 2 . Note that the results of SGD, AdaBelief and AdaHessian on ImageNet are lower than the number reported in original papers (Chen et al., 2020; Zhuang et al., 2020; Yao et al., 2020) which are run single time. More discussions are given in Appendix E.4. We also report the accuracy of AdaDQH for ResNet18 on Cifar10 for comparing with the SOTA resultsfoot_0 , which is listed in Appendix E.5. In addition, it is worth mentioning that although AdaDQH is considered as quasi-Hessian, its runtime cost is comparable with SGD and much lower than AdaHessian, which is shown in Table 4 .

3.3. NLP

We report case-insensitive BiLingual Evaluation Understudy (BLEU, higher is better) score and the perplexity (PPL, lower is better) on test set for NMT and LM tasks respectively. The results are shown in Table 5 . For the NMT task on IWSLT14, AdaDQH achieves a similar result to AdaBelief, outperforming the other optimizers. For LM task on PTB, AdaDQH obtains lowest PPL in all 1,2,3-layer LSTM experiments, as demonstrated in Figure 3 . Furthermore, we report the relative training time in Table 4 which is similar to CV.

3.4. RECSYS

We adopt Area Under the receiver-operator Curve (AUC) as the evaluation criterion which is a good measurement in CTR estimation (Graepel et al., 2010) . Table 6 shows that compared to other optimizers, AdaDQH can achieve significantly better or highly competitive performance on the AUC metric.

3.5. THE EFFECT OF δ

In this section, we analyze the rigorous effect of δ, i.e., what exact percentage of bt is truncated by δ in Algorithm 1. Figure 4 depicts the distribution of bt during the training process on different tasks in the best configuration we found. The black dot gives the exact percentage of bt that is truncated by δ in the task. Lower percentage means more SGD-like updates than adaptive steps, which is controlled by the choice of δ. 

4. CONCLUSION

In this paper, we propose the AdaDQH optimizer, which can evolve from stochastic to adaptive by auto switch of the precondition matrix and has better generalization compared to the SOTA optimizers. We theoretically prove the convergence rate in both convex and non-convex stochastic settings and conduct empirical evaluation in real-world datasets of different scenes. The results clearly demonstrate the advantages of our optimizer in getting significantly better performance. Finally, we analyze how it is able to auto switch from stochastic to adaptive, and the rigorous effect of the hyperparameter δ which controls the auto-switch process.



https://paperswithcode.com/sota/stochastic-optimization-on-cifar-10-resnet-18



2.2 CONVERGENCE ANALYSIS Using the framework developed in Reddi et al. (2018); Yang et al. (2016); Chen et al. (2019); Zhou et al. (

t=1 f t (w), i.e., w * = arg min w∈R n T t=1 f t (w) and there exists the constant D ∞ such that max t∈[T ] w tw * ∞ ≤ D ∞ . Then we have the following bound on the regret T t=1

Figure1: Trajectories of different optimizers in three test functions, where f (x, y) = (x + y) 2 + (x -y) 2 /10. We also provide animated versions in the supplemental material.

Figure 2: Testing accuracy curves of different optimizers for ResNet20/32 on Cifar10 and ResNet18 on ImageNet. The solid line represents the mean of the results and the shaded area represents the standard deviation.

Figure 4: The distribution of bt on different epochs/steps. The colored area denotes the ratio of bt in the corresponding interval. The values of δ of ResNet18 on ImageNet, Transformer on IWSLT14, 2-layer LSTM on PTB, and DCN on Criteo are 1e-5, 1e-14, 1e-5 and 1e-8 respectively.

Figure4reveals how auto switch in AdaDQH works in different tasks. As shown in Figure4a, AdaDQH behaves more like SGD in early stage of training (before the first learning rate decay at the 30th epoch) and switches to the adaptive for fine-tuning, since generally outperforms the adaptive optimizers in CNN tasks. Figure4bindicates that the parameters taking adaptive updates are dominant, which is expected because the adaptive optimizers like AdamW are preferred in transformer. As indicated in Figure4c, most parameters are updating stochastically, which explains why AdaDQH has a similar curve to SGD in Figure3bbefore the 100th epoch. The ratio grows from 3% to 5% afterwards, resulting in a better PPL in the fine-tuning stage. As for Figure4d, the model of the RecSys task is training for only one epoch, and AdaDQH gradually switch to the adaptive updates for a better fit to the data.

Different optimizers with choosing different B t .

Experiments overview

Top-1 accuracy for different optimizers when trained on Cifar10 and ImageNet.

Relative training time for AdaDQH (baseline), SGD and AdaHessian. Additionally, minutes of training one epoch with AdaDQH are provided. † is measured on one Nvidia P100 GPU, §/ ‡ on one/four Nvidia V100 GPU. Note that * results from the limitations of PyTorch running RNN model with second order optimizers.

reports the top-1 accuracy for different optimizers when trained on Cifar10 and ImageNet. It is remarkable that AdaDQH outperforms other optimizers on both Cifar10 and ImageNet. The testing accuracy ([µ ± σ]) curves of different optimizers for ResNet20/32 on Cifar10 and ResNet18 on ImageNet are plotted in Figure

Test BLEU score and PPL for NMT and LM tasks. † is reported inYao et al. (2020).

Test AUC for different optimizers when trained on Avazu and Criteo.

