IMPROVED GRADIENT DESCENT OPTIMIZATION AL-GORITHM BASED ON INVERSE MODEL-PARAMETER DIFFERENCE Anonymous authors Paper under double-blind review

Abstract

A majority of deep learning models implement first-order optimization algorithms like the stochastic gradient descent (SGD) or its adaptive variants for training large neural networks. However, slow convergence due to complicated geometry of the loss function is one of the major challenges faced by the SGD. The currently popular optimization algorithms incorporate an accumulation of past gradients to improve the gradient descent convergence via either the accelerated gradient scheme (including Momentum, NAG, etc.) or the adaptive learning-rate scheme (including Adam, AdaGrad, etc.). Despite their general popularity, these algorithms often display suboptimal convergence owing to extreme scaling of the learning-rate due to the accumulation of past gradients. In this paper, a novel approach to gradient descent optimization is proposed which utilizes the difference in the modelparameter values from the preceding iterations to adjust the learning-rate of the algorithm. More specifically, the learning-rate for each model-parameter is adapted inversely proportional to the displacement of the model-parameter from the previous iterations. As the algorithm utilizes the displacement of model-parameters, poor convergence caused due to the accumulation of past gradients is avoided. A convergence analysis based on the regret bound approach is performed and the theoretical bounds for a stable convergence are determined. An Empirical analysis evaluates the proposed algorithm applied on the CIFAR 10/100 and the ImageNet datasets and compares it with the currently popular optimizers. The experimental results demonstrate that the proposed algorithm shows significant improvement over the popular optimization algorithms.



) based on the use of the gradient of the objective function. First order algorithms use the first derivative to locate the optimum of the objective function through gradient descent. The second-order algorithms, on the other hand, use the secondorder derivative information to approximate the gradient direction as well as the step-size to attain the optimum. Major disadvantage of these methods is the large computations required to determine the inverse of the inverse Hessian matrix. The quasi-newton algorithms like the L-BFGS solve this problem by approximating the Hessian matrix, and have gained significant popularity in many optimization problems Kochenderfer & Wheeler (2019) . Over the years, first-order optimization algorithms have become the primary choice for training deep neural network models. One of the widely popular first-order optimization algorithms is the stochastic gradient descent (SGD) which is (a) easy to implement and (b) is known to perform well across many applications Agarwal et al. (2012); Nemirovski et al. (2009) . However, despite the ease of implementation and generalization, the SGD often shows slow convergence due to the fact that SGD scales the gradient uniformly in all dimensions for updating the model-parameters. This is disadvantageous, particularly in case of loss functions with uneven geometry, i.e., having regions of steep and shallow slope in different dimensions simultaneously, and often leads to slow convergence Sutton (1986) . There have been a number of optimization algorithms developed over the years which attempt to improve the convergence speed of the gradient descent in general. There are algorithms that expedite the convergence in the direction of the gradient descent, which include the Momentum and the Nesterov's Accelerated Gradient (NAG) algorithms Qian (1999); Botev et al. (2017) . Then there are algorithms that dynamically adapt the learning-rate of the algorithm based on an exponentially decaying average of the past gradients, and include the AdaGrad, RMSProp, and the Adadelta algorithms Duchi et al. ( 2011); Tieleman & Hinton (2012); Zeiler (2012). This category of algorithms implements the learning-rate as a vector, each element of whose corresponds to one modelparameter, and is dynamically adapted based on the gradient of the loss function relative to the corresponding model-parameter. Further, there are the adaptive gradient algorithms like the Adam algorithm and its variants like the Adamax, RAdam, Nadam, etc., which simultaneously incorporate the expedition of the gradient The above-mentioned adaptive learning-rate methods, are amongst the most widely implemented optimizers for machine learning. However, despite increasing popularity and relevance, these methods have their limitations. Wilson et al. (2017) in their study showed that the non-uniform scaling of the past gradients, in some cases, leads to unstable and extreme values of the learning-rate, causing suboptimal convergence of the algorithms. A variety of algorithms have been developed over the last few years which improve the Adam-type algorithms further, like the AdaBound, which employ dynamic bounds on the learning-rate Luo et al. ( 2019 In this paper, a new approach for the gradient descent optimization is proposed, where the learningrate is dynamically adapted according to the change in the model-parameter values from the preceding iterations. The algorithm updates the learning-rate individually for each model-parameter, inversely proportional to the difference in model-parameter value from the past iterations. This speeds up the convergence of the algorithm, especially in case of loss function regions shaped like a ravine, as the model-parameter converging with small steps on a gentle slope is updated with a larger learning-rate according to the inverse proportionality, thereby speeding up the overall convergence towards the optimum. Further in the paper, a theoretical analysis to determine the lower and upper bounds on the adaptive learning-rate and a convergence analysis using the regret bound approach is performed. Additionally, an empirical analysis of the proposed algorithm is carried out through implementation of the proposed algorithm for training standard datasets like the CIFAR-10/100 and the ImageNet datasets on different CNN models. The major contributions of the paper are as follows: • A new approach to gradient descent optimization, which updates the learning-rate dynamically, inversely proportional to the difference in the model-parameter values from the preceding iterations is presented. The novelty of the algorithm lies in the fact that it does not employ an accumulation of past gradients and thus the suboptimal convergence, due to very large or very small scaling of the learning-rate, is avoided. • A theoretical analysis determining the bounds on the adapting of the learning-rate is presented and a convergence analysis based on the regret bound approach is derived. • An empirical analysis implementing the proposed algorithm on the benchmark classification tasks and its comparison with the currently popular optimization algorithms is performed.



involves implementing an optimization algorithm to train the modelparameters by minimizing an objective function, and has gained tremendous popularity in fields like computer vision, image processing, and many other areas of artificial intelligence Dong et al. (2016); Simonyan & Zisserman (2015); Dong et al. (2014). Fundamentally, optimization algorithms are categorized into first-order Robbins & Monro (1951); Jain et al. (2018), higher-order Dennis & Moré (1977); Martens (2010) and derivative free algorithms Rios & Sahinidis (2013); Berahas et al. (

direction and the adaption of learning-rate based on the past gradients Kingma & Ba (2015); Dozat (2016); Reddi et al. (2018). Apart from that, some recent advances in the other first-order gradient descent methods include the signSGD Bernstein et al. (2018), variance reduction methods like the SAGA Roux et al. (2012), SVRG Johnson & Zhang (2013) and their improved variants Allen-Zhu & Hazan (2016); Reddi et al. (2016); Defazio et al. (2014).

); AdaBelief, which incorporates the variance of the gradient to adjust the learning-rate Zhuang et al. (2020); LookAhead, which considers the trajectories of the fast and the slow model-parameters Zhang et al. (2019); Radam, through rectifying the variance Liu et al. (2020); and AdamW, by decoupling the model-parameter decay Loshchilov & Hutter (2019).

