IMPROVED GRADIENT DESCENT OPTIMIZATION AL-GORITHM BASED ON INVERSE MODEL-PARAMETER DIFFERENCE Anonymous authors Paper under double-blind review

Abstract

A majority of deep learning models implement first-order optimization algorithms like the stochastic gradient descent (SGD) or its adaptive variants for training large neural networks. However, slow convergence due to complicated geometry of the loss function is one of the major challenges faced by the SGD. The currently popular optimization algorithms incorporate an accumulation of past gradients to improve the gradient descent convergence via either the accelerated gradient scheme (including Momentum, NAG, etc.) or the adaptive learning-rate scheme (including Adam, AdaGrad, etc.). Despite their general popularity, these algorithms often display suboptimal convergence owing to extreme scaling of the learning-rate due to the accumulation of past gradients. In this paper, a novel approach to gradient descent optimization is proposed which utilizes the difference in the modelparameter values from the preceding iterations to adjust the learning-rate of the algorithm. More specifically, the learning-rate for each model-parameter is adapted inversely proportional to the displacement of the model-parameter from the previous iterations. As the algorithm utilizes the displacement of model-parameters, poor convergence caused due to the accumulation of past gradients is avoided. A convergence analysis based on the regret bound approach is performed and the theoretical bounds for a stable convergence are determined. An Empirical analysis evaluates the proposed algorithm applied on the CIFAR 10/100 and the ImageNet datasets and compares it with the currently popular optimizers. The experimental results demonstrate that the proposed algorithm shows significant improvement over the popular optimization algorithms.



) based on the use of the gradient of the objective function. First order algorithms use the first derivative to locate the optimum of the objective function through gradient descent. The second-order algorithms, on the other hand, use the secondorder derivative information to approximate the gradient direction as well as the step-size to attain the optimum. Major disadvantage of these methods is the large computations required to determine the inverse of the inverse Hessian matrix. The quasi-newton algorithms like the L-BFGS solve this problem by approximating the Hessian matrix, and have gained significant popularity in many optimization problems Kochenderfer & Wheeler (2019) . Over the years, first-order optimization algorithms have become the primary choice for training deep neural network models. One of the widely popular first-order optimization algorithms is the stochastic gradient descent (SGD) which is (a) easy to implement and (b) is known to perform well across many applications Agarwal et al. (2012); Nemirovski et al. (2009) . However, despite the ease 1



involves implementing an optimization algorithm to train the modelparameters by minimizing an objective function, and has gained tremendous popularity in fields like computer vision, image processing, and many other areas of artificial intelligence Dong et al. (2016); Simonyan & Zisserman (2015); Dong et al. (2014). Fundamentally, optimization algorithms are categorized into first-order Robbins & Monro (1951); Jain et al. (2018), higher-order Dennis & Moré (1977); Martens (2010) and derivative free algorithms Rios & Sahinidis (2013); Berahas et al. (

