IMPROVED GRADIENT DESCENT OPTIMIZATION AL-GORITHM BASED ON INVERSE MODEL-PARAMETER DIFFERENCE Anonymous authors Paper under double-blind review

Abstract

A majority of deep learning models implement first-order optimization algorithms like the stochastic gradient descent (SGD) or its adaptive variants for training large neural networks. However, slow convergence due to complicated geometry of the loss function is one of the major challenges faced by the SGD. The currently popular optimization algorithms incorporate an accumulation of past gradients to improve the gradient descent convergence via either the accelerated gradient scheme (including Momentum, NAG, etc.) or the adaptive learning-rate scheme (including Adam, AdaGrad, etc.). Despite their general popularity, these algorithms often display suboptimal convergence owing to extreme scaling of the learning-rate due to the accumulation of past gradients. In this paper, a novel approach to gradient descent optimization is proposed which utilizes the difference in the modelparameter values from the preceding iterations to adjust the learning-rate of the algorithm. More specifically, the learning-rate for each model-parameter is adapted inversely proportional to the displacement of the model-parameter from the previous iterations. As the algorithm utilizes the displacement of model-parameters, poor convergence caused due to the accumulation of past gradients is avoided. A convergence analysis based on the regret bound approach is performed and the theoretical bounds for a stable convergence are determined. An Empirical analysis evaluates the proposed algorithm applied on the CIFAR 10/100 and the ImageNet datasets and compares it with the currently popular optimizers. The experimental results demonstrate that the proposed algorithm shows significant improvement over the popular optimization algorithms.

1. INTRODUCTION

Machine learning essentially involves implementing an optimization algorithm to train the modelparameters by minimizing an objective function, and has gained tremendous popularity in fields like computer vision, image processing, and many other areas of artificial intelligence Dong et al. (2016) ; Simonyan & Zisserman (2015) ; Dong et al. (2014) . Fundamentally, optimization algorithms are categorized into first-order Robbins & Monro (1951) ; Jain et al. (2018) , higher-order Dennis & Moré (1977) ; Martens (2010) and derivative free algorithms Rios & Sahinidis (2013); Berahas et al. (2019) based on the use of the gradient of the objective function. First order algorithms use the first derivative to locate the optimum of the objective function through gradient descent. The second-order algorithms, on the other hand, use the secondorder derivative information to approximate the gradient direction as well as the step-size to attain the optimum. Major disadvantage of these methods is the large computations required to determine the inverse of the inverse Hessian matrix. The quasi-newton algorithms like the L-BFGS solve this problem by approximating the Hessian matrix, and have gained significant popularity in many optimization problems Kochenderfer & Wheeler (2019) . Over the years, first-order optimization algorithms have become the primary choice for training deep neural network models. One of the widely popular first-order optimization algorithms is the stochastic gradient descent (SGD) which is (a) easy to implement and (b) is known to perform well across many applications Agarwal et al. (2012) ; Nemirovski et al. (2009) . However, despite the ease of implementation and generalization, the SGD often shows slow convergence due to the fact that SGD scales the gradient uniformly in all dimensions for updating the model-parameters. This is disadvantageous, particularly in case of loss functions with uneven geometry, i.e., having regions of steep and shallow slope in different dimensions simultaneously, and often leads to slow convergence Sutton (1986) . There have been a number of optimization algorithms developed over the years which attempt to improve the convergence speed of the gradient descent in general. There are algorithms that expedite the convergence in the direction of the gradient descent, which include the Momentum and the Nesterov's Accelerated Gradient (NAG) algorithms Qian (1999); Botev et al. (2017) . Then there are algorithms that dynamically adapt the learning-rate of the algorithm based on an exponentially decaying average of the past gradients, and include the AdaGrad, RMSProp, and the Adadelta algorithms Duchi et al. (2011) ; Tieleman & Hinton (2012); Zeiler (2012) . This category of algorithms implements the learning-rate as a vector, each element of whose corresponds to one modelparameter, and is dynamically adapted based on the gradient of the loss function relative to the corresponding model-parameter. Further, there are the adaptive gradient algorithms like the Adam algorithm and its variants like the Adamax, RAdam, Nadam, etc., which simultaneously incorporate the expedition of the gradient direction and the adaption of learning-rate based on the past gradients Kingma & Ba (2015) ; Dozat (2016); Reddi et al. (2018) . Apart from that, some recent advances in the other first-order gradient descent methods include the signSGD Bernstein et al. (2018) , variance reduction methods like the SAGA Roux et al. ( 2012), SVRG Johnson & Zhang (2013) and their improved variants Allen-Zhu & Hazan (2016); Reddi et al. (2016) ; Defazio et al. (2014) . The above-mentioned adaptive learning-rate methods, are amongst the most widely implemented optimizers for machine learning. However, despite increasing popularity and relevance, these methods have their limitations. Wilson et al. (2017) in their study showed that the non-uniform scaling of the past gradients, in some cases, leads to unstable and extreme values of the learning-rate, causing suboptimal convergence of the algorithms. A variety of algorithms have been developed over the last few years which improve the Adam-type algorithms further, like the AdaBound, which employ dynamic bounds on the learning-rate Luo et al. (2019) ; AdaBelief, which incorporates the variance of the gradient to adjust the learning-rate Zhuang et al. (2020) ; LookAhead, which considers the trajectories of the fast and the slow model-parameters Zhang et al. (2019) ; Radam, through rectifying the variance Liu et al. (2020) ; and AdamW, by decoupling the model-parameter decay Loshchilov & Hutter (2019) . In this paper, a new approach for the gradient descent optimization is proposed, where the learningrate is dynamically adapted according to the change in the model-parameter values from the preceding iterations. The algorithm updates the learning-rate individually for each model-parameter, inversely proportional to the difference in model-parameter value from the past iterations. This speeds up the convergence of the algorithm, especially in case of loss function regions shaped like a ravine, as the model-parameter converging with small steps on a gentle slope is updated with a larger learning-rate according to the inverse proportionality, thereby speeding up the overall convergence towards the optimum. Further in the paper, a theoretical analysis to determine the lower and upper bounds on the adaptive learning-rate and a convergence analysis using the regret bound approach is performed. Additionally, an empirical analysis of the proposed algorithm is carried out through implementation of the proposed algorithm for training standard datasets like the CIFAR-10/100 and the ImageNet datasets on different CNN models. The major contributions of the paper are as follows: • A new approach to gradient descent optimization, which updates the learning-rate dynamically, inversely proportional to the difference in the model-parameter values from the preceding iterations is presented. The novelty of the algorithm lies in the fact that it does not employ an accumulation of past gradients and thus the suboptimal convergence, due to very large or very small scaling of the learning-rate, is avoided. • A theoretical analysis determining the bounds on the adapting of the learning-rate is presented and a convergence analysis based on the regret bound approach is derived. • An empirical analysis implementing the proposed algorithm on the benchmark classification tasks and its comparison with the currently popular optimization algorithms is performed. The remainder of the paper is organized as follows: Section 2 gives a short background and formulates the problem at hand. In Section 3, the proposed algorithm is introduced and described in detail and the theoretical bounds on the adaptive learning-rate for a stable convergence are determined. In Section 4, the proposed algorithm is implemented on benchmark neural networks and its performance is evaluated in comparison with the currently popular optimization algorithms, like the SGD, Adam, AdaBelief, LookAhead, RAdam, AdamW, etc., which is followed by a conclusion and a brief outlook on the future aspects of the work in Section 5.

2. PROBLEM FORMULATION

2.1 NOTATIONS Given a vector p ∈ R, p k is used to denote the i th element of the vector at the k th time instance, ||p|| 2 denotes the L2norm of the vector and |p| gives the absolute value of the vector. Given two vectors p, q ∈ R, then p • q denotes the dot product, p ⊙ q denotes the element-wise product and p/q denotes the element-wise division of the two vectors.

2.2. BACKGROUND

Supervised learning involves training a neural network, initially modelled with parameters θ ∈ R, that maps the input dataset x ∈ R to predict an output ŷ = f (θ, x), and compute the gradient of a given loss function with respect to the model-parameters, and solve an optimization problem to adapt the parameters θ such that the loss function is minimized. The optimization (minimization) problem is represented as min θ 1 N N i=1 J y (i) , f (θ, x (i) ) (1) where J is the loss function, N is the number of training examples, x (i) is the i th training example, and y (i) is the corresponding labelled output. The SGD optimization recursively updates the modelparameters θ by subtracting the gradient of the loss function scaled by a certain factor (learning-rate) via the following update rule θ k+1 = θ k -µ k • ∇k (θ k ) where θ k represents the model-parameters at instance k, ∇k is the unbiased estimate of the exact gradient of the loss function and µ k is the learning-rate which is also generally adapted at every iteration Zinkevich (2003) ; Haykin (2014) . Slow convergence is one of the major challenges faced by the SGD optimization. SGD convergence can be particularly slow due to the complex geometry of the loss function. A variety of optimization algorithms, which dynamically adjust the learning-rate of the algorithm by considering an accumulation of the past gradients for adapting the learning-rate, have been developed and gained popularity in the past years Soydaner (2020); Dogo et al. (2018) . The adaptive gradient algorithms, like Adam and its variants (as listed in section 1), despite their widespread popularity, display suboptimal convergence in many scenarios. This results from the non-uniform scaling of the learning-rate accumulation of the past gradients Wilson et al. ( 2017); Chen et al. (2019) .Therefore, a novel approach to SGD optimization, which adjusts the learning-rate of the algorithm considering the difference in the model-parameter values instead of an accumulation of past gradients, is proposed in this paper.

3. PROPOSED ALGORITHM

The proposed algorithm is based on dynamic adjustment of the learning-rate according to the change in the values of the model-parameters between the preceding consecutive iterations. More precisely, the learning-rate for each model-parameter is dynamically adjusted by scaling it inversely proportional to the difference in the values of that model-parameter from the immediate preceding iterations. The proposed algorithm follows the norm that the model-parameters in regions of a gentle slope should be adapted with a larger learning-rate as compared to the parameters converging quickly over regions of steeper slope, thereby boosting the overall convergence of the algorithm towards the optimum. The difference coefficient, denoted by D (i) k , is a function of the absolute difference between the values of the i th model-parameter from the immediate preceding iterations, and is given by D (i) k = K 1 + abs(∆θ (i) k ) (3) where D (i) k is the difference coefficient of the i th model-parameter at the k th iteration and ∆θ (i) k = θ (i) k -θ (i) k-1 gives the difference between the values of the i th model-parameter between the k th and the (k -1) th iterations. The term 1 + abs(∆θ (i) k ) limits the range of the denominator to [1, ∞) so that extreme scaling of µ k in case of very small ∆θ k is avoided. The learning-rate µ (i) k for the i th model-parameter is thus updated according to the following relation. µ (i) k = µ 0 • D (i) k = µ 0 • K 1 + abs(∆θ (i) k ) (4) The characteristic of the D (i) k determines the behavior of the dynamic learning-rate. Note that for K = 1, D (i) k ∈ (0, 1] , ∀ ∆θ (i) k ∈ (-∞, ∞). It can be observed that a large difference in the modelparameter values results in a smaller D k and thus a smaller adaptive learning-rate, while a smaller difference results in a larger D k , i.e., a larger adaptive learning-rate. The dynamic learning-rate, in this algorithm, is represented by a vector M ∈ R L , every element of whose gives the learning-rate corresponding to each model-parameter M = µ (1) k , µ (2) k , ..., µ (L) k T (5) where µ (i) k = µ 0 • D (i) k and L is the number of model-parameters. Thus, at k th iteration, the update of the model-parameters is given by θ k+1 = θ k -M ⊙ ∇k (θ k ) The algorithm defined by the Eq. 6 above adjusts the learning-rate for each model-parameter as an inverse function of change in the model-parameter values. The Algorithm 1 depicts the generic framework of the proposed algorithm. Algorithm 1: Proposed algorithm Input: x ∈ R Parameters: initial model-parameters θ 0 initial learning-rate µ 0 Iterate: for k = 1,2,3,. . . until convergence 1. compute gradient ∇k = ∂J(θ k )/∂θ k 2. compute displacement coefficient D (i) k = K 1+abs(∆θ (i) k ) 3. compute adaptive learning-rate µ (i) k = µ 0 • D (i) k M = µ (1) k , µ (2) k , ..., µ (L) k T 4. update model-parameters θ k+1 = θ k -M ⊙ ∇k (θ k ) end for The idea of utilizing the difference in the model-parameters is also used in the L-BFGS algorithm, which belongs to the category of quasi-Newton methods, and uses the history of past m parameter updates and their gradients, to estimate the Hessian of the objective function H (θ k ) = ∇ 2 J (θ k ) Liu & Nocedal (1989) ; Kochenderfer & Wheeler (2019) . In comparison, the proposed algorithm utilizes the difference in model-parameters to update the learning-rate of the gradient descent. Also, the BFGS algorithm requires computation costs and storage capacity of the order O(n 2 ), whereas the proposed algorithm requires both computation as well as storage of the order O(n) Mokhtari & Ribeiro (2015) .

3.1. CONVERGENCE ANALYSIS

In this section, convergence analysis of the proposed algorithm based on the regret bound approach is performed, and the conditions for a guaranteed convergence based on the bounds on the adaptive learning-rate and the range of the parameter K are determined. Considering {J(θ)} = {J(θ 0 ), J(θ 1 ), J(θ 2 ), ..., J(θ k )} as the sequence of the loss function values, the regret bound (R J (T )) is the sum of the difference between the optimum value of the loss function, i.e. J(θ * ) and the series of function values from J(θ 0 ) to J(θ k ), and is given by R J (T ) = T k=1 [J(θ k ) -J(θ * )] where J(θ * ) = arg min x∈R J(θ). For a convex and twice differentiable loss function J (θ) ∀θ ∈ R, whose gradient is β-Lipschitz continuous i.e., ∀ θ 1 , θ 2 ∃ ∥ ∇J (θ 2 ) -∇J (θ 1 ) ∥ 2 < β ∥ θ 2 -θ 1 ∥ 2 , the regret bound for the proposed algorithm (proof in Appendix A) is given by R J (T ) ≤ 1 2M k ∥ θ 0 -θ * ∥ 2 2 (8) where M = µ (1) k , µ k , ..., µ . Further, the algorithm converges with a rate O (1/µ k k), which means the rate of convergence is inversely dependent on the number of iterations as well as the instantaneous learning-rate µ k , which in turn is determined from the change in the model-parameter values ∆θ k . This implies that, the model parameters on a flatter trajectory will converge at faster rate.

3.1.1. CONVERGENCE BOUNDS AND RANGE OF PARAMETER K

In this section, the lower and upper bounds for the adaptive learning-rate in case of a very small or a very large change in model-parameters are determined. The algorithm will converge with a rate O(1/k), if the adaptive learning-rate is bounded by 0 < µ k < 2 β (9) where k the number of iterations and β the maximum eigenvalue of the input correlation matrix R xx , determined from the classical theoretical analysis of the SGD convergencefoot_0 Boyd & Vandenberghe (2004) . Case I: For the model-parameter converging over a very steep slope, i.e., where the value of ∆θ (i) k is very large (∆θ (i) k → ∞), the lower bound on the adaptive learning rate is given by µ k (i) min = lim ∆θ→∞   µ 0 • K 1 + abs ∆θ (i) k   = 0 The minimum adaptive learning-rate µ k (i) min for a convergence with a rate O(1/k), must be bound by µ k (i) min > 0. Thus, the lower bound of the parameter K is given by K > 0. Case II: For the model-parameter following a very slow convergence over a long, gradual slope (∆θ (i) k ≈ 0), the theoretical maximum of the learning-rate is given by µ k (i) max = lim ∆θ→0   µ 0 • K 1 + abs ∆θ (i) k   = K • µ 0 (11) and for the gradient descent to have a global convergence, the upper bound must be limited to µ k (i) max ≤ 2/β. Thus, the range of the parameters K, computed by substituting the minimum and the maximum values respectively, is given by K ∈ 0, 2 µ0β . In this section, the proposed algorithm was described and the bounds on the adaptive learning-rate for a guaranteed convergence were determined. The proposed algorithm provides a novel approach to improve the convergence speed of the gradient descent algorithm, simultaneously guaranteeing a stable convergence.

4. EXPERIMENTS

In this section, an extensive experimental analysis was carried out to evaluate the efficacy and performance of the proposed algorithm. The performance of the proposed algorithm was compared with a number of state of the art optimizers like the SGDM, Adam, AdaBelief, LookAhead, RAdam, and AdamW. The first part of the experiments involved implementing the proposed algorithm on a standard two-parameter loss-function, i.e., Rosenbrock function, and visualise and compare its convergence with SGDM and Adam. In the next part, further experiments were carried out implementing the above-mentioned optimizer for training of two benchmark datasets, and comparing their performance based on the training loss and accuracy. The experiments involved training of (i) CIFAR 10 and CIFAR 100 datasets on the ResNet18 architecture, and (ii) Tiny-ImageNet dataset on the EfficientNet architecture.

4.1. PERFORMANCE ASSESSMENT ON ROSENBROCK FUNCTION

For the experiments in this section, the above-mentioned algorithms were implemented on a twodimensional optimization function and their convergence characteristics were compared. The Rosenbrock function is known to be used as a performance evaluation benchmark for different optimization algorithms Emiola & Adem (2021) . The Rosenbrock function for a two-parameter setup implemented in the experiments is given by f (w 1 , w 2 ) = κ w 2 1 -w 2 2 + (w 1 -1) 2 (12) where w 1 , w 2 ∈ R and f (w 1 , w 2 ) is strictly convex and differentiable, and the constant κ determines the steepness of the valley of the Rosenbrock function. The minimum of the function lies at (1, 1), and is inside a long, parabolic shaped flat valley, and achieving fast convergence to the minimum becomes difficult in some cases. The Fig. 1 (a) shows the Rosenbrock function with a κ = 100. The experiment was set up as follows: The model-parameters w 1 and w 2 were initialized to (0, 5) and the initial learning-rate for all the algorithms was set to µ 0 = 0.001. The hyperparameters β 1 and β 2 for the Adam optimizer were set to β 1 = 0.9 and β 2 = 0.999 respectively. The performance of the above-mentioned algorithms was analyzed and compared based on the speed of convergence, the accuracy and the run time of the algorithm. The Figure 1 shows the convergence comparison of the proposed algorithm with the SGD and the Adam optimizers. The plot 1(a) and 1(b) show the 3D surface plot and the contour plot of the convergence of the above-mentioned algorithms, while the plot 1(c) shows the convergence of the function against the number of iterations. It can be seen from the plot 1(c) that the proposed algorithm converges to the minimum in about 25 iterations. In comparison, the SGD algorithm converged in about 40 iterations, while the Adam algorithm showed the slowest convergence for this experiment and required about 160 iterations to converge to minimum. It can be observed that the proposed algorithm outperforms the SGD and the Adam algorithms in terms of the convergence speed, i.e., number of iterations for convergence. For the experiments in this section, image classification on the standard CIFAR-10 and CIFAR-100 datasets was performed using the ResNet-18 architecture He et al. (2015) . The CIFAR-10 and CIFAR-100 datasets consist of 32 × 32 colour images belonging to 10 and 100 classes respectively. The dataset is split into 50000 images for training and 10000 images for test Krizhevsky et al. (2009) . The SGDM optimizer was implemented with a step-decay scheduler with an initial learningrate of 0.1 and decreased by a factor of 10 after every 20 epochs. For the Adam and AdaBelief optimizers, the initial learning rate was chosen as 0.001 with hyperparameters β 1 = 0.9, β 2 = 0.999. The LookAhead optimizer was implemented with a learning-rate of 0.001 and the hyperparameters κ = 5 and α = 0.5. The RAdam and the AdamW optimizers were applied with a learning-rate of 0.001, hyperparameters β 1 = 0.9, β 2 = 0.999 and a weight decay of 0.0001 in case of AdamW. For the proposed algorithm, the initial learning-rate was set at 0.001 and the parameter K was chosen to be K = 10. This value was chosen after extensive simulations with different parameter values (refer section 4.2. 3) The CNN models were coded on the Google Colaboratory environment and trained on 25 GB RAM GPU and run for 100 epochs for each optimization algorithm with a batch-size of 128. The figure 2 A APPENDIX The proof of convergence of the algorithm is carried out considering the regret bound of the algorithm. Consider J(θ) as the objective function to be optimized and consider {J(θ)} = {J(θ 0 ), J(θ 1 ), J(θ 2 ), ..., J(θ k )} as the sequence of the loss function values. Regret bound, given by R J (T ), is the sum of the difference between the optimum value of the loss function, i.e. J(θ * ) and the series of function values from J(θ 0 ) to J(θ k ). The regret bound is given by R J (T ) = Assumptions: For the convergence analysis, the following assumptions are made: Assumption 1: The function J(θ) is convex, i.e., J(x 2 ) ≥ J(x 1 ) + ⟨∇J(x 1 ), (x 2 -x 1 )⟩ ∀x 1 , x 2 ∈ R n (14) Assumption 2: ∇J(θ) is L-Lipschitz continuous, i.e., ∥ ∇J(x 2 ) -∇J(x 1 ) ∥< L ∥ x 2 -x 1 ∥ ∀x 1 , x 2 ∈ R n and J(x 2 ) ≤ J(x 1 ) + ⟨∇J(x 1 ), (x 2 -x 1 )⟩ + L 2 ∥ x 2 -x 1 ∥ 2 2 ∀x 1 , x 2 ∈ R n Proof: The convergence analysis attempts to prove that the upper bound of the regret function R J (T ) is bounded by the inverse of the number of iterations. Considering the Lipschitz continuity, we have J(θ 2 ) ≤ J(θ 1 ) + ⟨∇J(θ 1 ), (θ 2 -θ 1 )⟩ + L 2 ∥ θ 2 -θ 1 ∥ 2 2 ∀θ 1 , θ 2 ∈ R n for the i th model-parameter, the above equation can be written as J(θ (i) k+1 ) ≤ J(θ (i) k ) + ∇J(θ (i) k ), θ (i) k+1 -θ (i) k + L 2 ∥ θ (i) k+1 -θ (i) k ∥ 2 2 (18) J(θ (i) k+1 ) ≤ J(θ (i) k ) + ∇J(θ (i) k ), -µ (i) k ∇J(θ (i) k ) + L 2 ∥ -µ (i) k ∇J(θ (i) k ) ∥ 2 2 (19) ≤ J(θ (i) k ) + ∇J(θ (i) k ), -µ (i) k ∇J(θ (i) k ) + L 2 ∥ -µ (i) k ∇J(θ (i) k ) ∥ 2 2 (20) ≤ J(θ (i) k ) -µ (i) k ∥ ∇J(θ (i) k ) ∥ 2 2 + (µ (i) k ) 2 L 2 ∥ ∇J(θ (i) k ) ∥ 2 2 (21) ≤ J(θ (i) k ) -µ (i) k 1 - µ (i) k L 2 ∥ ∇J(θ (i) k ) ∥ 2 2 (22)



The maximum eigenvalue of the matrix Rxx is determined from the trace of the matrix E x T x and under practical considerations, the trace E x T x is computed from the average power of the input signal x Haykin (2014).



Figure 1: (a) 3D Surface plot of the convergence. (b) 2D contour plot of the convergence. (c) Convergence of the function w.r.t. number of iterations.

(a)  shows the training results for the classification task on the CIFAR-10 dataset, and the figure 2(b) for the CIFAR-100 dataset. It can be observed that for all algorithms can achieve an accuracy approaching almost 99%. The proposed algorithm performs slightly better in case of both the CIFAR-10 and the CIFAR-100 datasets. The performance evaluation based on the training accuracy and loss for each optimizer is compiled in the table 1.

Figure 2: (a) Convergence comparison on the CIFAR-10 dataset (b) Convergence comparison on the CIFAR-100 dataset

Figure 3: Convergence comparison on the Tiny-ImageNet dataset

) gives the value of the loss function at instance k with respect to the modelparameter i. ∇J(θ(i) k )gives the gradient of the loss function at instance k with respect to the modelparameter i.

Comparison of the training loss and accuracy for the CIFAR Datasets

Training and Test Loss and accuracy on the Tiny-ImageNet dataset

Training and Test Loss and accuracy for different values of parameter K

annex

L and plugging it in 22 J(θIt can be observed from the above Eq. 23 that since ∥ ∇J(θ2 is always positive, the value of the loss function J(θ (i) k ) always decreases after every iteration untill it reaches the minimum J(θwhere the gradient of the loss function ∇J(θ (i) ) = 0. Now considering the convexity of the loss function ∇J(θ) again (Assumption 1), the following can be writtenPlugging the above in Eq. 23 givesRearranging the above equation giveswhich on simplification gives,The above inequality hold for every iteration of the sequence. Thus, summing over all the iterations, givesThe sum element on the right-hand side of the equation is a telescopic sum, and hence all the intermediate terms disappear, resulting in the followingThe second term in the right-hand side, i.e., ∥ θ2 is always positive and hence the following inequality holds true as wellThis relation shows that for a step-size µL , the regret of the bound of the gradient descent algorithm is bounded byIt was established in the above section that the upper bound of the variable learning-rate at any instance is bounded by µ (i) k ∈ (0, µ 0 ) and it can be inferred that the regret bound of the algorithm is given byAnd the regret bound of the complete loss function can be computed by aggregating the above relation across all dimensions, for all the model-parameters.where M = µ(1)k , ..., µwherek , ..., µ. Further, the algorithm converges with a rate O (1/µ k k), which means the rate of convergence is inversely dependent on the number of iterations as well as the instantaneous learning-rate µ k , which in turn is determined from the change in the model-parameter values ∆θ k . This implies that, nearing convergence, when the change in model-parameter values is very small, the algorithm converges faster than vanilla-GD (or vanilla-SGD).Since it is assumed that the loss function J(θ) is convex and the function J(θ) is reducing at every iteration for each model-parameter, i.e., J(θ k+1 ) < J(θ k ) < J(θ k-1 ) < ....Thus, it can be concluded that the algorithm converges with a rate O 1 k , given by

