ADAPTIVE HIERARCHICAL HYPER-GRADIENT DE-SCENT

Abstract

In this study, we investigate learning rate adaption at different levels based on the hyper-gradient descent framework and propose a method that adaptively learns the optimizer parameters by combining different levels of adaptations. Meanwhile, we show the relationship between regularizing over-parameterized learning rates and building combinations of adaptive learning rates at different levels. The experiments on several network architectures, including feed-forward networks, LeNet-5 and ResNet-18/34, show that the proposed multi-level adaptive approach can significantly outperforms baseline adaptive methods in a variety of circumstances.

1. INTRODUCTION

The basic optimization algorithm for training deep neural networks is the gradient descent method (GD), which includes stochastic gradient descent (SGD), mini-batch gradient descent, and batch gradient descent. The model parameters are updated according to the first-order gradients of the empirical risks with respect to the parameters being optimized, while back-propagation is implemented for calculating the gradients of parameters (Ruder, 2016) . Naïve gradient descent methods apply fixed learning rates without any adaptation mechanisms. However, considering the change of available information during the learning process, SGD with fixed learning rates can result in inefficiency and requires a large amount of computing resources in hyper-parameter searching. One solution is to introduce a learning rate adaptation. This idea can be traced back to the work on gain adaptation for connectionist learning methods (Sutton, 1992) and related extensions for non-linear cases (Schraudolph, 1999; Yu et al., 2006) . In recent years, optimizers with adaptive updating rules were developed in the context of deep learning, while the learning rates are still fixed in training. The proposed methods include AdaGrad (Duchi et al., 2011 ), RMSProp (Tieleman and Hinton, 2012 ), and Adam (Kingma and Ba, 2015) . In addition, there are optimizers aiming to address the convergence issue in Adam (Reddi et al., 2018; Luo et al., 2018) and to rectify the variance of the adaptive learning rate (Liu et al., 2019) . Other techniques, such as Lookahead, can also achieve variance reduction and stability improvement with negligible extra computational cost (Zhang et al., 2019) . Even though the adaptive optimizers with fixed learning rates can converge faster than SGD in a wide range of tasks, the updating rules are designed manually while more hyper-parameters are introduced. Another idea is to use objective function information and update the learning rates as trainable parameters. These methods were introduced as automatic differentiation, where the hyper-parameters can be optimized with backpropagation (Maclaurin et al., 2015; Baydin et al., 2018) . As gradient-based hyper-parameter optimization methods, they can be implemented as an online approach (Franceschi et al., 2017) . With the idea of auto-differentiation, learning rates can be updated in real-time with the corresponding derivatives of the empirical risk (Almeida et al., 1998) , which can be generated to all types of optimizers for deep neural networks (Baydin et al., 2017) . Another step size adaptation approach called "L4", is based on the linearized expansion of the loss functions, which rescales the gradient to make fixed predicted progress on the loss (Rolinek and Martius, 2018). Furthermore, by addressing the issue of poor generalization performance of adaptive methods, dynamically bound for gradient methods was introduced to build a gradual transition between adaptive approach and SGD (Luo et al., 2018) . Another set of approaches train an RNN (recurrent neural network) agent to generate the optimal learning rates in the next step given the historical training information, known as "learning to learn" (Andrychowicz et al., 2016) . This approach empirically outperforms hand-designed optimizers in a variety of learning tasks, but another study has shown that it may not be effective for long horizons (Lv et al., 2017) . The generalization ability of this approach can be improved by using meta training samples and hierarchical LSTMs (long short-term memory) (Wichrowska et al., 2017) . Beyond the adaptive learning rate, learning rate schedules can also improve the convergence of optimizers, including time-based decay, step decay, exponential decay (Li and Arora, 2019). The most fundamental and widely applied one is a piece-wise step-decay learning rate schedule, which could vastly improve the convergence of SGD and even adaptive optimizers (Luo et al., 2018; Liu et al., 2019) . It can be further improved by introducing a statistical test to determine when to apply step-decay (Lang et al., 2019; Zhang et al., 2020) . Also, there are works on warm-restart (O'donoghue and Candes, 2015; Loshchilov and Hutter, 2017), which could improve the performance of SGD anytime when training deep neural networks. We find that the existing gradient or model-based learning rate adaptation methods including hypergradient descent, L4 and learning to learn only focus on global adaptation, which could be further extended to multi-level cases. That focus aims to introduce locally shared adaptive learning rates such as the layer-wise learning rate and parameter-wise learning rate and considers all levels' information in determining the updating step-size for each parameter. The main contribution of our study can be summarized as follows: • We introduce hierarchical learning rate structures for neural networks and apply hypergradient descent to obtain adaptive learning rates at different levels. • We introduce a set of regularization techniques for learning rates to address the balance of global and local adaptations and show the relationship with weighted combinations. • We propose an algorithm implementing the combination of adaptive learning rates at multiple levels for model parameter updating.

2.1. LAYER-WISE, UNIT-WISE AND PARAMETER-WISE ADAPTATION

In the paper on hyper-descent (Baydin et al., 2017) , the learning rate is set to be a scalar. However, to make the most of learning rate adaptation, in this study, we introduce layer-wise or even parameterwise updating rules, where the learning rate α t in each iteration time step is considered to be a vector (layer-wise) or even a list of matrices (parameter-wise). For the sake of simplicity, we collect all the learning rates in a vector: α t = (α 1,t , ..., α N,t ) T . Correspondingly, the objective f (θ) is a function of θ = (θ 1 , θ 2 , ..., θ N ) T , collecting all the model parameters. In this case, the derivative of the objective function f with respect to each learning rate can be written as ∂f (θ t-1 ) ∂α i,t-1 = ∂f (θ 1,t-1 , ..., θ N,t-1 ) ∂α i,t-1 = N j=1 ∂f (θ 1,t-1 , ..., θ N,t-1 ) ∂θ j,t-1 ∂θ j,t-1 ∂α i,t-1 , ( ) where N is the total number of all the model parameters. Eq. ( 1) can be generalized to groupwise updating, where we associate a learning rate with a special group of parameters, and each parameter group is updated according to its only learning rate. Notice that although there is a dependency between α t-1 and θ t-2 with: α t-1 = α t-2 -β∇f (θ t-2 ), where β is the updating rate of hyper-gradient descent, we consider that α t-1 is calculated after θ t-2 and thus a change of α t-1 will not result in a change of θ t-2 . Assume θ t = u(Θ t-1 , α) is the updating rule, where Θ t = {θ s } t s=0 and α is the learning rate, then the basic gradient descent method for each group i gives θ i,t = u(Θ t-1 , α i,t-1 ) = θ i,t-1 -α i,t-1 ∇ θi f (θ t-1 ). Hence for gradient descent ∂f (θ t-1 ) ∂α i,t-1 = ∇ θi f (θ t-1 ) T ∇ αi,t-1 u(Θ t-1 , α i,t-1 ) = -∇ θi f (θ t-1 ) T ∇ θi f (θ t-2 ). (2) Here α i,t-1 is a scalar with index i at time step t -1, corresponding to the learning rate of the ith group, while the shape of ∇ θi f (θ) is the same as the shape of θ i . We particularly consider two special cases: (1) In layer-wise adaptation, θ i is the weight matrix of ith layer, and α i is the particular learning rate for this layer. (2) In parameter-wise adaptation, θ i corresponds to a certain parameter involved in the model, which can be an element of the weight matrix in a certain layer.

