ADAPTIVE HIERARCHICAL HYPER-GRADIENT DE-SCENT

Abstract

In this study, we investigate learning rate adaption at different levels based on the hyper-gradient descent framework and propose a method that adaptively learns the optimizer parameters by combining different levels of adaptations. Meanwhile, we show the relationship between regularizing over-parameterized learning rates and building combinations of adaptive learning rates at different levels. The experiments on several network architectures, including feed-forward networks, LeNet-5 and ResNet-18/34, show that the proposed multi-level adaptive approach can significantly outperforms baseline adaptive methods in a variety of circumstances.

1. INTRODUCTION

The basic optimization algorithm for training deep neural networks is the gradient descent method (GD), which includes stochastic gradient descent (SGD), mini-batch gradient descent, and batch gradient descent. The model parameters are updated according to the first-order gradients of the empirical risks with respect to the parameters being optimized, while back-propagation is implemented for calculating the gradients of parameters (Ruder, 2016) . Naïve gradient descent methods apply fixed learning rates without any adaptation mechanisms. However, considering the change of available information during the learning process, SGD with fixed learning rates can result in inefficiency and requires a large amount of computing resources in hyper-parameter searching. One solution is to introduce a learning rate adaptation. This idea can be traced back to the work on gain adaptation for connectionist learning methods (Sutton, 1992) and related extensions for non-linear cases (Schraudolph, 1999; Yu et al., 2006) . In recent years, optimizers with adaptive updating rules were developed in the context of deep learning, while the learning rates are still fixed in training. The proposed methods include AdaGrad (Duchi et al., 2011 ), RMSProp (Tieleman and Hinton, 2012 ), and Adam (Kingma and Ba, 2015) . In addition, there are optimizers aiming to address the convergence issue in Adam (Reddi et al., 2018; Luo et al., 2018) and to rectify the variance of the adaptive learning rate (Liu et al., 2019) . Other techniques, such as Lookahead, can also achieve variance reduction and stability improvement with negligible extra computational cost (Zhang et al., 2019) . Even though the adaptive optimizers with fixed learning rates can converge faster than SGD in a wide range of tasks, the updating rules are designed manually while more hyper-parameters are introduced. Another idea is to use objective function information and update the learning rates as trainable parameters. These methods were introduced as automatic differentiation, where the hyper-parameters can be optimized with backpropagation (Maclaurin et al., 2015; Baydin et al., 2018) . As gradient-based hyper-parameter optimization methods, they can be implemented as an online approach (Franceschi et al., 2017) . With the idea of auto-differentiation, learning rates can be updated in real-time with the corresponding derivatives of the empirical risk (Almeida et al., 1998) , which can be generated to all types of optimizers for deep neural networks (Baydin et al., 2017) . Another step size adaptation approach called "L4", is based on the linearized expansion of the loss functions, which rescales the gradient to make fixed predicted progress on the loss (Rolinek and Martius, 2018). Furthermore, by addressing the issue of poor generalization performance of adaptive methods, dynamically bound for gradient methods was introduced to build a gradual transition between adaptive approach and SGD (Luo et al., 2018) . Another set of approaches train an RNN (recurrent neural network) agent to generate the optimal learning rates in the next step given the historical training information, known as "learning to learn"

