MLR-SNET: TRANSFERABLE LR SCHEDULES FOR HETEROGENEOUS TASKS

Abstract

The learning rate (LR) is one of the most important hyper-parameters in stochastic gradient descent (SGD) for deep neural networks (DNN) training and generalization. However, current hand-designed LR schedules need to manually pre-specify a fixed form, which limits their ability to adapt to non-convex optimization problems due to the significant variation of training dynamics. Meanwhile, it always needs to search a proper LR schedule from scratch for new tasks. To address these issues, we propose to parameterize LR schedules with an explicit mapping formulation, called MLR-SNet. The learnable structure brings more flexibility for MLR-SNet to learn a proper LR schedule to comply with the training dynamics of DNN. Image and text classification benchmark experiments substantiate the capability of our method for achieving proper LR schedules. Moreover, the meta-learned MLR-SNet is plugand-play to generalize to new heterogeneous tasks. We transfer our meta-trained MLR-SNet to tasks like different training epochs, network architectures, datasets, especially large scale ImageNet dataset, and achieve comparable performance with hand-designed LR schedules. Finally, MLR-SNet can achieve better robustness when training data are biased with corrupted noise.

1. INTRODUCTION

Stochastic gradient descent (SGD) and its many variants (Robbins & Monro, 1951; Duchi et al., 2011; Zeiler, 2012; Tieleman & Hinton, 2012; Kingma & Ba, 2015) , have been served as the cornerstone of modern machine learning with big data. It has been empirically shown that DNN achieves stateof-the-art generalization performance on a wide variety of tasks when trained with SGD (Zhang et al., 2017) . Several recent researches observe that SGD tends to select the so-called flat minima (Hochreiter & Schmidhuber, 1997a; Keskar et al., 2017) , which seems to generalize better in practice. Scheduling learning rate (LR) for SGD is one of the most widely studied aspects to help improve the SGD training for DNN. Specifically, it has been experimentally studied how the LR (Jastrzebski et al., 2017) influences mimima solutions found by SGD. Theoretically, Wu et al. (2018a) analyze that LR plays an important role in minima selection from a dynamical stability perspective. He et al. (2019) provide a PAC-Bayes generalization bound for DNN trained by SGD, which is correlated with LR. In a word, finding a proper LR schedule highly influences the generalization performance of DNN, which has been widely studied recently (Bengio, 2012; Schaul et al., 2013; Nar & Sastry, 2018) . There mainly exist three kinds of hand-designed LR schedules: (1) Pre-defined LR policy is mostly used in current DNN training, like decaying or cyclic LR (Gower et al., 2019; Loshchilov & Hutter, 2017) , and brings large improvements in training efficiency. Some theoretical works suggested that the decaying schedule can yield faster convergence (Ge et al., 2019; Davis et al., 2019) or avoid strict saddles (Lee et al., 2019; Panageas et al., 2019) under some mild conditions. (2) LR search methods in tranditional convex optimization (Nocedal & Wright, 2006 ) can be extended to DNN training by searching LR adaptively in each step, such as Polyak's update rule (Rolinek & Martius, 2018) , Frank-Wolfe algorithm (Berrada et al., 2019) , and Armijo line-search (Vaswani et al., 2019), etc. (3) Adaptive gradient methods like Adam (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2015) , adapt LR for each parameters separately according to some gradient information. Although above LR schedules (as depicted in Fig. 1(a ) and 1(b)) can achieve competitive results on their learning tasks, they still have evident deficiencies in practice. On the one hand, these policies need to manually pre-specify the form of LR schedules, suffering from the limited flexibility to adapt to non-convex optimization problems due to the significant variation of training dynamics. On the other hand, when solving new heterogeneous tasks, it always needs to search a proper LR schedule from scratch, as well as to tune their involving hyper-parameters. This process is time and computation expensive, which tends to further raise their application difficulty in real problems. To alleviate the aforementioned issues, this paper presents a model to learn a plug-and-play LR schedule. The main idea is to parameterize the LR schedule as a LSTM network (Hochreiter & Schmidhuber, 1997b) , which is capable of dealing with such a long-term information dependent problem. As shown in Fig. 1 (c), the proposed Meta-LR-Schedule-Net (MLR-SNet) learns an explicit loss-LR dependent relationship. In a nutshell, this paper makes the following three-fold contributions. (1) We propose a MLR-SNet to learn an adaptive LR schedule, which can adjust LR based on current training loss as well as the information delivered from past training histories stored in the MLR-SNet. Due to the parameterized form of the MLR-SNet, it can be more flexible than hand-designed policies to find a proper LR schedule for the specific learning task. Fig. 1(d ) and 1(e) show our learned LR schedules, which have similar tendency as pre-defined policies, but more variations at their locality. This validates the efficacy of our method for adaptively adjusting LR according to training dynamics. (2) With an explicit parameterized structure, the meta-trained MLR-SNet can be transferred to new heterogeneous tasks (meta-test stage), including different training epochs, network architectures and datasets. Experimental results verify that our plug-and-play LR schedules can achieve comparable performance, while do not have any hyper-parameters compared with tranditional LR schedules. This potentially saves large labor and computation cost in real world applications. (3) The MLR-SNet is meta-learned to improve generalization performance on unseen data. We validate that with the guidance of clean data, our MLR-SNet can achieve better robustness when training data are biased with corrupted noise than hand-designed LR schedules.

2. RELATED WORK

Meta learning for optimization. Meta learning has a long history in psychology (Ward, 1937; Lake et al., 2017) . Meta learning for optimization can date back to 1980s-1990s (Schmidhuber, 1992; Bengio et al., 1991) Though faster in decreasing training loss than the traditional optimizers in some cases, the learned optimizers may not always generalize well to diverse problems, especially longer horizons (Lv et al., 2017) and large scale optimization problems (Wichrowska et al., 2017) . Moreover, they can not be



Figure 1: Pre-set LR schedules for (a) image and (b) text classification. (c) Visualization of how we input current loss L t to MLR-SNet, which then outputs a proper LR α t to help SGD find a better minima. LR schedules learned by MLR-SNet on (d) image and (e) text classification. (f) We transfer LR schedules learned on CIFAR-10 to image (CIFAR-100) and text (Penn Treebank) classification, and the subfigure shows the predicted LR during training.

, aiming to meta-learn the optimization process of learning itself.Recently,  Lv et al. (2017)  have attempted to scale this idea to larger DNN optimization problems. The main idea is to construct a meta-learner as the optimizer, which takes the gradients as input and outputs the whole updating rules. These approaches tend to make selecting appropriate training algorithms, scheduling LR and tuning other hyper-parameters in an automatic way. Except for solving continuous optimization problems, some works employ these ideas to other optimization problems, such as black-box functions(Chen et al., 2017), few-shot learning(Li et al., 2017), model's  curvature (Park & Oliva, 2019), evolution strategies(Houthooft et al., 2018), combinatorial functions  (Rosenfeld et al., 2018), etc.

