MLR-SNET: TRANSFERABLE LR SCHEDULES FOR HETEROGENEOUS TASKS

Abstract

The learning rate (LR) is one of the most important hyper-parameters in stochastic gradient descent (SGD) for deep neural networks (DNN) training and generalization. However, current hand-designed LR schedules need to manually pre-specify a fixed form, which limits their ability to adapt to non-convex optimization problems due to the significant variation of training dynamics. Meanwhile, it always needs to search a proper LR schedule from scratch for new tasks. To address these issues, we propose to parameterize LR schedules with an explicit mapping formulation, called MLR-SNet. The learnable structure brings more flexibility for MLR-SNet to learn a proper LR schedule to comply with the training dynamics of DNN. Image and text classification benchmark experiments substantiate the capability of our method for achieving proper LR schedules. Moreover, the meta-learned MLR-SNet is plugand-play to generalize to new heterogeneous tasks. We transfer our meta-trained MLR-SNet to tasks like different training epochs, network architectures, datasets, especially large scale ImageNet dataset, and achieve comparable performance with hand-designed LR schedules. Finally, MLR-SNet can achieve better robustness when training data are biased with corrupted noise.

1. INTRODUCTION

Stochastic gradient descent (SGD) and its many variants (Robbins & Monro, 1951; Duchi et al., 2011; Zeiler, 2012; Tieleman & Hinton, 2012; Kingma & Ba, 2015) , have been served as the cornerstone of modern machine learning with big data. It has been empirically shown that DNN achieves stateof-the-art generalization performance on a wide variety of tasks when trained with SGD (Zhang et al., 2017) . Several recent researches observe that SGD tends to select the so-called flat minima (Hochreiter & Schmidhuber, 1997a; Keskar et al., 2017) , which seems to generalize better in practice. Scheduling learning rate (LR) for SGD is one of the most widely studied aspects to help improve the SGD training for DNN. Specifically, it has been experimentally studied how the LR (Jastrzebski et al., 2017) influences mimima solutions found by SGD. Theoretically, Wu et al. (2018a) analyze that LR plays an important role in minima selection from a dynamical stability perspective. He et al. (2019) provide a PAC-Bayes generalization bound for DNN trained by SGD, which is correlated with LR. In a word, finding a proper LR schedule highly influences the generalization performance of DNN, which has been widely studied recently (Bengio, 2012; Schaul et al., 2013; Nar & Sastry, 2018) . There mainly exist three kinds of hand-designed LR schedules: (1) Pre-defined LR policy is mostly used in current DNN training, like decaying or cyclic LR (Gower et al., 2019; Loshchilov & Hutter, 2017) , and brings large improvements in training efficiency. Some theoretical works suggested that the decaying schedule can yield faster convergence (Ge et al., 2019; Davis et al., 2019) or avoid strict saddles (Lee et al., 2019; Panageas et al., 2019) under some mild conditions. (2) LR search methods in tranditional convex optimization (Nocedal & Wright, 2006 ) can be extended to DNN training by searching LR adaptively in each step, such as Polyak's update rule (Rolinek & Martius, 2018) , Frank-Wolfe algorithm (Berrada et al., 2019) , and Armijo line-search (Vaswani et al., 2019), etc. (3) Adaptive gradient methods like Adam (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2015) , adapt LR for each parameters separately according to some gradient information. Although above LR schedules (as depicted in Fig. 1(a ) and 1(b)) can achieve competitive results on their learning tasks, they still have evident deficiencies in practice. On the one hand, these policies need to manually pre-specify the form of LR schedules, suffering from the limited flexibility to adapt to non-convex optimization problems due to the significant variation of training dynamics. On the other hand, when solving new heterogeneous tasks, it always needs to search a proper LR schedule from scratch, as well as to tune their involving hyper-parameters. This process is time and computation expensive, which tends to further raise their application difficulty in real problems. To alleviate the aforementioned issues, this paper presents a model to learn a plug-and-play LR schedule. The main idea is to parameterize the LR schedule as a LSTM network (Hochreiter & Schmidhuber, 1997b) , which is capable of dealing with such a long-term information dependent problem. As shown in Fig. 1 (c), the proposed Meta-LR-Schedule-Net (MLR-SNet) learns an explicit loss-LR dependent relationship. In a nutshell, this paper makes the following three-fold contributions. (1) We propose a MLR-SNet to learn an adaptive LR schedule, which can adjust LR based on current training loss as well as the information delivered from past training histories stored in the MLR-SNet. Due to the parameterized form of the MLR-SNet, it can be more flexible than hand-designed policies to find a proper LR schedule for the specific learning task. Fig. 1(d ) and 1(e) show our learned LR schedules, which have similar tendency as pre-defined policies, but more variations at their locality. This validates the efficacy of our method for adaptively adjusting LR according to training dynamics. (2) With an explicit parameterized structure, the meta-trained MLR-SNet can be transferred to new heterogeneous tasks (meta-test stage), including different training epochs, network architectures and datasets. Experimental results verify that our plug-and-play LR schedules can achieve comparable performance, while do not have any hyper-parameters compared with tranditional LR schedules. This potentially saves large labor and computation cost in real world applications. (3) The MLR-SNet is meta-learned to improve generalization performance on unseen data. We validate that with the guidance of clean data, our MLR-SNet can achieve better robustness when training data are biased with corrupted noise than hand-designed LR schedules.

2. RELATED WORK

Meta learning for optimization. Meta learning has a long history in psychology (Ward, 1937; Lake et al., 2017) . Meta learning for optimization can date back to 1980s-1990s (Schmidhuber, 1992; Bengio et al., 1991) , aiming to meta-learn the optimization process of learning itself. Recently, Andrychowicz et al. (2016) ; Ravi & Larochelle (2017) ; Chen et al. (2017) ; Wichrowska et al. (2017) ; Li & Malik (2017) ; Lv et al. (2017) have attempted to scale this idea to larger DNN optimization problems. The main idea is to construct a meta-learner as the optimizer, which takes the gradients as input and outputs the whole updating rules. These approaches tend to make selecting appropriate training algorithms, scheduling LR and tuning other hyper-parameters in an automatic way. Except for solving continuous optimization problems, some works employ these ideas to other optimization problems, such as black-box functions (Chen et al., 2017) , few-shot learning (Li et al., 2017) , model's curvature (Park & Oliva, 2019) , evolution strategies (Houthooft et al., 2018) , combinatorial functions (Rosenfeld et al., 2018) , etc. Though faster in decreasing training loss than the traditional optimizers in some cases, the learned optimizers may not always generalize well to diverse problems, especially longer horizons (Lv et al., 2017) and large scale optimization problems (Wichrowska et al., 2017) . Moreover, they can not be guaranteed to output a proper descent direction in each iteration for DNN training, since they assume all parameters share one small net and ignore the relationship between each parameters. Our proposed method attempts to learn an adaptive LR schedule rather than the whole update rules. This makes it easy to learn and the meta-learned LR schedule can be transferred to new heterogeneous tasks. HPO and LR schedule adaptation. Hyper-parameter optimization (HPO) was historically investigated by selecting proper values for algorithm hyper-parameters to obtain better performance on validation set (see (Hutter et al., 2019) for an overview). Typical methods include grid search, random search (Bergstra & Bengio, 2012) , Bayesian optimization (Snoek et al., 2012) , gradient-based methods (Franceschi et al., 2017; Shu et al., 2020a; b) , etc. Recently, some works attempt to find a proper LR schedule under the framework of gradient-based HPO, which can be solved by bilevel optimization (Franceschi et al., 2017; Baydin et al., 2018) . However, most HPO techniques tend to fall into short-horizon bias and easily find a bad minima (Wu et al., 2018b) . Our MLR-SNet has an explicit function form, which makes the optimization of the LR schedules more robust and effective. Transfer to heterogeneous tasks. Transfer learning (Pan & Yang, 2009) aims to transfer knowledge obtained from source task to help the learning on the target task. Most transfer learning methods assume the source and target tasks consist of the same instance, feature or model spaces (Yang et al., 2020) , which greatly limits their applications. Recently, meta learning (Finn et al., 2017) aims to learn common knowledge shared over a distribution of tasks, such that the learned knowledge can transfer to unseen heterogeneous tasks. Most meta learning approaches focus on few shot learning framework, while we attempt to extend it into a standard learning framwork. The hand-designed LR schedules and HPO methods just try to find a proper LR schedule for given tasks, and need to be learned from scratch for new tasks. However, our meta-learned MLR-SNet is plug-and-play, which can directly transfer how to schedule LR for SGD to heterogeneous tasks without additional learning.

3. THE PROPOSED META-LR-SCHEDULE-NET (MLR-SNET) METHOD

The problem of training DNN can be formulated as the following non-convex optimization problem, min w∈R n L T r (D T r ; w) := 1 N N i=1 L T r i (w), where L T r i is the training loss function for data samples i ∈ D T r = {1, 2, • • • , N }, which characters the deviation of the model prediction from the data, and w ∈ R n represents the parameters of the model (e.g., the weight matrices in DNN) to be optimized. SGD (Robbins & Monro, 1951; Polyak, 1964) and its variants, including Momentum (Tseng, 1998) , Adagrad (Duchi et al., 2011 ), Adadelta (Zeiler, 2012) , RMSprop (Tieleman & Hinton, 2012 ), Adam (Kingma & Ba, 2015) , are often used for training DNN. In general, these algorithms can be summarized as the following formulation, w t+1 = w t + ∆w t , ∆w t = O t (∇L T r (w t ), H t ; Θ t ), where w t is t-th updating model parameters, ∇L T r (w t ) denotes the gradient of L T r at w t , H t represents the historical gradient information, and Θ t is the hyperparameter of the optimizer O, e.g., LR. To present our method's efficiency, we focus on the following vanilla SGD formulation, w t+1 = w t -α t 1 |B t | i∈Bt ∇L T r i (w t ) , where B t ⊂ D T r denotes the batch samples randomly sampled from the training dataset, |B t | denotes the number of the sampled batch samples, and ∇L T r i (w t ) denotes the gradient of sample i computed at w t and α t is the LR at t-th iteration.

3.1. EXISTING LR SCHEDULE STRATEGIES

As Bengio (2012) demonstrated, the choice of LR remains central to effective DNN training with SGD. As mentioned in Section 1, a variety of hand-designed LR schedules have been proposed. Though they achieve competitive results on some learning tasks, they share several drawbacks: (1) The pre-defined LR schedules suffer from the limited flexibility to adapt to the significantly changed training dynamics for the non-convex optimization problems. (2) It needs to be learned from scratch to find a proper LR schedule for the new tasks, which raises their application difficulty in real problems. Inspired by current meta-learning developments (Finn et al., 2017; Shu et al., 2018; 2019) , some researches proposed to learn a generic optimizer from data (Andrychowicz et al., 2016; Ravi & Larochelle, 2017; Chen et al., 2017; Wichrowska et al., 2017; Li & Malik, 2017; Lv et al., 2017) . The main idea is to learn a meta-learner as the optimizer to guide the learning of the whole updating rules. For example, Andrychowicz et al. (2016) try to replace Eq.( 2) with the following formulation, w t+1 = w t + g t , [g t , h t+1 ] T = m(∇ t , h t ; φ), where g t is the output of a LSTM net m, parameterized by φ, whose state is h t .This strategy can make selecting appropriate training algorithms, scheduling LR and tuning other hyper-parameters in a unified and automatic way. Though faster in decreasing training loss than the traditional optimizers in some cases, the learned optimizer may not always generalize well to more variant and diverse problems, like longer horizons (Lv et al., 2017) and large scale optimization problems (Wichrowska et al., 2017) . Moreover, it can not guarantee to output a proper descent direction in each iteration for network training. This tends to further increase their application difficulty in real problems. Recently, some methods (Franceschi et al., 2017; Baydin et al., 2018) consider the following constrained optimization problem to search the optimal LR schedule α * such that the produced models are associated with small validation error, min α={α0,••• ,α T -1 } L V al (D V al , w T ), s.t. w t+1 = φ t (w t , α t ), t = 0, 1, • • • , T -1, where L V al denotes the validation loss function, D V al = {1, 2, • • • , M } denotes hold-out validation set, α is to-be-solved hyper-parameter, φ t : R n × R + → R n is a stochastic weight update dynamics, like the updating rule in Eq.(2) or the vanilla SGD in Eq.( 3), and T is the maximum iteration step. Though achieving comparable results on some tasks with hand-designed LR schedules, they can not directly transfer to new tasks, since they do not have an explict transferable structure form.

3.2. PROPOSED META-LR-SCHEDULE-NET (MLR-SNET) METHOD

To address aforementioned issues, the main idea is to design a meta-learner with an explicit mapping formulation to parameterize LR schedules as shown in Fig. 1 (c), called MLR-SNet. The parameterized structure can bring two benefits: 1) It gives more flexibility to learn a proper LR schedule to comply with the significantly changed training dynamics of DNN; 2) It makes the meta-learned LR schedules be transferable and plug-and-play, which can be applied to new heterogeneous tasks. Formulation of MLR-SNet. The computational graph of MLR-SNet is depicted in Fig. 2(a) . Let A(•; θ) denote the MLR-SNet, and then the updating equation of SGD in Eq.( 3) can be rewritten as w t+1 = w t -A(L t ; θ t ) 1 |B t | i∈Bt ∇L T r i (w t ) , L t = 1 |B t | i∈Bt L T r i (w t ), where θ t is the parameter of MLR-SNet at t-th iteration (t = 0, • • • , T -1). At any iteration steps, A(•; θ) can learn an explicit loss-LR dependent relationship, such that the net can adaptively predict LR according to the current input loss L t , as well as the historical information stored in the net. For every iteration step, the whole forward computation process is (as shown in Fig. 2(b) )    i t f t o t g t    =    σ σ σ tanh    W 2 ReLU ReLU W 1 h t-1 L t , c t = f t c t-1 + i t g t h t = o t tanh(c t ) p t = σ(W 3 h t ) α t = γ • p t , where i t , f t , o t denote the Input, Forget and Output gates, respectively. Different from vanilla LSTM, the input h t-1 and the training loss L t are preprocessed by a fully-connected layer W 1 with ReLU activation function. Then it works as LSTM and obtains the output h t . After that, the predicted value p t is obtained by a linear transform W 3 on the h t with a Sigmoid activation function. Finally, we introduce a scale factor γfoot_0 to guarantee the final predicted LR located in the interval of [0, γ]. Albeit simple, this net is known for dealing with such long-term information dependent problem, and thus capable of finding a proper LR schedule to comply with the complex variations of training dynamics. Meta-Train: adapting to the training dynamics of DNN. The MLR-SNet can be meta-trained to improve the generalization performance on unseen validation data for DNN training as follows: min θ L V al (D V al , w T ), s.t. w t+1 = φ t (w t , A(L t ; θ)), t = 0, 1, • • • , T -1. Now the important question is how to efficiently meta-learn the parameter θ for the MLR-SNet. We employ the online approximation technique in (Shu et al., 2019) to jointly update θ and model parameter w to explore a proper LR schedule with better generalization for DNNs training. However, the step-wise optimization for θ is still expensive to handle large-scale datasets and DNN. Furthermore, we attempt to update θ after updating w several steps (T val ) as summarized in Algorithm 1. Updating θ. When it does not satisfy the updating conditions, θ is fixed; otherwise, θ is updated using the model parameter w t and MLR-SNet parameter θ t obtained in the last step by minimizing the validation loss defined in Eq.( 8). Adam can be employed to optimize the validation loss, i.e., θ t+1 = θ t + Adam(∇ θ L V al (D m , ŵt+1 (θ)); η t ), where Adam denotes the Adam algorithm, whose input is the gradient of validation loss with respect to MLR-SNet parameter θ on m mini-batch samples D m from D V al . η t denotes the LR of Adam. ŵt+1 (θ)foot_1 is formulated on a mini-batch training samples D n from D T r as follows: ŵt+1 (θ) = w t -A(L T r (D n , w t ); θ) • ∇ w L T r (D n , w) wt . Updating w. Then, the updated θ t+1 is employed to ameliorate the model parameter w, i.e., w t+1 = w t -A(L T r (D n , w t ); θ t+1 ) • ∇ w L T r (D n , w) wt . The whole meta-train learning algorithm can be summarized in Algorithm 1. All computations of gradients can be efficiently implemented by automatic differentiation libraries, like PyTorch (Paszke et al., 2019) Update wt+1 by Eq. ( 11). 9: end for Meta-Test: transferring to heterogeneous tasks. When we obtain the meta-learned MLR-SNet, it can be easily applied to new tasks. Now the new model parameter u for the new task is updated by, u t+1 = u t -A(L T r (D n , u t ); θ * ) • ∇ u L T r (D n , u) ut , where θ * is the parameter of the meta-learned MLR-SNet, which is fixed in the meta-test stage.

4. EXPERIMENTAL RESULTS

To evaluate the proposed MLR-SNet, we firstly conduct experiments to show our method is capable of finding proper LR schedules compared with baseline methods. Then we transfer the learned LR schedules to various tasks to show its superiority in generalization. Finally, we show our method behaves robust and stable when training data contain different data corruptions. Baselines. For image classification tasks, the compared methods include SGD with hand-designed LR schedules: 1) Fixed LR, 2) Exponential decay, 3) MultiStep decay, 4) SGD with restarts (SGDR) (Loshchilov & Hutter, 2017) . Also, we compare with SGD with Momentum (SGDM) with above four LR schedules. The momentum is fixed as 0.9. Meanwhile, we compare with adaptive gradient method: 5)Adam, LR search method: 6) L4 (Rolinek & Martius, 2018) , and current LR schedule adaptation methods: 7) hyper-gradient descent (HD) (Baydin et al., 2018) , 8) real-time hyper-parameter optimization (RTHO) (Franceschi et al., 2017) . For text classification tasks, we compare with 1) SGD and 2) Adam with LR tuned using a validation set. They drop the LR by a factor of 4 when the validation loss stops decreasing. Also, we compared with 3) L4, 4) HD, 5) RTHO. We run all experiments with 3 different seeds reporting accuracy. The detailed illustrations of experimental setting, and more experimental results are presented in Appendix B. Image tasks. 2017). An extra tuning is necessary for better performance. 6) L4 greedily searches LR locally to decrease loss, while the complex DNN training dynamics can not guarantee it to obtain a good minima. 7) HD and RTHO are able to achieve similar performance to hand-designed LR schedules. The LR schedules learned by L4, HD and RTHO can be found in supplementary material. Though not using extra historical gradient information to help optimization, our method achieves comparable results with baselines by finding a proper LR schedule for SGD. Text tasks. Fig. 4 (c) and 4(f) show the test perplexity on the Penn Treebank with 2-layer and 3-layer LSTM, respectively. Adam and SGD heuristically drops LR when the validation loss stops decreasing. However, our MLR-SNet predicts LR according to training dynamics by minimizing the validation loss, which is a more intelligent way to employ the validation dataset. Thus our method achieves comparable or even better performance than Adam and SGD. The learned LR schedules of the MLR-SNet are presented in Fig. 1 (b), which have similar shape as the hand-designed policies. L4 often falls into a bad minima since it greedily searches LR locally. HD and RTHO directly optimize LR to improve the performance on validation dataset, obtaining the similar results as Adam and SGD. With an explicit strcuture, our method behaves more robust and efficient than HD and RTHO. Remark. Actually, the performance of the hand-design LR schedules can be regarded as the best/ upper performance bound. Since these strategies have been tested to work well for the specific tasks, and they are written into the standard deep learning library. For different image and text tasks, our MLR-SNet can achieve the similar or even a little better performance compared with the best baselines, demostrating the effectiveness and generality of our method.

4.2. META-TEST: TRANSFERABILITY OF PLUG-AND-PLAY LR SCHEDULES

The learned MLR-SNet is transferable and plug-and-play. Here we validate if MLR-SNet can transfer to new heterogeneous tasks. Since the methods L4,HD,RTHO in Section 4.1 are not able to generalize, we do not compare them here. Actually our results show superiority on image tasks beyond baseline methods when trained with SGD, here we present stronger baseline results in which compared methods are trained with SGDM. We use the MLR-SNet meta-learned on CIFAR-10 with ResNet-18 in Section 4.1 as the plug-and-play LR schedules for the following experiments. Transfer to different epochs. The plug-and-play MLR-SNet is meta-trained with epoch 200, and we transfer it to other different training epochs, e.g., 100, 400,1200. As shown in Fig. 5 , our MLR-SNet has the ability to train for longer horizons and achieves almost same performance as MultiStep LR. The slight shakes for epoch 1200 may due to that our MLR-SNet can learn the LR similar to SGDR locally. The Exponential LR has a little performance decreased for the longer epochs. Transfer to different datasets. We transfer the LR schedules meta-learned on CIFAR-10 to SVHN (Netzer et al., 2011) , TinyImageNetfoot_2 , and Penn Treebank (Marcus & Marcinkiewicz). As shown in Fig. 6 , though datasets vary from image to text, our method can still obtain a relatively stable and comparable generalization performance for different tasks with baseline method. Transfer to different net architectures. We also transfer the LR schedules meta-learned on ResNet-18 to light-weight nets ShuffleNetV2 (Ma et al., 2018) , MobileNetV2 (Sandler et al., 2018) or NASNet (Deng et al., 2009) among existing learning-to-optimize literatures. However, it can only be executed for thousands of steps, and then its loss begins to increase dramatically, far from the optimization process in practice. We transfer the LR schedule meta-trained on CIFAR-10 with ResNet-18 to ImageNet dataset with ResNet-50foot_4 . As shown in Fig. 8 , the validation accuracy of our method is competitive with those hand-designed LR schedules methods. This implies our method is capable of dealing with such large scale optimization problem, making learning-to-optimize ideas towards more practical applications. C . Table 1 shows the mean test accuracy of 15 models (±std). As can be seen, our proposed MLR-SNet is capable of achieving better generalization performance on clean test data than baseline methods, which implies that our method behaves more robust and stable than the pre-set LR schedules when the learning tasks are changed. This is due to that our MLR-SNet has more flexibility to adapt the variation of the data distribution than the pre-set LR schedules, and it can find a proper LR schedule through minimizing the generalization error which is based on the knowledge specifically conveyed from the given validation data. The detailed illustrations of experimental setting, and the transferrablity experiment of meta-learned MLR-SNet are presented in supplementary material. 5 SOME ANALYSIS OF MLR-SNET

5.1. CONVERGENCE ANALYSIS OF MLR-SNET

The preliminary experimental evaluations show that our method gives good convergence performance on various tasks. We find that the meta-learned LR schedules in our experiments follow a consistent trajectory as shown in Fig. 1 , sharing a similar tendency as the Exponential LR schedules. To provide a theoretical convergence analysis, we roughly assume that the LR predicted by MLR-SNet obey a Exponential LR form. The convergence analysis for DNN training can refer to (Li et al., 2020) . Here, we provide a convergence analysis of the MLR-SNet training. The proof is listed in the Appendix A.  min 0≤t≤T E[ ∇L V al ( ŵt (θ t )) 2 2 ] ≤ O( C ln(T ) T + σ 2 ), ( ) where C is some constant independent of the convergence process, σ is the variance of drawing uniformly mini-batch sample at random.

5.2. THE STRUCTURE OF THE MLR-SNET

We regard the LR scheduling as a long-term information dependent problem, and thus we parameterize the LR schedule as an LSTM network. As we known, MLP network can also learn an explicit mapping but ignore the temporal information. Here, we compare the performance of the two types of metalearners. As shown in Fig. 9 , in the early training stage, both of them achieve the similar performance. While at the later training stage, the LSTM meta-learner brings a notable performance increase compared with MLP meta-learner. This may due to that the accumulated temporal information of the LSTM meta-learner can help find a more proper LR for such DNNs training.

5.3. COMPUTATIONAL COMPLEXITY ANALYSIS

In the meta-training stage, our MLR-SNet learning algorithm can be roughly regarded as requiring two extra full forward and backward passes of the network (step 6 in algorithm 1) in the presence of the normal network parameters update (step 8 in algorithm 1), together with the forward passes of MLR-SNet for every LR. Therefore compared to normal training, our method needs about 3× computation time for one iteration. Since we periodically update MLR-SNet after several iterations, this will not substantially increase the computational complexity compared with normal network training. In the meta-test stage, our transferred LR schedules predict LR for each iteration by a small MLR-SNet, whose computational cost should be significantly less than the cost of the normal network training. To empirically show the differences between hand-designed LR schedules and our method, we conduct experiments with ResNet-18 on CIFAR-10 and report the running time for all methods. All experiments are implemented on a computer with Intel Xeon(R) CPU E5-2686 v4 and a NVIDIA GeForce RTX 2080 8GB GPU. We follow the corresponding settings in Section 4.1, and results are shown in Figure 10 . Except that RTHO costs significantly more time, other methods including MLR-SNet training and testing have similar time consuming. Our MLR-SNet takes barely longer time to complete the meta-training and meta-testing phase compared to hand-designed LR schedules. Therefore our method is completely capable of practical application.

6. CONCLUSION AND DISCUSSION

In this paper, we have proposed to learn an adaptive and transferrable LR schedule in a meta learning manner. To this aim, we design an LSTM-type meta-learner (MLR-SNet) to parameterize LR schedules, which gives more flexibility to adaptively learn a proper LR schedule to comply with the significantly complex training dynamics of DNN. Meanwhile, the meta-learned LR schedules are plug-and-play and transferrable, which can be transferred how to schedule LR for SGD to new heterogeneous tasks. Comprehensive experiments substantiate the superiority of our method on various image and text benchmarks in its adaptability, transferrability and robustness, as compared with current LR schedules policies. The MLR-SNet is highly practical as it requires negligible increase in the parameter size and computation time, and no transferrable cost for new tasks. We believe our proposed method has a potential to become a new tool to study how to design LR schedules to help improve current DNN training, as well as more practical applications. Recently, Keskar et al. (2017) ; Dinh et al. (2017) suggested that the width of a local optimum is related to generalization. Wider optima leads to better generalization. We use the visualization technique in (Izmailov et al., 2018) to visualize the "width" of the solutions for different LR schedules on CIFAR-100 with ResNet-18. As shown in Fig. 3 , our method lies a wide flat region of the train loss. This could explain the better generalization of our method compared with pre-set LR schedules. Deeper understandings on this point will be further investigated.

A CONVERGENCE ANALYSIS OF THE MLR-SNET

Lemma 1 Suppose the loss function is Lipschitz smooth with respect to the model parameter w with constant L, and have ρ-bounded gradients with respect to training/validation data. And the A(θ) is differential with a δ-bounded gradient and twice differential with its Hessian bounded by B. Then the gradient of MLR-SNet parameter θ with respect to loss is Lipschitz smooth. Proof The gradient of MLR-SNet parameter θ with respect to loss ∇ θ j ( ŵt (θ))| θt = ∂ j ( ŵt (θ)) ∂ ŵt (θ) ∂ ŵt (θ) ∂A(θ) ∂A(θ) ∂θ = -α t n n i=1 ∂ j ( ŵt (θ)) ∂ ŵt (θ) ∂ i (w t ) ∂w t ∂A(θ) ∂θ θt , Let G ij = ∂ j ( ŵt(θ)) ∂ ŵt(θ) ∂ i(wt) ∂wt , and then take gradient of θ in both sides of above equallity, we have ∇ 2 θ 2 j ( ŵt (θ))| θt = -α t n n i=1 ∂G ij ∂θ ∂A(θ) ∂θ + G ij ∂A 2 (θ) ∂θ 2 . ( ) For the first term in the right hand side, we have that ∂G ij ∂θ ∂A(θ) ∂θ ≤δ ∂ j ( ŵt (θ)) ∂ ŵt (θ)∂θ ∂ i (w t ) ∂w t =δ ∂ ∂ ŵt (θ) -α t n n i=1 ∂ j ( ŵt (θ)) ∂ ŵt (θ) ∂ i (w t ) ∂w t ∂A(θ) ∂θ θt ∂ i (w t ) ∂w t =δ -α t n n i=1 ∂ 2 j ( ŵt (θ)) ∂ ŵ2 t (θ) ∂ i (w t ) ∂w t ∂A(θ) ∂θ θt ∂ i (w t ) ∂w t ≤α t Lρ 2 δ 2 . ( ) For the second term in the right hand side, we have that G ij ∂A 2 (θ) ∂θ 2 ≤ Bρ 2 (16) Combining the above two inequalities Eq.( 15)(16), we have ∇ θ j ( ŵt (θ))| θt ≤ αρ 2 (α t Lδ 2 + B). Define L A = αρ 2 (α t Lδ 2 + B), and based on the Lagrange mean value theorem, we have: ∇L V al ( ŵt (θ 1 )) -L V al ( ŵt (θ 2 )) ≤ L A θ 1 -θ 2 . ( ) Thus the conclusion holds. Theorem 2 Suppose the loss function is Lipschitz smooth with respect to the model parameter w with constant L, and have ρ-bounded gradients with respect to training/validation data. And the A(θ) is differential with a δ-bounded gradient and twice differential with its Hessian bounded by B. Let the learning rate α t = A(θ t ) predicted by MLR-SNet obey the exponential LR, i.e., α t = α 0 β t , β = (Γ/T ) 1/T , Γ ≥ 1. Let η t = η for all t ∈ [T ]. If we use Adam algorithm to update MLR-SNet, we choose η satisfied η ≤ 2L and 1 -β 2 ≤ 2 16ρ 2 , where β 2 , are the hyperparameter of the Adam algorithm. Then for θ t generated using Adam, we have the following bound: min 0≤t≤T E[ ∇L V al ( ŵt (θ t )) 2 2 ] ≤ O( C ln(T ) T + σ 2 ), ( ) where C is some constant independent of the convergence process, σ is the variance of drawing uniformly mini-batch sample at random. Proof Suppose we have a small validation set with M samples {x 1 , x 2 , • • • , x M }, each associating with a validation loss function i (w(θ)), where w is the parameter of the model, and θ is the parameter of the MLR-SNet. The overall validation loss would be, L V al (w) = 1 M M i=1 i (w(θ)). ( ) According to the updating algorithm 1, we have: L V al ( ŵt+1 (θ t+1 )) -L V al ( ŵt (θ t )) = {L V al ( ŵt+1 (θ t+1 )) -L V al ( ŵt (θ t+1 ))} (a) + {L V al ( ŵt (θ t+1 )) -L V al ( ŵt (θ t ))} (b) (21) For term (a), L V al ( ŵt+1 (θ t+1 )) -L V al ( ŵt (θ t+1 )) ≤ ∇ w L V al ( ŵt+1 (θ t+1 )), ŵt+1 (θ t+1 ) -ŵt (θ t+1 ) + L 2 ŵt+1 (θ t+1 ) -ŵt (θ t+1 ) 2 2 (22) According to Eq (6), we have ŵt+1 (θ t+1 ) -ŵt (θ t+1 ) = -α t ∇ w L B T r ( ŵt (θ t+1 )) where α t = A(L B T r ( ŵt (θ t+1 ); θ t ), L B T r (w t ) = 1 |Bt| i∈Bt ∇L T r i (w t ). This can be written as ŵt+1 (θ t+1 ) -ŵt (θ t+1 ) = -α t ∇ w L T r ( ŵt (θ t+1 )) + ξ (t) , where ξ (t) = ∇ w L B T r ( ŵt (θ t+1 )) -∇ w L T r ( ŵt (θ t+1 )) . Since B t is the mini-batch samples drawn uniformly from the entire data set, we have E[ξ (t) ] = 0. Furthermore, ξ (t) are i.i.d random variable with finite variance, since B t are drawn i.i.d with a finite number of samples. Then Eq (22) can be written as a ≤ ∇ w L V al ( ŵt (θ t+1 )), -α t ∇ w L T r ( ŵt (θ t+1 )) + ξ (t) + L 2 -α t ∇ w L T r ( ŵt (θ t+1 )) + ξ (t) 2 2 = ∇ w L V al ( ŵt (θ t+1 )), -α t ∇ w L T r ( ŵt (θ t+1 )) + ξ (t) + Lα 2 t 2 ∇ w L T r ( ŵt (θ t+1 )) 2 + ξ (t) 2 2 -∇ w L T r ( ŵt (θ t+1 )), ξ (t) ≤ ∇ w L V al ( ŵt (θ t+1 )), -α t ∇ w L T r ( ŵt (θ t+1 )) + ξ (t) + L 2 α 2 t ρ 2 + ξ (t) 2 2 -∇ w L T r ( ŵt (θ t+1 )), ξ (t) . For term (b), according to Lemma 1, i.e., the validation loss is Lipschitz smooth with respect to the MLR-SNet parameter θ, for briefly denote L. L V al ( ŵt (θ t+1 )) -L V al ( ŵt (θ t )) ≤ ∇ θ L V al ( ŵt (θ t )), θ t+1 -θ t + L 2 θ t+1 -θ t 2 2 (25) If we adopt Adam to update the parameter of MLR-SNet, θ t+1 -θ t in Eq.( 25) is updated by θ t+1 = θ t -η t g t,i √ v t,i + , ( ) where g t,i = ∇ θ L i V al ( ŵt (θ t )). Now, we have L V al ( ŵt (θ t+1 )) -L V al ( ŵt (θ t )) ≤ -η t d i=1 ∇ θ L i V al ( ŵt (θ t )), g t,i √ v t,i + + Lη 2 t 2 d i=1 g 2 t,i ( √ v t,i + ) 2 (27) Based on the proof process in (Zaheer et al., 2018) (Eq 4 in p. 13), L V al ( ŵt (θ t+1 )) -L V al ( ŵt (θ t )) ≤ - η t 2( √ β 2 ρ + ) ∇ θ L V al ( ŵt (θ t )) 2 2 + η t ρ √ 1 -β 2 2 + Lη 2 2 2 σ 2 M , Now Eq.( 21) has become the following: L V al ( ŵt+1 (θ t+1 )) -L V al ( ŵt (θ t )) ≤ ∇ w L V al ( ŵt (θ t+1 )), -α t ∇ w L T r ( ŵt (θ t+1 )) + ξ (t) + L 2 α 2 t ρ 2 + ξ (t) 2 2 -∇ w L T r ( ŵt (θ t+1 )), ξ (t) - η t 2( √ β 2 ρ + ) ∇ θ L V al ( ŵt (θ t )) 2 2 + η t ρ √ 1 -β 2 2 + Lη 2 2 2 σ 2 M , Taking expectations with respect to ξ on both side of Eq.( 29) and rearranging the inequality, we can obtain: E ξ η t 2( √ β 2 ρ + ) ∇ θ L V al ( ŵt (θ t )) 2 2 ≤α t ρ 2 + L 2 α 2 t (ρ 2 + σ 2 ) -L V al ( ŵt+1 (θ t+1 )) + L V al ( ŵt (θ t )) + η t ρ √ 1 -β 2 2 + Lη 2 2 2 σ 2 M Using telscoping sum, we obtain T t=1 η t 2( √ β 2 ρ + ) E ∇ θ L V al ( ŵt (θ t )) 2 2 ≤L V al ( ŵt (θ 1 )) -L V al ( ŵt (θ T +1 )) + ρ 2 T t=1 α t + L 2 (ρ 2 + σ 2 ) T t=1 α 2 t + η t ρ √ 1 -β 2 2 + Lη 2 2 2 σ 2 T M ≤L V al ( ŵt (θ 1 )) + ρ 2 T t=1 α t + L 2 (ρ 2 + σ 2 ) T t=1 α 2 t + η t ρ √ 1 -β 2 2 + Lη 2 2 2 σ 2 T M (30) Therefore, (α = 0.15) and search different hyper-lrs from {1e -3 , 1e -4 , 1e -5 , 1e -6 , 1e -7 } for HD and RTHO, reporting the best performing hyper-lr. min t E ξ ∇ θ L V al ( ŵt (θ t )) 2 2 ≤ T t=1 ηt 2( √ β2ρ+ ) E ξ ∇ θ L V al ( ŵt (θ (t) )) 2 2 T t=1 ηt 2( √ β2ρ+ ) ≤ L V al ( ŵt (θ 1 )) + ρ 2 T t=1 α t + L 2 (ρ 2 + σ 2 ) T t=1 α 2 t + ηtρ √ 1-β2 2 + Lη 2 2 2 σ 2 T M T t=1 η t 2( β 2 ρ + ) ≤ 2( √ β 2 ρ + ) T η L V al ( ŵt (θ 1 )) + ρ 2 T t=1 α t + L 2 (ρ 2 + σ 2 ) T t=1 α 2 t + η t ρ √ 1 -β 2 2 + Lη 2 2 2 σ 2 T M ≤O( ln(T ) T + σ 2 ). Penn Treebank. We use a 2-layer and 3-layer LSTM network which follows a word-embedding layer and the output is fed into a linear layer to compute the probability of each word in the vocabulary. Hidden size of LSTM cell is set to 512 and so is the word-embedding size. We tie weights of the word-embedding layer and the final linear layer. Dropout is applied to the output of word-embedding layer together with both the first and second LSTM layers with a rate of 0.5. As for training, the LSTM net is trained for 150 epochs with a batch size of 32 and a sequence length of 35. We set the base optimizer SGD to have an initial LR of 20 without momentum, for Adam, the initial LR is set to 0.01 and weight for moving average of gradient is set to 0. We apply a weight decay of 5e -6 to both base optimizers. All experiments involve a 0.25 clipping to the network gradient norm. For both SGD and Adam, we decrease LR by a factor of 4 when performance on validation set shows no progress. For L4, we try different α in {0.1, 0.05, 0.01, 0.005} and reporting the best test perplexity among them. For both HD and RTHO, we search the hyper-lr lying in {1, 0.5, 0.1, 0.05}, and report the best results. MLR-SNet architecture and parameter setting. The architecture of MLR-SNet is illustrated in Section 3.2. In our experiment, the size of hidden nodes is set as 40. The initialization of MLR-SNet follows the default setting in Pytorch. The Pytorch implementation of MLR-SNet is listed above. We employ Adam optimizer to train MLR-SNet, and just set the parameters as originally recommended with a LR of 1e -3 , and a weight decay of 1e -4 , which avoids extra hyper-parameter tuning. For image classification tasks, the input of MLR-SNet is the training loss of a mini batch samples. Every data batch's LR is predicted by MLR-SNet and we update it twice per epoch according to the loss of the validation data. While for text classification tasks, we take L T r log(vocabulary size) as input of MLR-SNet to deal with the influence of large scale classes of text. MLR-SNet is updated every 100 batches due to the large number of batches per epoch compared to that in image datasets. Results. Due to the space limitation, we only present the test accuracy in the main paper. Here, we present the training loss and test accuracy of our method and all compared methods on image and text tasks, as shown in Fig. 11 . For image tasks, except for Adam and SGD with fixed LR, other methods can decrease the loss to 0 almostly. Though local minima can be reached by these methods, the generalization ability of the these mimimas has a huge difference, which can be summarized from test accuracy curves. As shown in Fig. 11(a) ,11(b),11(g),11(h), when using SGD to train DNNs, the compared methods SGD with Exponential LR, L4, HD, RTHO fail to find such good solutions to generalize well. Especially, L4 greedily searches LR to decrease loss to 0, making it fairly hard to adapt the complex DNNs training dynamic and obtain a good mimima, while our method can adjust LR to comply with the significant variations of training dynamic, leading to a better generalization solution. As shown in Fig. 11(d) ,11(e),11(j),11(k), when baseline methods are trained with SGDM, these methods make a great progress in escaping from the bad minimas. In spite of this, our method still shows superiority in finding a solution with better generalization compared with these competitive training strategies. In the third column in Fig. 11 , we plot learned LR schedules of compared methods and our method. As can be seen, our method can learn LR schedules approximating the hand-designed LR schedules while with more locally varying. HD and RTHO often have the same trajectory while producing lower or faster downward trend than ours. This tends to explain our final performances on test set is better than HD and RTHO, since our method can adaptively adjust LR utilizing the past training histories explicitly. L4 greedily searches a LR to decrease the loss. This often leads to a large value causing fluctuations or even divergence (Fig. 11(l )), or a small value causing slow progress (Fig. 11(r) ), or both of them (Fig. 11(c ) 11(f) 11(i) 11(o)). Such LR schedules often result in bad mimimas. Moreover, all compared methods regard LR as hyper-parameter to learn without a transferable formulation, and the learned LR schedules can not generalize to other learning tasks directly. Generally, they just try to find a proper LR schedule from scratch for new tasks. However, our meta-learned MLR-SNet is plug-and-play and transferrable, which can directly transfer how to schedule LR for SGD to heterogeneous tasks without additional learning.

Ablation study.

(1) The architecture of shows the test accuracy on CIFAR-10 with ResNet-18 of different architectures of MLR-SNet. As can be seen, our algorithm is not sensitive to the choose of the MLR-SNet's architectures. This implies that our algorithm is robust and stable for helping improve DNN training. (2) The gobal LR of the meta optimizer. To further validate that whether our MLR-SNet behaves robust to the meta optimizer. We adapot Adam optimizer to search the proper LR schedules. Fig. 12(b) shows that our MLR-SNet achieves the similar performance even for different global LRs. This implies our MLR-SNet needs not carefully tune the LR of the meta optimizer, which makes it easy to reproduce and apply to various problems. (3) The different γ values of the MLR-SNet. One important hyperparameter of the MLR-SNet is γ, here we verify our method is not sensitive to the choose of γ value. We test γ values from 0.1 to 10 for the DNNs training. As shown in Fig. 12(c ), even with different learning scales, our method can still help DNNs achieve almost similar performance. This implies the MLR-SNet is robust to the choose of the γ, which makes it easy to be applied into parctice.

C EXPERIMENTAL DETAILS AND ADDITIONAL RESULTS IN SECTION 4.2

We investigate the transferability of the learned LR schedule when applied to various tasks in Section 4.2 of the main paper. We use the MLR-SNet meta-learned on CIFAR-10 with ResNet-18 in Section 4.1 to directly predict the LR for SGD algorithm to new heterogeneous tasks. We save the learned MLR-SNet at different epochs in the whole one meta-train run. As is shown in Fig. 13 Our method trains the ResNet-18 by SGD with a weight decay 5e -4 , and the MLR-SNet is learned under the guidance of a small set of validation set without corruptions. We randomly choose 10 clean images for each class as validation set. The experimental result is listed in Table 1 in the main paper. Additional robustness results of transferrable LR schedules on different data corruptions. Furthermore, we want to explore the robust performance of different tasks for our transferrable LR schedules. Different from above experiments where all 15 models are trained under the guidance of a small set of validation set, we just train a ResNet-18 on Gaussian Noise corruption to meta-learn the MLR-SNet, and then transfer the meta-learned LR schedules to other 14 corruptions. We report the average accuracy of 14 models on test data to show the robust performance of our transferred LR schedules. All the methods are meta-tested with a ResNet-18 for 100 epochs with batch size 128. The hyper-parameter setting of hand-designed LR schedules keeps same with above. Table 7 shows the mean test accuracy of 14 models. As can be seen, our transferrable LR schedules obtain the final best performance compared with hand-designed LR schedules. This implies that our transferrable LR schedules can also perform robust and stable than the pre-set LR schedules when the learning tasks are changed. However, our transferrable LR schedules are plug-and-play, and have no additional hyper-parameters to tune when transferred to new heterogeneous tasks. F APPLYING MLR-SNET ON TOP OF ADAM. To further demostrate the versatility of our method, we apply the MLR-SNet on top of the Adam algorithm. Fig. 16 shows that our methods can substantially improve the performance of the original Adam algorithm.

G EXPERIMENTAL RESULTS OF ADDITIONAL COMPARED METHOD LR CONTROLLER

In this section, we present the experimental results of LR Controller Xu et al. (2019) , which is a related work of ours but under the reinforcement learning framework. Due to their learning algorithm is relatively computationally expensive and not very easy to optimize, we will show our method has a superiority in finding such a good LR schedule that scales and generalizes. To start a fair comparison, we follow all the training settings and structure of LR Controller proposed in Xu et al. (2019) except that we modify the batch size to 128 and increase training steps to cover 200 epochs of data to match our setup in Section 4.1foot_7 . Firstly, we train LR Controller on CIFAR-10 with ResNet-18 and CIFAR-100 with WideResNet-28-10 as we do in Section 4.1. As shown in Fig. 17 , our method demonstrates evident superiority in finding a solution with better generalization compared with LR Controller strategies. LR Controller performs steadily in the early training phase, but soon fluctuates significantly and fails to progress. This tends to show that the LR Controller suffers from a severe stability issue when training step increases, especially being compared to our MLR-SNet. Then we transfer the LR schedules learned on CIFAR-10 for our method and LR Controller to CIFAR-100 to verify their transferability. Test settings are the same with those related in Section 4.2. As shown in Fig. 18 , LR Controller makes a comparatively slower progress in the whole training process. While our method achieves a competitive performance, which indicates the capability of transferring to other tasks for our method. 



As we know that the performance of hand-designed LR schedules and HPO methods is very sensitive to the initial LR. To avoid carefully tuning the initial LR, we learn the LR schedules from an interval [0, γ], and now the initial LR is determined by the output of the MLR-SNet. We set γ = 1 for image tasks, and γ = 40 for text tasks in all our experiments to eliminate the influence of loss magnitude between two different tasks. Notice that ŵt+1(θ) here is a function of θ to guarantee the gradient in Eq.(9) to be able to compute. It can be downloaded at https://tiny-imagenet.herokuapp.com. The pytorch code of all these networks can be found on https://github.com/weiaicunzai/pytorch-cifar100. The training code can be found on https://github.com/pytorch/examples/tree/master/imagenet. It can be downloaded at https://tiny-imagenet.herokuapp.com. We use the original 50,000 train images of CIFAR-10/100 as test data. Code for LR Controller can be found at https://github.com/nicklashansen/adaptive-learning-rate-schedule



Figure 1: Pre-set LR schedules for (a) image and (b) text classification. (c) Visualization of how we input current loss L t to MLR-SNet, which then outputs a proper LR α t to help SGD find a better minima. LR schedules learned by MLR-SNet on (d) image and (e) text classification. (f) We transfer LR schedules learned on CIFAR-10 to image (CIFAR-100) and text (Penn Treebank) classification, and the subfigure shows the predicted LR during training.

Figure 2: The structure of our proposed MLR-SNet.

Figure 4: Test accuracy of our methods (train) and compared baselines on different datasets.

Fig.4(a)  and 4(b) show the classification accuracy on CIFAR-10 and CIFAR-100 test sets, respectively. It can be observed that: 1) our algorithm outperforms all other competing methods, and the learned LR schedules by MLR-SNet are presented in Fig.1(d), which have similar shape as the hand-designed policies, while with more elaborate variation details in locality for adapting training dynamics. 2) The Fixed LR has similar performance to other baselines at the early training, while falls into fluctuations at the later training. This implies that the Fixed LR can not finely adapt to such DNN training dynamics. 3) The MultiStep LR drops the LR at some epochs, and such elegant strategy overcomes the issue of Fixed LR and obtains higher and stabler performance at the later training. 3) The Exponential LR improves test performance faster at the early training than other baselines, while makes a slow progress due to smaller LR at the later training. 4) SGDR uses the cyclic LR, which needs more epochs to obtain a stable result. 5) Though Adam has an adaptive coordinate-specific LR, it behaves worse than MultiStep and Exponential LR as demonstrated inWilson et al. (

Figure 6: Test accuracy of transferred LR schedules on different datasets.

4.3 ROBUSTNESS ON DIFFERENT DATA CORRUPTIONSIn this section, we validate that whether our MLR-SNet behaves robust against corrupted training data. To this aim, we design experiments as follows: we take CIFAR-10-C and CIFAR-100-C(Hendrycks & Dietterich, 2019) as our training set, consisting of 15 types of different generated corruptions on test images data of CIFAR-10/CIFAR-100, and the original training set of CIFAR-10/100 as test set. Though the original images of CIFAR-10/100-C are the same with the CIFAR-10/100 test set, different corruptions have changed the data distributions. To guarantee the calculated models finely generalize to test set, we choose the validation set as 10 clean images for each class. Each corruption can be roughly regarded as a task, and thus we obtain 15 models trained on CIFAR-10/100-

Figure 11: Train loss (perplexity), test accuracy (perplexity) and learned LR schedules of our methods (train) and compared baselines on different tasks.

Figure 12: Ablation study. (a) Test accuracy on CIFAR-10 with ResNet-18 of different architectures of MLR-SNet. 'a-b' denotes the configurations of MLR-SNet, where 'a' represents the number of layers, and 'b' represents the number of hidden nodes. (b)Test accuracy on CIFAR-10 with ResNet-18 of different LRs of meta optimizer 'Adam'. (c) Test accuracy on CIFAR-10 with ResNet-18 of different gamma values of MLR-SNet.

THE PRELIMINARY EXPLORATION OF THE INFLUENCE ON META-TEST TASKS OF THE DIFFERENT META-TRAINING TASKS.In this section, we study the performance influence on the target task of the different meta-training tasks. Here we fixed the target task as training ResNet-18 on TinyImageNet. We choose three different meta-training task: 1) training ResNet-18 on CIFAR-10; 2) training WideResNet-28-10 on CIFAR-100; 3) training 3-layer LSTM on Penn Treebank. Fig.15shows the meta-test performance of three different meta-training tasks. It can be seen that the meta-training task more related to the target task would obtain better transferable performance on the target task.

Figure 15: The meta-test performance on TinyImageNet (target task) of different meta-training tasks.

Figure 17: Train loss, test accuracy and learned LR schedules of our method(train) and LR Controller(train) on CIFAR-10 and CIFAR-100.

Figure 18: Train loss, test accuracy of our method(test) and LR Controller(test) on CIFAR-100.

Test accuracy (%) on CIFAR-10 and CIFAR-100 training set of different methods trained on CIFAR-10-C and CIFAR-100-C. Best and Last denote the results of the best and the last epoch.

Let the learning rate α t = A(θ t ) predicted by MLR-SNet obey the exponential LR, i.e.,α t = α 0 β t , β = (Γ/T ) 1/T , Γ ≥ 1. Let η t = η for all t ∈ [T ].If we use Adam algorithm to update MLR-SNet, we choose η satisfied η ≤ 2L and 1 -β 2 ≤ 2 16ρ 2 , where β 2 , are the hyperparameter of the Adam algorithm. Then for θ t generated using Adam, we have the following bound:

In this section, we attempt to evaluate the capability of MLR-SNet to learn LR schedules compared with baseline methods. Here, we provide implementation details of all experiments. Test accuracy (%) of CIFAR dataset with SGD baselines.

Test accuracy (%) of CIFAR dataset with SGDM baselines.

Test perplexity on the Penn Treebank dataset.

Test accuracy (%) of CIFAR-10 dataset with different networks.

Validation accuracies on ImageNet dataset.Transfer to different net architectures. We transfer the learned LR schedules for different net architectures training. All the methods are trained on CIFAR-10 with different net architectures. The hyper-parameters of all methods are the same with the setting of CIFAR-10 with ResNet-18. We test the meta-learned LR schedule to different configurations of DenseNetHuang et al. (2017). As shown in Fig.14, our method perform slightly stable than MultiStep strategy at about 75-125 epochs. This tends to show the superiority of adaptive LR to train the DenseNets. Also, we transfer the LR schedules to several novel networks, the results are presented in Fig.8in the main paper.Transfer to large scale optimization. We transfer the learned LR schedules for the training of the large scale optimization problems. The predicted LR by MLR-SNet will not substantially increase the complexity compared with hand-designed LR schedules for DNNs training. This makes it feasible and reliable to transfer our meta-learned LR schedules to such large scale optimization problems. We train a ResNet-50 on ImageNet with hand-designed LR schedules and our transferred LR schedules. The training code can be found on https://github.com/pytorch/ examples/tree/master/imagenet, and the parameter setting keeps unchanged except the LR. All compared hand-designed LR schedules are trained by SGDM with a momentum 0.9, a weight decay 5e -4 , an initial learning rate 0.1 for 90 epochs, and batch size 256. Fixed LR uses 0.1 LR

Test accuracy (%) on CIFAR-10 and CIFAR-100 training set of different methods trained on CIFAR-10-C and CIFAR-100-C. Best and Last denote the results of the best and the last epoch. The Bold and Underline Bold denote the first and second best results, respectively. Each dataset contains 15 types of algorithmically generated corruptions from noise, blur, weather, and digital categories. These corruptions contain Gaussian Noise, Shot Noise, Impulse Noise, Defocus Blur, Frosted Glass Blur, Motion Blur, Zoom Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic, Pixelate and JPEG. All the corruptions are gererated on 10,000 test set images, and each corruption contains 50,000 images since each type of corruption has five levels of severity. We treat CIFAR-10-C or CIFAR-100-C dataset as training set, and train a model with ResNet-18 for each corruption dataset. Finally, we can obtain 15 models for CIFAR-10/100-C. Each corruption can be roughly regarded as a task, and the average accuracy of 15 models on test data 7 is used to evaluate the robust performance of different tasks for each LR schedules strategy.

