MLR-SNET: TRANSFERABLE LR SCHEDULES FOR HETEROGENEOUS TASKS

Abstract

The learning rate (LR) is one of the most important hyper-parameters in stochastic gradient descent (SGD) for deep neural networks (DNN) training and generalization. However, current hand-designed LR schedules need to manually pre-specify a fixed form, which limits their ability to adapt to non-convex optimization problems due to the significant variation of training dynamics. Meanwhile, it always needs to search a proper LR schedule from scratch for new tasks. To address these issues, we propose to parameterize LR schedules with an explicit mapping formulation, called MLR-SNet. The learnable structure brings more flexibility for MLR-SNet to learn a proper LR schedule to comply with the training dynamics of DNN. Image and text classification benchmark experiments substantiate the capability of our method for achieving proper LR schedules. Moreover, the meta-learned MLR-SNet is plugand-play to generalize to new heterogeneous tasks. We transfer our meta-trained MLR-SNet to tasks like different training epochs, network architectures, datasets, especially large scale ImageNet dataset, and achieve comparable performance with hand-designed LR schedules. Finally, MLR-SNet can achieve better robustness when training data are biased with corrupted noise.

1. INTRODUCTION

Stochastic gradient descent (SGD) and its many variants (Robbins & Monro, 1951; Duchi et al., 2011; Zeiler, 2012; Tieleman & Hinton, 2012; Kingma & Ba, 2015) , have been served as the cornerstone of modern machine learning with big data. It has been empirically shown that DNN achieves stateof-the-art generalization performance on a wide variety of tasks when trained with SGD (Zhang et al., 2017) . Several recent researches observe that SGD tends to select the so-called flat minima (Hochreiter & Schmidhuber, 1997a; Keskar et al., 2017) , which seems to generalize better in practice. Scheduling learning rate (LR) for SGD is one of the most widely studied aspects to help improve the SGD training for DNN. Specifically, it has been experimentally studied how the LR (Jastrzebski et al., 2017) influences mimima solutions found by SGD. Theoretically, Wu et al. (2018a) analyze that LR plays an important role in minima selection from a dynamical stability perspective. He et al. ( 2019) provide a PAC-Bayes generalization bound for DNN trained by SGD, which is correlated with LR. In a word, finding a proper LR schedule highly influences the generalization performance of DNN, which has been widely studied recently (Bengio, 2012; Schaul et al., 2013; Nar & Sastry, 2018) . There mainly exist three kinds of hand-designed LR schedules: (1) Pre-defined LR policy is mostly used in current DNN training, like decaying or cyclic LR (Gower et al., 2019; Loshchilov & Hutter, 2017) , and brings large improvements in training efficiency. Some theoretical works suggested that the decaying schedule can yield faster convergence (Ge et al., 2019; Davis et al., 2019) or avoid strict saddles (Lee et al., 2019; Panageas et al., 2019) under some mild conditions. (2) LR search methods in tranditional convex optimization (Nocedal & Wright, 2006 ) can be extended to DNN training by searching LR adaptively in each step, such as Polyak's update rule (Rolinek & Martius, 2018) , Frank-Wolfe algorithm (Berrada et al., 2019) , and Armijo line-search (Vaswani et al., 2019), etc. (3) Adaptive gradient methods like Adam (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2015) , adapt LR for each parameters separately according to some gradient information. Although above LR schedules (as depicted in Fig. 1(a ) and 1(b)) can achieve competitive results on their learning tasks, they still have evident deficiencies in practice. On the one hand, these policies need to manually pre-specify the form of LR schedules, suffering from the limited flexibility to adapt to non-convex optimization problems due to the significant variation of training dynamics. On the other hand, when solving new heterogeneous tasks, it always needs to search a proper LR

