AUTOLRS: AUTOMATIC LEARNING-RATE SCHEDULE BY BAYESIAN OPTIMIZATION ON THE FLY

Abstract

The learning rate (LR) schedule is one of the most important hyper-parameters needing careful tuning in training DNNs. However, it is also one of the least automated parts of machine learning systems and usually costs significant manual effort and computing. Though there are pre-defined LR schedules and optimizers with adaptive LR, they introduce new hyperparameters that need to be tuned separately for different tasks/datasets. In this paper, we consider the question: Can we automatically tune the LR over the course of training without human involvement? We propose an efficient method, AutoLRS, which automatically optimizes the LR for each training stage by modeling training dynamics. AutoLRS aims to find an LR applied to every τ steps that minimizes the resulted validation loss. We solve this black-box optimization on the fly by Bayesian optimization (BO). However, collecting training instances for BO requires a system to evaluate each LR queried by BO's acquisition function for τ steps, which is prohibitively expensive in practice. Instead, we apply each candidate LR for only τ τ steps and train an exponential model to predict the validation loss after τ steps. This mutual-training process between BO and the loss-prediction model allows us to limit the training steps invested in the BO search. We demonstrate the advantages and the generality of AutoLRS through extensive experiments of training DNNs for tasks from diverse domains using different optimizers. The LR schedules auto-generated by AutoLRS lead to a speedup of 1.22×, 1.43×, and 1.5× when training ResNet-50, Transformer, and BERT, respectively, compared to the LR schedules in their original papers, and an average speedup of 1.31× over state-of-the-art heavily-tuned LR schedules.

1. INTRODUCTION

In the regime of deep learning, the success of training largely depends on the choice of the learning rate (LR) schedule, since most optimizers will have difficulty traversing a non-smooth and non-convex loss landscape with multiple local minimums and possibly saddle points (Kawaguchi, 2016; Jin et al., 2017; Goodfellow et al., 2016; Li et al., 2018a) . To achieve stable and fast convergence towards a solution with good generalization performance, one has to tune the LR schedules carefully for different tasks (Nar & Sastry, 2018; Jastrzębski et al., 2017) . This tuning is usually non-trivial and requires many trial-and-error iterations that are computationally expensive. Moreover, the randomness of the widely-used mini-batch stochastic gradient descent (SGD) may introduce more uncertainty and difficulty in the tuning process. For the same reasons, it is also hard to directly formulate the search of the LR schedule as a well-posed optimization problem and address it through standard optimization. The broadly-adopted strategy is to either pick one from a family of pre-defined LR schedules or apply an optimizer that has a built-in mechanism changing the LR adaptively. However, we have a limited number of choices for pre-defined LR schedules, most of which are simple functions such as exponent or cosine and thus cannot perfectly align with the non-smooth loss landscape. The latter set of adaptive optimizers, e.g., Adam (Kingma & Ba, 2015) and Adadelta (Zeiler, 2012) , are extended from convex optimization and rely on strong assumptions to make the convergence properties hold. Moreover, the methods in both categories introduce new hyper-parameters that have to be tuned separately for different tasks or datasets, requiring significant human involvement. In this paper, we study the question: can we automatically tune the LR over the course of training without human involvement? At the beginning of every τ steps (i.e., a "stage" in our method), we seek to identify an LR that optimizes the validation loss (i.e., an empirical estimate of the generalization error) at the end of the stage. To do so, we employ Bayesian optimization (BO) that treats the validation loss as a black-box function of LR. BO simultaneously updates a posterior estimation of the black-box function and searches for the best LR with respect to the posterior. This approach is, however, computationally expensive since estimating the posterior needs many (input, output) instances of the function, and acquiring each instance costs τ steps of training. We, therefore, develop a simple yet efficient approximation: for every LR that BO decides to evaluate, we train the model by using the LR for only τ τ steps and use the validation loss over the τ steps to train a time-series forecasting model that provides a prediction of the validation loss after τ steps. As we will show later, an exponential model suffices to produce accurate predictions when using a small τ = τ /10. Then, AutoLRS can allow BO to explore ten different LRs in each stage and still bound the total running time to approximately twice the training cost associated with the generated schedule, i.e., the time spent to find the stage-specific LRs is roughly equal to the time spent training the model with the identified LRs. AutoLRS does not depend on a pre-defined LR schedule, dataset, or a specified task and is compatible with almost all optimizers. Hence, it can be generally deployed across a broad range of ML tasks without much human involvement or expensive tuning over choices of LR schedules and their hyperparameters. Moreover, since it directly minimizes the validation loss, it does not only accelerate the convergence but also improves the generalization compared to just minimizing the training loss. Furthermore, AutoLRS only needs to update two extremely light-weight models, i.e., the BO posterior and the exponential forecasting model, and it is efficient in exploring the loss landscape. Hence, it does not result in notable extra costs in either memory or computation. Note that AutoLRS searches for better LRs based on the training dynamics, which can be seen as a form of selfsupervision. The interaction between BO and the forecasting model is an example of mutual learning, where one produces training data for the other. In experiments, we apply AutoLRS to train three representative DNNs widely used in practice, i.e., ResNet-50 (He et al., 2016a) on ImageNet classification (Russakovsky et al., 2015) ; Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019) for NLP tasks. Though they have been extensively studied and have hand-tuned LR schedules, the LR schedules computed by AutoLRS are faster than the original, hand-tuned, LR schedules by 1.22×, 1.43×, and 1.5× for training ResNet-50, Transformer, and BERT, respectively, in terms of the training steps used to update the DNN (i.e., excluding the costs of the LR/hyperparameter search). It meanwhile achieves test-set performance better or on par with state-of-the-art results. We also carefully hand-tuned two state-of-the-art learning rate schedules, CLR (Smith, 2017) and SGDR (Loshchilov & Hutter, 2017) , and conducted more than ten experiments with different CLR/SGDR hyperparameters on each model. AutoLRS still has an average speedup of 1.29× and 1.34× across the three models, in terms of training steps, compared to the best CLR and SGDR LR schedules, respectively. The AutoLRS implementation is available at https://github.com/YuchenJin/autolrs.

2. RELATED WORK

Learning rate scheduling: In contrast to traditional LR schedules with a monotone decreasing sequence of LRs and multi-step LR schedule, a recent class of LR schedules propose to apply multiple cycles of LR decay. Cyclical Learning Rate (CLR) changes LR from a maximal LR (η max ) to a minimal LR (η min ) at a pre-defined frequency and achieves faster convergence for some DNNs (Smith, 2017) . The approach requires a "LR range test" to estimate the minimal and maximal LR. The LR range test trains the model with a linearly-increasing LR between a low LR

