REVISITING THE TRAIN LOSS: AN EFFICIENT PER-FORMANCE ESTIMATOR FOR NEURAL ARCHITECTURE SEARCH

Abstract

Reliable yet efficient evaluation of generalisation performance of a proposed architecture is crucial to the success of neural architecture search (NAS). Traditional approaches face a variety of limitations: training each architecture to completion is prohibitively expensive, early stopping estimates may correlate poorly with fully trained performance, and model-based estimators require large training sets. Instead, motivated by recent results linking training speed and generalisation with stochastic gradient descent, we propose to estimate the final test performance based on the sum of training losses. Our estimator is inspired by the marginal likelihood, which is used for Bayesian model selection. Our modelfree estimator is simple, efficient, and cheap to implement, and does not require hyperparameter-tuning or surrogate training before deployment. We demonstrate empirically that our estimator consistently outperforms other baselines under various settings and can achieve a rank correlation of 0.95 with final test accuracy on the NAS-Bench201 dataset within 50 epochs.

1. INTRODUCTION

Reliably estimating the generalisation performance of a proposed architecture is crucial to the success of Neural Architecture Search (NAS) but has always been a major bottleneck in NAS algorithms (Elsken et al., 2018) . The traditional approach of training each architecture for a large number of epochs and evaluating it on validation data (full evaluation) provides a reliable performance measure, but requires prohibitively high computational resources on the order of thousands of GPU days (Zoph & Le, 2017; Real et al., 2017; Zoph et al., 2018; Real et al., 2019; Elsken et al., 2018) . This motivates the development of methods for speeding up performance estimation to make NAS practical for limited computing budgets. A popular simple approach is early-stopping which offers a low-fidelity approximation of generalisation performance by training for fewer epochs (Li et al., 2016; Falkner et al., 2018; Li & Talwalkar, 2019) . However, if we stop the training early at a small number of epochs and evaluate the model on validation data, the relative performance ranking may not correlate well with the performance ranking of the full evaluation (Zela et al., 2018) . Another line of work focuses on learning curve extrapolation (Domhan et al., 2015; Klein et al., 2016b; Baker et al., 2017) , which trains a surrogate model to predict the final generalisation performance based on the initial learning curve and/or meta-features of the architecture. However, the training of the surrogate often requires hundreds of fully evaluated architectures to achieve satisfactory extrapolation performance and the hyper-parameters of the surrogate also need to be optimised. Alternatively, the idea of weight sharing is adopted in one-shot NAS methods to speed up evaluation (Pham et al., 2018; Liu et al., 2019; Xie et al., 2019b) . Despite leading to significant cost-saving, weight sharing heavily underestimates the true performance of good architectures and is unreliable in predicting the relative ranking among architectures (Yang et al., 2020; Yu et al., 2020) . In view of the above limitations, we propose a simple model-free method which provides a reliable yet computationally cheap estimation of the generalisation performance ranking of architectures: the Sum over Training Losses (SoTL). Our method harnesses the training losses of the commonly-used SGD optimiser during training, and is motivated by recent empirical and theoretical results linking training speed and generalisation (Hardt et al., 2016; Lyle et al., 2020) . We ground our method in the Bayesian update setting, where we show that the SoTL estimator computes a lower bound to the model evidence, a quantity with sound theoretical justification for model selection (MacKay, 1992) . We show empirically that our estimator can outperform a number of strong existing approaches to predict the relative performance ranking among architectures, while speeding up different NAS approaches significantly.

2. METHOD

We propose a simple metric that estimates the generalisation performance of a deep neural network model via the Sum of its Training Losses (SoTL). After training a deep neural network whose prediction is f θ (•) for T epochsfoot_0 , we sum the training losses collected so far: SoT L = T t=1 1 B B i=1 l f θt,i (X i ), y i (1) where l is the training loss of a mini-batch (X i , y i ) at epoch t and B is the number of training steps within an epoch. If we use the first few epochs as the burn-in phase for θ t,i to converge to certain distribution P (θ) and start the sum from epoch t = T -E + 1 instead of t = 1, we obtain a variant SoTL-E. In the case where E = 1, we start the sum at t = T and our estimator corresponds to the sum over training losses within epoch t = T . We discuss SoTL's theoretical interpretation based on Bayesian marginal likelihood and training speed in Section 3, and empirically show that SoTL, despite its simple form, can reliably estimate the generalisation performance of neural architectures in Section 5. If the sum over training losses is a useful indicator for the generalisation performance, one might expect the sum over validation losses to be a similarly effective performance estimator. The sum over validation losses (SoVL) lacks the link to the Bayesian model evidence, and so its theoretical motivation is different from our SoTL. Instead, the validation loss sum can be viewed as performing a bias-variance trade-off; the parameters at epoch t can be viewed as a potentially high-variance sample from a noisy SGD trajectory, and so summation reduces the resulting variance in the validation loss estimate at the expense of incorporating some bias due to the relative ranking of models' test performance changing during training. We show in Section 5 that SoTL clearly outperforms SoVL in estimating the true test performance.

3. THEORETICAL MOTIVATION

The SoTL metric is a direct measure of training speed and draws inspiration from two lines of work: the first is a Bayesian perspective that connects training speed with the marginal likelihood in the model selection setting, and the second is the link between training speed and generalisation (Hardt et al., 2016) . In this section, we will summarize recent results that demonstrate the connection between SoTL and generalisation, and further show that in Bayesian updating regimes, the SoTL metric corresponds to an estimate of a lower bound on the model's marginal likelihood, under certain assumptions.

3.1. TRAINING SPEED AND THE MARGINAL LIKELIHOOD

We motivate the SoTL estimator by a connection to the model evidence, also called the marginal likelihood, which is the basis for Bayesian model selection. The model evidence quantifies how likely a dataset D is to have been generated by a model, and so can be used to update a prior belief distribution over which model from a given set is most likely to have generated D. Given a model with parameters θ, prior π(θ), and likelihood P (D|θ) for a training data set D = {D 1 , . . . , D n } with data points D i = (x i , y i ), the (log) marginal likelihood is expressed as follows. log P (D) = log E π(θ) [P (D|θ)] ⇔ log P (D) = Interpreting the negative log posterior predictive probability -log P (D i |D <i ) of each data point as a 'loss' function, the log evidence then corresponds to the area under a training loss curve, where each



T can be far from the total training epochs T end used in complete training



D i |D <i ) = n i=1 log E P (θ|D<i) [P (D i |θ)]

