REVISITING THE TRAIN LOSS: AN EFFICIENT PER-FORMANCE ESTIMATOR FOR NEURAL ARCHITECTURE SEARCH

Abstract

Reliable yet efficient evaluation of generalisation performance of a proposed architecture is crucial to the success of neural architecture search (NAS). Traditional approaches face a variety of limitations: training each architecture to completion is prohibitively expensive, early stopping estimates may correlate poorly with fully trained performance, and model-based estimators require large training sets. Instead, motivated by recent results linking training speed and generalisation with stochastic gradient descent, we propose to estimate the final test performance based on the sum of training losses. Our estimator is inspired by the marginal likelihood, which is used for Bayesian model selection. Our modelfree estimator is simple, efficient, and cheap to implement, and does not require hyperparameter-tuning or surrogate training before deployment. We demonstrate empirically that our estimator consistently outperforms other baselines under various settings and can achieve a rank correlation of 0.95 with final test accuracy on the NAS-Bench201 dataset within 50 epochs.

1. INTRODUCTION

Reliably estimating the generalisation performance of a proposed architecture is crucial to the success of Neural Architecture Search (NAS) but has always been a major bottleneck in NAS algorithms (Elsken et al., 2018) . The traditional approach of training each architecture for a large number of epochs and evaluating it on validation data (full evaluation) provides a reliable performance measure, but requires prohibitively high computational resources on the order of thousands of GPU days (Zoph & Le, 2017; Real et al., 2017; Zoph et al., 2018; Real et al., 2019; Elsken et al., 2018) . This motivates the development of methods for speeding up performance estimation to make NAS practical for limited computing budgets. A popular simple approach is early-stopping which offers a low-fidelity approximation of generalisation performance by training for fewer epochs (Li et al., 2016; Falkner et al., 2018; Li & Talwalkar, 2019) . However, if we stop the training early at a small number of epochs and evaluate the model on validation data, the relative performance ranking may not correlate well with the performance ranking of the full evaluation (Zela et al., 2018) . Another line of work focuses on learning curve extrapolation (Domhan et al., 2015; Klein et al., 2016b; Baker et al., 2017) , which trains a surrogate model to predict the final generalisation performance based on the initial learning curve and/or meta-features of the architecture. However, the training of the surrogate often requires hundreds of fully evaluated architectures to achieve satisfactory extrapolation performance and the hyper-parameters of the surrogate also need to be optimised. Alternatively, the idea of weight sharing is adopted in one-shot NAS methods to speed up evaluation (Pham et al., 2018; Liu et al., 2019; Xie et al., 2019b) . Despite leading to significant cost-saving, weight sharing heavily underestimates the true performance of good architectures and is unreliable in predicting the relative ranking among architectures (Yang et al., 2020; Yu et al., 2020) . In view of the above limitations, we propose a simple model-free method which provides a reliable yet computationally cheap estimation of the generalisation performance ranking of architectures: the Sum over Training Losses (SoTL). Our method harnesses the training losses of the commonly-used SGD optimiser during training, and is motivated by recent empirical and theoretical results linking training speed and generalisation (Hardt et al., 2016; Lyle et al., 2020) . We ground our method in the Bayesian update setting, where we show that the SoTL estimator computes a lower bound to the

