-301 AND THE CASE FOR SURROGATE BENCHMARKS FOR NEURAL ARCHITECTURE SEARCH

Abstract

The most significant barrier to the advancement of Neural Architecture Search (NAS) is its demand for large computational resources, which hinders scientifically sound empirical evaluations. As a remedy, several tabular NAS benchmarks were proposed to simulate runs of NAS methods in seconds. However, all existing tabular NAS benchmarks are limited to extremely small architectural spaces since they rely on exhaustive evaluations of the space. This leads to unrealistic results that do not transfer to larger search spaces. To overcome this fundamental limitation, we propose NAS-Bench-301, the first surrogate NAS benchmark, using a search space containing 10 18 architectures, many orders of magnitude larger than any previous tabular NAS benchmark. After motivating the benefits of a surrogate benchmark over a tabular one, we fit various regression models on our dataset, which consists of ∼60k architecture evaluations, and build surrogates via deep ensembles to also model uncertainty. We benchmark a wide range of NAS algorithms using NAS-Bench-301 and obtain comparable results to the true benchmark at a fraction of the real cost. Finally, we show how NAS-Bench-301 can be used to generate new scientific insights.

1. INTRODUCTION

Neural Architecture Search (NAS) promises to advance representation learning by automatically finding architectures that facilitate the learning of strong representations for a given dataset. NAS has already achieved state-of-the-art performance on many tasks (Real et al., 2019; Liu et al., 2019a; Saikia et al., 2019; Elsken et al., 2020) and to create resource-aware architectures (Tan et al., 2018; Elsken et al., 2019a; Cai et al., 2020) . For a review, we refer to Elsken et al. (2019b) . Despite many advancements in terms of both efficiency and performance, empirical evaluations in NAS are still problematic. Different NAS papers often use different training pipelines, different search spaces and different hyperparameters, do not evaluate other methods under comparable settings, and cannot afford enough runs for testing significance. This practice impedes assertions about the statistical significance of the reported results, recently brought into focus by several authors (Yang et al., 2019; Lindauer & Hutter, 2019; Shu et al., 2020; Yu et al., 2020) . To circumvent these issues and enable scientifically sound evaluations in NAS, several tabular benchmarks (Ying et al., 2019; Zela et al., 2020b; Dong & Yang, 2020; Klyuchnikov et al., 2020) have been proposed recently (see also Appendix A.1 for more details). However, all these benchmarks rely on an exhaustive evaluation of all architectures in a search space, which limits them to unrealistically small search spaces (so far containing only between 6k and 423k architectures). This is a far shot from standard spaces used in the NAS literature, which contain more than 10 18 architectures (Zoph & Le, 2017; Liu et al., 2019b) . This discrepancy can cause results gained on existing tabular NAS benchmarks to not generalize to realistic search spaces; e.g., promising anytime results of local search on existing tabular NAS benchmarks were shown to not transfer to realistic search spaces (White et al., 2020b) . To address these problems, we make the following contributions: 1. We present NAS-Bench-301, a surrogate NAS benchmark that is first to cover a realistically-sized search space (namely the cell-based search space of DARTS (Liu et al., 2019b) ), containing more than 10 18 possible architectures. This is made possible by estimating their performance via a surrogate model, removing the constraint to exhaustively evaluate the entire search space. 2. We empirically demonstrate that a surrogate fitted on a subset of architectures can in fact model the true performance of architectures better than a tabular benchmark (Section 2). 3. We analyze and release the NAS-Bench-301 training dataset consisting of ∼60k fully trained and evaluated architectures, which will also be publicly available in the Open Graph Benchmark (Hu et al., 2020) (Section 3). 4. Using this dataset, we thoroughly evaluate a variety of regression models as surrogate candidates, showing that strong generalization performance is possible even in large spaces (Section 4). 5. We utilize NAS-Bench-301 as a benchmark for running various NAS optimizers and show that the resulting search trajectories closely resemble the ground truth trajectories. This enables sound simulations of thousands of GPU hours in a few seconds on a single CPU machine (Section 5). 6. We demonstrate that NAS-Bench-301 can help in generating new scientific insights by studying a previous hypothesis on the performance of local search in the DARTS search space (Section 6). To foster reproducibility, we open-source all our code and data in a public repo: https:// anonymous.4open.science/r/3f99ef91-c472-4394-b666-5d464e099aca/ 

2. MOTIVATION -CAN WE DO BETTER THAN A TABULAR BENCHMARK?

We start by motivating the use of surrogate benchmarks by exposing an issue of tabular benchmarks that has largely gone unnoticed. Tabular benchmarks are built around a costly, exhaustive evaluation of all possible architectures in a search space, and when an architecture's performance is queried, the tabular benchmark simply returns the respective table entry. The issue with this process is that the stochasticity of mini-batch training is also reflected in the performance of an architecture i, hence making it a random variable Y i . Therefore, the table only contains results of a few draws y i ∼ Y i (existing NAS benchmarks feature up to 3 runs per architecture). Given the variance in these evaluations, a tabular benchmark acts as a simple estimator that assumes independent random variables, and thus estimates the performance of an architecture based only on previous evaluations of the same architecture. From a machine learning perspective, knowing that similar architectures tend to yield similar performance, and that the variance of individual evaluations can be high (both shown to be the case by Ying et al. ( 2019)), it is natural to assume that better estimators may exist. In the remainder of this section, we empirically verify this hypothesis and show that surrogate benchmarks can provide better performance estimates than tabular benchmarks based on less data. Model Mean Absolute Error (MAE) 1, [2, 3] 2, [1, 3] 3, [1, 2] Tab. 4.534e-3 4.546e-3 4.539e-3 Surr. 3.446e-3 3.455e-3 3.441e-3 Table 1 : MAE between performance predicted by a tab./surr. benchmark fitted with one seed each, and the true performance of evaluations with the two other seeds. Test seeds in brackets. Setup We choose NAS-Bench-101 (Ying et al., 2019) as a tabular benchmark for our analysis and a Graph Isomorphism Network (GIN, Xu et al. (2019a) ) as our surrogate model.foot_1 Each architecture x i in NAS-Bench-101 contains 3 validation accuracies y 1 i , y 2 i , y 3 i from training x i with 3 different seeds. We excluded all diverged models with less than 50% validation accuracy on any of the three evaluations in NAS-Bench-101. We split this dataset to train the GIN surrogate model on one of the seeds, e.g., D train = {(x i , y 1 i )} i and evaluate on the other two, e.g., D test = {(x i , ȳ23 i )} i , where ȳ23 i = (y 2 i + y 3 i )/2. We emphasize that training a surrogate to model a search space is not a typical inductive regression task but rather a transductive one. By definition of the search space, the set of possible architectures is known ahead of time (although it may be very large), hence a surrogate model does not have to generalize to out-ofdistribution data if the training data covers the space well.

Results

We compute the mean absolute error MAE = i |ŷi-ȳ 23 i | n of the surrogate model trained on D train = {(x i , y 1 i )} i , where ŷi is predicted validation accuracy and n = |D test |. Table 1 shows that the surrogate model yields a lower MAE than the tabular benchmark, i.e. MAE = i |y 1



i -ȳ 23 i | n We used a GIN implementation by Errica et al. (2020); see Appendix B for details on training the GIN.

