A SURGERY OF THE NEURAL ARCHITECTURE EVALU-ATORS

Abstract

Neural architecture search (NAS) has recently received extensive attention due to its effectiveness in automatically designing effective neural architectures. A major challenge in NAS is to conduct a fast and accurate evaluation (i.e., performance estimation) of neural architectures. Commonly used fast architecture evaluators include parameter-sharing ones and predictor-based ones. Despite their high evaluation efficiency, the evaluation correlation (especially of the well-performing architectures) is still questionable. In this paper, we conduct an extensive assessment of both the parameter-sharing and predictor-based evaluators on the NAS-Bench-201 search space, and break up how and why different configurations and strategies influence the fitness of the evaluators. Specifically, we develop a set of NAS-oriented criteria to understand the behavior of fast architecture evaluators in different training stages. And based on the findings of our experiments, we give pieces of knowledge and suggestions to guide NAS application and motivate further research.

1. INTRODUCTION

Studies have shown that the automatically discovered architectures by NAS can outperform the hand-crafted architectures for various applications, such as classification (Nayman et al., 2019; Zoph & Le, 2017) , detection (Ghiasi et al., 2019; Chen et al., 2019b ), video understanding (Ryoo et al., 2019) , text modeling (Zoph & Le, 2017) , etc. Early NAS algorithms (Zoph & Le, 2017) suffer from the extremely heavy computational burden, since the evaluation of neural architectures is slow. Thus, how to estimate the performance of a neural architecture in a fast and accurate way is vital for addressing the computational challenge of NAS. A neural architecture evaluator outputs the evaluated score of an architecture that indicates its quality. The straightforward solution is to train an architecture from scratch to convergence and then test it on the validation dataset, which is extremely time-consuming. Instead of exactly evaluating architectures on the target task, researchers usually construct a proxy model with fewer layers or fewer channels (Pham et al., 2018; Real et al., 2019; Wu et al., 2019) , and train this model to solve a proxy task of smaller scales (Cai et al., 2018a; Elsken et al., 2018; Klein et al., 2017; Wu et al., 2019) , e.g., smaller dataset or subsets of dataset, training or finetuning for fewer epochs. Traditional evaluators conduct separate training phases to acquire the weights that are suitable for each architecture. In contrast, one-shot evaluation amortized the training cost of different architectures through parameter sharing or a global hypernetwork, thus significantly reduce the evaluation cost. Pham et al. ( 2018) constructs an over-parametrized super network (supernet) such that all architectures in the search space are sub-architectures of the supernet. Throughout the search process, the shared parameters in the supernet are updated on the training dataset split, and each architecture is evaluated by directly using the corresponding subset of the weights. Afterwards, the parameter sharing technique is widely used for architecture search in different search spaces (Wu et al., 2019; Cai et al., 2020) , or incorporated with different search strategies (Liu et al., 2018b; Nayman et al., 2019; Xie et al., 2018; Yang et al., 2019; Cai et al., 2020) . Hypernetwork (Brock et al., 2018; Zhang et al., 2018) based evaluation is another type of one-shot evaluation strategy, in which a hypernetwork is trained to generate proper weights for each architecture. Since hypernetwork solutions are not generic currently, this paper concentrates on the evaluation of parameter sharing evaluators. Whether or not one-shot strategies can provide highly-correlated architecture evaluation results is essential for the efficacy of the NAS process. Many recent studies have been focusing on assessing the evaluation correlation of one-shot architecture evaluators (Bender et al., 2018; Sciuto et al., 2019; Zela et al., 2020) .

One-Shot

Besides one-shot evaluation strategies, predictor-based evaluation strategies (Luo et al., 2018; Liu et al., 2018a; Deng et al., 2017; Sun et al., 2019; Wang et al., 2018; Xu et al., 2019; Ning et al., 2020) use a performance predictor that takes the architecture description as inputs and outputs a predicted performance score. The performance predictor should be trained using "ground-truth" architecture performances. This paper utilizes the same set of criteria to evaluate and compare different performance predictors. The fast neural architecture evaluators (i.e., performance estimators) are summarized in Fig. 1 , including parameter sharing, hypernetworks, and predictor-based ones. And this paper aims at revealing the status of current architecture evaluation strategies systematically. Specifically, we develop a set of NAS-oriented criteria to understand the behavior of fast architecture evaluators in different training stages. And based on the findings of our experiments, we give pieces of knowledge and suggestions to guide NAS application and motivate further research. The knowledge revealed by this paper includes: 1) Layer proxy brings a larger evaluation gap than using channel proxy, thus channel proxy can be utilized to reduce the computational cost, while proxy-less search w.r.t the layer number is worth studying. 2) The convergence rate of different criteria varies during the one-shot supernet training, which shows that the good architectures are distinguished from bad architectures in the early stage. 3) As training goes on, the one-shot performances of isomorphic sub-architectures become closer. 4) De-isomorphic sampling or post de-isomorphism handling can help avoid the over-estimation of simple architectures. 5) Parameter sharing evaluator tends to over-estimate smaller architectures, and is better at comparing smaller models than larger models. 6) One should use ranking losses rather than regression losses to train predictors, since they are more stable. 7) Different predictors under-or over-estimate different architectures, and currently, the best predictor might still have trouble in comparing large architectures. 8) As expected, architecture predictors can distinguish good architectures better after multiple stages of training, as the training data are more and more concentrated on the good architectures.



Figure 1: An overview of fast neural architecture evaluators (i.e., performance estimators).

