WEAK NAS PREDICTOR IS ALL YOU NEED

Abstract

Neural Architecture Search (NAS) finds the best network architecture by exploring the architecture-to-performance manifold. It often trains and evaluates a large amount of architectures, causing tremendous computation cost. Recent predictorbased NAS approaches attempt to solve this problem with two key steps: sampling some architecture-performance pairs and fitting a proxy accuracy predictor. Existing predictors attempt to model the performance distribution over the whole architecture space, which could be too challenging given limited samples. Instead, we envision that this ambitious goal may not be necessary if the final aim is to find the best architecture. We present a novel framework to estimate weak predictors progressively. Rather than expecting a single strong predictor to model the whole space, we seek a progressive line of weak predictors that can connect a path to the best architecture, thus greatly simplifying the learning task of each predictor. It is based on the key property of the predictors that their probabilities of sampling better architectures will keep increasing. We thus only sample a few well-performed architectures guided by the predictive model, to estimate another better weak predictor. By this coarse-to-fine iteration, the ranking of sampling space is refined gradually, which helps find the optimal architectures eventually. Experiments demonstrate that our method costs fewer samples to find the top-performance architectures on NAS-Benchmark-101 and NAS-Benchmark-201, and it achieves the state-of-the-art ImageNet performance on the NASNet search space.

1. INTRODUCTION

Neural Architecture Search (NAS) has become a central topic in recent years with great progress (Liu et al., 2018b; Luo et al., 2018; Wu et al., 2019; Howard et al., 2019; Ning et al., 2020; Wei et al., 2020; Luo et al., 2018; Wen et al., 2019; Chau et al., 2020; Luo et al., 2020) . Methodologically, all existing NAS methods try to find the best network architecture by exploring the architecture-toperformance manifold, such as reinforced-learning-based (Zoph & Le, 2016 ), evolution-based (Real et al., 2019 ) or gradient-based Liu et al. (2018b) approaches. In order to cover the whole space, they often train and evaluate a large amount of architectures, thus causing tremendous computation cost. Recently, predictor-based NAS methods alleviate this problem with two key steps: one sampling step to sample some architecture-performance pairs, and another performance modeling step to fit the performance distribution by training a proxy accuracy predictor. An in-depth analysis of existing methods (Luo et al., 2018) founds that most of those methods (Ning et al., 2020; Wei et al., 2020; Luo et al., 2018; Wen et al., 2019; Chau et al., 2020; Luo et al., 2020) attempt to model the performance distribution over the whole architecture space. However, since the architecture space is often exponentially large and highly non-convex, modeling the whole space is very challenging especially given limited samples. Meanwhile, different types of predictors in these methods have to demand handcraft design of the architecture representations to improve the performance. In this paper, we envision that the ambitious goal of modeling the whole space may not be necessary if the final goal is to find the best architecture. Intuitively, we assume the whole space could be divided into different sub-spaces, some of which are relatively good while some are relatively bad. We tend to choose the good ones while neglecting the bad ones, which makes sure more samples will be used to model the good subspace precisely and then find the best architecture. From another perspective, instead of optimizing the predictor by sampling the whole space as well as existing methods, we propose to jointly optimize the sampling strategy and the predictor learning, which helps achieve better sample efficiency and prediction accuracy simultaneously.

