WEAK NAS PREDICTOR IS ALL YOU NEED

Abstract

Neural Architecture Search (NAS) finds the best network architecture by exploring the architecture-to-performance manifold. It often trains and evaluates a large amount of architectures, causing tremendous computation cost. Recent predictorbased NAS approaches attempt to solve this problem with two key steps: sampling some architecture-performance pairs and fitting a proxy accuracy predictor. Existing predictors attempt to model the performance distribution over the whole architecture space, which could be too challenging given limited samples. Instead, we envision that this ambitious goal may not be necessary if the final aim is to find the best architecture. We present a novel framework to estimate weak predictors progressively. Rather than expecting a single strong predictor to model the whole space, we seek a progressive line of weak predictors that can connect a path to the best architecture, thus greatly simplifying the learning task of each predictor. It is based on the key property of the predictors that their probabilities of sampling better architectures will keep increasing. We thus only sample a few well-performed architectures guided by the predictive model, to estimate another better weak predictor. By this coarse-to-fine iteration, the ranking of sampling space is refined gradually, which helps find the optimal architectures eventually. Experiments demonstrate that our method costs fewer samples to find the top-performance architectures on NAS-Benchmark-101 and NAS-Benchmark-201, and it achieves the state-of-the-art ImageNet performance on the NASNet search space.

1. INTRODUCTION

Neural Architecture Search (NAS) has become a central topic in recent years with great progress (Liu et al., 2018b; Luo et al., 2018; Wu et al., 2019; Howard et al., 2019; Ning et al., 2020; Wei et al., 2020; Luo et al., 2018; Wen et al., 2019; Chau et al., 2020; Luo et al., 2020) . Methodologically, all existing NAS methods try to find the best network architecture by exploring the architecture-toperformance manifold, such as reinforced-learning-based (Zoph & Le, 2016 ), evolution-based (Real et al., 2019 ) or gradient-based Liu et al. (2018b) approaches. In order to cover the whole space, they often train and evaluate a large amount of architectures, thus causing tremendous computation cost. Recently, predictor-based NAS methods alleviate this problem with two key steps: one sampling step to sample some architecture-performance pairs, and another performance modeling step to fit the performance distribution by training a proxy accuracy predictor. An in-depth analysis of existing methods (Luo et al., 2018) founds that most of those methods (Ning et al., 2020; Wei et al., 2020; Luo et al., 2018; Wen et al., 2019; Chau et al., 2020; Luo et al., 2020) attempt to model the performance distribution over the whole architecture space. However, since the architecture space is often exponentially large and highly non-convex, modeling the whole space is very challenging especially given limited samples. Meanwhile, different types of predictors in these methods have to demand handcraft design of the architecture representations to improve the performance. In this paper, we envision that the ambitious goal of modeling the whole space may not be necessary if the final goal is to find the best architecture. Intuitively, we assume the whole space could be divided into different sub-spaces, some of which are relatively good while some are relatively bad. We tend to choose the good ones while neglecting the bad ones, which makes sure more samples will be used to model the good subspace precisely and then find the best architecture. From another perspective, instead of optimizing the predictor by sampling the whole space as well as existing methods, we propose to jointly optimize the sampling strategy and the predictor learning, which helps achieve better sample efficiency and prediction accuracy simultaneously. Based on the above motivation, we present a novel framework that estimates a series of weak predictors progressively. Rather than expecting a strong predictor to model the whole space, we instead seek a progressive evolving of weak predictors that can connect a path to the best architecture. In this way, it greatly simplifies the learning task of each predictor. To ensure moving the best architecture along the path, we increase the sampling probability of better architectures guided by the weak predictor at each iteration. Then, the consecutive weak predictor with better samples will be trained in the next iteration. We iterate until we arrive at an embedding subspace where the best architectures reside. The weak predictor achieved at the final iteration becomes the dedicated predictor focusing on such a fine subspace and the best performed architecture can be easily predicted. Compared to existing predictor-based NAS, our method has several merits. First, since only weak predictors are required to locate the good subspace, it yields better sample efficiency. On NAS-Benchmark-101 and NAS-Benchmark-201, it costs significantly fewer samples to find the top-performance architecture than existing predictorbased NAS methods. Second, it is much less sensitive to the architecture representation (e.g., different architecture embeddings) and the predictor formulation design (e.g., MLP, Gradient Boosting Regression Tree, Random Forest). Experiments show our superior robustness in all their combinations. Third, it is generalized to other search spaces. Given a limited sample budget, it achieves the state-of-the-art ImageNet performance on the NASNet search space.

2.1. REVISIT PREDICTOR-BASED NEURAL ARCHITECTURE SEARCH

Neural Architecture Search (NAS) finds the best network architecture by exploring the architectureto-performance manifold. It can be formulated as an optimization problem. Given a search space of network architectures X and a discrete architecture-to-performance mapping function f : X → P from architecture set X to performance set P , the objective is to find the best neural architecture x * with the highest performance f (x) in the search space X: x * = arg max x∈X f (x) A naïve solution is to estimate the performance mapping f (x) through the full search space, however, it is prohibitively expensive since all architectures have to be exhaustively trained from scratch. To address this problem, predictor-based NAS learns a proxy predictor f (x) to approximate f (x) using some architecture-performance pairs , which significantly reduces the training cost. In general, predictor-based NAS can be formulated as: x * = arg max 



Figure 1: Comparison between iterative weak predictors and non-iterative strong predictor on NAS-Bench-201 ImageNet subset. Our method significantly reduces the needed amount of samples to reach the optimal architecture.

. f = arg min S, f ∈ F s∈S L( f (s), f (s)) (2)where L is the loss function for the predictor f , F is a set of all possible approximation to f , S := {S ⊆ X | |S| ≤ C} is the training pairs for predictor f given sample budget C. Here, C is directly correlated to the total training cost. Our objective is to minimize the loss L of the predictor f based on some sampled architectures S.

