HOW PREDICTORS AFFECT SEARCH STRATEGIES IN NEURAL ARCHITECTURE SEARCH?

Abstract

Predictor-based Neural Architecture Search is an important topic since it can efficiently reduce the computational cost of evaluating candidate architectures. Most existing predictor-based NAS algorithms aim to design different predictors to improve the prediction performance. Unfortunately, even a promising performance predictor may suffer from the accuracy decline due to long-term and continuous usage, thus leading to the degraded performance of the search strategy. That naturally gives rise to the following problems: how predictors affect search strategies and how to appropriately use the predictor? In this paper, we take reinforcement learning (RL) based search strategy to study theoretically and empirically the impact of predictors on search strategies. We first formulate a predictor-RL-based NAS algorithm as model-based RL and analyze it with a guarantee of monotonic improvement at each trail. Then, based on this analysis, we propose a simple procedure of predictor usage, named mixed batch, which contains ground-truth data and prediction data. The proposed procedure can efficiently reduce the impact of predictor errors on search strategies with maintaining performance growth. Our algorithm, Predictor-based Neural Architecture Search with Mixed batch (PNASM), outperforms traditional NAS algorithms and prior state-of-the-art predictor-based NAS algorithms on three NAS-Bench-201 tasks and one NAS-Bench-ASR task .

1. INTRODUCTION

Neural Architecture Search (NAS) aims to automatically find out effective architectures in a pre-defined search space for a given dataset (Baker et al., 2016; Zoph & Le, 2016) , which has shown to generate architectures that achieve promising results in many domains (Zoph et al., 2018; Tan & Le, 2019; Howard et al., 2019; Chen et al., 2020) . However, due to the high computational cost for evaluating the generated architecture performance, traditional NAS methods are prohibitively costly in real-world deployment. Recently, many approaches have been proposed to reduce the evaluation cost, which can be categorized into training-free predictors (Pham et al., 2018; Mellor et al., 2021) and training-based predictors (Wei et al., 2022; Springenberg et al., 2016; Shi et al., 2020; White et al., 2021a; Lu et al., 2021; Wen et al., 2020; Tang et al., 2020; Luo et al., 2018) . Training-based methods, which require training a performance predictor to predict the final validation accuracy based on the feature of architecture, have received much more attention due to their better generalization ability. Recent efforts on training-based methods focus on improving the prediction performance by designing models to precisely capture features of network architectures, e.g., GCN and Transformer. Several works demonstrate their robust predictions and combine them with the traditional search strategy such as Bayesian Optimization (BO) (Springenberg et al., 2016; Shi et al., 2020; White et al., 2021a) and Evolutionary Algorithms (EA) (Wei et al., 2022; Lu et al., 2021; Wei et al., 2022) . Unfortunately, even a promising performance predictor may suffer from the accuracy decline due to long-term and continuous usage (Fig. 1 ), thus leading to performance collapse. Most existing works barely consider the impact of predictor usage on the search strategy. The inappropriate usage of predictor may perform worse asymptotically than their predictor-free counterparts. That leads to two natural questions: how predictors affect search strategies and how to appropriately use the predictor to improve search efficiency? In this paper, we take RL-based search strategy to study the impact of predictors on search strategies both theoretically and empirically. We first formulate a predictor-RL-based NAS algorithm as model-based RL and analyze a class of predictor-based NAS algorithms with improvement guarantees. Formula derivation results indicate that if the predictor is used for a long time, enlarged predictor error compounding policy error will lead to performance collapse. Then, based on the analysis, we propose a simple procedure of predictor usage, named mixed batch, to update the search strategy, which contains ground-truth data and prediction data. The prediction data, on the one side, can greatly improve sample efficiency, and on the other side encourages policy exploration. The ground-truth data allows the updated parameters close in parameter space and prevents a bad-update accidentally cause performance collapse (Fig. 2 ). We empirically demonstrate that the proposed procedure can achieve pronounced improvements in performance compared to other predictor-based NAS approaches. To summarize, our contribution in this work are following: • We conduct the first study of the impact of predictors on NAS search strategies both theoretically and empirically. • We formulate and analyze a category of predictor-RL-based NAS algorithms with improvement guarantees based on predictor error and policy error. Theoretical analysis indicates that the long-term use of predictor declines the performance of search strategy. • We propose a novel predictor-based NAS framework, namely PNASM (Predictor-based Neural Architecture Search with Mixed batch), to make limited usage of the performance predictor and improve the search performance. • Our proposed method outperforms both traditional and predictor-based NAS methods and achieves state-of-the-art results on CIFAR-10, CIFAR-100, and ImageNet-16-120 of NAS-Bench-201, and TIMIT of NAS-Bench-ASR.

2.1. NEURAL ARCHITECTURE SEARCH

Traditional NAS methods, such as reinforcement learning (Zoph & Le, 2016; Baker et al., 2016) , evolutionary search (Real et al., 2019) , and gradient-based search (Liu et al., 2019) , have shown to generate networks that outperform manually-designed networks. However, these algorithms require enormous search costs due to the high evaluation cost for generated architectures. To reduce



Figure 1: Cumulative error between true val and predicted val over sampled architectures. REINFORCE+Predictor means long and continuous usage of predictor without updating it.

Figure 2: Parameters of policy updated by different ways. Left. Parameters of policy deviate far from the optimal one due to compounding error of long-term usage of predictor. Right. Limited mix usage of predictor can balance performance and computational cost.

