HOW PREDICTORS AFFECT SEARCH STRATEGIES IN NEURAL ARCHITECTURE SEARCH?

Abstract

Predictor-based Neural Architecture Search is an important topic since it can efficiently reduce the computational cost of evaluating candidate architectures. Most existing predictor-based NAS algorithms aim to design different predictors to improve the prediction performance. Unfortunately, even a promising performance predictor may suffer from the accuracy decline due to long-term and continuous usage, thus leading to the degraded performance of the search strategy. That naturally gives rise to the following problems: how predictors affect search strategies and how to appropriately use the predictor? In this paper, we take reinforcement learning (RL) based search strategy to study theoretically and empirically the impact of predictors on search strategies. We first formulate a predictor-RL-based NAS algorithm as model-based RL and analyze it with a guarantee of monotonic improvement at each trail. Then, based on this analysis, we propose a simple procedure of predictor usage, named mixed batch, which contains ground-truth data and prediction data. The proposed procedure can efficiently reduce the impact of predictor errors on search strategies with maintaining performance growth. Our algorithm, Predictor-based Neural Architecture Search with Mixed batch (PNASM), outperforms traditional NAS algorithms and prior state-of-the-art predictor-based NAS algorithms on three NAS-Bench-201 tasks and one NAS-Bench-ASR task .

1. INTRODUCTION

Neural Architecture Search (NAS) aims to automatically find out effective architectures in a pre-defined search space for a given dataset (Baker et al., 2016; Zoph & Le, 2016) , which has shown to generate architectures that achieve promising results in many domains (Zoph et al., 2018; Tan & Le, 2019; Howard et al., 2019; Chen et al., 2020) . However, due to the high computational cost for evaluating the generated architecture performance, traditional NAS methods are prohibitively costly in real-world deployment. Recently, many approaches have been proposed to reduce the evaluation cost, which can be categorized into training-free predictors (Pham et al., 2018; Mellor et al., 2021) and training-based predictors (Wei et al., 2022; Springenberg et al., 2016; Shi et al., 2020; White et al., 2021a; Lu et al., 2021; Wen et al., 2020; Tang et al., 2020; Luo et al., 2018) . Training-based methods, which require training a performance predictor to predict the final validation accuracy based on the feature of architecture, have received much more attention due to their better generalization ability. Recent efforts on training-based methods focus on improving the prediction performance by designing models to precisely capture features of network architectures, e.g., GCN and Transformer. Several works demonstrate their robust predictions and combine them with the traditional search strategy such as Bayesian Optimization (BO) (Springenberg et al., 2016; Shi et al., 2020; White et al., 2021a) and Evolutionary Algorithms (EA) (Wei et al., 2022; Lu et al., 2021; Wei et al., 2022) . Unfortunately, even a promising performance predictor may suffer from the accuracy decline due to long-term and continuous usage (Fig. 1 ), thus leading to performance collapse. Most existing works barely consider the impact of predictor usage on the search strategy. The inappropriate usage of predictor may perform worse asymptotically than their predictor-free counterparts. That leads to two natural questions: how predictors affect search strategies and how to appropriately use the predictor to improve search efficiency? Figure 2 : Parameters of policy updated by different ways. Left. Parameters of policy deviate far from the optimal one due to compounding error of long-term usage of predictor. Right. Limited mix usage of predictor can balance performance and computational cost. In this paper, we take RL-based search strategy to study the impact of predictors on search strategies both theoretically and empirically. We first formulate a predictor-RL-based NAS algorithm as model-based RL and analyze a class of predictor-based NAS algorithms with improvement guarantees. Formula derivation results indicate that if the predictor is used for a long time, enlarged predictor error compounding policy error will lead to performance collapse. Then, based on the analysis, we propose a simple procedure of predictor usage, named mixed batch, to update the search strategy, which contains ground-truth data and prediction data. The prediction data, on the one side, can greatly improve sample efficiency, and on the other side encourages policy exploration. The ground-truth data allows the updated parameters close in parameter space and prevents a bad-update accidentally cause performance collapse (Fig. 2 ). We empirically demonstrate that the proposed procedure can achieve pronounced improvements in performance compared to other predictor-based NAS approaches. To summarize, our contribution in this work are following: • We conduct the first study of the impact of predictors on NAS search strategies both theoretically and empirically. • We formulate and analyze a category of predictor-RL-based NAS algorithms with improvement guarantees based on predictor error and policy error. Theoretical analysis indicates that the long-term use of predictor declines the performance of search strategy. • We propose a novel predictor-based NAS framework, namely PNASM (Predictor-based Neural Architecture Search with Mixed batch), to make limited usage of the performance predictor and improve the search performance. • Our proposed method outperforms both traditional and predictor-based NAS methods and achieves state-of-the-art results on CIFAR-10, CIFAR-100, and ImageNet-16-120 of NAS-Bench-201, and TIMIT of NAS-Bench-ASR.

2.1. NEURAL ARCHITECTURE SEARCH

Traditional NAS methods, such as reinforcement learning (Zoph & Le, 2016; Baker et al., 2016) , evolutionary search (Real et al., 2019) , and gradient-based search (Liu et al., 2019) , have shown to generate networks that outperform manually-designed networks. However, these algorithms require enormous search costs due to the high evaluation cost for generated architectures. To reduce the search costs, researchers have proposed predictor-based methods to quickly estimate the performance of architectures instead of training from scratch (Wei et al., 2022; Springenberg et al., 2016; Shi et al., 2020; White et al., 2021a; Lu et al., 2021; Wen et al., 2020; Tang et al., 2020; Luo et al., 2018) . There are two classes of predictor-based NAS methods: Training-based predictors. Training-based predictors follow a supervised learning paradigm to learn the correlation between network architectures and their corresponding performance. These predictors are usually used within BO frameworks (Springenberg et al., 2016; Shi et al., 2020; White et al., 2021a) , evolutionary frameworks (Wei et al., 2022; Lu et al., 2021; Wei et al., 2022) , or without any search strategy (Wen et al., 2020) , to conduct NAS. BONAS (Shi et al., 2020) adopts a GCN-based accuracy predictor as a surrogate function of BO to search for architectures. Similarly, BANANAS (White et al., 2021a ) also uses BO to perform NAS and provides a thorough analysis of the relation between BO and the neural predictor. NPENAS (Wei et al., 2022) develops two kinds of neural predictors to guide evolutionary strategy to boost the exploration ability. NPNAS (Wen et al., 2020) directly uses a regression model to predict the validation accuracy of a large number of random architectures and chooses the top-k architectures to obtain the best one. SemiNAS (Tang et al., 2020) proposes a semi-supervised predictor to capture the intrinsic similarities of labeled and unlabeled architectures. TNASP (Lu et al., 2021) uses a Transformer-based predictor and evolutionary algorithms to perform NAS. Similar to these algorithms, our model PNASM also adopts the training-based predictor. Differently, we propose a novel update scheme by combining groundtruth data and prediction data to optimize the reinforcement learning strategy, which can reduce the impact of predictor error on the optimization strategy and improve search efficiency. Training-free predictors. Recently, several works have proposed to compute statistics from a single minibatch data by a single forward and backward propagation. NASWOT (Mellor et al., 2021) evaluates randomly-initialized architectures based on binary activation codes of ReLU units. TE-NAS (Chen et al., 2021a) analyzes the spectrum of the neural tangent kernel and the number of linear regions in the input space to rank architectures. Zero-Cost NAS (Abdelfattah et al., 2021) compares six conventional reduced-training proxies to compute a model's score. Although these training-free predictors achieve satisfying results on some datasets, their performance cannot be guaranteed in practice due to the limited generalization ability (Lu et al., 2021; White et al., 2021b) .

2.2. MODEL-BASED REINFORCEMENT LEARNING

Model-based Reinforcement Learning (MBRL) methods have shown great success on real-world sequential decision problems due to their sample efficiency ability (Kaelbling et al., 1996) . MBRL learns a model of the environment, which predicts state transitions and rewards. Thus, they are widely used to solve problems where the data are hardly collected in real-world physical systems. The dynamics of environment are usually modeled by Gaussian processes (Deisenroth & Rasmussen, 2011) , local linear models (Levine & Koltun, 2013; Kumar et al., 2016) , and neural network function approximators (Draeger et al., 1995; Gal et al., 2016; Nagabandi et al., 2018; Janner et al., 2019; Yu et al., 2020; Shen et al., 2020) . If we consider the evaluating of candidate architectures as an RL-environment, we can formulate predictor-RL-based NAS as MBRL.The difference between our formulation and the traditional MBRL is that our predictor is trained to predict rewards not the state transitions.

3. METHOD

3.1 PRELIMINARY NAS problem. Given a dataset D and a search space O of neural architectures, the RL-based optimization strategy is to search the best architecture A * ∈ O that maximizes the expected accuracy on the validation set D valid , which is defined by: A * = argmax A∈O E (Dtrain ,Dvalid )∼D [R (A w * , D valid )] s.t. w * = argmin w L (A w , D train ) where R(A w * , D valid ) measures the accuracy of an architecture A with parameters w * on the validation data D valid . w is the parameters of architectures. L represents the loss of architectures on the training data D train . Predictor-based NAS. Since evaluating an architecture in Eq. 1 typically takes hours, many NAS methods use performance predictors to speed up this process. A performance predictor f ϕ generally consists of an encoder f E and a regressor f R , which encodes the information of discrete architectures into continuous feature representations and learns the correlation from the network features and the network performance, respectively. Most training-based predictors are trained by the supervised learning from a database containing neural architectures A and their corresponding performances R(A w * , D valid ) (Luo et al., 2018; Wen et al., 2020; Chen et al., 2021b; Lu et al., 2021) . That is, predictor f ϕ is trained to minimize the MSE loss between the predicted accuracy and the true accuracy of the architecture sampled from the database: ϕ * = arg min ϕ A (R(A w * , D valid ) -f ϕ (A)) 2 (2) where f ϕ (A) denotes the predicted performance of A. After the above process, the performance predictor can quickly predict the final accuracy or ranking of unseen architectures.

3.2. MBRL-BASED OPTIMIZATION STRATEGY

Predictor-RL-based NAS can be formulated as a MBRL problem with a tuple ⟨S, A, T, R, γ⟩, where S, A, T , R and γ denote the state space, the action space, the state transition dynamics, the reward function and the discount factor (Please see Section A of the Suppl. for more details). Normally, MBRL corresponds to recovering the state transition dynamics T and the reward function R. Following the RL-based NAS framework (Zoph & Le, 2016; Baker et al., 2016; Zoph et al., 2018) , we only need to recover the reward function R, which corresponds to the predictor f ϕ . In this framework, an agent, also called the controller, samples T -step trajectory τ = (s 1 , a 1 , . . . , s T , a T ) at each episode, which corresponds to the description of a neural architecture A = a 1:T . Then, evaluate the performance of the generated architecture A either by training from scratch or by the predictor. The evaluation result R(τ ) is used as a reward signal to update the parameters θ of the policy π. After several iterations, the agent will learn to generate, with high probability, an architecture with high reward (accuracy). The goal of the agent is to maximize the expected reward: J(π θ ) = E τ ∼π θ [R(τ )] = τ R(τ )p(τ |θ) where R(τ ) denotes the evaluated performance of the generated architecture τ . p(τ |θ) denotes the probability of a trajectory τ . As the reward signal R is non-differentiable, one common approach is to use REINFORCE (Williams, 1992) to update the parameters θ of the policy: θ k+1 = θ k + α∇ θ J(π θ )| θ k (4) where ∇ θ J(π θ ) is given by: ∇ θ J(π θ ) = E τ ∼π θ [R(τ )∇ θ log p(τ |θ)] ≈ 1 N N n=1 R (τ n ) ∇ θ log p (τ n |θ) = 1 N N n=1 T t=1 ∇ θ log π θ (a n t |s n t ) R (τ n ) ( ) where N is the number of neural architectures that the agent generates at each iteration (equivalent to the batch size) . T is the number of candidate operations (actions) of a neural architecture.

3.3. MONOTONIC PREDICTOR-BASED IMPROVEMENT

In this section, we will give a monotonic improvement based on a general predictor-RL-based NAS as described in Algorithm 1, where the policy is optimized based on the data provided by the predictor. The performance of policy is affected by the predictor usage since errors in the predictor may be exploited by the policy optimization, thus leading to a large gap between the true performance of the policy and that under the predictor. Algorithm 1 Monotonic Predictor-Based Policy Optimization. Input: # of initial samples S; Batch-size N 1: Collect S architectures by running policy π θ 2: Evaluate S architectures by training from scratch and store the samples into D t 3: Initialize predictor f ϕ and policy π θ by the collected samples 4: while time limit not exceeded do 5: Collect N architectures by running policy π θ 6: Evaluate N architectures by the predictor f ϕ and store the samples into B 7: Update the policy π θ via Eq. 4 with B; 8: end while Our goal is to build a performance guarantee for the predictor-RL-based NAS. Motivated by MBPO (Janner et al., 2019) , we wish to construct a lower bound of the following form: η(π) ≥ η(π) -C where η(π) represents the expected true performance of policy π which are updated by the reward signal of training architectures from scratch, i.e., in the true dynamics; whereas η(π) denotes the expected performance of policy π that are updated based on the reward signal provided by the predictor. Such a statement guarantees that, as long as we improve by at least C under the performance predictor, we can guarantee improvement over the true performance η. The difference C between the true performance and that under the predictor comes from two error quantities of the performance predictor: generalization error due to the prediction ability, and policy error (distribution shift) due to the updated policy receiving the reward signal provided by the predictor. Since the performance predictor is trained using supervised learning, we define this generalization error ϵ m by: max τ ∼π D |R 1 (τ ) -R 2 (τ )| ≤ ϵ m where R 1 (τ ) denotes the true performance of an architecture described by τ and R 2 (τ ) denotes the prediction performance under the performance predictor, i.e., R 2 := f ϕ ; π D denotes the datacollecting policy. This error ϵ m can be estimated in practice by measuring the difference between the true reward and the prediction reward on the same trajectory τ , which is obtained under the data-collecting policy π D . We define policy error by the maximum total-variation distance of the policy between iterations: max s D T V (π(a|s)||π D (a|s)) ≤ ϵ π In practice, we can measure the KL divergence between policies. Based on these two sources of errors (generalization error ϵ m and policy error ϵ π ), we now give our bound: Theorem 1 Let the generation error between the true reward and the prediction reward be bounded at each trajectory by ϵ m and the policy divergence be bounded by ϵ π . Then the expected true reward and expected prediction reward of the policy are bounded as: η(π) ≥ η(π) -( N τ =1 2R max ϵ π + N τ =1 ϵ m p(τ | θ)) C(ϵm,ϵπ) Proof. See Appendix Theorem B.1. This bound implies that as long as we improve the expected reward η(π) under the predictor by more than C(ϵ m , ϵ π ), we can guarantee improvement under the expected true reward.

3.4. MIXING REAL-BASED AND PREDICTOR-BASED UPDATES

Theorem 1 provides a useful relationship between true rewards and prediction rewards. However, it is noted that if the predictor error ϵ m is too high, there may not exist a policy that can guarantee the improvement. Besides, the analysis of Theorem 1 relies on using the prediction reward to update the policy continuously, i.e., equivalent to increasing N , which allows model error to compound with policy error and results in a large gap value C. Thus, we can improve the algorithm by relying less on the performance predictor when the performance predictor is inaccurate and instead by training neural architectures to rely more on real data. For the above issues, we introduce a simple procedure mixed batch to reduce the influence of two errors on the policy. A policy with mixed batch, denoted as π mix , means a batch of N samples are collected by the following two steps: first, run policy π to generate first k architectures which are evaluated by training from scratch (under the true environment); then, generate N -k architectures under the learned performance predictor f ϕ (Algorithm 2). Under this scheme, the expected reward can be bounded as follows: Theorem 2 Given the expected reward η(π mix ) from the k-steps mixed batch method, we have η(π) ≥ η(π mix ) - N τ =1 R max ϵ π + N τ =k+1 R max ϵ π + N τ =k+1 ϵ m p(τ | θ) (10) Proof. See Appendix Theorem B.2. This bound implies that as long as we mix the true data and prediction data into one batch, we can reduce the error caused by long-term use of the performance predictor. Collect N -k architectures by running policy π θ 8: Evaluate N -k architectures by the predictor f ϕ and store the samples into D p ; 9: Select k and N -k pairs from D t and D p respectively to form a mini-batch B; 10: Update the policy π θ via Eq. 4 with B; 11: Retrain the predictor f ϕ via Eq. 2 with D t ; 12: end while

4. EXPERIMENTS

We employ our model on four datasets, specifically NAS-Bench-201: CIFAR-10, CIFAR-100, and ImageNet-120, NAS-Bench-ASR: TIMIT. We split up our experiments into three categories: selecting the best predictor for search spaces, evaluating the performance of our model and other popular algorithms on two NAS benchmarks, and performing ablation experiments. Moreover, we put several experimental results; the implementation details; information of baselines in Appendix D .

4.1. CHOOSE PREDICTOR FOR NAS

To obtain high-performance predictor-RL-based NAS algorithm, we first choose a high-performance predictor among the currently most popular performance predictors, including MLP, GCN, BA-NANAS, BONAS, NAO, SemiNAS, and Transformer. We randomly run 20 times for each predictor on CIFAR-10 and report the mean and variance of the Kendall's Tau correlation coefficient on test samples. Kendall's Tau is a common indicator measuring the correlation between the ranking of prediction values and the true labels, and higher value indicates more accurate prediction. The training and testing samples are sampled randomly from the collected architecture-accuracy pairs. Table 1 presents the comparison results, from which we can make the following observations: 1) As the number of training samples increases, the performance of predictor improves. Therefore, it's important for all predictors to have enough initial training samples. 2) SemiNAS and Transformer perform well even with a small number of training samples. For example, their Kendall's Tau can achieve around 0.550 when the number of training samples is 100 or 200. According to the experimental results, we choose SemiNAS as our performance predictor on NAS-Bench-201 since it adopts semi-supervised learning (Tang et al., 2020) to train the predictor, which could make full use of the unlabeled architecture information as the size of training samples increases, thus allowing it to outperform Transformer. Performance on NAS-Bench-201. Table 2 shows the comparison between the proposed method and strong baselines. According to the time budget, we randomly run 100 and 20 times for traditional and predictor-based methods, respectively. "search" means the total search time (time budget) including the time for initializing the predictor. "+B" and "+E" denote that we combine the predictors with optimization strategies of Bayesian Optimization and Evolutionary algorithm, respectively. The best result on each dataset is in boldface. We can easily see from the experimental results that: 1) Our method (PNASM) outperforms both traditional and predictor-based methods and achieves the performance of 94.33%, 72.89%, and 46.44% on the three datasets, respectively, close to the optimal performance. 2) Compared with REINFORCE, our method improves the test accuracy on CIFAR-10, CIFAR-100, and ImageNet-16-120 by 0.31%, 0.54%, and 0.7%, respectively, which demonstrates that the appropriate usage of predictor can help the optimization strategy to explore the more promising space. 3) Compared with the advanced SemiNAS+E, our PNASM (SemiNAS+RL) improves the validation accuracy on CIFAR-10, CIFAR-100, and ImageNet-16-120 by 0.06%, 0.24%, and 0.39%, respectively. The experiment results suggest that compared with EA, the advantage achieved by RL-based optimization strategy lies in the proper usage of predictor, i.e., mixed batch which contains true architecture-accuracy pairs and predicted ones, thus leading to the improvement in performance. 4) The performance of all predictors combined with BO is worse than those combined with EA and BOHB, which is consistent with the experiment result in White et al. (2021b) . Performance on NAS-Bench-ASR. Table 3 shows the comparison between the proposed method and traditional NAS algorithms. Since there is no information about the time cost of each architecture on NAS-Bench-ASR, we terminate the search process as the number of sampled true architectures reaches 300. We randomly run 20 times for each method. The best result is in boldface. We can easily see from the experimental results that: 1) Our model (PNASM) achieves the best results. 2) Compared with REINFORCE, our method improves the validation PER by 0.12%. 3) REA performs well compared to other traditional NAS algorithms. As we can see from the above results, our model still performs well on other NAS benchmarks, which indicates that mixed batch does help the RL-based search strategy make full use of the predictor. Table 4 shows the results of PNASM with different k values over 20 runs with different seeds, from which we can see that: 1) The model without using the predicted data (k=all) performs well on the three datasets, but incurs a large computational cost. Conversely, if we use the predictor all the time (k=0), the model performs poorly on the three datasets but with least computational cost, which demonstrate that a long term usage of predictor will amplify both the predictor error ϵ m and the policy error ϵ π , thus leading to policy collapse. 2) A specific value of k can achieve comparable performance to that k=all. For example, with the setting of k=5, the model achieves the performance 91.50% on the validation dataset on CIFAR-10. Similarly, k=2 on CIFAR-100 and k=2 on ImageNet-16-120, which demonstrates that mixed batch is effective way of using predictor, which allows the search strategy to maintain excellent performance with less computational cost. Compared to the model with k=all, the model with k=5 achieves around 2× speedups on CIFAR-10, and the model with k=2 brings around 4× speedups on CIFAR-100 and ImageNet-16-120. 3) Models with k=5, 8, and 15, outperform that with k=15 on CIFAR-100. We speculate the reason of this phenomenon is that the usage of predictor injects noise into the parameter space of policy. The parameter noise limited in a reasonable range allows the policy to better explore the search space, as indicated by (Fortunato et al., 2018; Plappert et al., 2018) . To further study how k affect the search strategy during the search process, Fig. 3 presents the current best architecture's validation accuracy, from which we can make the following observations: 1) The performance of k = 0 (long-term and continuous predictor usage) is inferior to others in most cases, which demonstrates that the long-term usage of unreliable predictor further exacerbates policy error. 2) As the number of ground-truth data increases (k from 2 to 15), the performance curve gets close to that of k=all (blue line). In particular, there is a specific value k can achieve comparable validation accuracy to that of k=all, e.g., k=5, with low computational cost. Therefore, k=5 is a good value that balances the performance and the computational costs. We recommend the true samples in the mixed batch is 5 for other datasets. Adaptive Method. Although k can achieve a good trade-off between the performance and the time cost, but the proper setting of k requires precisely fine-tuning. To simplify the setting, we propose an adaptive method (PNASM-A), which can dynamically adjust k by measuring the differences between policies in two consecutive iterations: k = α × N ( ) where α is given by: where π i-1 represents the policy after i -1 th iteration; π i denotes the policy after i th iteration. α = D KL (π i-1 , π i ), 0 ≤ D KL (π i-1 , π i ) < 1 1, 1 ≤ D KL (π i-1 , π i ) (12) For the i + 1 th iteration, k = αN . As Table 2 shown, the adaptive method (PNASM-A) still outperforms other baselines on all three datasets. Besides, the performance of PNASM-A is close to that of PNASM, which demonstrate the effective of our adaptive variant.

5. CONCLUSION AND FUTURE WORK

In this paper, we investigate the role of predictor usage in neural architecture search procedures both theoretically and empirically. We first formulate predictor-RL-based NAS as model-based RL problem, and provide it with monotonic improvement guarantees, which suggests that the long-term and continuous usage of predictor will degrade the performance due to the model error exploited by the search policy. Motivated by this analysis, we then propose a novel framework PNASM that uses a special procedure, mixed batch, to justify predictor usage, which can mitigate the impact of predictor errors on search strategies and reduce the computational cost. Extensive experiments on NAS-Bench-201 have shown the effectiveness of the proposed method. In the future, we plan to investigate how to appropriately use the predictor with other search strategies. A WORKFLOW OF GENERATING ARCHITECTURE BY RL-AGENT Figure 4 presents the main structure of the controller. The controller consists of two multilayer perceptron (MLP) layers which serve as the input-embedding layer and the output-embedding layer, and an LSTM network which is the core of the controller for remembering previous decisions. At each episode, the controller unrolls T time-steps to sample a trajectory τ = {s 1 , a 1 , . . . , s T , a T }. τ describes the representation of an architecture. At each step t, it works as follows: The input s t is fed into the input-embedding layer, which converts s t into a high-dimensional embedding e t , thus making the agent better observe the state: e t = W in • s t + b in where W in and b in are embedding parameters of the input layer. Then, e t is fed to the core network consisting of LSTM layer, which helps the agent explore the correlation between decisions: o t , h t = LST M (e t , h t-1 ) Next, the output of LSTM o t is fed to the output layer to convert the output of the LSTM into a lowdimensional representation y t , which denotes the distribution of the candidate operations, where y t = [µ t , σ t ]. y t = W out • o t + b out where W out and b out are embedding parameters of the output layer. At last, given the distribution for candidate operations, the agent samples one by the same sampling technique: a t = Sample(N (µ t , σ t )) y t is fed to the state at the the next time step t + 1: s t+1 = y t The initial state s 1 is a zero embedding vector. Under the special workflow of the controller, the probability of trajectory is equal to: Under review as a conference paper at ICLR 2023 p(τ |θ) = p (s 1 ) π θ (a 1 |s 1 ) p (s 2 |s 1 , a 1 ) π θ (a 2 |s 2 ) p (s 3 |s 2 , a 2 ) • • • π θ (a T |s T ) = p (s 1 ) T t=1 π θ (a t |s t ) p (s t+1 |s t , a t ) Since the next state s t+1 is equal to y t , which is converted by the previous action a t , both p(s 1 ) and p(s t+1 |s t , a t ) are equal to one. Thus, p(τ |θ) is simplified as: p(τ |θ) = T t=1 π θ (a t |s t ) Therefore, the RL-based NAS can be formulated as MDP with fixed transition probability but unknown reward function.

B THEOREMS

Theorem B.1 Monotonic predictor-based improvement: η(π) ≥ η(π) - N τ =1 2R max ϵ π + N τ =1 ϵ m p(τ | θ) Proof. Let π D denote the data collecting policy (old policy under the true environment). Since the performance predictor relies on the training data collected by the policy π D , we need to introduce π D by adding and subtracting η(π D ), to get: η(π) -η(π) = η(π) -η(π D ) L1 + η(π D ) -η(π)

L2

We can bound L 1 and L 2 both using Lemma C.2. For L 1 , there is no predictor error (generation error) since π and π D run under the true environment: L 1 ≥ - N τ =1 R max ϵ π For L 2 , policy π runs under the predictor model which incurs predictor error and policy error. Thus, we have: L 2 ≥ - N τ =1 R max ϵ π - N τ =1 ϵ m p(τ | θ) The desired result is obtained by adding the two bounds together. Theorem B.2 Mixed batch bound: η(π) ≥ η(π mix ) - N τ =1 R max ϵ π + N τ =k+1 R max ϵ π + N τ =k+1 ϵ m p(τ | θ) Proof. Let π mix := π D , π denote the policy with the mixed batch which runs the old policy π D under the true dynamics until k samples, then executes the new policy π under the predictor in the last N -k samples. As the proof for Theorem B.1, we add and subtract the correct reference quantity π D , which can be also denoted as  π D := π D , π D . η(π) -η(π mix ) = η(π, π) -η(π D , π) = η(π, π) -η(π D , π D ) + η(π D , π D ) -η(π D , π) = η(π, π) -η(π D , π D ) L1 + η(π D , π D ) -η(π D , L 3 ≥ - N τ =k+1 R max ϵ π - N τ =k+1 ϵ m p(τ | θ) Adding two bounds L 1 and L 3 together yields the result.

C LEMMAS

Lemma C.1 Policy error: max τ T t=1 π 1 (a t |s t ) - T t=1 π 2 (a t |s t ) ≤ ϵ π Proof. Considering that p(s 1 ) and p(s t+1 |s t , a t ) in Eq. ( 18) are equal to one, we make the following approximation: D T V (π 1 (a|s)||π 2 (a|s)) = 1 2 s,a |π 1 (a|s) -π 2 (a|s)| ≈ 1 2 τ T t=1 π 1 (a t |s t ) - T t=1 π 2 (a t |s t ) = D T V (p(τ |θ 1 )||p(τ |θ 2 )) Thus, max s D T V (π 1 (a|s)||π 2 (a|s)) ≈ max τ D T V (p(τ |θ 1 )||p(τ |θ 2 )) = max τ T t=1 π 1 (a t |s t ) - T t=1 π 2 (a t |s t ) ≤ ϵ π Lemma C.2 Expected reward bound: |η(π 1 ) -η(π 2 )| ≤ τ R max ϵ π + τ ϵ m p(τ | θ) Proof. Here, η(π 1 ) denotes the expected true reward of π 1 with the reward function R 1 , and η(π 2 ) denotes the expected reward of π 2 with the reward function R 2 . max τ ∼π1 |R 1 (τ ) -R 2 (τ )| ≤ δ m and max s D T V (π 1 (a|s)||π 2 (a|s)) ≤ δ π . According to Eq. 3, we have: |η(π 1 ) -η(π 2 )| = | τ (R 1 (τ )p(τ |θ 1 ) -R 2 (τ )p(τ |θ 2 ))| According to Lemma C.1, we have: a) η(π 1 ) ≥ η(π 2 ). Since max τ ∼π1 |R 1 (τ ) -R 2 (τ )| ≤ δ m , we can get: -δ m + R 1 (τ ) ≤ R 2 (τ ) ≤ δ m + R 1 (τ ). Then, |η(π 1 ) -η(π 2 )| = | τ (R 1 (τ )p(τ |θ 1 ) -R 2 (τ )p(τ |θ 2 ))| ≤ | τ (R 1 (τ )p(τ |θ 1 ) -R 2 (τ ) min p(τ |θ 2 ))| = | τ (R 1 (τ )p(τ |θ 1 ) -(R 1 (τ ) -δ m )p(τ |θ 2 ))| = | τ R 1 (τ )p(τ |θ 1 ) - τ R 1 (τ )p(τ |θ 2 ) + τ δ m p(τ |θ 2 )| ≤ τ R 1 (τ )|p(τ |θ 1 ) -p(τ |θ 2 )| + τ δ m p(τ |θ 2 ) = τ R 1 (τ )| T t=1 π 1 (a t |s t ) - T t=1 π 2 (a t |s t )| + τ δ m p(τ |θ 2 ) ≤ τ R max | T t=1 π 1 (a t |s t ) - T t=1 π 2 (a t |s t )| + τ δ m p(τ |θ 2 ) ≤ τ R max δ π + τ δ m p(τ |θ 2 ) b) η(π 1 ) < η(π 2 ). Since max τ ∼π1 |R 1 (τ ) -R 2 (τ )| ≤ δ m , we obtain: -δ m + R 2 (τ ) ≤ R 1 (τ ) ≤ δ m + R 2 (τ ). Then, |η(π 1 ) -η(π 2 )| = |η(π 2 ) -η(π 1 )| = | τ (R 2 (τ )p(τ |θ 2 ) -R 1 (τ )p(τ |θ 1 ))| ≤ | τ (R 2 (τ )p(τ |θ 2 ) -R 1 (τ ) min p(τ |θ 1 ))| = | τ (R 2 (τ )p(τ |θ 2 ) -(R 2 (τ ) -δ m )p(τ |θ 1 ))| = | τ R 2 (τ )p(τ |θ 2 ) - τ R 2 (τ )p(τ |θ 1 ) + τ δ m p(τ |θ 1 )| ≤ τ R 2 (τ )|p(τ |θ 2 ) -p(τ |θ 1 )| + τ δ m p(τ |θ 1 ) = τ R 2 (τ )|p(τ |θ 1 ) -p(τ |θ 2 )| + τ δ m p(τ |θ 1 ) = τ R 2 (τ )| T t=1 π 1 (a t |s t ) - T t=1 π 2 (a t |s t )| + τ δ m p(τ |θ 1 ) ≤ τ R max | T t=1 π 1 (a t |s t ) - T t=1 π 2 (a t |s t )| + τ δ m p(τ |θ 1 ) ≤ τ R max δ π + τ δ m p(τ |θ 1 ) In summary, we have |η Baselines. We compare our method with two types of state-of-the-art methods: traditional NAS algorithms and predictor-based NAS algorithms: (π 1 ) -η(π 2 )| ≤ τ R max δ π + τ δ m p(τ | θ), where θ = θ 2 if η(π 1 ) ≥ η(π 2 ); otherwise, θ = θ 1 .

D EXPERIMENTAL DETAILS AND RESULTS

1. Traditional NAS algorithms include: random search (RS) (Bergstra & Bengio, 2012) , REA (Real et al., 2019) , REINFORCE (Williams, 1992), and BOHB (Falkner et al., 2018) . We use the code provided by Dong et al. (2021) to implement these algorithms. To ensure a fair comparison, we also require the search strategy of the baselines sample unique architectures as our method does. 2. Predictor-based NAS algorithms include: MLP (White et al., 2021a) , GCN (Wen et al., 2020) , BANANAS (White et al., 2021a) , BONAS (Shi et al., 2020) , NAO (Luo et al., 2018) , SemiNAS (Tang et al., 2020) , Transformer (Lu et al., 2021) and XGBoost (Chen & Guestrin, 2016) . We compare the most representative performance predictors to select the best performance predictor as our predictor. In addition, we combine these performance predictors except Transformer with two widely used search strategies, BO and EA, as the predictor-based NAS algorithms. We use the code from (White et al., 2021b) and NAS-BENCH-SUITE (Mehta et al., 2022) to implement these algorithms . Implementation details. Our method consists of two modules, an RL agent as the search strategy and a performance predictor. The agent consists of input and output layers and a LSTM model. The input-layer is an embedding layer, and the size of each embedding vector is 32. The LSTM model is a two-layer LSTM with 35 hidden units on each layer. The output-layer is a linear layer with 32 hidden units. The agent is trained with the Adam optimizer with the learning rate 0.001. Weights of the agent are initialized uniformly between -0.1 and 0.1. We use a tanh of 2.5 and a temperature of 5.0 for the sampling logits (Bello et al., 2017) to prevent premature convergence and add the controller's sample entropy to the reward, weighted by 0.0001. Especially, the batch size N is 20, which is different from the setting of Zoph & Le (2016) . We set k to 2, 5, 2, and 2 for CIFAR-10, CIFAR-100, ImageNet-16-120, and TIMIT, respectively. The initial architecture-accuracy pairs for PNASM and PNASM-A is 100 across all experiments. (Note that following Dong et al. (2021) , we train candidate architectures in 12 epochs and retrieve the best architecture by training in full epochs on NAS-Bench-201.) D.2 EFFECTIVENESS STUDY ON COMPONENTS OF PNASM. To figure out which part of PNASM boost the improvement, we conduct a series of ablation experiment on PNASM, which mainly consists of LSTM-based agent, a predictor and the batch strategy. Table 5 shows comparison results of different PNASM variants on NAS-Bench-201 (We also conduct the same experiment on NAS-Bench-ASR, see Table 7 ). The four variants are: 1. "LSTM" denotes that we run the search just using LSTM-based agent, without a predictor. The agent (policy) is updated after every candidate. 2. "LSTM+Batch" denotes that we update LTSM-based agent (without the predictor) after a batch of candidates is found. 3. "LSTM+P" denotes that we run the search with LSTM-based agent combined with a predictor. The agent is updated after every architecture is found. 4. "LSTM+Batch+P" denotes that we use LSTM-based agent and a predictor to search, but perform batch-wise update of the agent, without evaluating any additional architectures from scratch. 5. PNASM denotes that we use LSTM-based agent and a predictor but perform mixed batch update of the agent. Predictors in all variants are initialized by 100 sampled architectures. We randomly run 20 times for each variant and use the search time as the time budget. Since "LSTM+P" and "LSTM+Batch+P" use the predictor for each candidate, the time cost of evaluating architecture is low. To ensure fair experiments, we allow "LSTM+P" and "LSTM+Batch+P" to sample 2000 architectures, which is greater than that by other variants. From Table 5 , we can make the following observation: 1) Batch strategy does not improve LSTM-based agent's performance on CIFAR-10, which is opposite on TIMIT. 2) Although the predictor can reduce the search time, but the long and continuous usage brings large error to the search strategy, as indicated by the results of "LSTM+P" and "LSTM+Batch+P". 3) Mixed batch helps LSTM-based agent to make better use of the predictor, since ground-truth data corrects the error caused by long-term usage. Additionally, to better demonstrate the search process of all variants, Figure 5 plots the current best validation accuracy of all variants over the search time on CIFAR-10. 



Figure 1: Cumulative error between true val and predicted val over sampled architectures. REINFORCE+Predictor means long and continuous usage of predictor without updating it.

ABLATION STUDYImpact of Number of True Samples k. The value of k determines the ratio of true samples to prediction ones in a batch, which balances the performance and the computational costs. To study the impact of value k on the final performance, we conduct a series of experiments on the three datasets with different values of k. Each experiment samples 1000 architectures. "search" means the total time costs including the time of predictor initialization .

Figure 3: Validation accuracy vs. number of architectures on different settings of k.

Figure 4: Left. Internal structure of the controller. Right. Workflow of the controller. It unrolls T steps to output a trajectory τ = {s 1 , a 1 , . . . , s T , a T }, which describes an architecture.

π) L3 After k samples, L 3 differ in both model and policy, which incorporates both predictor error ϵ N -k m and policy error ϵ N -k π in the last N -k samples. This can be bound by Lemma C.2 with setting ϵ N -k m = ϵ m and ϵ N -k π = ϵ π , which results in:

Figure 5: Performance of different variants of PNASM on CIFAR-10. Left figure presents the comparison results among "LSTM","LSTM+Batch" and PNASM. Right figure shows the comparison results between "LSTM+P" and "LSTM+Batch+P", because the search time of the two variants is inconsistent with others.

Performance Comparisons of Predictors on CIFAR-10.



Performance Comparisons of NAS methods on NAS-Bench-ASR.

Comparison of PNASM with different k values on NAS-Bench-201. ± 0.20 91.37 ± 0.22 91.50 ± 0.06 91.50 ± 0.05 91.40 ± 0.14 91.50 ± 0.04 test 93.98 ± 0.22 94.06 ± 0.26 94.28 ± 0.16 94.26 ± 0.16 94.19 ± 0.19 94.31 ± 0.05 ± 0.68 72.99 ± 0.36 73.09 ± 0.12 73.14 ± 0.16 73.02 ± 0.13 72.99 ± 0.10 test 72.14 ± 0.52 72.84 ± 0.41 72.79 ± 0.42 72.77 ± 0.40 72.44 ± 0.32 72.36 ± 0.± 0.82 46.30 ± 0.23 46.21 ± 0.29 46.21 ± 0.16 46.32 ± 0.00 46.31 ± 0.02 test 45.84 ± 0.92 46.31 ± 0.35 46.41 ± 0.36 45.93 ± 0.66 46.47 ± 0.00 46.47 ± 1.

Mehrotra et al., 2020) is a tabular NAS benchmark for automatic speech recognition. The search space consists of 8242 unique models trained on TIMIT dataset. Each model includes all kinds of runtime measurements, such as the per epoch validation and final test metrics, Phoneme Error Rate (PER), and CTC loss. The search space consists of four nodes, with three main edges that can take on one of six operations, and six skip connection edges, which can be set to on or off.

Comparisons of different Variants of PNASM on CIFAR-10.

D.3 ADDITIONAL EXPERIMENTAL ON NAS-BENCH-ASR D.3.1 CHOOSE PREDICTOR FOR NAS-BENCH-ASR

Since the predictor usually has poor generalization ability across the search space, we choose a high-performance predictor for TIMIT on NAS-Bench-ASR again. Table 6 compares the spearman correlation of PER of four predictors: MLP, NAO, SemiNAS, and XGBoost. We randomly run 20 times for each predictor. We can easily make the following observation from Table 6 : 1) Predictors show poor generalization ability across the search spaces. For example, SemiNAS performs well on NAS-Bench-201 in Table 1 , but fails on NAS-Bench-ASR. 2) MLP performs well if it has sufficient initial training data. 3) XGBoost performs good across the different settings of initialization. According to the experimental results, we choose XGBoost as the predictor on NAS-Bench-ASR. Table 7 shows the comparisons of different variants of PNASM on NAS-Bench-ASR. Predictors take 100 true architecture-val pairs to initialize. We can make the similar observation as Table 5 except for the batch strategy. On NAS-Bench-ASR, Batch strategy can help LSTM-based agent to achieve lower validation PER. We speculate that batch strategy is highly related to statistics of dataset on NASBench. 

