GRAPHPNAS: LEARNING DISTRIBUTIONS OF GOOD NEURAL ARCHITECTURES VIA DEEP GRAPH GENERATIVE MODELS

Abstract

Neural architectures can be naturally viewed as computational graphs. Motivated by this perspective, we, in this paper, study neural architecture search (NAS) through the lens of learning random graph models. In contrast to existing NAS methods which largely focus on searching for a single best architecture, i.e., point estimation, we propose GraphPNAS, a deep graph generative model that learns a distribution of well-performing architectures. Relying on graph neural networks (GNNs), our GraphPNAS can better capture topologies of good neural architectures and relations between operators therein. Moreover, our graph generator leads to a learnable probabilistic search method that is more flexible and efficient than the commonly used RNN generator and random search methods. Finally, we learn our generator via an efficient reinforcement learning formulation for NAS. To assess the effectiveness of our GraphPNAS, we conduct extensive experiments on three search spaces, including the challenging RandWire on Tiny-ImageNet, ENAS on CIFAR10, and NAS-Bench-101. The complexity of RandWire is significantly larger than other search spaces in the literature. We show that our proposed graph generator consistently outperforms RNN based one and achieves better or comparable performances than state-of-the-art NAS methods.

1. INTRODUCTION

In recent years, we have witnessed a rapidly growing list of successful neural architectures that underpin deep learning, e.g., VGG, LeNet, ResNets (He et al., 2016) , Transformers (Dosovitskiy et al., 2020) . Designing these architectures requires researchers to go through time-consuming trial and errors. Neural architecture search (NAS) (Zoph & Le, 2016; Elsken et al., 2018b) has emerged as an increasingly popular research area which aims to automatically find state-of-the-art neural architectures without human-in-the-loop. NAS methods typically have two components: a search module and an evaluation module. The search module is expressed by a machine learning model, such as a deep neural network, designed to operate in a high dimensional search space. The search space, of all admissible architectures, is often designed by hand in advance. The evaluation module takes an architecture as input and outputs the reward, e.g., performance of this architecture trained and then evaluated with a metric. The learning process of NAS methods typically iterates between the following two steps. 1) The search module produces candidate architectures and sends them to the evaluation module; 2) The evaluation module evaluates these architectures to get the reward and sends the reward back to the search module. Ideally, based on the feedback from the evaluation module, the search module should learn to produce better and better architectures. Unsurprisingly, this learning paradigm of NAS methods fits well to reinforcement learning (RL). Most NAS methods (Liu et al., 2018b; White et al., 2020; Cai et al., 2019) only return a single best architecture (i.e., a point estimate) after the learning process. This point estimate could be very biased as it typically underexplores the search space. Further, a given search space may contain multiple (equally) good architectures, a feature that a point estimate cannot capture. Even worse, since the learning problem of NAS is essentially a discrete optimization where multiple local minima exist, many local search style NAS methods (Ottelander et al., 2020) tend to get stuck in local minima. From the Bayesian perspective, modelling the distribution of architectures is inherently better than point estimation, e.g., leading to the ability to form ensemble methods that work better in practice. Moreover, modelling the distribution of architectures naturally caters to probabilistic search methods which are better suited for avoiding local optima, e.g., simulated annealing. Finally, modeling the distribution of architectures allows to capture complex structural dependencies between operations that characterize good architectures capable of more efficient learning and generalization. Motivated by the above observations and the fact that neural architectures can be naturally viewed as attributed graphs, we propose a probabilistic graph generator which models the distribution over good architectures using graph neural networks (GNNs). Our generator excels at generating topologies with complicated structural dependencies between operations. From the Bayesian inference perspective, our generator returns a distribution over good architectures, rather than a single point estimate, allowing to capture the multi-modal nature of the posterior distribution of good architectures and to effectively average or ensemble architecture (sample) estimates. Different from the Bayesian deep learning (Neal, 2012; Blundell et al., 2015; Gal & Ghahramani, 2016 ) that models distributions of weights/hidden units, we model distributions of neural architectures. Lastly, our probabilistic generator is less prone to the issue of local minima, since multiple random architectures are generated at each step during learning. In summary, our key contributions are as below. • We propose a GNN-based graph generator for neural architectures which empowers a learnable probabilistic search method. To the best of our knowledge, we are the first to explore learning deep graph generative models as generators in NAS. • We explore a significantly larger search space (e.g., graphs with 32 operators) than the literature (e.g., garphs with up to 12 operators) and propose to evaluate architectures under low-data regime, which altogether boost effectiveness and efficiency of our NAS system. • Extensive experiments on three different search spaces show that our method consistently outperforms RNN-based generators and is slightly better or comparable to the state-of-theart NAS methods. Also, it can generalize well across different NAS system setups.

2. RELATED WORKS

Neural Architecture Search. The main challenges in NAS are 1) the hardness of discrete optimization, 2) the high cost for evaluating neural networks, and 3) the lack of principles in the search space design. First, to tackle the discrete optimization, evolution strategies (ES) (Elsken et al., 2019; Real et al., 2019a) , reinforcement learning (RL) (Baker et al., 2017; Zhong et al., 2018; Pham et al., 2018; Liu et al., 2018a) , Bayesian optimization (Bergstra et al., 2013; White et al., 2019) and continuous relaxations (Liu et al., 2018b) have been explored in the literature. We follow the RL path as it is principled, flexible in injecting prior knowledge, achieves the state-of-the-art performances (Tan & Le, 2019) , and can be naturally applied to our graph generator. Second, the evaluation requires training individual neural architectures which is notoriously time consuming (Zoph & Le, 2016) . Pham et al. (2018) ; Liu et al. (2018b) propose a weight-sharing supernet to reduce the training time. Baker et al. (2018) use a machine learning model to predict the performance of fully-trained architectures conditioned on early-stage performances. Brock et al. (2018) ; Zhang et al. (2018) directly predict weights from the search architectures via hypernetworks. Since our graph generator do not relies on specific choice of evaluation method, we choose to experiment on both oracle training(training from scratch) and supernet settings for completeness. Third, the search space of NAS largely determines the optimization landscape and bounds the best-possible performance. It is obvious that the larger the search space is, the better the best-possible performance and the higher the search cost would likely be. Besides this trade-off, few principles are known about designing the search space. Previous work (Pham et al., 2018; Liu et al., 2018b; Ying et al., 2019; Li et al., 2020) mostly focuses on cell-based search space. A cell is defined as a small (e.g., up to 8 operators) computational graph where nodes (i.e., operators like 3×3 convolution) are connected following some topology. Once the search is done, one often stacks up multiple cells with the same topology but different weights to build the final neural network. Other works (Tan et al., 2019; Cai et al., 2019; Tan & Le, 2019) typically fix the topology, e.g., a sequential backbone, and search for layer-wise configurations (e.g., operator types like 3×3 vs. 5×5 convolution and number of filters). In our method, to demonstrate our graph generator's ability in exploring large topology search space, we first explore on a challenging large cell space (32 operators), after which we experiment on ENAS Macro (Pham et al., 2018) and NAS-Benchmark-101 (Ying et al., 2019) for more comparison with previous methods. Neural Architecture as Graph for NAS. Recently, a line of NAS research works propose to view neural architectures as graphs and encode them using graph neural networks (GNNs). In (Zhang et al., 2020; Luo et al., 2018) , graph auto-encoders are used to map neural architectures to and back from a continuous space for gradient-based optimization. Shi et al. (2020) use bayesian optimization (BO), where GNNs are used to get embedding from neural architectures. Despite the extensive use of GNNs as encoders, few works focus on building graph generative models for NAS. Closely related to our work, Xie et al. (2019) explore different topologies of the similar cell space using nonlearnable random graph models. You et al. (2020) subsequently investigate the relationship between topologies and performances. Following this, Ru et al. (2020) propose a hierarchical search space modeled by random graph generators and optimize hyper-parameters using BO. They are different from our work as we learn the graph generator to automatically explore the cell space. Deep Graph Generative Models. Graph generative models date back to the Erdős-Rényi model (Erdös & Rényi, 1959) , of which the probability of generating individual edges is the same. Other well-known graph generative models include the stochastic block model (Holland et al., 1983) , the small-world model (Watts & Strogatz, 1998) , and the preferential attachment model (Barabási & Albert, 1999) . Recently, deep graph generative models instead parameterize the probability of generating edges and nodes using deep neural networks in, e.g., the auto-regressive fashion (Li et al., 2018; You et al., 2018; Liao et al., 2019) or variational autoencoder fashion (Kipf & Welling, 2016; Grover et al., 2018; Liu et al., 2019) . These models are highly flexible and can model complicated distributions of real-world graphs, e.g., molecules (Jin et al., 2018) , road networks (Chu et al., 2019) , and program structures (Brockschmidt et al., 2018) . Our graph generator builds on top of the stateof-the-art deep graph generative model in (Liao et al., 2019) with several important distinctions. First, instead of only generating nodes and edges, we also generate node attributes (e.g., operator types in neural architectures). Second, since good neural architectures are actually latent, our learning objective maximizes the expected reward (e.g., validation accuracies) rather than the simple log likelihood, thus being more challenging.

3. METHODS

The architecture of any feedforward neural network can be naturally represented as a directed acyclic graph (DAG), a.k.a., computational graph. There exist two equivalent ways to define the computational graph. First, we denote operations (e.g., convolutions) as nodes and denote operands (e.g., tensors) as edges which indicate how the computation flows. Second, we denote operands as nodes and denote operators as edges. We adopt the first view. In particular, a neural network G with N operations is defined as a tuple (A, X) where A ∈ {0, 1} N ×N is an N × N adjacent matrix with A ij = 1 indicates that the output of the j-th operator is used as the input of the i-th operator. For operator with multiple inputs, inputs are combined together (e.g., using sum or average operator) before sending into the operator. X is a N -size attribute vector encoding operation types. For any operation i, its operation type X i can only choose from a pre-defined list with length D, e.g., 1 × 1, 3 × 3 or 5 × 5 convolutions. Note that for any valid feedforward architecture, G can not have loops. One sufficient condition to satisfy the requirement is to constrain A to be a lower triangular matrix with zero diagonal (i.e., excluding self-loops). This formalism creates a search space of D N 2 N (N -1)/2 possible architectures, which is huge even for moderately large number of operators N and number of operation types D. The goal of NAS is to find an architecture or a set of architectures within this search space that would perform well. For practical consideration, we search for cell graphs (e.g., N = 32) and then replicate this cell several times to build a deep neural architecture. We also experiment on the ENAS Macro search space where G defines a entire network. More details for the corresponding search spaces can be found in Section 4.

3.1. NEURAL ARCHITECTURE SEARCH SYSTEM

Before delving into details, we first give an overview of our NAS system, which consists of two parts: a generator and an evaluator. The system diagram is shown in Fig. 1 . At each step, the probabilistic graph generator samples a set of cell graphs, which are further translated to neural architectures by replicating the cell graph multiple times and stacking them up. Then the evaluator evaluates these architectures, obtains rewards, and sends architecture-reward pairs to the replay buffer. The replay buffer is then used to improve the generator, effectively forming a reinforcement learning loop. 

3.1.1. PROBABILISTIC GENERATORS FOR NEURAL ARCHITECTURES

Now we introduce our probabilistic graph generator which is based on a state-of-the-art deep autoregressive graph generative model in (Liao et al., 2019) . Auto-Regressive Generation. Specifically, we decompose the distribution of a cell graph along with attributes (operation types) in an auto-regressive fashion, P(A, X) = N i=1 P (A i,: |A i-1,: , X i-1 , • • • , A 1,: , X 1 ) P (X i |A i-1,: , X i-1 , • • • , A 1,: , X 1 ) , where A i,: and X i denote the i-th row of the adjacency matrix A and the i-th operation type respectively. To ensure the generated graphs are DAGs, we constrain A to be lower triangular by adding a binary mask, i.e., the i-th node can only be reached from the first i -1 nodes. We omit the masks in the equations for better readability. We further model the conditional distributions as follows, P (A i,: |A i-1,: , X i-1 , • • • , A 1,: , X 1 ) = K k=1 α k 1≤j<i θ k,i,j P (X i |A i-1,: , X i-1 , • • • , A 1,: , X 1 ) = Categorical (β 1 , • • • , β D ) (3) α 1 , . . . , α K = Softmax 1≤j<i MLP α (h S i -h S j ) (4) β 1 , . . . , β D = Softmax MLP β (h S i ) (5) θ 1,i,j , . . . , θ K,i,j = Sigmoid MLP θ (h S i -h S j ) , where the distributions of the operation type and edges are categorical and K-mixture of Bernoulli respectively. D is again the number of operation types. MLP α , MLP β , and MLP θ are different instances of two-layer MLPs with ReLU activations. Here h S i is the representation of i-th node returned by a GNN which has been executed S steps of message passing at each generation step. This auto-regressive construction breaks down the nice property of permutation invariance for graph generation. However, we do not find it as an issue in practice, partly due to the fact that the graph isomorphism becomes less likely to happen while considering both topology and operation types. Message Passing GNNs. Each generation step n ≤ N in auto-regressive generation above relies on representations of nodes up to and including n itself (see Eq. ( 4)-( 6)). To obtain these node representations {h S i }, we exploit message passing GNNs (Gilmer et al., 2017) with an attention mechanism similar to (Liao et al., 2019) . In particular, the s-th message passing step involves executing the following equations successively, m s ij = f ([h s i -h s j , 1 ij ]) (7) hs i = [h s i , u i ] a s ij = Sigmoid(g( hs i -hs j )) h s+1 i = GRU(h s i , j∈N (i) a s ij m s ij ). ( ) where N (i) is the set of node i along with its neighboring nodes. m s ij is the message sent from node i to j at the s-th message passing step. The connectivity for the propagation in GNN is given by A 1:i-1,1:i-1 with the last node (for which A i,: has not been generated yet) being fully connected. Note that message passing step is different from the generation step and we run multiple message passing steps per generation step in order to capture the structural dependency among nodes and edges. The f and g are two-layer MLPs. Since graphs are DAGs in our case rather than undirected ones as in (Liao et al., 2019) , we add 1 ij in Eq. ( 7), a one-hot vector for indicating the direction of the edge. We initialize the node representations h 0 i (for i < n) as the corresponding one-hot encoded operation type vectors; h 0 n is initialized to a special one-hot vector. Here u i is an additional feature vector that helps distinguish i-th node from others. We found using one-hot-encoded incoming neighbors of i-th node and a positional encoding of the node index i work well in practice. We encourage readers to reference Fig. 4 for a detailed visualization of graph generation process.

Sampling.

To sample from our generator, we first draw architectures following the standard ancestral sampling where each step involves drawing random samples from a categorical distribution and a mixture of Bernoulli distribution. At each step, this sampling process adds a new operator with a certain operation type and wire it to previously sampled operators.

3.1.2. EVALUATOR

Our design of generator and NAS pipeline do not rely on a specific choice of evaluator. Motivated by (Mnih et al., 2013) , we use a replay buffer for storing the evaluated architectures. In our paper, based on specific datasets, we explore three types of evaluators, namely, oracle evaluator, supernet evaluator and benchmark evaluator, which are briefly introduced as follows. Oracle evaluator. Given a sample from the generator, an oracle evaluator trains the corresponding network from scratch and tests it to get the validation performances. To reduce computation overhead, a common approach is to use early stopping (training with fewer epochs) as in (Tan et al., 2019; Tan & Le, 2019) . In our experiment, we instead use a low-data evaluator similar to few-shot learning where we keep the same number of classes but use fewer samples per class to train. SuperNet evaluator. Aiming at further reducing the amount of compute, this evaluator uses a weight-sharing strategy where each graph is a sub-graph of the supernet. We followed the singlepath supernet setup used in (Pham et al., 2018) to compare with previous methods. Benchmark evaluator. NAS benchmarks, e.g., (Ying et al., 2019) , provide accurate evaluation for architectures within the search space, which can be seen as oracle evaluators with full training budgets on target datasets.

3.2. LEARNING METHOD

Since we are dealing with discrete latent variables, i.e., good architectures in our case, we train our NAS system using REINFORCE (Williams, 1992) algorithm with the control variate (a.k.a. baseline) to reduce the variance. In particular, the gradient of the loss or negative expected reward L w.r.t. the generator parameters ϕ is, ∇L(ϕ) = E P(G) - ∂ log P(G) ∂ϕ R(G) , where the reward R is standardized as R(G) = (R(G) -C)/σ. Here the baseline C is the average reward of architectures in the replay buffer and σ is standard deviation of rewards in the replay buffer. The expectation in Eq. ( 11) is approximated by the Monte Carlo estimation. However, the score function (i.e., the gradient of log likelihood w.r.t. parameters) in the above equation may numerically differ a lot for different architectures. For example, if a negative sample, i.e., an architecture with a reward lower than the baseline, has a low probability P(G), it would highly likely to have an extremely large absolute score function value, thus leading to a negative reward with an extremely large magnitude. Therefore, in order to balance positive and negative rewards, we propose to use the reweighted log likelihood as follows, log P(G) = β1 R(G)≤0 log(1 -P(G)) + 1 R(G)>0 log(P(G)) ( ) where β is a hyperparameter that controls the weighting between negative and positive rewards. P(G) is the original probability given by our generator. Exploration vs. Exploitation Similar to many RL approaches, our NAS system faces the exploration vs. exploitation dilemma. We found that our NAS system may quickly collapse (i.e., overly exploit) to a few good architectures due to the powerful graph generative model, thus losing the diversity and reducing to point estimate. Inspired by the epsilon greedy algorithm (Sutton & Barto, 2018) used in multi-armed bandit problems, we design a random explorer to encourage more exploration in the early stage. Specifically, at each search step, our generator samples from either itself or a prior graph distribution like the Watts-Strogatz model with a probability ϵ. As the search goes on, ϵ is gradually annealed to 0 so that the generator gradually has more exploitation over exploration. Whats more, we design our replay buffer to keep a small portion of candidates. As training goes on, bad samples will be gradually be replaced by good samples for training our generators, which encourage the model to exploit more.

4. EXPERIMENTS

In this section, we extensively investigate our NAS system on three different search spaces to verify its effectiveness. First, we adopt the challenging RandWire search space (Xie et al., 2019) which is significantly larger than common ones. To the best of our knowledge, we are the first to explore learning NAS systems in this space. Then we search on the ENAS Macro (Pham et al., 2018) and NAS-Bench-101 (Ying et al., 2019) search spaces to further compare with previous literature. For all experiments, we set the number of mixture Bernoulli K to be 10, the number of message passing steps S to 7, hidden sizes of node representation h s i and message m s ij to 128. For RNN-based baselines, we follow the design in (Zoph et al., 2018) if not other specified.

4.1. RANDWIRE SEARCH SPACE ON TINY-IMAGENET

RandWire Search Space. Originally proposed in (Xie et al., 2019) , a randomly wired neural network is a ResNet-like four-stage network with the cell graph G defines the connectivity of N convolution layers within each stage. At the end of each stage, the resolution is downsampled by 3×3 convolution with stride 2 whereas the number of channels is doubled. While following the RandWire small regime in (Xie et al., 2019) , we share the cell graph G among last three stages for simplification. To keep the number of parameters roughly the same, we fix the node type to be separable 3×3 convolution. The number of nodes N within the cell graph G is set to 32 excluding the input and output nodes. This yields a search space of 2.1 × 10 149 valid adjacency matrices, which is extremely large and renders the neural architecture search challenging. More details of the RandWire search space can be found in the Appendix E.1. Tiny-ImageNet w. Oracle Evaluator. To enable search on the RandWire space, we exploit the oracle evaluator on the Tiny-ImageNet dataset (Chrabaszcz et al., 2017) . To save computation, we employ a low-data oracle evaluator where we sample 1/10 of Tiny-ImageNet training set for training and use the rest for validation at each search step. Similar to the few-shot learning, we keep the number of classes unchanged but reduce the number of samples per class. After the search, we retrain our found architectures on the full training set and evaluate it on the original validation set. Specifically, for each model, the oracle evaluator trains for 300 epochs and uses the average validation accuracy of the last 3 epochs as the reward. Our total search budget is around 16 GPU days, which approximately amounts to 320 model evaluations, e.g., 40 search steps and 8 samples evaluated per step. For random search baselines, we choose Erdős-Rényi (ER) and Watts-Strogatz (WS) models. Specifically, we first randomly draw hyperparameters from certain ranges, i.e., 0.1 ≤ We set the reweight coefficient β to 0.05. For the random explorer, we choose WS model with the same hyperparameter range as a prior distribution and set ϵ = 0.6 in the beginning and decay it by a factor of 0.2 every 10 search steps. We also find that gradually shrinking replay buffer size to keep 30% to 10% of top-performing architectures helps stabilize the training of the generator. At the search time, we reject samples that already appear in the replay buffer to avoid duplications. We apply the same setting to the RNN generator for a fair comparison. Results. As shown in Table 1 , we compare our NAS system with other random search methods and learning-to-search methods. We can see that our method outperforms the RNN-based generator and other random search methods in terms of average validation accuracy on the full dataset. Our generator also has a lower variance compared to the RNN-based one. Moreover, we observed that RNN based generator sometimes degenerates so that it frequently sample densely-connected graphs. This is probably due to the fact that RNN based generator does not effectively utilize the topology information. We can see that a high search reward (i.e., a low-data validation accuracy) do not necessarily lead to better performances in full data training, which indicates a bias of the oracle evaluator within the low-data regime. Random search methods are prone to be biased as they select architectures solely based on the search reward. Nevertheless, our generator is less affected by the bias and able to learn a distribution of good architectures that perform well on full data training. 2 .

Model

SuperNet Evaluator. For ENAS Macro search space, we experiment on CIFAR10 (Krizhevsky et al., 2009) dataset. For our generator, we use the ER model with p = 0.4 as our explorer, where ϵ decays from 1 to 0 in the first 100 search steps. For RNN based generator, we follow the setup in (Pham et al., 2018) . We also adopt the weight-sharing mechanism in (Pham et al., 2018) to obtain a SuperNet evaluator that efficiently evaluates a model's performance. We use a budget of 300 search steps with around 100 architectures evaluated per step for all methods. After the search, we use a short-training of 100 epochs to evaluate the performances of 8 sampled architectures, after which top-4 performing ones are chosen for a 600-epoch full training. The best validation error rate among these 4 architectures is reported. For simplicity and a fair comparison, we do not use additional tricks (e.g., adding entropy regularizer) in (Pham et al., 2018) . More details are provided in Appendix F. In Table 3 , we compare the error rates and variances for different NAS methods. Note that this variance reflects the uncertainty of the distribution of architectures as it is computed based on sampled architectures. It is clear that our GraphPNAS achieves both lower error rates and lower variances compared to RNN based generator and is on par with the state-of-the-art NAS methods on other search spaces. We also see that the best architecture performance of our generator outperforms RNN based generator by a significant margin. This verifies that our GraphPNAS is able to learn a distribution of well-performing neural architectures. Given that we only sample 8 architectures, the performances could be further improved with more computational budgets.

4.3. NAS BENCHMARKS

NAS-Bench-101 (Ying et al., 2019) is a tabulate benchmark containing 423K cell graphs, each of which is a DAG with up to 7 nodes and 9 edges including input and output nodes. We compare the performances of our GraphPNAS to open-source implementations of random search methods, local search methods, and BANANAS (White et al., 2019) . The latter two are the best algorithms found by White et al. (2020) on NAS-Bench-101. For GCN prediction and evolution methods, we use the score reported in (White et al., 2020) . We give each NAS method the same budget of 300 queries and plot the curve of lowest test error as a function of the number of evaluated architectures. As shown in Fig. 2 , our GraphPNAS is able to quickly find well-performing architectures. We also report the avg error rate over 10 runs in NAS-Bench-201 (Dong & Yang, 2020) is defined on a smaller search space where up to 4 nodes and 6 edges are allowed. Experimental results can be found in Appendix A.

5. DISCUSSION & CONCLUSION

Qualitative Comparisons between RNN and GraphPNAS. In (You et al., 2020) , the clustering coefficient and the average path length have been used to investigate distributions of graphs. Here we adopt the same metrics to visualize architectures (graphs) sampled by RNN based and our generators in RandWire experiments. Points in Fig. 3 refer to architectures sampled from both generators in the last 15 search steps where random explorers are disabled. The validation performances are color-coded. We can see that our GraphPNAS samples a set of graphs that have better validation accuracies while the ones of RNN generator have large variances in performances. Moreover, the graphs in our case concentrate to those with smaller clustering coefficients, thus less likely being densely-connected. On the contrary, RNN generator tends to sample graphs that are more likely to be densely-connected. While RNN has been widely used for NAS, we show in our experiments that our graph generator consistently outperforms RNN over three search spaces on two different datasets. This is likely due to the fact that our graph generator better leverages graph topologies, thus being more expressive in learning the distribution of graphs.

Bias in Evaluator.

In our experiments, we use SuperNet evaluator, low-data, and full-data Oracle evaluator to efficiently evaluate the model. From the computational efficiency perspective, one would prefer the SuperNet evaluator. However, it tends to give high rewards to those architectures used for training SuperNet. Although the low-data evaluator is more efficient than the full-data one, its reward is biased as discussed in Section 4.1. This bias is caused by the discrepancy between the data distributions in low-data and full-data regimes. We also tried to follow (Tan et al., 2019) to use early stopping to reduce the time cost of the full-data evaluator. However, we found that it assigns higher rewards to those shallow networks which converge much faster in the early stage of training. We show detailed results in Appendix E.4. Search Space Design. The design of search space largely affects the performances of NAS methods. Our GraphPNAS successfully learns good architectures on the challenging RandWire search space. However, the search space is still limited as the cell graph across different stages is shared. A promising direction is to learn to generate graphs in a hierarchical fashion. For example, one can first generate a macro graph and then generate individual cell graphs (each cell is a node in the macro graph) conditioned on the macro graph. This will significantly enrich the search space by including the macro graph and untying cell graphs.

Conclusion.

In this paper, we propose a GNN-based graph generator for NAS, called GraphP-NAS. Our graph generator naturally captures topologies and dependencies of operations of wellperforming neural architectures. It can be learned efficiently through reinforcement learning. We extensively study its performances on the challenging RandWire as well as two widely used search spaces. Experimental results show that our GraphPNAS consistently outperforms RNN-based generator on all datasets. Future works include exploring ensemble methods based on our GraphPNAS and hierarchical graph generation on even larger search spaces.

A EXPERIMENTS ON THE NAS-BENCH-201 SEARCH SPACE

Here we compare our method on NAS-Bench-201 (Dong & Yang, 2020) with random search (RAN-DOM) Bergstra & Bengio (2012) , random search with parameter sharing (RSPS) Li & Talwalkar (2020) , REA Real et al. (2019b ), REINFORCE Williams (1992) , ENAS Pham et al. (2018) , first order DARTS (DARTS 1st) Liu et al. (2018b) , second order DARTS (DARTS 2nd), GDAS Dong & Yang (2019c) , SETN Dong & Yang (2019b) , TAS Dong & Yang (2019a) , FBNet-V2 Wan et al. (2020 ), TuNAS Bender et al. (2020) , BOHB Falkner et al. (2018) To fairly compare with scores reported in (Dong et al., 2021) , we fix a search budget of 20000s on CIFAR10 and CIFAR100, and 30000s on ImangeNet-16-120, which is approximately 150, 80, and 40 oracle evaluations (with 1 sample evaluated per step) on CIFAR10, CIFAR100, and ImageNet-16-120 respectively. Specifically for NAS-Bench-201, we use random explorer in the first 10 steps and keep the top 15 architectures in the replay buffer. We found that our model outperforms previous methods on CIFAR10 and CIFAR100 datasets and is on par with state-of-the-art methods on ImageNet-16-120 dataset. On ImageNet-16-120, after we extend the number of search budget from 40 to 60 steps, we significantly boost the performance to 45.57(+0.57) and 45.79(+0.4) on validation and test sets respectively. This indicates that the search process of our model hasn't converged due to the limited search steps. This suggests that a reasonable number of search steps is needed for our model to reach its full potential.

B COMPARISON WITH NAGO

Following Xie et al. (2019) 's work on random graph models, Ru et al. (2020) propose to learn parameters of random graph models using bayesian optimization. We compare with the randwire search space (refers to as RANG) in (Ru et al., 2020) . Since the original search space in (Xie et al., 2019) do not reuse cell graphs for different stages, we train conditionally independent graph generators for different stages respectively. That is 3 conditionally independent generators for conv 3 , conv 4 , and conv 5 stage in Table 9 . We perform a search on the CIFAR10 dataset, where each model is evaluated for 100 epochs. We restrict the search budget to 600 oracle evaluations. We align with settings in (Ru et al., 2020) for retraining and report sampled architecture's test accuracy and standard deviation in the Table 6 . We can see that our method learns a distribution of graphs that outperforms previous methods.

C EXPERIMENTS ON THE DARTS SEARCH SPACE

DARTS is a small cell-graph-based search space (Liu et al., 2018b incoming edges. Desipte the fact that NAS performances on CIFAR10 with DARTS is already saturated, we experiment on it to compare with previous NAS methods. To be consistent with DARTS, we split CIFAR10 into subsets containing 40000 training and 10000 validation samples and train architectures using SGD for 50 epochs with an initial learning rate of 0.025 (annealed to 0 via cosine decay), momentum of 0.9, weight decay of 0.0003, as well as auxiliary towers, drop path, and cutout. We allow a budget of 96 oracle evaluations, or roughly 72 hours on 4 NVIDIA TITAN Xp GPUs. We follow the DARTS final evaluation pipeline and perform SGD for 600 with an initial learning rate 0.025 annealed to 0 by cosine decay, momentum 0.9, and weight decay 0.0003. We also perform data augmentation (cropping/horizontal flipping) and adapt auxiliary towers, drop path, and cutout. From Table 7 , we can see that our method is on par with the state-of-art methods. The computation cost is majorly due to the fact that oracle evaluation is used, where future works using a performance predictor or SuperNet-based evaluator could be exploited to reduce the computation. Due to limits on available computation, we reuse the best hyperparameters found in NAS-bench-101. We expect further hyperparameter tuning would further boost the performance. We also provide search results on CIFAR100. 

D MORE DETAILS ON GRAPH GENERATOR

To more clearly illustrate the sampling process of our generator, we detailed probabilistic sampling process of our generator in Fig. 4 . 

F MORE DETAILS ON ENAS MACRO EXPERIMENTS

For ENAS Macro search space, we use a pytorch-based open source implementationfoot_1 and follow the detailed parameters provided in (Pham et al., 2018) for RNN generator. Specifically, we follow (Pham et al., 2018) to train the SuperNet and update the generator in an iterative style. At each search step, two sets of samples G train and G eval are sampled from the generator. G train is used to learn the SuperNet's weights by back-propagating the training loss. The updated SuperNet is used for evaluating G eval , which is then used for updating the generator. For our generator, we evaluate 100 architectures per step and update our generator every 5 epochs of SuperNet training. Instead of evaluating on a single batch, we reduce the number of models evaluated per step and evaluate on the full test set. We found this stables the training of our generator while keeping evaluation costs the same. In the replay buffer, the top 20% of architectures is kept. For training SuperNet and RNN generator, we follow the same hyper-parameter setting in (Pham et al., 2018) except the learning rate decay policy is changed to cosine learning rate decay. For Figure 7 : The best architecture found by GraphPNAS and RNN generator (Pham et al., 2018) . Correspond to scores report in table 3. To get this architecture, we pre-evaluate 8 samples for both methods and select top-performing architecture. training our generator we use the same hyperparameter as in Table 10 with graph batch size changed to 32. For retraining the found the best architecture, we use a budget of 600 epoch training with a learning rate of 0.1, batch size 256, and weight decay 2e-4. We also apply a cutout with a probability of 0.4 to all the models when retraining the model.

F.1 VISUALIZATION OF BEST ARCHITECTURE FOUND

Here we visualize the best architecture found by GraphPNAS and RNN generator for Enas Macro search space in Fig. 7 .

G MORE DETAILS ON NAS-BENCH-101

For sampling on NAS-Bench-101 (Ying et al., 2019) , we first sample a 7-node DAG, then we remove any node that is not connected to the input or the output node. We reject samples that have more than 9 edges or don't have a path from input to output. To train our generator on Nas-Bench-101, we use Erdős-Rényi with p = 0.25, ϵ is set to 1 in the beginning and decreased to 0 after 30 search steps. For the replay buffer, we keep the top 30 architectures. Our model is updated every 10 model evaluations, where we train 70 epochs on the replay buffer at each update time. The learning rate is set to 1e-3 for with a batch size of 2 on Nas-bench-101. per



1 × 1, 5 × 5 convolution, 1 × 1, 5 × 5 separable convolution, max pooling, avg pooling https://github.com/microsoft/nni/tree/v1.6/examples/nas/enas



Figure 1: Figure (a) is the pipeline of our NAS system. The core part is a GNN-based graph generator from which we sample graph representations of neural network G. The corresponding model for each G is then sent to evaluator for evaluation. The evaluation result is first stored in a replay buffer and then used for learning the graph generator through Reinforcement Learning.Figure(b) shows one generation step in the proposed probabilistic graph generator.

Figure 3: Visualization of architecutre explore sapce of GraphPNAS vs RNN. Each point in the figure denotes a model evaluation. Colors of each node denotes its validation accuracy returned by the low-data Oracle evaluator. based generator by a significant margin and beats strong baselines like local search methods and BANANAS. Notably, our GraphPNAS has a much lower variance than other methods, thus being more stable across multiple runs.

Figure 4: Detailed steps of auto-regressive generation with our graph generator.

Figure 6: Visualization of Top 3 architectures sampled by each method. We observe that around 50% of samples from RNN generators are densely connected graphs or even fully connected graphs.

Comparisons on Tiny-ImageNet. The top and bottom blocks include random search and learning-to-search methods respectively. ER-TopK and WS-TopK refers to top (K=4) architectures found by all WS and ER models during search. ER-BEST and WS-BEST refer to the best ER and WS models found during search, i.e., WS(k=4,p=0.75) and ER(p=0.1). Here Avg and Std of accuracies are computed based on 4 architectures sampled from generators.

Comparisons on CIFAR10 dataset. The bottom and top blocks include NAS methods with ENAS Macro and other search spaces respectively.



Our GraphPNAS again outperforms RNN

. Searched best architecture performance on NAS-Bench-201. We run our methods 10 times to obtain mean and standard deviation.

). It is defined on a DAG with up to 6 nodes and 8 possible operators. Each node can have no more than two inputs. Like NAS-Bench-201, node operations are defined on the edge where nodes represent fuse operation between Comparison of the searched results on CIFAR10. Mean test accuracy and the standard deviation are calculated over 8 samples from the searched generator. We align the search space design and retraining setting for a fair comparison.

Comparison of results on CIFAR10 with DARTS search space.

Due to time constraints, we only compare with a few open-source NAS methods. For a fair comparison, we reuse the final training hyperparameters from Chen et al. (2019). Comparison of results on CIFAR100 with DARTS search space.

Table9and Fig.5. Randwire search space with base and large settings. Base is the default setting for search while Large refers to the architecture of scaled up models in Table2. conv denote a ReLU-SepConv-BN triplet . The input size is 224×224 pixels. The change of the output size implies a stride of 2 (omitted in table) in the convolutions that are placed at the end of each block. G is the shared cell graph that has N = 32 node.E.2 DETAILS FOR RANDWIRE EXPERIMENTSFor experiment on Tiny-Imagenet, we resize image to 224 × 224 as showed in Table9. We apply the basic data augmentation of horizontal random flip and random cropping with padding size 4. Visualization of RandWire search base space used in this paper. Different from(Xie et al., 2019), G here is shared across three stages.

Hyperparameter setting for oracle evaluator and training our graph generator. with early stopping training, the generator will generate more shallow architectures with a shorter path from input to output. The corresponding average final validation accuracy also dropped by a large margin compared to the low data evaluator counter part.

In the table, we show ablation on the choice of oracle evaluator with our graph generator. The average Path and Longest path are computed as the average path length and longest path length from input to output over 8 samples from the corresponding generator.

