GRAPHPNAS: LEARNING DISTRIBUTIONS OF GOOD NEURAL ARCHITECTURES VIA DEEP GRAPH GENERATIVE MODELS

Abstract

Neural architectures can be naturally viewed as computational graphs. Motivated by this perspective, we, in this paper, study neural architecture search (NAS) through the lens of learning random graph models. In contrast to existing NAS methods which largely focus on searching for a single best architecture, i.e., point estimation, we propose GraphPNAS, a deep graph generative model that learns a distribution of well-performing architectures. Relying on graph neural networks (GNNs), our GraphPNAS can better capture topologies of good neural architectures and relations between operators therein. Moreover, our graph generator leads to a learnable probabilistic search method that is more flexible and efficient than the commonly used RNN generator and random search methods. Finally, we learn our generator via an efficient reinforcement learning formulation for NAS. To assess the effectiveness of our GraphPNAS, we conduct extensive experiments on three search spaces, including the challenging RandWire on Tiny-ImageNet, ENAS on CIFAR10, and NAS-Bench-101. The complexity of RandWire is significantly larger than other search spaces in the literature. We show that our proposed graph generator consistently outperforms RNN based one and achieves better or comparable performances than state-of-the-art NAS methods.

1. INTRODUCTION

In recent years, we have witnessed a rapidly growing list of successful neural architectures that underpin deep learning, e.g., VGG, LeNet, ResNets (He et al., 2016 ), Transformers (Dosovitskiy et al., 2020) . Designing these architectures requires researchers to go through time-consuming trial and errors. Neural architecture search (NAS) (Zoph & Le, 2016; Elsken et al., 2018b) has emerged as an increasingly popular research area which aims to automatically find state-of-the-art neural architectures without human-in-the-loop. NAS methods typically have two components: a search module and an evaluation module. The search module is expressed by a machine learning model, such as a deep neural network, designed to operate in a high dimensional search space. The search space, of all admissible architectures, is often designed by hand in advance. The evaluation module takes an architecture as input and outputs the reward, e.g., performance of this architecture trained and then evaluated with a metric. The learning process of NAS methods typically iterates between the following two steps. 1) The search module produces candidate architectures and sends them to the evaluation module; 2) The evaluation module evaluates these architectures to get the reward and sends the reward back to the search module. Ideally, based on the feedback from the evaluation module, the search module should learn to produce better and better architectures. Unsurprisingly, this learning paradigm of NAS methods fits well to reinforcement learning (RL). Most NAS methods (Liu et al., 2018b; White et al., 2020; Cai et al., 2019) only return a single best architecture (i.e., a point estimate) after the learning process. This point estimate could be very biased as it typically underexplores the search space. Further, a given search space may contain multiple (equally) good architectures, a feature that a point estimate cannot capture. Even worse, since the learning problem of NAS is essentially a discrete optimization where multiple local minima exist, many local search style NAS methods (Ottelander et al., 2020) tend to get stuck in local minima. From the Bayesian perspective, modelling the distribution of architectures is inherently better than point estimation, e.g., leading to the ability to form ensemble methods that work better in practice. Moreover, modelling the distribution of architectures naturally caters to probabilistic search methods which are better suited for avoiding local optima, e.g., simulated annealing. Finally, modeling the distribution of architectures allows to capture complex structural dependencies between operations that characterize good architectures capable of more efficient learning and generalization. Motivated by the above observations and the fact that neural architectures can be naturally viewed as attributed graphs, we propose a probabilistic graph generator which models the distribution over good architectures using graph neural networks (GNNs). Our generator excels at generating topologies with complicated structural dependencies between operations. From the Bayesian inference perspective, our generator returns a distribution over good architectures, rather than a single point estimate, allowing to capture the multi-modal nature of the posterior distribution of good architectures and to effectively average or ensemble architecture (sample) estimates. Different from the Bayesian deep learning (Neal, 2012; Blundell et al., 2015; Gal & Ghahramani, 2016 ) that models distributions of weights/hidden units, we model distributions of neural architectures. Lastly, our probabilistic generator is less prone to the issue of local minima, since multiple random architectures are generated at each step during learning. In summary, our key contributions are as below. • We propose a GNN-based graph generator for neural architectures which empowers a learnable probabilistic search method. To the best of our knowledge, we are the first to explore learning deep graph generative models as generators in NAS. • We explore a significantly larger search space (e.g., graphs with 32 operators) than the literature (e.g., garphs with up to 12 operators) and propose to evaluate architectures under low-data regime, which altogether boost effectiveness and efficiency of our NAS system. • Extensive experiments on three different search spaces show that our method consistently outperforms RNN-based generators and is slightly better or comparable to the state-of-theart NAS methods. Also, it can generalize well across different NAS system setups.

2. RELATED WORKS

Neural Architecture Search. The main challenges in NAS are 1) the hardness of discrete optimization, 2) the high cost for evaluating neural networks, and 3) the lack of principles in the search space design. First, to tackle the discrete optimization, evolution strategies (ES) (Elsken et al., 2019; Real et al., 2019a) , reinforcement learning (RL) (Baker et al., 2017; Zhong et al., 2018; Pham et al., 2018; Liu et al., 2018a) 2018) directly predict weights from the search architectures via hypernetworks. Since our graph generator do not relies on specific choice of evaluation method, we choose to experiment on both oracle training(training from scratch) and supernet settings for completeness. Third, the search space of NAS largely determines the optimization landscape and bounds the best-possible performance. It is obvious that the larger the search space is, the better the best-possible performance and the higher the search cost would likely be. Besides this trade-off, few principles are known about designing the search space. Previous work (Pham et al., 2018; Liu et al., 2018b; Ying et al., 2019; Li et al., 2020) mostly focuses on cell-based search space. A cell is defined as a small (e.g., up to 8 operators) computational graph where nodes (i.e., operators like 3×3 convolution) are connected following some topology. Once the search is done, one often stacks up multiple cells with the same topology but different weights to build the final neural network. Other works (Tan et al., 2019; Cai et al., 2019; Tan & Le, 2019) typically fix the topology, e.g., a sequential backbone, and search for layer-wise configurations (e.g., operator types like 3×3 vs. 5×5 convolution and number of filters). In our method, to demonstrate our graph generator's ability in exploring large topology search space, we first explore on a challenging large cell space (32 operators), after which we experiment on ENAS Macro (Pham et al., 2018) and NAS-Benchmark-101(Ying et al., 2019) for more comparison with previous methods.



, Bayesian optimization(Bergstra et al., 2013; White et al., 2019)  and continuous relaxations(Liu et al., 2018b)  have been explored in the literature. We follow the RL path as it is principled, flexible in injecting prior knowledge, achieves the state-of-the-art performances (Tan & Le, 2019), and can be naturally applied to our graph generator. Second, the evaluation requires training individual neural architectures which is notoriously time consuming(Zoph & Le, 2016). Pham et al. (2018); Liu et al. (2018b) propose a weight-sharing supernet to reduce the training time. Baker et al. (2018) use a machine learning model to predict the performance of fully-trained architectures conditioned on early-stage performances. Brock et al. (2018); Zhang et al. (

