DIFFERENTIABLE GRAPH OPTIMIZATION FOR NEU-RAL ARCHITECTURE SEARCH

Abstract

In this paper, we propose Graph Optimized Neural Architecture Learning (GOAL), a novel gradient-based method for Neural Architecture Search (NAS), to find better architectures with fewer evaluated samples. Popular NAS methods usually employ black-box optimization based approaches like reinforcement learning, evolution algorithm or Bayesian optimization, which may be inefficient when having huge combinatorial NAS search spaces. In contrast, we aim to explicitly model the NAS search space as graphs, and then perform gradient-based optimization to learn graph structure with efficient exploitation. To this end, we learn a differentiable graph neural network as a surrogate model to rank candidate architectures, which enable us to obtain gradient w.r.t the input architectures. To cope with the difficulty in gradient-based optimization on the discrete graph structures, we propose to leverage proximal gradient descent to find potentially better architectures. Our empirical results show that GOAL outperforms mainstream black-box methods on existing NAS benchmarks in terms of search efficiency.

1. INTRODUCTION

Neural Architecture Search (NAS) methods achieve great success and outperform hand-crafted models in many deep learning applications, such as image recognition, object detection and natural language processing (Zoph et al., 2017; Liu et al., 2019; Ghiasi et al., 2019; Chen et al., 2020) . Due to the expensive cost of training-evaluating a neural architecture, the key challenge of NAS is to explore possible good candidates effectively. To cope with this challenge, various methods have been proposed, such as reinforcement learning (RL), evolution algorithm (EA), Bayesian optimization (BO) and weight-sharing strategy (WS), to perform efficient search (Zoph & Le, 2016; Real et al., 2019; Hutter et al., 2011; Liu et al., 2019; Guo et al., 2019) . While the weight-sharing strategy improves overall efficiency by reusing trained weights to reduce the total training cost, zeroth-order algorithms like RL, EA and BO employ black-box optimization, with the goal of finding optimal solutions with fewer samples. However, the search space of NAS is exponentially growing with the increasing number of choices. As a result, such huge combinatorial search spaces lead to insufficient exploitation of black-box learning framework (Luo et al., 2018) . Another line of research has been focused on formulating the NAS search space as graph structures, typically directed acyclic graphs (DAGs), and then the search target is cast as choosing an optimal combination of the nodes and edges in the graph structure (Pham et al., 2018; Liu et al., 2019; Xie et al., 2019) . However, existing methods tend to perform the optimization in the indirect manner using black-box optimization. In contrast, we aim to explicitly model the search space as graphs and optimize graph structures directly. We thus propose Graph Optimized Neural Architecture Learning (GOAL), a novel NAS approach combined with graph learning for efficient exploitation, as briefly shown in Fig. 1 . Unlike other black-box approaches, we use a differentiable surrogate model to directly optimize the graph structures. The surrogate model takes a graph structure corresponds to a neural architecture as input, and predicts a relative ranking score as the searching signal. We then apply gradient descent on the input graph structure to optimize the corresponding architecture, which attempts to obtain a better predicted ranking score. As we optimize the surrogate model and the architectures iteratively, the optimal architectures could be typically obtained after a few iterations. In particular, to cope with the difficulty of using gradient-based optimization on the discrete graph structure, we adapt the proximal algorithm for allowing us to optimize discrete variables in a The main contributions of this paper are summarized as follows: • We propose a differentiable surrogate model for ranking neural architectures based on GNN, which takes advantage of the graph structure of neural architectures and guides the architecture search efficiently. • We present GOAL, a novel gradient-based NAS sample-efficient approach with the assistance of the proposed surrogate model. Comparing to exist algorithms with GNN surrogates, GOAL makes full use of the learned representation by jointly optimizing the GNN model and candidate architectures and performing efficient exploitation within graphstructured search spaces. • Our empirical results demonstrate that the GOAL significantly outperforms existing stateof-the-art methods in various search spaces settings.

2.1. NEURAL ARCHITECTURE SEARCH

From the earliest explorations on automatic neural network designing to recent NAS trends, the NAS problem developed from hyper-parameter optimization, becoming a more challenging task due to the inherent complexity of its search space (Bergstra et al., 2013; Elsken et al., 2018) . Existing popular approaches include various kinds of algorithms: reinforcement learning (Zoph & Le, 2016; Zoph et al., 2017; Pham et al., 2018 ), evolution algorithm (Real et al., 2019) , Bayesian optimization (Falkner et al., 2018; White et al., 2019) , monte carlo tree search (Wang et al., 2019b; a) , gradient based methods (Liu et al., 2019; Luo et al., 2018) , etc. There are also some works employ surrogate models to predict the performance of architectures before training to reduce the cost of architecture evaluation (Liu et al., 2018; Wen et al., 2019; Wang et al., 2019b) . Some most recently parallel works even tries to improve black-box optimization methods like Bayesian optimization by efficient surrogate predictors (White et al., 2019; Shi et al., 2019) . Existing gradient-based methods usually employ weight-sharing based relaxation or use encoder-decoder to optimize a continuous hidden space (Liu et al., 2019; Luo et al., 2018) . These approximations cause biased model criterion and generation, which can harm the final performance (Yu et al., 2019; Yang et al., 2020) . In contrast, our method directly optimizes the discrete architectures, avoids the biased model criterion.



Overview of the GOAL steps. A GNN-based surrogate model f predicts the ranking score ỹ of the original architecture α t , then back-propagating ỹ through the GNN model to compute the gradient w.r.t. α t . A better architecture α t+1 is obtained by proximal gradient descent.

