INTERPRETABLE NEURAL ARCHITECTURE SEARCH VIA BAYESIAN OPTIMISATION WITH WEISFEILER-LEHMAN KERNELS

Abstract

Current neural architecture search (NAS) strategies focus only on finding a single, good, architecture. They offer little insight into why a specific network is performing well, or how we should modify the architecture if we want further improvements. We propose a Bayesian optimisation (BO) approach for NAS that combines the Weisfeiler-Lehman graph kernel with a Gaussian process surrogate. Our method optimises the architecture in a highly data-efficient manner: it is capable of capturing the topological structures of the architectures and is scalable to large graphs, thus making the high-dimensional and graph-like search spaces amenable to BO. More importantly, our method affords interpretability by discovering useful network features and their corresponding impact on the network performance. Indeed, we demonstrate empirically that our surrogate model is capable of identifying useful motifs which can guide the generation of new architectures. We finally show that our method outperforms existing NAS approaches to achieve the state of the art on both closed-and open-domain search spaces.

1. INTRODUCTION

Neural architecture search (NAS) aims to automate the design of good neural network architectures for a given task and dataset. Although different NAS strategies have led to state-of-the-art neural architectures, outperforming human experts' design on a variety of tasks (Real et al., 2017; Zoph and Le, 2017; Cai et al., 2018; Liu et al., 2018a; b; Luo et al., 2018; Pham et al., 2018; Real et al., 2018; Zoph et al., 2018a; Xie et al., 2018) , these strategies behave in a black-box fashion, which returns little design insight except for the final architecture for deployment. In this paper, we introduce the idea of interpretable NAS, extending the learning scope from simply the optimal architecture to interpretable features. These features can help explain the performance of networks searched and guide future architecture design. We make the first attempt at interpretable NAS by proposing a new NAS method, NAS-BOWL; our method combines a Gaussian process (GP) surrogate with the Weisfeiler-Lehman (WL) subtree graph kernel (we term this surrogate GPWL) and applies it within the Bayesian Optimisation (BO) framework to efficiently query the search space. During search, we harness the interpretable architecture features extracted by the WL kernel and learn their corresponding effects on the network performance based on the surrogate gradient information. Besides offering a new perspective on interpratability, our method also improves over the existing BO-based NAS approaches. To accommodate the popular cell-based search spaces, which are noncontinuous and graph-like (Zoph et al., 2018a; Ying et al., 2019; Dong and Yang, 2020) , current approaches either rely on encoding schemes (Ying et al., 2019; White et al., 2019) or manually designed similarity metrics (Kandasamy et al., 2018) , both of which are not scalable to large architectures and ignore the important topological structure of architectures. Another line of work employs graph neural networks (GNNs) to construct the BO surrogate (Ma et al., 2019; Zhang et al., 2019; Shi et al., 2019) ; however, the GNN design introduces additional hyperparameter tuning, and the training of the GNN also requires a large amount of architecture data, which is particularly expensive to obtain in NAS. Our method, instead, uses the WL graph kernel to naturally handle the graph-like search spaces and capture the topological structure of architectures. Meanwhile, our surrogate preserves the merits of GPs in data-efficiency, uncertainty computation and automated hyperparameter treatment. In summary, our main contributions are as follows: • We introduce a GP-based BO strategy for NAS, NAS-BOWL, which is highly query-efficient and amenable to the graph-like NAS search spaces. Our proposed surrogate model combines a GP with the WL graph kernel (GPWL) to exploit the implicit topological structure of architectures. It is scalable to large architecture cells (e.g. 32 nodes) and can achieve better prediction performance than competing methods. 

2. PRELIMINARIES

Graph Representation of Neural Networks Architectures in popular NAS search spaces can be represented as an acyclic directed graph (Elsken et al., 2018; Zoph et al., 2018b; Ying et al., 2019; Dong and Yang, 2020; Xie et al., 2019) , where each graph node represents an operation unit or layer (e.g. a conv3×3-bn-relu in Ying et al. ( 2019)) and each edge defines the information flow from one layer to another. With this representation, NAS can be formulated as an optimisation problem to find the directed graph and its corresponding node operations (i.e. the directed attributed graph G) that give the best architecture validation performance y(G): G * = arg max G y(G). Bayesian Optimisation and Gaussian Processes To solve the above optimisation, we adopt BO, which is a query-efficient technique for optimising a black-box, expensive-to-evaluate objective (Brochu et al., 2010) . BO uses a statistical surrogate to model the objective and builds an acquisition function based on the surrogate. The next query location is recommended by optimising the acquisition function which balances the exploitation and exploration. We use a GP as the surrogate model in this work, as it can achieve competitive modelling performance with small amount of query data (Williams and Rasmussen, 2006) and give analytic predictive posterior mean µ(G t |D t-1 ) and variance k(G t , G t |D t-1 ) on the heretofore unseen graph G t given t - 1 observations: µ(G t |D t-1 ) = k(G t , G 1:t-1 )K -1 1:t-1 y 1:t-1 and k(G t , G t |D t-1 ) = k(G t , G t ) -k(G t , G 1:t-1 )K -1 1:t-1 k(G 1:t-1 , G t ) where G 1:t-1 = {G 1 , . . . , G t-1 } and y 1:t-1 = [y 1 , . . . , y t-1 ] T are the t -1 observed graphs and objective function values, respectively, and D t-1 = {G 1:t-1 , y 1:t-1 }. [K 1:t-1 ] i,j = k(G i , G j ) is the (i, j)-th element of Gram matrix induced on the (i, j)-th training samples by k(•, •), the graph kernel function. We use Expected Improvement (Mockus et al., 1978) in this work though our approach is compatible with alternative choices. 



Graph kernels are kernel functions defined over graphs to compute their level of similarity. A generic graph kernel may be represented by the function k(•, •) over a pair of graphs G and G (Kriege et al., 2020): k(G, G ) = φ(G), φ(G ) H (2.1) where φ(•) is some feature representation of the graph extracted by the graph kernel and •, • H denotes inner product in the associated reproducing kernel Hilbert space (RKHS) (Nikolentzos et al., 2019; Kriege et al., 2020). For more detailed reviews on graph kernels, the readers are referred to Nikolentzos et al. (2019), Ghosh et al. (2018) and Kriege et al. (2020).

We propose the idea of interpretable NAS based on the graph features extracted by the WL kernel and their corresponding surrogate derivatives. We show that interpretability helps in explaining the performance of the searched neural architectures. As a singular example of concrete application, we propose a simple yet effective motif-based transfer learning baseline to warm-start search on a new image tasks.• We demonstrate that our surrogate model achieves superior performance with much fewer observations in search spaces of different sizes, and that our strategy both achieves state-of-the-art performances on both NAS-Bench datasets and open-domain experiments while being much more efficient than comparable methods.

availability

available at https://github.com/xingchenwan/nasbowl 

