TRANSFER NAS WITH META-LEARNED BAYESIAN SURROGATES

Abstract

While neural architecture search (NAS) is an intensely-researched area, approaches typically still suffer from either (i) high computational costs or (ii) lack of robustness across datasets and experiments. Furthermore, most methods start searching for an optimal architecture from scratch, ignoring prior knowledge. This is in contrast to the manual design process by researchers and engineers that leverage previous deep learning experiences by, e.g., transferring architectures from previously solved, related problems. We propose to adopt this human design strategy and introduce a novel surrogate for NAS, that is meta-learned across prior architecture evaluations across different datasets. We utilize Bayesian Optimization (BO) with deep-kernel Gaussian Processes, graph neural networks for obtaining architecture embeddings and a transformer-based dataset encoder. As a result, our method consistently achieves state-of-the-art results on six computer vision datasets, while being as fast as one-shot NAS methods.

1. INTRODUCTION

While deep learning has removed the need for manual feature engineering, it has shifted this manual work to the meta-level, introducing the need for manual architecture engineering. The natural next step is to also remove the need to manually define the architecture. This is the problem tackled by the field of neural architecture search (NAS). Even though NAS is an intensely-researched area, there is still no NAS method that is both generally robust and efficient. Blackbox optimization methods, such as reinforcement learning (Zoph & Le, 2017) , evolutionary algorithms (Real et al., 2019), and Bayesian optimization (Ru et al., 2021; White et al., 2021) work reliably but are slow. On the other hand, one-shot methods (Liu et al., 2019; Dong & Yang, 2019b ) often have problems with robustness (Zela et al., 2020) , and the newest trend of zero-cost proxies often does not provide more information about an architecture's performance than simple statistics, such as the architecture's number of parameters (White et al., 2022) . An understudied path towards efficiency in NAS is to transfer information across datasets. This idea is naturally motivated by how researchers and engineers tackle new deep learning problems: they leverage the knowledge they obtained from previous experimentation and, e.g., re-use architectures designed for one task and apply or adapt them to a novel task. While a few NAS approaches in this direction exist (Wong et al., 2018; Lian et al., 2020; Elsken et al., 2020; Wistuba, 2021; Lee et al., 2021; Ru et al., 2021) , they typically come with one or more of the following limitations: (i) they are only applicable to settings with little data, (ii) they only explore a fairly limited search space or even can just choose from a handful of pre-selected architecture, or (iii) they can not adapt to data seen at test-time. One approach to obtain efficient NAS methods that has been overlooked in the literature so far is to exploit the common formulation of NAS as a hyperparameter optimization (HPO) problem (Bergstra et al., 2013; Domhan et al., 2015; Awad et al., 2021) and draw on the extensive literature on transfer HPO (Wistuba et al., 2016; Feurer et al., 2018a; Perrone & Shen, 2019; Salinas et al., 2020; Wistuba & Grabocka, 2021) . In contrast to standard transfer HPO methods that meta-learn parametric surrogates from a pool of source datasets (Wistuba et al., 2016; Feurer et al., 2018a; Wistuba & Grabocka, 2021) , in this work we explore the direction of meta learning surrogates by contextualizing them on the dataset characteristics (a.k.a. meta-features) (Vanschoren, 2018; Jomaa et al., 2021a; Rivolli et al., 2022) . In this work, we present an efficient Bayesian Optimization (BO) method with a novel deep-kernel surrogate that yields a new NAS method which combines the best of both worlds: the reliability of blackbox optimization at a computational cost in the same order of magnitude as one-shot approaches. Concretely, we propose a BO method for NAS that leverages dataset-contextualized surrogates for transfer learning. Following Lee et al. ( 2021), we use a graph encoder (Zhang et al., 2019) to encode neural architectures and an attention-based dataset encoder (Lee et al., 2019) to obtain context features. We then use deep kernel learning (Wilson et al., 2016) to obtain meta-learned kernels for the joint space of architectures and datasets, allowing us to use the full power of BO for efficient NAS. This approach solves two key issues of Lee et al. (2021) , which is closest to our work: (i) a lack of trading-off exploration vs. exploitation and (ii) the lack of exploiting new function evaluations on a test task, blindly following what has been observed during meta training. As a result, our surrogates are optimized for efficiently transferring architectures for a new target dataset based on its meta-features. To sum up, our contributions are as follows: • Inspired by manual architecture design, we treat NAS as a transfer or few-shot learning problem. We leverage ideas from transfer HPO to meta-learn a kernel for Bayesian Optimization, which encodes both architecture and dataset information. • We are the first to combine deep-kernel Gaussian Processes (GPs) with a graph neural network encoder, a transformer-based dataset encoder, the first to apply BO with deep GPs to NAS, and the first to do all of this in a transfer NAS setting. • Our resulting method outperforms both state-of-the-art blackbox NAS methods as well as state-of-the-art one-shot methods across six computer vision benchmarks. To foster reproducibility, we make our code available at https://github.com/TNAS-DCS/ TNAS-DCS. We address the points in the "NAS Best Practices Checklist" in Appendix F.

2. RELATED WORK

NAS is an intensely-researched field, with over 1000 papers published in the last two years alonefoot_0 . We therefore limit our discussion of NAS to the most related fields of Bayesian optimization for NAS and meta learning approaches for NAS. For a full discussion of the NAS literature, we refer the interested readers to a series of surveys by Elsken et al. ( 2019 Bayesian optimization (BO) for NAS. As BO is commonly used in hyperparameter optimization (HPO), one can simply treat architectural choices as categorical hyperparameters and re-use, e.g., tree-based HPO methods that can natively handle categorical choices well (Bergstra et al., 2013; Domhan et al., 2015; Falkner et al., 2018) . While Gaussian Processes (GPs) are more typically applied to continuous hyperparameters, they can also be used for NAS by creating an appropriate kernel; such kernels for GP-based BO can be manually engineered (Swersky et al., 2013; Kandasamy et al., 2018; Ru et al., 2021) . A recent alternative is to exploit (Bayesian) neural networks for BO (Snoek et al., 2015; Springenberg et al., 2016; White et al., 2021) . However, while these neural networks are very expressive, they require more data to fit well than GPs and thus are outperformed by GP-based approaches when only a few function evaluations can be afforded. In this work, we combine the sample efficiency of GPs and the expressive power of neural networks, by using deep GPs combined with a graph neural network encoder. Meta learning for NAS. To mitigate the computational infeasibility of starting NAS methods from scratch for each new task, several approaches have been proposed along the lines of meta and transfer learning. Most of these warm-start the weights of architectures in a target task Wong et al. et al., 2021) , where the authors propose to generate candidate architectures and rank them conditioned directly on a task, utilizing a meta-feature extractor (Lee et al., 2019) . However, there are two key differences in our



See list at: https://www.automl.org/automl/literature-on-neural-architecture-search



), Wistuba et al. (2019) and Ren et al. (2020), and for an introduction to BO to Shahriari et al. (2016); Hutter et al. (2019).

(2018); Lian et al. (2020); Elsken et al. (2020); Wistuba (2021). Ru et al. (2021) extracts architectural motifs that can be reused on other datasets. Most related to our work is MetaD2A (Lee

