TRANSFER NAS WITH META-LEARNED BAYESIAN SURROGATES

Abstract

While neural architecture search (NAS) is an intensely-researched area, approaches typically still suffer from either (i) high computational costs or (ii) lack of robustness across datasets and experiments. Furthermore, most methods start searching for an optimal architecture from scratch, ignoring prior knowledge. This is in contrast to the manual design process by researchers and engineers that leverage previous deep learning experiences by, e.g., transferring architectures from previously solved, related problems. We propose to adopt this human design strategy and introduce a novel surrogate for NAS, that is meta-learned across prior architecture evaluations across different datasets. We utilize Bayesian Optimization (BO) with deep-kernel Gaussian Processes, graph neural networks for obtaining architecture embeddings and a transformer-based dataset encoder. As a result, our method consistently achieves state-of-the-art results on six computer vision datasets, while being as fast as one-shot NAS methods.

1. INTRODUCTION

While deep learning has removed the need for manual feature engineering, it has shifted this manual work to the meta-level, introducing the need for manual architecture engineering. The natural next step is to also remove the need to manually define the architecture. This is the problem tackled by the field of neural architecture search (NAS). Even though NAS is an intensely-researched area, there is still no NAS method that is both generally robust and efficient. Blackbox optimization methods, such as reinforcement learning (Zoph & Le, 2017) , evolutionary algorithms (Real et al., 2019), and Bayesian optimization (Ru et al., 2021; White et al., 2021) work reliably but are slow. On the other hand, one-shot methods (Liu et al., 2019; Dong & Yang, 2019b ) often have problems with robustness (Zela et al., 2020) , and the newest trend of zero-cost proxies often does not provide more information about an architecture's performance than simple statistics, such as the architecture's number of parameters (White et al., 2022 ). An understudied path towards efficiency in NAS is to transfer information across datasets. This idea is naturally motivated by how researchers and engineers tackle new deep learning problems: they leverage the knowledge they obtained from previous experimentation and, e.g., re-use architectures designed for one task and apply or adapt them to a novel task. While a few NAS approaches in this direction exist (Wong et al., 2018; Lian et al., 2020; Elsken et al., 2020; Wistuba, 2021; Lee et al., 2021; Ru et al., 2021) , they typically come with one or more of the following limitations: (i) they are only applicable to settings with little data, (ii) they only explore a fairly limited search space or even can just choose from a handful of pre-selected architecture, or (iii) they can not adapt to data seen at test-time. One approach to obtain efficient NAS methods that has been overlooked in the literature so far is to exploit the common formulation of NAS as a hyperparameter optimization (HPO) problem (Bergstra et al., 2013; Domhan et al., 2015; Awad et al., 2021) and draw on the extensive literature on transfer HPO (Wistuba et al., 2016; Feurer et al., 2018a; Perrone & Shen, 2019; Salinas et al., 2020; Wistuba & Grabocka, 2021) . In contrast to standard transfer HPO methods that meta-learn parametric surrogates from a pool of source datasets (Wistuba et al., 2016; Feurer et al., 2018a; Wistuba & Grabocka, 2021) , in this work we explore the direction of meta learning

