NEURAL ARCHITECTURE SEARCH WITHOUT TRAINING

Abstract

The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be remedied if we could infer a network's trained accuracy from its initial state. In this work, we examine the correlation of linear maps induced by augmented versions of a single image in untrained networks and motivate how this can be used to give a measure which is highly indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU, and verify its effectiveness on NAS-Bench-101 and NAS-Bench-201. Finally, we show that our approach can be readily combined with more expensive search methods for added value: we modify regularised evolutionary search to produce a novel algorithm that outperforms its predecessor.

1. INTRODUCTION

The success of deep learning in computer vision is in no small part due to the insight and engineering efforts of human experts, allowing for the creation of powerful architectures for widespread adoption (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016; Szegedy et al., 2016; Huang et al., 2017) . However, this manual design is costly, and becomes increasingly more difficult as networks get larger and more complicated. Because of these challenges, the neural network community has seen a shift from designing architectures to designing algorithms that search for candidate architectures (Elsken et al., 2019; Wistuba et al., 2019) . These Neural Architecture Search (NAS) algorithms are capable of automating the discovery of effective architectures (Zoph & Le, 2017; Zoph et al., 2018; Pham et al., 2018; Tan et al., 2019; Liu et al., 2019; Real et al., 2019) . NAS algorithms are broadly based on the seminal work of Zoph & Le (2017) . A controller network generates an architecture proposal, which is then trained to provide a signal to the controller through REINFORCE (Williams, 1992), which then produces a new proposal, and so on. Training a network for every controller update is extremely expensive; utilising 800 GPUs for 28 days in Zoph & Le (2017) . Subsequent work has sought to ameliorate this by (i) learning stackable cells instead of whole networks (Zoph et al., 2018) and (ii) incorporating weight sharing; allowing candidate networks to share weights to allow for joint training (Pham et al., 2018) . These contributions have accelerated the speed of NAS algorithms e.g. to half a day on a single GPU in Pham et al. (2018) . For some practitioners, NAS is still too slow; being able to perform NAS quickly (i.e. in seconds) would be immensely useful in the hardware-aware setting where a separate search is typically required for each device and task (Wu et al., 2019; Tan et al., 2019) . Moreover, recent works have scrutinised NAS with weight sharing (Li & Talwalkar, 2019; Yu et al., 2020) ; there is continued debate as to whether it is clearly better than simple random search. The issues of cost and time, and the risks of weight sharing could be avoided entirely if a NAS algorithm did not require any network training. In this paper, we show that this can be achieved. We explore two recently released NAS benchmarks, NAS-Bench-101 (Ying et al., 2019), and NAS-Bench-201 (Dong & Yang, 2020) and examine the relationship between the linear maps induced by an untrained network for a minibatch of augmented versions of a single image (Section 3). These maps are easily computed using the Jacobian. The correlations between these maps (which we denote 1

