NEURAL ARCHITECTURE SEARCH WITHOUT TRAINING

Abstract

The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be remedied if we could infer a network's trained accuracy from its initial state. In this work, we examine the correlation of linear maps induced by augmented versions of a single image in untrained networks and motivate how this can be used to give a measure which is highly indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU, and verify its effectiveness on NAS-Bench-101 and NAS-Bench-201. Finally, we show that our approach can be readily combined with more expensive search methods for added value: we modify regularised evolutionary search to produce a novel algorithm that outperforms its predecessor.

1. INTRODUCTION

The success of deep learning in computer vision is in no small part due to the insight and engineering efforts of human experts, allowing for the creation of powerful architectures for widespread adoption (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016; Szegedy et al., 2016; Huang et al., 2017) . However, this manual design is costly, and becomes increasingly more difficult as networks get larger and more complicated. Because of these challenges, the neural network community has seen a shift from designing architectures to designing algorithms that search for candidate architectures (Elsken et al., 2019; Wistuba et al., 2019) . These Neural Architecture Search (NAS) algorithms are capable of automating the discovery of effective architectures (Zoph & Le, 2017; Zoph et al., 2018; Pham et al., 2018; Tan et al., 2019; Liu et al., 2019; Real et al., 2019) . NAS algorithms are broadly based on the seminal work of Zoph & Le (2017) . A controller network generates an architecture proposal, which is then trained to provide a signal to the controller through REINFORCE (Williams, 1992) , which then produces a new proposal, and so on. Training a network for every controller update is extremely expensive; utilising 800 GPUs for 28 days in Zoph & Le (2017) . Subsequent work has sought to ameliorate this by (i) learning stackable cells instead of whole networks (Zoph et al., 2018) and (ii) incorporating weight sharing; allowing candidate networks to share weights to allow for joint training (Pham et al., 2018) . These contributions have accelerated the speed of NAS algorithms e.g. to half a day on a single GPU in Pham et al. (2018) . For some practitioners, NAS is still too slow; being able to perform NAS quickly (i.e. in seconds) would be immensely useful in the hardware-aware setting where a separate search is typically required for each device and task (Wu et al., 2019; Tan et al., 2019) . Moreover, recent works have scrutinised NAS with weight sharing (Li & Talwalkar, 2019; Yu et al., 2020) ; there is continued debate as to whether it is clearly better than simple random search. The issues of cost and time, and the risks of weight sharing could be avoided entirely if a NAS algorithm did not require any network training. In this paper, we show that this can be achieved. We explore two recently released NAS benchmarks, NAS-Bench-101 (Ying et al., 2019) , and NAS- Bench-201 (Dong & Yang, 2020) and examine the relationship between the linear maps induced by an untrained network for a minibatch of augmented versions of a single image (Section 3). These maps are easily computed using the Jacobian. The correlations between these maps (which we denote The y-axes are individually scaled for visibility. The profiles are distinctive; the histograms for good architectures in both search spaces have their mass around zero with a small positive skew. We can look at this distribution for an untrained network to predict its final performance without any training. More histograms are available in Appendix A. by Σ J ) are distinctive for networks that perform well when trained on both NAS-Benches; this is immediately apparent from visualisation alone (Figure 1 ). We devise a score based on Σ J and perform an ablation study to demonstrate its robustness to inputs and network initialisation. We incorporate our score into a simple search algorithm that doesn't require training (Section 4). This allows us to perform architecture search quickly, for example, on CIFAR-10 (Krizhevsky, 2009) we are able to search for a network that achieve 93.36% accuracy in 29 seconds within the NAS-Bench-201 search space; several orders of magnitude faster than traditional NAS methods for a modest change in final accuracy (e.g. REINFORCE finds a 93.85% net in 12000 seconds). Finally, we show that we can combine our approach with regularised evolutionary search (REA, Pham et al., 2018) to produce a new NAS algorithm, Assisted-REA (AREA) that outperforms its precedessor, attaining 94.16% accuracy on NAS-Bench-101 in 12,000 seconds. Code for reproducing our experiments is available in the supplementary material. We believe this work is an important proof-of-concept for NAS without training, and shows that the large resource costs associated with NAS can be avoided. The benefit is two-fold, as we also show that we can integrate our approach into existing NAS techniques for scenarios where obtaining as high an accuracy as possible is of the essence.

2. BACKGROUND

Designing a neural architecture by hand is a challenging and time-consuming task. It is extremely difficult to intuit where to place connections, or which operations to use. This has prompted an abundance of research into neural architecture search (NAS); the automation of the network design process. In the pioneering work of Zoph & Le (2017), the authors use an RNN controller to generate



Figure 1: Histograms of the correlations between linear maps in an augmented minibatch of a single CIFAR-10 image for untrained architectures in (a) NAS-Bench-101 (b) NAS-Bench-201. The histograms are sorted into columns based on the final CIFAR-10 validation accuracy when trained.The y-axes are individually scaled for visibility. The profiles are distinctive; the histograms for good architectures in both search spaces have their mass around zero with a small positive skew. We can look at this distribution for an untrained network to predict its final performance without any training. More histograms are available in Appendix A.

