NEURAL ARCHITECTURE SEARCH ON IMAGENET IN FOUR GPU HOURS: A THEORETICALLY INSPIRED PERSPECTIVE

Abstract

Neural Architecture Search (NAS) has been explosively studied to automate the discovery of top-performer neural networks. Current works require heavy training of supernet or intensive architecture evaluations, thus suffering from heavy resource consumption and often incurring search bias due to truncated training or approximations. Can we select the best neural architectures without involving any training and eliminate a drastic portion of the search cost? We provide an affirmative answer, by proposing a novel framework called training-free neural architecture search (TE-NAS). TE-NAS ranks architectures by analyzing the spectrum of the neural tangent kernel (NTK) and the number of linear regions in the input space. Both are motivated by recent theory advances in deep networks and can be computed without any training and any label. We show that: (1) these two measurements imply the trainability and expressivity of a neural network; (2) they strongly correlate with the network's test accuracy. Further on, we design a pruning-based NAS mechanism to achieve a more flexible and superior trade-off between the trainability and expressivity during the search. In NAS-Bench-201 and DARTS search spaces, TE-NAS completes high-quality search but only costs 0.5 and 4 GPU hours with one 1080Ti on CIFAR-10 and ImageNet, respectively. We hope our work inspires more attempts in bridging the theoretical findings of deep networks and practical impacts in real NAS applications.

1. INTRODUCTION

The recent development of deep networks significantly contributes to the success of computer vision. Thanks to many efforts by human designers, the performance of deep networks have been significantly boosted (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Xie et al., 2017) . However, the manual creation of new network architectures not only costs enormous time and resources due to trial-and-error, but also depends on the design experience that does not always scale up. To reduce the human efforts and costs, neural architecture search (NAS) has recently amassed explosive interests, leading to principled and automated discovery for good architectures in a given search space of candidates (Zoph & Le, 2016; Brock et al., 2017; Pham et al., 2018; Liu et al., 2018a; Chen et al., 2018; Bender et al., 2018; Gong et al., 2019; Chen et al., 2020a; Fu et al., 2020) . As an optimization problem, NAS faces two core questions: 1) "how to evaluate", i.e. the objective function that defines what are good architectures we want; 2) "how to optimize", i.e. by what means we could effectively optimize the objective function. These two questions are entangled and highly non-trivial, since the search spaces are of extremely high dimension, and the generalization ability of architectures cannot be easily inferred (Dong & Yang, 2020; Dong et al., 2020) . Existing NAS methods mainly leverage the validation set and conduct accuracy-driven architecture optimization. They either formulate the search space as a super-network ("supernet") and make the training loss differentiable through the architecture parameters (Liu et al., 2018b) , or treat the architecture selection as a sequential decision making process (Zoph & Le, 2016) or evolution of genetics (Real et al., 2019) . However, these NAS algorithms suffer from heavy consumption of both time and GPU resources. Training a supernet till convergence is extremely slow, even with many effective heuristics for sampling or channel approximations (Dong & Yang, 2019; Xu et al., 2019) . Approximated proxy inference such as truncated training/early stopping can accelerate the search, but is well known to introduce search bias to the inaccurate results obtained (Pham et al., 2018; Liang et al., 2019; Tan et al., 2020) . The heavy search cost not only slows down the discovery of novel architectures, but also blocks us from more meaningfully understanding the NAS behaviors. On the other hand, the analysis of neural network's trainability (how effective a network can be optimized via gradient descent) and expressivity (how complex the function a network can represent) has witnessed exciting development recently in the deep learning theory fields. By formulating neural networks as a Gaussian Process (no training involved), the gradient descent training dynamics can be characterized by the Neural Tangent Kernel (NTK) of infinite (Lee et al., 2019) or finite (Yang, 2019) width networks, from which several useful measures can be derived to depict the network trainability at the initialization. Hanin & Rolnick (2019a;b); Xiong et al. ( 2020) describe another measure of network expressivity, also without any training, by counting the number of unique linear regions that a neural network can divide in its input space. We are therefore inspired to ask: • How to optimize NAS at network's initialization without involving any training, thus significantly eliminating a heavy portion of the search cost? • Can we define how to evaluate in NAS by analyzing the trainability and expressivity of architectures, and further benefit our understanding of the search process? Our answers are yes to both questions. In this work, we propose TE-NAS, a framework for trainingfree neural architecture search. We leverage two indicators, the condition number of NTK and the number of linear regions, that can decouple and effectively characterize the trainability and expressivity of architectures respectively in complex NAS search spaces. Most importantly, these two indicators can be measured in a training-free and label-free manner, thus largely accelerates the NAS search process and benefits the understanding of discovered architectures. To our best knowledge, TE-NAS makes the first attempt to bridge the theoretical findings of deep neural networks and real-world NAS applications. While we intend not to claim that the two indicators we use are the only nor the best options, we hope our work opens a door to theoretically-inspired NAS and inspires the discovery of more deep network indicators. Our contributions are summarized as below: • We identify and investigate two training-free and label-free indicators to rank the quality of deep architectures: the spectrum of their NTKs, and the number of linear regions in their input space. Our study finds that they reliably indicate the trainability and expressivity of a deep network respectively, and are strongly correlated with the network's test accuracy. • We leverage the above two theoretically-inspired indicators to establish a training-free NAS framework, TE-NAS, therefore eliminating a drastic portion of the search cost. We further introduce a pruning-based mechanism, to boost search efficiency and to more flexibly trade-off between trainability and expressivity. • In NAS-Bench-201/DARTS search spaces, TE-NAS discovers architectures with a strong performance at remarkably lower search costs, compared to previous efforts. With just one 1080Ti, it only costs 0.5 GPU hours to search on CIFAR10, and 4 GPU hours on ImageNet, respectively, setting the new record for ultra-efficient yet high-quality NAS.

2. RELATED WORKS

Neural architecture search (NAS) is recently proposed to accelerate the principled and automated discovery of high-performance networks. However, most works suffer from heavy search cost, for both weight-sharing based methods (Liu et al., 2018b; Dong & Yang, 2019; Liu et al., 2019; Yu et al., 2020a; Li et al., 2020a; Yang et al., 2020a) and single-path sampling-based methods (Pham et al., 2018; Guo et al., 2019; Real et al., 2019; Tan et al., 2020; Li et al., 2020c; Yang et al., 2020b) . A one-shot super network can share its parameters to sampled sub-networks and accelerate the architecture evaluations, but it is very heavy and hard to optimize and suffers from a poor correlation between its accuracy and those of the sub-networks (Yu et al., 2020c) . Sampling-based methods achieve more accurate architecture evaluations, but their truncated training still imposes bias to the performance ranking since this is based on the results of early training stages. Instead of estimating architecture performance by direct training, people also try to predict network's accuracy (or ranking), called predictor based NAS methods (Liu et al., 2018a; Luo et al., 2018; Dai et al., 2019; Luo et al., 2020) . Graph neural network (GNN) is a popular choice as the predictor model (Wen et al., 2019; Chen et al., 2020b) . Siems et al. (2020) even propose the first large-scale

