ZERO-COST PROXIES FOR LIGHTWEIGHT NAS

Abstract

Neural Architecture Search (NAS) is quickly becoming the standard methodology to design neural network models. However, NAS is typically compute-intensive because multiple models need to be evaluated before choosing the best one. To reduce the computational power and time needed, a proxy task is often used for evaluating each model instead of full training. In this paper, we evaluate conventional reduced-training proxies and quantify how well they preserve ranking between neural network models during search when compared with the rankings produced by final trained accuracy. We propose a series of zero-cost proxies, based on recent pruning literature, that use just a single minibatch of training data to compute a model's score. Our zero-cost proxies use 3 orders of magnitude less computation but can match and even outperform conventional proxies. For example, Spearman's rank correlation coefficient between final validation accuracy and our best zero-cost proxy on NAS-Bench-201 is 0.82, compared to 0.61 for EcoNAS (a recently proposed reduced-training proxy). Finally, we use these zerocost proxies to enhance existing NAS search algorithms such as random search, reinforcement learning, evolutionary search and predictor-based search. For all search methodologies and across three different NAS datasets, we are able to significantly improve sample efficiency, and thereby decrease computation, by using our zero-cost proxies. For example on NAS-Bench-101, we achieved the same accuracy 4× quicker than the best previous result.

1. INTRODUCTION

Instead of manually designing neural networks, neural architecture search (NAS) algorithms are used to automatically discover the best ones (Tan & Le, 2019a; Liu et al., 2019; Bender et al., 2018) . Early work by Zoph & Le (2017) proposed using a reinforcement learning (RL) controller that constructs candidate architectures, these are evaluated and then feedback is provided to the controller based on the performance of the candidate. One major problem with this basic NAS methodology is that each evaluation is very costly -typically on the order of hours or days to train a single neural network fully. We focus on this evaluation phase -we propose using proxies that require a single minibatch of data and a single forward/backward propagation pass to score a neural network. This is inspired by recent pruning-at-initialization work by Lee et al. (2019 ), Wang et al. (2020) and Tanaka et al. ( 2020) wherein a per-parameter saliency metric is computed before training to inform parameter pruning. Can we use such saliency metrics to score an entire neural network? Furthermore, can we use these "single minibatch" metrics to rank and compare multiple neural networks for use within NAS? If so, how do we best integrate these metrics within existing NAS algorithms such as RL or evolutionary search? These are the questions that we hope to (empirically) tackle in this work with the goal of making NAS less compute-hungry. Our contributions are: • Zero-cost proxies We adapt pruning-at-initialization metrics for use with NAS. This requires these metrics to operate at the granularity of an entire network rather than individual parameters -we devise and validate approaches that aggregate parameter-level metrics in a manner suitable for ranking candidates during NAS search. • Comparison to conventional proxies We perform a detailed comparison between zerocost and conventional NAS proxies that use a form of reduced-computation training. First, we quantify the rank consistency of conventional proxies on large-scale datasets: 15k models vs. 50 models used in (Zhou et al., 2020) . Second, we show that zero-cost proxies can match or exceed the rank consistency of conventional proxies. • Ablations on NAS benchmarks We perform ablations of our zero-cost proxies on five different NAS benchmarks (NAS-Bench-101/201/NLP/ASR and PyTorchCV) to both test the zero-cost metrics under different settings, and expose properties of successful metrics. • Integration with NAS Finally, we propose two ways to use zero-cost metrics effectively within NAS algorithms: random search, reinforcement learning, aging evolution and predictor-based search. For all algorithms and three NAS datasets we show significant speedups, up to 4× for NAS-Bench-101 compared to current state-of-the-art. The proxy used is a reduced-computation training, wherein one of the following four variables is reduced: (1) number of epochs, (2) number of training samples, (3) input resolution (4) model size (controlled through the number of channels after the first convolution). Even though such proxies were used in many prior works, EcoNAS is the first systematic study of conventional proxy tasks that we found. One main finding by Zhou et al. ( 2020) is that using approximately 1 4 of the model size and input resolution, all training samples, and 1 10 the number of epochs was a reasonable proxy which yielded the best results for their experiment (Zhou et al., 2020) .

3.2. ZERO-COST NAS PROXIES

We present alternative proxies for network accuracy that can be used to speed up NAS. A simple proxy that we use is grad norm in which we sum the Euclidean norm of the gradients after a



Efficiency To decrease NAS search time, various techniques were used in the literature. Pham et al. (2018) and Cai et al. (2018) use weight sharing between candidate models to decrease the training time during evaluation. Liu et al. (2019) and others use smaller datasets (CIFAR-10) as a proxy to the full task (ImageNet1k). In EcoNAS, Zhou et al. (2020) extensively investigated reduced-training proxies wherein input size, model size, number of training samples and number of epochs were reduced in the NAS evaluation phase. We compare to EcoNAS in this work to elucidate how well our zero-cost proxies perform compared to familiar and widely-used conventional proxies. Pruning The goal is to reduce the number of parameters in a neural network, one way to do this is by identifying a saliency (importance) metric for each parameter, and the less-important parameters are removed. For example, Han et al. (2015), Frankle & Carbin (2019) and others use parameter magnitudes as the criterion while LeCun et al. (1990), Hassibi & Stork (1993) and Molchanov et al. (2017) use gradients. However, the aforementioned works require training before computing the saliency criterion. A new class of pruning-at-initialization algorithms, that require no training, were introduced by Lee et al. (2019) and extended by Wang et al. (2020) and Tanaka et al. (2020). A single forward/backward propagation pass is used to compute a saliency criterion which is successfully used to heavily prune neural networks before training. We extend these pruning-at-initialization criteria towards scoring entire neural networks and we investigate their use with NAS algorithms. Intersection between pruning and NAS Concepts from pruning have been used within NAS multiple times. For example, Mei et al. (2020) use channel pruning in their AtomNAS work to arrive at customized multi-kernel-size convolutions (mixconvs as introduced by Tan & Le (2019b)). In their Blockswap work, Turner et al. (2020) use Fisher information at initialization to score different lightweight primitives that are substituted into a neural network to decrease computation. This is the earliest work we could find that attempts to perform a type of NAS by scoring neural networks without training using a pruning criterion, More recently,Mellor et al. (2020)  introduced a new metric for scoring neural networks at initialization based on the correlation of Jacobians with different inputs. They perform "NAS without training" by performing random search with their zero-cost metric (jacob cov) to rank neural networks instead of using accuracy. We include jacob cov in our analysis and we introduce five more zero-cost metrics in this work.3 PROXIES FOR NEURAL NETWORK ACCURACY3.1 CONVENTIONAL NAS PROXIES (ECONAS) In conventional sample-based NAS, a proxy training regime is often used to predict a model's accuracy instead of full training. Zhou et al. (2020) investigate conventional proxies in depth by computing the Spearman rank correlation coefficient (Spearman ρ) of a proxy task to final test accuracy.

availability

https://github.com/mohsaied

