ZICO: ZERO-SHOT NAS VIA INVERSE COEFFICIENT OF VARIATION ON GRADIENTS

Abstract

Neural Architecture Search (NAS) is widely used to automatically obtain the neural network with the best performance among a large number of candidate architectures. To reduce the search time, zero-shot NAS aims at designing training-free proxies that can predict the test performance of a given architecture. However, as shown recently, none of the zero-shot proxies proposed to date can actually work consistently better than a naive proxy, namely, the number of network parameters (#Params). To improve this state of affairs, as the main theoretical contribution, we first reveal how some specific gradient properties across different samples impact the convergence rate and generalization capacity of neural networks. Based on this theoretical analysis, we propose a new zero-shot proxy, ZiCo, the first proxy that works consistently better than #Params. We demonstrate that ZiCo works better than State-Of-The-Art (SOTA) proxies on several popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS, TransNASBench-101) for multiple applications (e.g., image classification/reconstruction and pixel-level prediction). Finally, we demonstrate that the optimal architectures found via ZiCo are as competitive as the ones found by one-shot and multi-shot NAS methods, but with much less search time. For example, ZiCo-based NAS can find optimal architectures with 78.1%, 79.4%, and 80.4% test accuracy under inference budgets of 450M, 600M, and 1000M FLOPs, respectively, on ImageNet within 0.4 GPU days.

1. INTRODUCTION

During the last decade, deep learning has achieved great success in many areas, such as computer vision and natural language modeling Krizhevsky et al. ( 2012 2022), these general zero-shot proxies proposed to date cannot consistently work better than a naive proxy, namely, the number of parameters (#Params). These results may undermine the effectiveness of zero-shot NAS approaches. To address the limitations of existing zero-shot proxies, we target the following key questions: 1. How do some specific gradient properties, i.e., mean value and standard deviation across different samples, impact the training convergence of neural networks? 2. Can we use these two gradient properties to design a new theoretically-grounded proxy that works better than #Params consistently across many different NAS topologies/tasks? To this end, we first analyze how the mean value and standard deviation of gradients across different training batches impact the training convergence of neural networks. Based on our analysis, we propose ZiCo, a new proxy for zero-shot NAS. We demonstrate that, compared to all existing proxies (including #Params), ZiCo has either a higher or at least on-par correlation with the test accuracy on popular NAS-Benchmarks (NASBench101, NATS-Bench-SSS/TSS) for multiple datasets (CIFAR10/100, ImageNet16-120). Finally, we demonstrate that ZiCo enables a zero-shot NAS framework that can efficiently find the network architectures with the highest test accuracy compared to other zero-shot baselines. In fact, our zero-shot NAS framework achieves competitive FLOPs-accuracy tradeoffs compared to multiple one-shot and multi-shot NAS, but with much lower time costs. To summarize, we make the following major contributions: • We theoretically reveal how the mean value and variance of gradients across multiple samples impact the training convergence and generalization capacity of neural networks. • We propose a new zero-shot proxy, ZiCo, that works better than existing proxies on popular NAS-Benchmarks (NASBench101, NATS-Bench-SSS/TSS, TransNASBench-101) for multiple applications (image classification/reconstruction and pixel-level prediction). • We demonstrate that our proposed zero-shot NAS achieves competitive test accuracy with representative one-shot and multi-shot NAS with much less search time. The rest of the paper is organized as follows. We discuss related work in Section 2. In Section 3, we introduce our theoretical analysis. We introduce our proposed zero-shot proxy (ZiCo) and the NAS framework in Section 3.4. Section 4 validates our analysis and presents our results with the proposed zero-shot NAS. We conclude the paper in Section 5 with remarks on our main contribution. 



); Liu & Deng (2015); Huang et al. (2017); He et al. (2016); Dosovitskiy et al. (2021); Brown et al. (2020); Vaswani et al. (2017). In recent years, neural architecture search (NAS) has been proposed to search for optimal architectures, while reducing the trial-and-error (manual) network design efforts Baker et al. (2017); Zoph & Le (2017); Elsken et al. (2019). Moreover, the neural architectures found via NAS show better performance than the manually-designed networks in many mainstream applications Real et al. (2017); Gong et al. (2019); Xie et al. (2019); Wu et al. (2019); Wan et al. (2020); Li & Talwalkar (2020); Kandasamy et al. (2018); Yu et al. (2020b); Liu et al. (2018b); Cai et al. (2018); Zhang et al. (2019a); Zhou et al. (2019); Howard et al. (2019); Li et al. (2021b). Despite these advantages, many existing NAS approaches involve a time-consuming and resourceintensive search process. For example, multi-shot NAS uses a controller or an accuracy predictor to conduct the search process and it requires training of multiple networks; thus, multi-shot NAS is extremely time-consuming Real et al. (2019); Chiang et al. (2019). Alternatively, one-shot NAS merges all possible networks from the search space into a supernet and thus only needs to train the supernet once Dong & Yang (2019); Zela et al. (2020); Chen et al. (2019); Cai et al. (2019); Stamoulis et al. (2019); Chu et al. (2021); Guo et al. (2020); Li et al. (2020); this enables oneshot NAS to find a good architecture with much less search time. Though the one-shot NAS has significantly improved the time efficiency of NAS, training is still required during the search process. Recently, the zero-shot approaches have been proposed to liberate NAS from training entirely Wu et al. (2021); Zhou et al. (2022; 2020); Ingolfsson et al. (2022); Tran & Bae (2021); Do & Luong (2021); Tran et al. (2021); Shu et al. (2022b); Li et al. (2022). Essentially, the zero-shot NAS utilizes some proxy that can predict the test performance of a given network without training. The design of such proxies is usually based on some theoretical analysis of deep networks. For instance, the first zero-shot proxy called NN-Mass was proposed by Bhardwaj et al. (2019); NN-Mass theoretically links how the network topology influences gradient propagation and model performance. Hence, zero-shot approaches can not only significantly improve the time efficiency of NAS, but also deepen the theoretical understanding on why certain networks work well. While NN-Mass consistently outperforms #Params, it is not defined for generic NAS topologies and works mostly for simple repeating blocks like ResNets/MobileNets/DenseNets Bhardwaj et al. (2019). Later several zeroshot proxies are proposed for general neural architectures. Nonetheless, as revealed in Ning et al. (2021); White et al. (

zero-shot NAS is to rank the accuracy of various candidate network architectures without training, such that we can replace the expensive training process in NAS with some computationefficient proxies Xiang et al. (2021a); Javaheripi et al. (2022); Li et al. (2021a). Hence, the quality of the proxy determines the effectiveness of zero-shot NAS. Several works use the number of linear regions to approximately measure the expressivity of a deep neural network Mellor et al. (2021); Chen et al. (2021b); Bhardwaj et al. (2022). Alternatively, most of the existing proxies are derived from the gradient of deep networks. For example, Synflow, SNIP, and GraSP rely on the gradient w.r.t the parameters of neural networks; they are proved to be the different approximations of Taylor expansion of deep neural networks Abdelfattah et al. (2021); Lee et al. (2019b); Tanaka et al. (2020); Wang et al. (2020). Moreover, the Zen-score approximates the gradient w.r.t featuremaps and measures the complexity of neural networks Lin et al. (2021); Sun et al. (2021). Furthermore, Jacob cov leverages the Jacobian matrix between the loss and multiple input samples to quantify the capacity

availability

Our code is available at https://github.com/SLDGroup

