SWEET GRADIENT MATTERS: DESIGNING CONSISTENT AND EFFICIENT ESTIMATOR FOR ZERO-SHOT NEURAL ARCHITECTURE SEARCH

Abstract

Neural architecture search (NAS) is one of the core technologies of AutoML for designing high-performance networks. Recently, Zero-Shot NAS has gained growing interest due to its training-free property and super-fast search speed. However, existing Zero-Shot estimators commonly suffer from low consistency, which limits their reliability and applicability. In this paper, we observe that Sweet Gradient of parameters, i.e., the absolute gradient values within a certain interval, brings higher consistency in network performance compared to the overall number of parameters. We further demonstrate a positive correlation between the network depth and the proportion of parameters with sweet gradients in each layer. Based on the analysis, we propose a training-free method to find the Sweet Gradient interval and obtain an estimator, named Sweetimator. Experiments show that Sweetimator has superior consistency compared to existing Zero-Shot estimators in four benchmarks with eight search spaces. Moreover, Sweetimator outperforms state-of-the-art Zero-Shot estimators in NAS-Bench-201 and achieves competitive performance with 2.5x speedup in the DARTS search space.

1. INTRODUCTION

The computer vision field has witnessed the great success of deep learning. Iconic works such as ResNet (He et al., 2016) , MobileNet (Howard et al., 2017; Sandler et al., 2018) , and EfficentNet (Tan & Le, 2019) are widely applied for a variety of real-world tasks such as object detection and semantic segmentation. To tackle the trail-and-error shortcomings of handcrafted architectures, Neural Architecture Search (NAS) (Elsken et al., 2019) has been proposed to automatically search powerful networks that even outperform manual designs (Zoph et al., 2018) . A major theme in NAS development is efficiency. From this perspective, NAS can be broadly classified into three categories: All-Shot, One-Shot, and Zero-Shot NAS. All-Shot NAS utilizes approaches such as reinforcement learning (Zoph & Le, 2017) or evolutionary algorithms (Real et al., 2019) to train the sampled architectures one by one during the search process, which costs hundreds or even thousands of GPU days. Based on weight sharing (Pham et al., 2018) , One-Shot NAS trains one supernet and utilizes sampling-based (Guo et al., 2020; Chu et al., 2021b; Yu et al., 2020) or gradient-based (Liu et al., 2019; Chen et al., 2019; Xu et al., 2020) approaches, thus reducing the search cost to a few GPU days. Zero-Shot NAS leverages training-free estimators (Mellor et al., 2021; Abdelfattah et al., 2021) to evaluate network performance. As no networks are trained, the search time is reduced to a few GPU hours or even seconds. However, Zero-Shot NAS commonly suffers from low consistency. Figure 1 illustrates the Spearman's rank between the test accuracy obtained by training the network from scratch and the estimated performance score of mainstream Zero-Shot methods in NAS-Bench-101 (Ying et al., 2019) , NAS-Bench-201 (Dong & Yang, 2020; Dong et al., 2022), and NAS-Bench-301 (Zela et al., 2022) . The results demonstrate that these methods do not consistently outperform the simple metric of the number of parameters, which limits their reliability and applicability. Moreover, a question is also naturally raised: could we find a Zero-Shot estimator with superior consistency to parameters? For the networks in the NAS-Bench-101, the NAS-Bench-201, and the NAS-Bench-301 space, we observe that some specific parameters, whose absolute gradient values are in a certain interval, have a

