SWEET GRADIENT MATTERS: DESIGNING CONSISTENT AND EFFICIENT ESTIMATOR FOR ZERO-SHOT NEURAL ARCHITECTURE SEARCH

Abstract

Neural architecture search (NAS) is one of the core technologies of AutoML for designing high-performance networks. Recently, Zero-Shot NAS has gained growing interest due to its training-free property and super-fast search speed. However, existing Zero-Shot estimators commonly suffer from low consistency, which limits their reliability and applicability. In this paper, we observe that Sweet Gradient of parameters, i.e., the absolute gradient values within a certain interval, brings higher consistency in network performance compared to the overall number of parameters. We further demonstrate a positive correlation between the network depth and the proportion of parameters with sweet gradients in each layer. Based on the analysis, we propose a training-free method to find the Sweet Gradient interval and obtain an estimator, named Sweetimator. Experiments show that Sweetimator has superior consistency compared to existing Zero-Shot estimators in four benchmarks with eight search spaces. Moreover, Sweetimator outperforms state-of-the-art Zero-Shot estimators in NAS-Bench-201 and achieves competitive performance with 2.5x speedup in the DARTS search space.

1. INTRODUCTION

The computer vision field has witnessed the great success of deep learning. Iconic works such as ResNet (He et al., 2016) , MobileNet (Howard et al., 2017; Sandler et al., 2018) , and EfficentNet (Tan & Le, 2019) are widely applied for a variety of real-world tasks such as object detection and semantic segmentation. To tackle the trail-and-error shortcomings of handcrafted architectures, Neural Architecture Search (NAS) (Elsken et al., 2019) has been proposed to automatically search powerful networks that even outperform manual designs (Zoph et al., 2018) . A major theme in NAS development is efficiency. From this perspective, NAS can be broadly classified into three categories: All-Shot, One-Shot, and Zero-Shot NAS. All-Shot NAS utilizes approaches such as reinforcement learning (Zoph & Le, 2017) or evolutionary algorithms (Real et al., 2019) to train the sampled architectures one by one during the search process, which costs hundreds or even thousands of GPU days. Based on weight sharing (Pham et al., 2018) , One-Shot NAS trains one supernet and utilizes sampling-based (Guo et al., 2020; Chu et al., 2021b; Yu et al., 2020) or gradient-based (Liu et al., 2019; Chen et al., 2019; Xu et al., 2020) approaches, thus reducing the search cost to a few GPU days. Zero-Shot NAS leverages training-free estimators (Mellor et al., 2021; Abdelfattah et al., 2021) to evaluate network performance. As no networks are trained, the search time is reduced to a few GPU hours or even seconds. However, Zero-Shot NAS commonly suffers from low consistency. Figure 1 illustrates the Spearman's rank between the test accuracy obtained by training the network from scratch and the estimated performance score of mainstream Zero-Shot methods in NAS-Bench-101 (Ying et al., 2019) , NAS-Bench-201 (Dong & Yang, 2020; Dong et al., 2022) , and NAS-Bench-301 (Zela et al., 2022) . The results demonstrate that these methods do not consistently outperform the simple metric of the number of parameters, which limits their reliability and applicability. Moreover, a question is also naturally raised: could we find a Zero-Shot estimator with superior consistency to parameters? For the networks in the NAS-Bench-101, the NAS-Bench-201, and the NAS-Bench-301 space, we observe that some specific parameters, whose absolute gradient values are in a certain interval, have a stronger consistency with the network performance than the overall number of parameters (Parameters for short). For the sake of brevity, we named the gradient in the such interval as Sweet Gradient. We found an interesting property of Sweet Gradient that the proportion of parameters with Sweet Gradient in each layer is positively correlated with the depth of the network. Based on this property, we propose Sweetimator, an estimator that computes Sweet Gradient interval without training. Figure 1 shows that Sweetimator outperforms the Parameters estimator and achieves the best consistency in all three benchmarks. The contributions of this work are: • We observe the Sweet Gradient phenomenon, i.e., the number of parameters with absolute gradient values in a certain interval has better performance consistency than Parameters. • We demonstrate that there is a positive correlation between the network depth and the proportion of parameters with Sweet Gradient in each layer. • We propose a simple and effective Zero-Shot estimator, Sweetimator, that can find Sweet Gradient intervals without training. • In the consistency experiments, Sweetimator outperforms the existing Zero-Shot estimators in four benchmarks with eight search spaces. In the search experiments, Sweetimator has superior performance to state-of-the-art Zero-Shot estimators in NAS-Bench-201 and achieves competitive results with 2.5x speedup in the DARTS search space.

2. RELATED WORK

Neural Architecture Search. Neural architecture search aims at automatically designing the bestperforming network for a specific task. In the early days, Zoph & Le (2017) proposed a reinforcement learning framework to search hyper-parameters of an entire network. Inspired by the modular design paradigm of handcrafted neural networks, NASNet (Zoph et al., 2018) searched cell structures and stacked the searched best normal cell and reduction cell to form a network. Subsequently, Pham et al. (2018) proposed a weight-sharing strategy to reduce the search overhead to a few GPU Days. Afterward, sampling-based approaches (Guo et al., 2020; Chu et al., 2021b; Yu et al., 2020) trained the supernet by path sampling and utilized sub-networks accuracy for evaluation. DARTS (Liu et al., 2019) and its variants (Chen et al., 2019; Xu et al., 2020; Zela et al., 2020; Chu et al., 2021a; Wang et al., 2021; Sun et al., 2022) leveraged differentiable strategies to optimize the supernet and select the final architecture. Non-Zero-Shot Estimator. To facilitate the performance evaluation process, various estimators have been proposed. It is natural to use the validation loss or accuracy (Zoph & Le, 2017; Real et al., 2019; Liu et al., 2018) as a performance estimator. Subsequently, SPOS (Guo et al., 2020) and similar works (Pham et al., 2018; Yu et al., 2020; Chu et al., 2021b) utilize the accuracy of sub-networks as a proxy for efficient evaluation. Another path for estimators is to utilize machine learning models to predict network performance. NAO (Luo et al., 2018) utilizes an encoder and estimator to find a high-performance network. Wen et al. (2020) ; Chen et al. (2021c) utilized graph convolutions networks to regress network performance. GP-NAS (Li et al., 2020) proposed a gaussian process based NAS method to obtain the correlations between architectures and performances. TNASP (Lu et al., 2021) uses a transformer and self-evolutionary frameworks to predict network performance. Although the above works can effectively evaluate the network performance, a large search overhead still exists with training networks. Zero-Shot Estimator. Zero-Shot estimators are applied to search architectures without training. NASWOT (Mellor et al., 2021) utilizes activations of rectified linear units to evaluate networks with random initialization. Abdelfattah et al. (2021) proposes a series of zero-cost proxies based on pruning literature (Lee et al., 2019b; Wang et al., 2020; Tanaka et al., 2020) . TE-NAS (Chen et al., 2021a) combines neural tangent kernel (NTK) (Lee et al., 2019a) and linear region (Raghu et al., 2017) to evaluate the trainability and expressivity of networks. Zen-NAS (Lin et al., 2021) proposes Zen-Score to measure network performance based on network expressivity. KNAS (Xu et al., 2021) finds that the gradient kernel of the initialized network correlated well with training loss and validation performance. Zhang & Jia (2022) analyzes the sample-wise optimization landscape and proposes GradSign for performance evaluation. A remarkably efficient search is facilitated by the above-mentioned zero-shot estimators. However, compared to network parameters, these methods are not competitive in terms of ranking consistency, raising concerns about their applicability.

3.1. PRELIMINARIES

Zero-Shot NAS utilizes network information at initialization for scoring and expects strong rank consistency with the performance. In particular, gradient information is widely adopted in Zero-Shot estimators. For example, SNIP (Abdelfattah et al., 2021) applies gradient values to approximate the change of loss, and GradSign (Zhang & Jia, 2022) calculates the gradient conflict between data samples. Considering that the network parameter is a strong performance estimator as shown in Figure 1 , we therefore combine the network parameter with the gradient. Suppose the loss function is J, the network parameters are θ ∈ R m and corresponding average gradients in a mini batch are ∇ θ J ∈ R m . Then the Zero-Shot estimator to be explored is as follows. Score(thr 1 , thr 2 ) = m k=1 I thr 1 ≤ ∂J n θ 0 ∂θ k < thr 2 (1) where thr 1 , thr 2 ≥ 0 are two thresholds, θ 0 is the network initialization parameter. I is the indicator function with a value of 1 when the condition is true and 0 otherwise. Equation (1) describes the number of parameters whose absolute gradient values are within a certain interval. Score(0, +∞) represents the overall network parameters. For the sake of brevity, we utilize [thr 1 , thr 2 ) to describe a interval. We provide a theoretical analysis for the proposed Zero-Shot estimator in Appendix A.

3.2. SWEET GRADIENT

We then analyze the rank consistency of Equation (1) with different [thr 1 , thr 2 ) intervals. Figure 2 illustrates the Spearman's rank between the test accuracy of one hundred networks and their scores of Equation (1) on NAS-Bench-101 and NAS-Bench-201. We can see the following patterns: • The Spearman's rank in the upper right corner of the heatmap can be regarded as the consistency of Parameters which are covered overwhelmingly when thr 1 = 0 and thr 2 = 5. • The score with interval [thr 1 , thr 2 ) in the dark blue area yields better rank consistency than Parameters. We name the gradient in such intervals Sweet Gradient. The Sweet Gradient Interval is defined as: the absolute gradient interval where the number of parameters have better performance consistency than Parameters. Although Figure 2 only shows the Spearman's rank of one hundred architectures in two search spaces, Sweet Gradient exists across different search spaces, datasets, the number of architectures, batch size, and initializations (please refer to Appendix B for more details). This observation indicates that a proper threshold setting can obtain an estimator with better rank consistency than the overall network parameters. However, the thresholds derived based on architecture accuracy still need network training. Does there exist a method to obtain two thresholds without training? This motivates us to investigate the underlying reason for the Sweet Gradient phenomenon. Reviewing the development of neural networks, depth is one of the most critical factors affecting performance (Simonyan & Zisserman, 2015) . Inspired by He et al. (2016) ; Balduzzi et al. (2017) , we dissect the parameters of the different intervals from the perspective of depth. To eliminate the effect of the parameter magnitude, the parameter proportion is used for analysis: P roportion l = Score l (thr 1 , thr 2 ) Score l (0, +∞) where P roportion l is the proportion of parameters with the gradient interval [thr 1 , thr 2 ) in the l-th layer, whose range is [0, 1]. Therefore, there is a positive correlation between network depth and parameter proportion of Sweet Gradients in each layer. We can seek Sweet Gradient based on the property: Sweet Gradient is more likely to be located in intervals where the parameter proportion increases as network depth increases. Besides the network depth, we also investigated the gradient and activation distribution but found no correlation between the two factors and Sweet Gradient. Please refer to Appendix D for more details.

3.3. SWEETIMATOR

Sweetimator represents the score in Equation ( 1) when [thr 1 , thr 2 ) is a Sweet Gradient interval. The discovered property answers the motivational question of how to obtain the Sweet Gradient interval without training, and thus the Sweetimator. Heuristically, we propose sweetness to estimate how sweet the gradient of an interval is:  Sweetness(thr 1 , thr 2 ) = 1 n n k ( 1 l k -1 l k -1 i sign(P roportion i+1 -P roportion i )) (3) where n is the number of architectures, l k denotes the depth of the k-th architecture, and the sign function is used to determine whether the parameter proportion is ascending as the depth increases. Equation (3) has a range of [-1, 1], indicating interval incrementality with the depth. Sweetness = 1 indicates monotonically ascending and Sweetness = -1 indicates monotonically descending. Although both Sweetness and GradSign utilize the sign function, the purposes are completely different. GradSign evaluates the gradient conflict between samples, while we evaluate the interval incrementality for Sweetimator. In practice, one problem in obtaining the Sweet Gradient is the interval setting, which can be described by the triplet (max, min, split). max means the largest order of magnitude, min means the smallest order of magnitude except zero, and split means each order of magnitude is divided equally into several parts. For example, (1, 0.1, 2) means three interval [0, 0.1), [0.1, 0.5), [0.5, 1.0). We empirically found that max = 10 and min = 1e-10 are sufficient. And the ablation study of split is provided in Section 4.4. Algorithm 1 describes how to obtain Sweetimator.

Algorithm 1 Sweetimator

Input: m initialized architectures; p intervals based on (max, min, split); Output: Best Interval for Sweetimator 1: maxSweetness = -1 2: bestInterval = N one 3: for i ← 1 to p do 4: Sweetness i ← Equation (3) 5: if Sweetness i > maxSweetness then 6: maxSweetness = Sweetness i 7: bestInterval = Interval i 8: end if 9: end for 10: return bestInterval

4. EXPERIMENT

In this section, we first evaluate the ranking consistency of Sweetimator in NAS-Bench-101 (Ying et al., 2019) , NAS-Bench-201 (Dong & Yang, 2020; Dong et al., 2022) , NAS-Bench-301 (Zela et al., 2022) , and NDS (Radosavovic et al., 2019) benchmarks. Then, we use Sweetimator to conduct search experiments in NAS-Bench-201 and DARTS (Liu et al., 2019) search spaces. Finally, we provide ablation experiments for further analysis. Experimental details can be referred to Appendix E.

4.1. CONSISTENCY RESULTS

Benchmarks. We conduct experiments in NAS-Bench-101, NAS-Bench-201, NAS-Bench-301, and NDS search spaces. NAS-Bench-101 is the first large-scale NAS Benchmark with 423k architectures and corresponding accuracy on CIFAR-10. NAS-Bench-201 is an extension to NAS-Bench-101, containing 15625 architectures and CIFAR-10, CIFAR-100, and ImageNet16-120 datasets. NAS-Bench-301 is the first surrogate NAS benchmark, with 10 18 architectures on CIFAR-10. NDS statistically analyzes multiple network design spaces, including NAS-Net (Zoph et al., 2018) , Amoe-baNet (Real et al., 2019) , PNAS (Liu et al., 2018) , ENAS (Pham et al., 2018) , and DARTS (Liu et al., 2019) search space. And each space has thousands of architectures in the NDS benchmark. Baselines. We compare common Zero-Shot estimators, including SNIP (Lee et al., 2019b) , GraSP (Wang et al., 2020) , SynFlow (Tanaka et al., 2020) , Fisher (Turner et al., 2020 ), GradNorm (Abdelfattah et al., 2021) , NASWOT (Mellor et al., 2021) , TE-NAS (Chen et al., 2021a) , Zen-Score (Lin et al., 2021) , GradSign (Zhang & Jia, 2022) , and FLOPs and Parameters of networks. A comparison with Non-Zero-Shot estimators is in Appendix F.

Settings.

The consistency experiments are divided into two groups. The first group contains NAS-Bench-101, NAS-Bench-201, and NAS-Bench-301 with 4500, 15625, and 5000 architectures to test consistency, respectively. For fairness, the batch size of all Zero-Shot estimators is 64. The evaluation metric is Spearman's rank. The second group includes NAS-Net, AmoebaNet, PNAS, ENAS, and DARTS spaces of NDS benchmark with 4846, 4983, 4999, 4999, and 5000 architectures for evaluation, respectively. The architectures are trained on CIFAR-10. All Zero-Shot estimators have a batch size of 128. The evaluation metric is Kendall's Tau. Results. Tables 1 and 2 show that Sweetimator significantly outperforms existing Zero-Shot estimators. For example, Sweetimator is better than the estimator Parameters by 46% (0.423 vs. 0.618) in NAS-Bench-101 in Table 1 . The results verify the effectiveness of the proposed method. Moreover, we observe that existing Zero-Shot estimators do not consistently outperform Parameters in terms of rank consistency, which is also found by Ning et al. (2021) . In contrast, Sweetimator significantly outperforms Parameters, suggesting that Sweet Gradient is a direction worth exploring for NAS. Sweetimator-Assisted NAS Settings. In the experiment, the NAS algorithms of REA (Real et al., 2019) , REINFORCE (Williams, 1992) and BOHB (Falkner et al., 2018) are assisted with Sweetimator to verify the applicability. For a fair comparison, we utilized the assisted algorithm proposed in GradSign (Zhang & Jia, 2022) . In particular, the random selection of each NAS algorithm is replaced by Sweetimator-assisted selection. The NASWOT-, GradSign-, Sweetimator-assisted NAS algorithms are abbreviated as A-, G-, S-algorithm, e.g., S-REA, respectively. Following Zhang & Jia (2022) , all results are searched with a time budget of 12000s. Results. 

4.4. ABLATION STUDY

Batch size, architecture numbers, and split of the interval are three major factors that influence the gradient calculation and threshold acquisition of Sweetimator. For further analysis, we conduct ablation studies in NAS-Bench-201, which are shown in Table 5 . Batch Size. The batch size is an important hyper-parameter that determines the quality of the gradient. The larger the batch size, the more accurate the gradient. Table 5 demonstrates that Spearman's rank commonly becomes higher with the increasing batch size. Therefore, Sweetimator can be more effective by using a larger batch size. Architecture Number. The architecture number affects threshold acquisition. Table 5 shows that a larger architecture number does not change Spearman's rank. This is because the obtained interval is the same across different architecture numbers. The results suggest that a small number of architectures (e.g., 100) is enough to obtain the best interval. Split of Interval. The split influences the granularity of the interval. In 

5. LIMITATIONS AND FUTURE WORK

We only consider computer vision classification tasks and cell-based search spaces in this work. And finding the optimal gradient interval is still a process of discrete selection rather than a continuous differentiable optimization process. In future works, we intend to apply Sweetimator to diverse search spaces (e.g., Transformer space) and explore an optimization-friendly method to obtain the best interval. Besides, it is still a mystery that the Sweet Gradient interval exhibits the property of increasing parameter proportion as network depth increases. We believe a more profound reason exists behind this observation which helps us to better understand deep neural networks. Therefore, a theoretical analysis is necessary in the future work.

6. CONCLUSION

This work observes the Sweet Gradient phenomenon that some specific parameters, whose absolute gradient values are in a certain interval, have a stronger consistency with the network performance than Parameters. To obtain the Sweet Gradient interval without training, we investigated the relationship between network depth and Sweet Gradient. We found that Sweet Gradient tends to exist in intervals with increasing parameter proportion as network depth increases. Based on the property, we utilize Sweetness to obtain the best interval. Experiments demonstrate that Sweetimator can achieve superior rank consistency and excellent search results, verifying the effectiveness of the proposed estimator.

A THEORETICAL ANALYSIS

The dataset is denoted as {x i , y i } N i=1 , where the input x i ∈ R d and the output y i ∈ R. The loss function is J N (θ) = N i=1 ℓ (y i , f (θ; x i )) where f (θ; x) is the network architecture, θ ∈ R m and ℓ(•, •) is the loss function. Then the optimal network parameter θ * is θ * ≜ argmin θ∈R m J N (θ). Score index m k=1 I τ 1 ≤ ∂J N θ 0 ∂θ k < τ 2 (6) where τ 2 > τ 1 > 0 and θ 0 is the initialization parameter. Here τ 1 and τ 2 correspond to thr 1 and thr 2 in (1). Regarding the optimal parameter θ * and the initial parameter θ 0 , we provide theoretical analysis from two perspectives. • Higher performance. The small loss at the optimal parameter J N (θ * ) implies the high network performance. We analyze the upper bound of J N (θ * ) and expect it as small as possible; • Easier to optimize: The small distance between the initial parameter and the optimal parameter means the easy optimization process. We analyze the upper bound of θ 0 -θ * 2 and expect it as small as possible. A.1 THEORETICAL RESULTS The theorems in (Allen-Zhu et al., 2019) reveal that the neighborhood around random initialization has excellent properties that are almost convex and semi-smooth. Based on the work, we introduce a slightly stronger assumption here. Denote the vector ℓ 2 norm as ∥•∥ 2 . Assumption 1 J N (θ) is h-strong convex and H-smooth in the neighborhood Γ θ 0 , R ≜ θ θ -θ 0 2 ≤ R of the initialization θ 0 , where R > 0 is the neighborhood radius. Lemma 1 J N (θ) is h-strong convex and H-smooth which is equivalent to h • I ⪯ ∇ 2 J N (θ) ⪯ H • I, ∀θ ∈ Γ θ 0 , R . Theorem 1 (The upper bound of the loss in the optimal point.) Assume that J N (θ) is h-strong convex in the neighborhood Γ θ 0 , R which in Assumption 1 hold. Then J N (θ * ) ≤ J N (θ 0 ) - 1 2H ε 2 m - 1 2H τ 2 1 -ε 2 • m k=1 I ∂J N θ 0 ∂θ k ≥ τ 1 (7) where 0 ≤ ε ≜ min ∂J N (θ 0 ) ∂θ k ∂J N (θ 0 ) ∂θ k < τ 1 < τ 1 . Remark 1 The upper bound of J N (θ * ) is influenced by two items of J N (θ 0 ) and m k=1 I ∂J N (θ 0 ) ∂θ k ≥ τ 1 . Considering network performance ranking, we can ignore the effect of the first item since J N (θ 0 ) of different networks is similar. In practice, the network initialization (Glorot & Bengio, 2010; He et al., 2015) tends to maintain the mean and variance of the output in each layer, so the final output logits of different networks have similar distribution, resulting in similar losses. In NAS-Bench-201, the loss of J N (θ 0 ) of all candidate networks on CIFAR10, CIFAR100 and ImageNet16-120 are 2.343 ± 0.043, 4.651 ± 0.042, and 4.842 ± 0.047, respectively. Small deviation indicates that the J N (θ 0 ) of different networks is indeed similar. Therefore, we only need to focus on the second item, i.e, the larger m k=1 I ∂J N (θ 0 ) ∂θ k ≥ τ 1 , the smaller the upper bound of J N (θ * ). Theorem 2 (The upper bound of distance between the initial point and the optimal point.) Assume J N (θ) is H-smooth in the neighborhood Γ θ 0 , R in Assumption 1 hold. Then θ 0 -θ * 2 ≤ 1 h • M • m -(M -τ 2 ) m k=1 I ∂J N θ 0 ∂θ k < τ 2 where M ≜ max ∂J N (θ 0 ) ∂θ k ∂J N (θ 0 ) ∂θ k ≥ τ 2 > τ 2 > τ 1 > 0. Remark 2 To make the upper bound of θ 0 -θ * 2 small, the index m k=1 I ∂J N (θ 0 ) ∂θ k < τ 2 need to be as large as possible. For given τ 2 > τ 1 > 0, considering the two goals of high performance and easy optimization, we expect the upper bound of J N (θ * ) and θ 0 -θ * 2 to be as small as possible. Combining Theorem 1 and 2, the index m k=1 I ∂J N θ 0 ∂θ k ≥ τ 1 • I ∂J N θ 0 ∂θ k < τ 2 = m k=1 I τ 1 ≤ ∂J N θ 0 ∂θ k < τ 2 need to be as large as possible. According to Theorems 1 and 2, we can obtain the proposed Zero-Shot estimator. A.2 PROOF OF LEMMA 1 Let's just simplify J N (θ) to J(N ) for the sake of the statement. Denote J(θ), ∀θ ∈ Γ θ 0 , R is h-strong convex and H-smooth as "Left" and h • I ⪯ ∇ 2 J N (θ) ⪯ H • I, ∀θ ∈ Γ θ 0 , R as "Right" respectively. Left ⇒ Right. Since J(θ) is h-strong convex, F (θ) = J(θ) -1 h ∥θ∥ 2 2 is convex. Then for the convex function F (θ), we have (∇F (θ 1 ) -∇F (θ 2 )) ⊤ (θ 1 -θ 2 ) ≥ 0, ∀θ 1 , θ 2 ∈ Γ θ 0 , R . Notice that ∇F (θ) = ∇J (θ) -hθ and substitute it into Equation (8). Then ∥∇J (θ 1 ) -∇J (θ 2 )∥ 2 ∥θ 1 -θ 2 ∥ 2 ≥ (∇J (θ 1 ) -∇J (θ 2 )) ⊤ (θ 1 -θ 2 ) ≥ h ∥θ 1 -θ 2 ∥ 2 2 (9) where the first equality comes from Cauchy-schwartz inequality. Since J(θ) is L-smooth, J(θ) is Lipschitz continuous and the Lipschitz constant is H, i.e., ∥∇J(θ 1 ) -J(θ 2 )∥ 2 ≤ H ∥θ 1 -θ 2 ∥ 2 , ∀θ 1 , θ 2 ∈ Γ θ 0 , R . By Cauchy-schwartz inequality, we can obtain that (∇J (θ 1 ) -∇J (θ 2 )) ⊤ (θ 1 -θ 2 ) ≤ ∥∇J (θ 1 ) -∇J (θ 2 )∥ 2 ∥θ 1 -θ 2 ∥ 2 ≤ H ∥θ 1 -θ 2 ∥ 2 2 . (10) Combing Equation ( 9) and (10), we have h ∥θ 1 -θ 2 ∥ 2 ≤ ∥∇J (θ 2 + tv) -∇J (θ 2 )∥ ≤ H ∥θ 1 -θ 2 ∥ 2 . ( ) Inspired by Theorem 5.12 in (Beck, 2017) , assume θ 1 = θ 2 + tv where t > 0, then ∇J (θ 2 + tv) -∇J (θ 2 ) = t 0 ∇ 2 J(θ 2 + zv)vdz Thus ht∥v∥ 2 ≤ t 0 ∇ 2 J(θ 2 + zv)dz • v 2 = ∥∇J (θ 2 + tv) -∇J (θ 2 )∥ 2 ≤ Ht∥v∥ 2 (13) Divide both sides by t and let t → 0 + . Then h∥v∥ 2 ≤ ∇ 2 J(θ 2 ) • v 2 ≤ H∥v∥ 2 (14) By the arbitrariness of θ 1 and θ 2 , we have h • I ⪯ ∇ 2 J(θ) ⪯ H • I. Right ⇒ Left. For ∀θ 1 , θ 2 ∈ Γ θ 0 , R , take J(θ) out of the Taylor expansion at θ 2 and substitute θ 1 into J(θ). J(θ 1 ) = J(θ 2 ) + (∇J(θ 2 )) ⊤ (θ 1 -θ 2 ) + 1 2 (θ 1 -θ 2 ) ⊤ ∇ 2 J (βθ 1 + (1 -β)θ 2 ) (θ 1 -θ 2 ) (16) where 0 < β < 1. By h • I ⪯ ∇ 2 J(θ) ⪯ H • I, we have J(θ 1 ) ≥ J(θ 2 ) + (∇J(θ 2 )) ⊤ (θ 1 -θ 2 ) + 1 2 h ∥θ 1 -θ 2 ∥ 2 2 and J(θ 1 ) ≤ J(θ 2 ) + (∇J(θ 2 )) ⊤ (θ 1 -θ 2 ) + 1 2 H ∥θ 1 -θ 2 ∥ 2 2 . ( ) By the arbitrariness of θ 1 , θ 2 , the following conclusion can be obtained by exchanging the position of θ 1 and θ 2 J(θ 2 ) ≥ J(θ 1 ) + (∇J(θ 1 )) ⊤ (θ 2 -θ 1 ) + 1 2 h ∥θ 1 -θ 2 ∥ 2 2 and J(θ 2 ) ≤ J(θ 1 ) + (∇J(θ 1 )) ⊤ (θ 2 -θ 1 ) + 1 2 H ∥θ 1 -θ 2 ∥ 2 2 . ( ) Adding Equation (17) to Equation ( 19), we can get (∇J(θ 2 ) -∇J(θ 1 )) ⊤ (θ 2 -θ 1 ) ≥ h ∥θ 1 -θ 2 ∥ 2 2 (21) Similarly, by adding Equation ( 18) and ( 20), we have (∇J(θ 2 ) -∇J(θ 1 )) ⊤ (θ 2 -θ 1 ) ≤ H ∥θ 1 -θ 2 ∥ 2 2 . ( ) Thus J N (θ) is h-strong convex and H-smooth.

A.3 PROOF OF THEOREM 1: THE UPPER BOUND OF THE LOSS IN THE OPTIMAL PARAMETER

Take the loss function J N (θ) out of the Taylor expansion at the initialization parameter θ 0 . For ∀θ ∈ Γ θ 0 , R , we can obtain that J N (θ) = J N θ 0 + ∇J N θ 0 ⊤ θ -θ 0 + 1 2 θ -θ 0 ⊤ ∇ 2 J N αθ + (1 -α) θ -θ 0 θ -θ 0 ≤ J N θ 0 + ∇J N θ 0 ⊤ θ -θ 0 + 1 2 H • θ -θ 0 2 2 = J N θ 0 + m k=1 ∂J N θ 0 ∂θ k • θ k -θ 0 k + 1 2 H • m k=1 θ k -θ 0 k 2 ≜ L N (θ) where 0 < α < 1. Denote θ * ≜ argmin θ∈Γ(θ 0 ,R) L N (θ). We assert that J N (θ * ) ≤ L N θ * . ( ) Otherwise if J N (θ * ) > L N θ * , we have J N θ * ≥ J N (θ * ) > L N θ * (25) which is in contradiction with Equation ( 23). Notice that L N (θ) is a quadratic function of θ -θ 0 . To make the upper bound of J (θ * ) to be as small as possible, we expect L N (θ) to be the global minimum in Γ θ 0 , R . The minimum point of L N (θ) is θ * k -θ 0 k = - 1 H ∂J N θ 0 ∂θ k , k ∈ {1, • • • , m} Next we prove that the optimal point of L N (θ) is in the feasible domain Γ θ 0 , R . Since θ * is the optimal point of J N (θ) and J N (θ) is H-smooth, we have ∇J N θ 0 2 = ∇J N θ 0 -∇J N (θ * ) 2 ≤ H∥θ * -θ 0 ∥ 2 ≤ HR. Then θ * -θ 0 2 = 1 H ∇J N θ 0 2 ≤ R, further we have θ * ∈ Γ θ 0 , R . Thus the global minimum of L N (θ) is L N θ * = J N θ 0 - 1 2H m k=1 ∂J N θ 0 ∂θ k 2 . ( ) For the upper bound of J N (θ * ), we can obtain that J N (θ * ) ≤ L N θ * = J N θ 0 - 1 2H m k=1 ∂J N θ 0 ∂θ k 2 ≤ J N θ 0 - 1 2H τ 2 1 • m k=1 I ∂J N θ 0 ∂θ k ≥ τ 1 - 1 2H • ε 2 • m k=1 I ∂J N θ 0 ∂θk < τ 1 = J N (θ 0 ) - 1 2H ε 2 m - 1 2H τ 2 1 -ε 2 • m k=1 I ∂J N θ 0 ∂θk ≥ τ 1 (30) where 0 ≤ ε ≜ min ∂J N (θ 0 ) ∂θ k ∂J N (θ 0 ) ∂θ k < τ 1 < τ 1 .

A.4 PROOF OF THEOREM 2: THE UPPER BOUND OF DISTANCE BETWEEN THE INITIAL PARAMETER AND THE OPTIMAL PARAMETER

Denote the vector ℓ 1 norm as ∥•∥ 1 . Take the gradient of loss function ∇J N (θ) out of the Taylor expansion at the optimal parameter θ * and substitute the initialization parameter θ 0 into the gradient of loss function. ∇J N θ 0 = ∇J N (θ * ) + ∇ 2 J N γθ 0 + (1 -γ)θ * • θ 0 -θ * = ∇ 2 J N γθ 0 + (1 -γ)θ * • θ 0 -θ * (31) where 0 < γ < 1 and the last equality comes from ∇J N (θ * ) = 0. By Assumption 1, we have h • I ⪯ ∇ 2 J N γθ 0 + (1 -γ)θ * ⪯ H • I. Thus h • θ 0 -θ * 2 ≤ ∇J N θ 0 2 ≤ H • θ 0 -θ * 2 . Further we can obtain that θ 0 -θ * 2 ≤ 1 h ∇J N θ 0 2 ≤ 1 h • ∇J N θ 0 1 = 1 h • m k=1 ∂J N θ 0 ∂θ k ≤ 1 h • τ 2 • m k=1 I ∂J N θ 0 ∂θ k < τ 2 + 1 h • M • m k=1 I ∂J N θ 0 ∂θ k ≥ τ 2 = 1 h • τ 2 • m k=1 I ∂J N θ 0 ∂θ k < τ 2 + 1 h • M • m - m k=1 I ∂J N θ 0 ∂θ k < τ 2 = 1 h • M • m -(M -τ 2 ) m k=1 I ∂J N θ 0 ∂θ k < τ 2 . ( ) where M ≜ max ∂J N (θ 0 ) ∂θ k ∂J N (θ 0 ) ∂θ k ≥ τ 2 > τ 2 > τ 1 > 0. B SWEET GRADIENT PHENOMENON Sweet Gradient widely exists across different search spaces, datasets, the number of architectures, batch size, and initializations. Here we list the heatmaps of Sweet Gradient under various configurations. 

D OTHER INVESTIGATIONS

Investigation 1: Is the Sweet Gradient related to gradient distribution? A natural conjecture is that Sweet Gradient is more likely to exist in intervals with concentrated gradient values. Figure 11 Although the cycle is only 200, the scores converge rapidly and the search results can be competitive with other Zero-Shot methods, which also bring a fast search speed. Therefore, we did not do more hyper-parameter tuning. Please note that the search time includes computing the best interval and searching for the optimal architecture. In the retraining phase, we follow DARTS settings to train architectures. On CIFAR-10, the network consists of 20 layers with 36 initial channels. We utilized the SGD optimizer to train the network for 600 epochs with a batch size of 96. The learning rate decays from 0.025 to 0 by the cosine scheduler. Other settings like cutout, auxiliary, and path dropout are the same as DARTS. On ImageNet, the network consists of 14 cells with 48 channels, which is restricted to be less than 600M FLOPs. The SGD optimizer is used to train the network with 250 epochs, a learning rate of 0.5, weight decay of 3e-5, and a batch size of 1024. We train the network on eight NVIDIA V100 for around three days.

G VISUALIZATION OF SEARCH RESULTS

G 



Figure 1: The Spearman's rank correlation coefficient of Zero-Shot estimators on NAS-Bench-101, NAS-Bench-201, and NAS-Bench-301. The dotted line indicates the spearman's rank of Parameters.

Figure 2: The Spearman's rank under different intervals on NAS-Bench-101 and NAS-Bench-201 (CIFAR10). Each rank is calculated with 100 architectures and a batch size of 128.

Figure 3: The parameter proportion along cell depth of Sweet and Non-Sweet Gradient intervals on NAS-Bench-101 and NAS-Bench-201 (CIFAR-10). Each layer represents a cell of the network and there are 9 and 15 layers in NAS-Bench-101 and NAS-Bench-201, respectively.

0.46 0.40 0.32 0.32 0.31 0.31 0.31 0.31 0.31 0.40 0.33 0.25 0.25 0.24 0.24 0.24 0.24 0.24 0.09 0.03 0.02 0.01 0.01 0.01 0.01 0.01 -0.02 -0.03 -0.04 -0.05 -0.05 -0.05 -0.05 -0.08 -0.12 -0.13 -0.13 -0.13 -0.13 -0.18 -0.19 -0.20 -0.20 -0.20 -0.35 -0.40 -0.40 -0.40 -0.43 -0.42 -0

Figure 4: Sweet Gradient across different search spaces.

Figure 11: The gradient and activation distribution under different intervals in NAS-Bench-101 and NAS-Bench-201 (CIFAR-10).

Figure 12: Visualization of model test accuracy versus Sweetimator metric score on CIFAR-10, CIFAR-100, ImageNet16-120.

Rank Consistency of Zero-Shot estimators in NAS-Bench-101, NAS-Bench-201, and NAS-Bench-301 by Spearman's rank.

Rank Consistency of Zero-Shot estimators in five search spaces of NDS by Kendall's Tau.

Mean ± std accuracies in NAS-Bench-201. The upper and lower parts indicate results of Architecture Selection and Assisted NAS, respectively. All searches run for 500 times. For a fair comparison, architectures are searched on CIFAR-10 and evaluated on CIFAR-10, CIFAR-100 and ImageNet16-120. Note that we rerun GradSign-assisted NAS experiments using their released code.

Table3summarizes the search results. In architecture selection experiments, Sweetimator outperforms compared baselines with a closer distance to Optimal, indicating the ability to search high-performance architectures. In the Sweetimator-assisted NAS experiments, the average accuracy of Sweetimator-assisted was higher than the original algorithm and other estimator-assisted algorithms with lower standard deviation, further validating the efficiency and applicability of Sweetimator.4.3 SEARCH RESULTS IN DARTS SEARCH SPACESettings. DARTS(Liu et al., 2019) is a popular search space to evaluate NAS algorithms. We conduct experiments on CIFAR-10 (Krizhevsky & Hinton, 2009) and ImageNet(Olga et al., 2015) dataset. In the search phase, we utilize 100 architectures and a batch size of 128 to obtain the best interval of Sweetimator. Further, we integrate Sweetimator with REA algorithm(Real et al., 2019) for searching. The hyper-parameters of REA are followed byDong & Yang (2020) with 200 cycles. Despite small cycles, the score converges rapidly and the search results can be competitive with other Zero-Shot methods, which also bring a fast search speed. In the retraining phase, we follow DARTS settings to build and train searched network for a fair comparison. More details can be referred to Appendix E.

Table 5, the Spearman's ranks are improved on CIFAR-10 (0.888 → 0.913) and CIFAR-100 (0.859 → 0.887) but almost kept on ImageNet16-120 (0.835 → 0.836) when split is larger. Consequently, the interval granularity of different datasets is different, indicating that Sweetimator can be more efficient by designing a flexible interval selection algorithm in the future. Rank Consistency of Zero-Shot estimators in NAS-Bench-201 by Spearman's rank. The baseline has a batch size of 64, architecture number of 100, and split of 2.

.35 -0.21 -0.20 -0.14 -0.13 -0.13 -0.13 -0.14 -0.15 -0.13 -0.08 0.30 0.51 0.67 0.70 0.72 0.71 0.71 0.71 0.71 -0.33 0.03 0.08 0.22 0.24 0.27 0.27 0.28 0.28 0.29 0.31 0.65 0.80 0.85 0.86 0.86 0.86 0.85 0.86 0.86 0.03 0.07 0.22 0.24 0.27 0.27 0.28 0.28 0.29 0.31 0.65 0.80 0.85 0.86 0.86 0.86 0.85 0.85 0.85 -0.00 0.22 0.24 0.27 0.27 0.28 0.28 0.29 0.32 0.66 0.80 0.85 0.86 0.86 0.86 0.85 0.85 0.85 0.21 0.24 0.27 0.27 0.28 0.28 0.29 0.32 0.66 0.80 0.86 0.86 0.86 0.86 0.85 0.85 0.85 0.24 0.27 0.27 0.28 0.28 0.30 0.34 0.67 0.81 0.86 0.86 0.86 0.85 0.84 0.85 0.85 0.28 0.28 0.28 0.28 0.31 0.35 0.69 0.81 0.86 0.86 0.86 0.85 0.84 0.84 0.84 0.28 0.29 0.30 0.33 0.40 0.72 0.83 0.87 0.87 0.86 0.84 0.84 0.84 0.84 0.30 0.30 0.36 0.43 0.73 0.84 0.87 0.87 0.85 0.84 0.84 0.84 0.84 0.39 0.44 0.51 0.80 0.87 0.90 0.89 0.86 0.85 0.85 0.85 0.85 0.45 0.55 0.84 0.88 0.90 0.88 0.85 0.85 0.85 0.85 0.85 0.76 0.90 0.91 0.90 0.87 0.86 0.86 0.86 0.86 0.86 0.91 0.92 0.89 0.86 0.85 0.85 0.86 0.

F COMPARISON WITH NON-ZERO-SHOT ESTIMATORS

Settings. We compare Sweetimator with Non-Zero-Shot estimators in NAS-Bench-201. The baselines include SPOS (Guo et al., 2020) , Neural Predictor (Wen et al., 2020) , NAO (Luo et al., 2018) , TNASP (Lu et al., 2021) . For SPOS, we trained a supernet with 250 epochs and a batch size of 256, then utilized the validation accuracy of sub-networks as the estimator. For Neural Predictor, NAO and TNASP, we directly used the results in Lu et al. (2021) . For Sweetimator, we set the batch size to 128, the architecture number to 100, and the split to 2. The compared metric is Kendall's Tau between the score of estimators and the test accuracy on CIFAR-10 of the benchmark.Results. Table 7 shows the comparison results. Sweetimator can achieve superior rank consistency to Non-Zero-Shot estimators on NAS-Bench-201, which further demonstrate the effectiveness of the proposed method. Moreover, Sweetimator takes only a few minutes to obtain the best interval, which is more efficient than the process of training networks in Non-Zero-Shot estimators. 

H MORE DIFFERENT TASKS

We conduct experiments on NAS-Bench-NLP and NAS-Bench-ASR benchmarks to further validate our method. The compared baselines include SynFlow, GradSign, and Parameters. The results in the following table show that Sweetimator has superior performance consistency than other methods. However, the Spearman's rank of NAS-Bench-NLP drops significantly compared to that of CV tasks. We conjecture that the text generation tasks and RNN architectures are more complicated than the classification tasks and CNN architectures, and furthermore, the hyper-parameters of the architecture candidates in the NAS-Bench-NLP search space are not well-tuned. Due to the limited time available for rebuttal, we will improve the consistency of Sweetimator for more tasks in the future. 

