ZICO: ZERO-SHOT NAS VIA INVERSE COEFFICIENT OF VARIATION ON GRADIENTS

Abstract

Neural Architecture Search (NAS) is widely used to automatically obtain the neural network with the best performance among a large number of candidate architectures. To reduce the search time, zero-shot NAS aims at designing training-free proxies that can predict the test performance of a given architecture. However, as shown recently, none of the zero-shot proxies proposed to date can actually work consistently better than a naive proxy, namely, the number of network parameters (#Params). To improve this state of affairs, as the main theoretical contribution, we first reveal how some specific gradient properties across different samples impact the convergence rate and generalization capacity of neural networks. Based on this theoretical analysis, we propose a new zero-shot proxy, ZiCo, the first proxy that works consistently better than #Params. We demonstrate that ZiCo works better than State-Of-The-Art (SOTA) proxies on several popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS, TransNASBench-101) for multiple applications (e.g., image classification/reconstruction and pixel-level prediction). Finally, we demonstrate that the optimal architectures found via ZiCo are as competitive as the ones found by one-shot and multi-shot NAS methods, but with much less search time. For example, ZiCo-based NAS can find optimal architectures with 78.1%, 79.4%, and 80.4% test accuracy under inference budgets of 450M, 600M, and 1000M FLOPs, respectively, on ImageNet within 0.4 GPU days.

1. INTRODUCTION

During the last decade, deep learning has achieved great success in many areas, such as computer vision and natural language modeling Krizhevsky et al. (2012) ; Liu & Deng (2015) ; Huang et al. (2017) ; He et al. (2016) ; Dosovitskiy et al. (2021) ; Brown et al. (2020) ; Vaswani et al. (2017) . In recent years, neural architecture search (NAS) has been proposed to search for optimal architectures, while reducing the trial-and-error (manual) network design efforts Baker et al. (2017) ; Zoph & Le (2017) ; Elsken et al. (2019) . Moreover, the neural architectures found via NAS show better performance than the manually-designed networks in many mainstream applications Real et al. (2017) ; Gong et al. (2019) ; Xie et al. (2019) ; Wu et al. (2019) ; Wan et al. (2020) ; Li & Talwalkar (2020) ; Kandasamy et al. (2018) ; Yu et al. (2020b) ; Liu et al. (2018b) ; Cai et al. (2018) ; Zhang et al. (2019a) ; Zhou et al. (2019) ; Howard et al. (2019) ; Li et al. (2021b) . Despite these advantages, many existing NAS approaches involve a time-consuming and resourceintensive search process. For example, multi-shot NAS uses a controller or an accuracy predictor to conduct the search process and it requires training of multiple networks; thus, multi-shot NAS is extremely time-consuming Real et al. (2019) ; Chiang et al. (2019) . Alternatively, one-shot NAS merges all possible networks from the search space into a supernet and thus only needs to train the supernet once Dong & Yang (2019) ; Zela et al. (2020) ; Chen et al. (2019) ; Cai et al. (2019) ; Stamoulis et al. (2019) ; Chu et al. (2021) ; Guo et al. (2020) ; Li et al. (2020) ; this enables oneshot NAS to find a good architecture with much less search time. Though the one-shot NAS has significantly improved the time efficiency of NAS, training is still required during the search process. Recently, the zero-shot approaches have been proposed to liberate NAS from training entirely Wu et al. (2021) ; Zhou et al. (2022; 2020) ; Ingolfsson et al. (2022) ; Tran & Bae (2021) ; Do & Luong (2021) ; Tran et al. (2021) ; Shu et al. (2022b) ; Li et al. (2022) . Essentially, the zero-shot NAS utilizes some proxy that can predict the test performance of a given network without training. The design of such proxies is usually based on some theoretical analysis of deep networks. For instance, the first zero-shot proxy called NN-Mass was proposed by Bhardwaj et al. (2019) ; NN-Mass theoretically links how the network topology influences gradient propagation and model performance. Hence, zero-shot approaches can not only significantly improve the time efficiency of NAS, but also deepen the theoretical understanding on why certain networks work well. While NN-Mass consistently outperforms #Params, it is not defined for generic NAS topologies and works mostly for simple repeating blocks like ResNets/MobileNets/DenseNets Bhardwaj et al. (2019) . Later several zeroshot proxies are proposed for general neural architectures. Nonetheless, as revealed in Ning et al. (2021) ; White et al. (2022) , these general zero-shot proxies proposed to date cannot consistently work better than a naive proxy, namely, the number of parameters (#Params). These results may undermine the effectiveness of zero-shot NAS approaches. To address the limitations of existing zero-shot proxies, we target the following key questions: 1. How do some specific gradient properties, i.e., mean value and standard deviation across different samples, impact the training convergence of neural networks? 2. Can we use these two gradient properties to design a new theoretically-grounded proxy that works better than #Params consistently across many different NAS topologies/tasks? To this end, we first analyze how the mean value and standard deviation of gradients across different training batches impact the training convergence of neural networks. Based on our analysis, we propose ZiCo, a new proxy for zero-shot NAS. We demonstrate that, compared to all existing proxies (including #Params), ZiCo has either a higher or at least on-par correlation with the test accuracy on popular NAS-Benchmarks (NASBench101, NATS-Bench-SSS/TSS) for multiple datasets (CIFAR10/100, ImageNet16-120). Finally, we demonstrate that ZiCo enables a zero-shot NAS framework that can efficiently find the network architectures with the highest test accuracy compared to other zero-shot baselines. In fact, our zero-shot NAS framework achieves competitive FLOPs-accuracy tradeoffs compared to multiple one-shot and multi-shot NAS, but with much lower time costs. To summarize, we make the following major contributions: • We theoretically reveal how the mean value and variance of gradients across multiple samples impact the training convergence and generalization capacity of neural networks. • We propose a new zero-shot proxy, ZiCo, that works better than existing proxies on popular NAS-Benchmarks (NASBench101, NATS-Bench-SSS/TSS, TransNASBench-101) for multiple applications (image classification/reconstruction and pixel-level prediction). • We demonstrate that our proposed zero-shot NAS achieves competitive test accuracy with representative one-shot and multi-shot NAS with much less search time. The rest of the paper is organized as follows. We discuss related work in Section 2. In Section 3, we introduce our theoretical analysis. We introduce our proposed zero-shot proxy (ZiCo) and the NAS framework in Section 3.4. Section 4 validates our analysis and presents our results with the proposed zero-shot NAS. We conclude the paper in Section 5 with remarks on our main contribution. Wang et al. (2020) . Moreover, the Zen-score approximates the gradient w.r.t featuremaps and measures the complexity of neural networks Lin et al. (2021) ; Sun et al. (2021) . Furthermore, Jacob cov leverages the Jacobian matrix between the loss and multiple input samples to quantify the capacity of modeling the complex functions Lopes et al. (2021) . Though zero-shot NAS can significantly accelerate the NAS process, it has been revealed that the naive proxy #Params generally works better than all the proxies proposed to date Ning et al. (2021) ; White et al. (2022) . These limitations of existing proxies motivate us to look for a new proxy that can consistently work better than #Params and address the limitations of existing zero-shot NAS approaches. (2019b) . In our work, we extend such kernel-based analysis to reveal the relationships between the gradient properties and the training convergence and generalization capacity of neural networks.

3. CONVERGENCE AND GENERALIZATION VIA GRADIENT ANALYSIS

We consider the mean value and standard deviation of gradients across different samples and first explore how these two metrics impact the training convergence of linear regression tasks.

3.1. LINEAR REGRESSION

Inspired by Du et al. (2019b) , we use the training set S with M samples as follows: S = {(x i , y i ), i = 1, ..., M, x i ∈ R d , y i ∈ R, ||x i || = 1, |y i | ≤ R, M > 1} (1) where R is a positive constant and || • || denotes the L2-norm of a given vector; x i ∈ R d is the i th input sample and normalized by its L2-norm (i.e., ||x i || = 1), and y i is the corresponding label. We define the following linear model f = a T x optimized with an MSE-based loss function L: min a i L(y i , f (x i ; a)) = min a i 1 2 (a T x i -y i ) 2 (2) where a ∈ R d is the initial weight vector of f . We denote the gradient of L w.r.t to a as g(x i ) when taking (x i , y i ) as the training sample: g(x i ) = ∂L(y i , f (x i ; a)) ∂a We denote the j th element of g(x i ) as g j (x i ). We compute the mean value (µ j ) and standard deviation (σ j ) of g j (x i ) across all training samples as follows: µ j = 1 M M i g j (x i ) σ j = 1 M M i (g j (x i ) -µ j ) 2 Theorem 3.1. We denote the updated weight vector as â and denote ij [g j (x i )] 2 = G. Assume we use the accumulated gradient of all training samples and learning rate η to update the initial weight vector a, i.e., â = a -η i g(x i ). If the learning rate 0 < η < 2, then the total training loss is bounded as follows: i L(y i , f (x i ; â)) ≤ G 2 - η 2 M 2 (2 -η) j µ 2 j (5) In particular, if the learning rate η = 1 M , then L(â) is bounded by: i L(y i , f (x i ; â)) ≤ M 2 j σ 2 j (6) We provide the proof in Appendix A and the experimental results to validate this theorem in Sec 4.2. Remark 3.1 Intuitively, Theorem. 3.1 tells us that the higher the gradient absolute mean across different training samples, the lower the training loss the model converges to; i.e., the network converges at a faster rate. Similarly, if ηM < 1, the smaller the gradient standard deviation across different training samples/batches, the lower the training loss the model can achieve.

3.2. MLPS WITH RELU

In this section, we generalize the linear model to a network with ReLU activation functions. We primarily consider the standard deviation of gradients in the Gaussian kernel space. We still focus on the regression task on the training set S defined in Eq. 1. We consider a neural network in the same form as Du et al. (2019b) : h(x; s, W ) = 1 √ m m i s r ReLU(w T r x) (7) where m is the number of output neurons of the first layer; s r is the r th element in the output weight vector s; W ∈ R m×d is the input weight matrix, and w r ∈ R d is the r th row weight vector in W . For training on the dataset S with M samples defined in Eq. 1, we minimize the following loss function: L(s, W ) = M i=1 1 2 (h(x i ; s, W ) -y i ) 2 (8) Following the common practice Du et al. (2019b) , we fix the second layer (s) and use gradient descent to optimize the first layer (W ) with a learning rate η: w r (t) = w r (t -1) -η t i=0 ∂L(s, W (t -1)) ∂w r (t -1) where W (t -1) denote the input weight matrix after t -1 training steps; w r (t) denote the r th row weight vector after t training steps. Definition 1. (Gram Matrix) A Gram Matrix H(t) ∈ R M ×M on the training set {(x i , y i ), i = 1, . .., M } after t training steps is defined as follows: H ij (t) = 1 m x T i x j m r=1 I{x T i w r (t) ≥ 0, x T j w r (t) ≥ 0} ( ) where I is the indicator function and I{A} = 1 if and only if event A happens. We denote the λ min (H) as the minimal eigenvalue of a given matrix H. We denote the λ 0 = λ min (H(∞)).

3.2.1. CONVERGENCE RATE

Theorem 3.2. Given a neural network with ReLU activation function optimized by minimizing Eq. 8, we assume that each initial weight vector {w r (0), r = 1, ..., n} is i.i.d. generated from N (0, I) and the gradient for each weight follows i.i.d. N (0, σ), where the σ is measured across different training steps. For some positive constants δ and ϵ, if the learning rate η satisfies η < λ0 √ πδ 2M 2 √ 2Φ(1-ϵ)tσ , then with with probability at least (1 -δ)(1 -ϵ), the following holds true: for any r ∈ [m], ||w r (0) -w r (t)|| ≤ C = ηtσ Φ(1 -ϵ), and at training step t the Gram matrix H(t) satisfies: λ min (H(t)) ≥ λ min (H(0)) - 2 √ 2M 2 ηtσ Φ(1 -ϵ) √ πδ > 0 (11) Φ(•) is the inverse cumulative distribution function for a d-degree chi-squared distribution χ 2 (d). We provide the proof in Appendix B. We now introduce the following result from Du et al. (2019b) to further help our analysis. Lemma 1. Du et al. (2019b) Assume we set the number of output neurons of the first layer m = Ω( M 6 λ 4 0 δ 3 ) and we i.i.d. initialize w r ∼ N (0, I) and s r ∼ unif orm[{-1, 1}], for r ∈ [m]. When minimizing the loss function in Eq. 8 on the training set S in Eq. 1, with probability at least 1 -δ over the initialization, the training loss after t training steps is bounded by: L(s, (W (t)) ≤ e -λmin(H(t)) L(s, (W (t -1)) (12) Theorem 3.3. Under the assumptions of Theorem 3.2 and Lemma 1, with probability at least (1δ)(1 -ϵ), the following inequality holds true: L(s, (W (t)) ≤ e -λmin(H(0)) e 2 √ 2M 2 ηtσ √ Φ(1-ϵ) √ πδ L(s, (W (t -1)) The proof consists of replacing λ min (H(t)) in Eq. 12 with its lower bound given by Theorem 3.2.  λ max (H(t)) ≤ λ max (H(0)) + 2 √ 2M 2 ηtσ Φ(1 -ϵ) √ πδ Φ(•) is the inverse cumulative distribution function for a d-degree chi-squared distribution χ 2 (d). We provide the proof in Appendix C. Remark 3.5 Theorem. 3.5 shows that after some training steps t, the network with a smaller standard deviation (σ) of gradients will have a lower largest eigenvalues of the Gram matrix; i.e., the network has a flatter loss landscape rate at each training step. Therefore, based on Proposition 3.4, the model will generalize better. We further validate this theorem in the following section. 

3.4. NEW ZERO-SHOT PROXY

Inspired by the above theoretical insights, we next propose a proxy (ZiCo) that jointly considers both absolute mean and standard deviation values. Following the standard practice, we consider convolutional neural networks (CNNs) as candidate networks. Definition 2. Given a neural network with D layers and loss function L, the Zero-shot inverse Coefficient of Variation (ZiCo) is defined as follows: ZiCo = D l=1 log( ω∈θ l E[|∇ ω L(X i , y i ; Θ)|] V ar(|∇ ω L(X i , y i ; Θ)|) ), i ∈ {1, ..., N } where Θ denote the initial parameters of a given network; θ l denote the parameters of the l th layer of the network, and ω represents each element in θ l ; X i and y i are the i th input batch and corresponding labels from the training set; N is number of training batches used to compute ZiCo. We incorporate log to stabilize the computation by regularizing the extremely large or small values. Of note, our metric is applicable to general CNNs; i.e., there's no restriction w.r.t. the neural architecture when calculating ZiCo. As discussed in Section 3.3, the networks with higher ZiCo tend to have better convergence rates and higher generalization capacity. Hence, the architectures with higher ZiCo are better architectures. We remark that the loss values in Eq. 15 are all computed with the initial parameters Θ; that is, we never update the value of the parameters when computing ZiCo for a given network (hence it follows the basic principle of zero-shot NAS, i.e., never train, only use the initial parameters). In practice, two batches are enough to make ZiCo achieve the SOTA performance among all previously proposed accuracy proxies (see Sec. 4.5). Hence, we use only two input batches (N = 2) to compute ZiCo; this makes ZiCo very time efficient for a given network.

4.1. EXPERIMENTAL SETUP

We conduct the following types of experiments: (i) Empirical validation of Theorem 3.1, Theorem 3.3 and Theorem 3.5; (ii) Evaluation of the proposed ZiCo on multiple NAS benchmarks; (iii) Illustration of ZiCo-based zero-shot NAS on ImageNet. For the experiments (i), to validate Theorem 3.1, we optimize a linear model as in Eq. 2 on the MNIST dataset, the mean gradient values and the standard deviation vs. the total training loss. Moreover, we also optimize the model defined by Eq. 7 on MNIST and report the training loss vs. the standard deviation in order to validate Theorem 3.2 and Theorem 3.5. For experiments (ii), we compare our proposed ZiCo against existing proxies on three mainstream NAS benchmarks: NATSBench is a popular cell-based search space with two different search spaces: (1) NATSBench-TSS consisting of 15625 total architectures with different cell structures trained on CIFAR10, CIFAR100, and ImageNet16-120 (Img16-120) datasets, which is just renamed For experiments (iii), we use ZiCo to conduct the zero-shot NAS (see Algorithm 1) on ImageNet. We first use Algorithm 1 to find the networks with the highest ZiCo under various FLOPs budgets. We conduct the search for 100k steps; this takes 10 hours on a single NVIDIA 3090 GPU (i.e., 0.4 GPU days). Then, we train the obtained network with the exact same training setup as Lin et al. (2021) . Specifically, we train the neural network for 480 epochs with the batch size 512 and input resolution 224. We also use the distillation-based training loss functions by taking Efficient-B3 as the teacher. Finally, we set the initial learning rate as 0.1 with a cosine annealing scheduling scheme.

4.2. VALIDATION OF THEOREMS 3.1&3.3&3.5

To empirically validate Theorem 3.1, we first randomly sample 1000 training images in MNIST; we then normalize these images with their L2-norm to create the training set S . We compute the gradient w.r.t. the network parameters for each individual training sample. Next, as discussed in Theorem 3.1, we use the accumulated gradient over these samples to update the network parameters with learning rate η = 1. Then, we calculate the square sum of mean gradients and the total training loss. We repeat the above process 1000 times on the same S. As shown in Fig. 1 (a), we plot the total training loss vs. square sum of mean gradients as defined in Eq. 5. Clearly, the networks with the higher square sum of mean gradients values tend to have lower training loss. In comparison, Fig. 1(b) shows that networks with a lower square sum of variance value tend to have lower training loss values, which coincides with the conclusion drawn from Eq. 6. These results empirically validate our Theorem 3.1. Moreover, to optimize a two-layer MLP with ReLU activation functions as defined in Eq. 7, we use the entire training set of MNIST and apply the gradient descent (Eq. 9) to update the weights. We set the batch size as 256 and measure the standard deviation of gradients (σ) w.r. For NASBench101, as shown in Fig. 3 , ZiCo has a significantly higher correlation score with the real test accuracy than all the other proxies, except Zen-score. For example, ZiCo has a 0.46 Kendall's τ score, while #Params is only 0.31. In general, ZiCo has the highest correlation coefficients among all existing proxies for various search spaces and datasets of NATSBench and NASBench101. To our best knowledge, ZiCo is the first proxy that shows a consistently higher correlation coefficient compared to #Params. The above results validate the effectiveness of our proposed ZiCo; thus, ZiCo can be directly used to search for optimal networks for various budgets. Next, we describe the search results in detail. We then calculate the correlation between ZiCo with the real test accuracy. Fig. 4 (a) shows that using two batches to compute ZiCo generates the highest score. Hence, in our work, we always use two batches (N = 2) to compute ZiCo since it is both accurate and time-efficient.

4.4. ZICO

Batch size We compute ZiCo with two batches under varying batch size {1, 2, 4, 8, 16, 32, 64, 128} for the same 2000 networks as above; we then calculate the correlation between ZiCo with the test accuracy. Fig. 4 (b) shows that batch size 64 is enough to stabilize the coefficient. Hence, we set the batch size as 128 and use two batches to compute ZiCo. We provide more ablation studies in Appendix F. A PROOF OF THEOREM 3.1 Theorem 3.1 We denote the updated weight vector as â and ij [g j (x i )] 2 = G. Assume we use the accumulated gradient of all training samples and learning rate η to update the initial weight vector a, i.e., â = a -η i g(x i ). If the learning rate 0 < η < 2, then the total training loss is bounded as follows: i L(y i , f (x i ; â)) ≤ G 2 - η 2 M 2 (2 -η) j µ 2 j ( ) In particular, if the learning rate η = 1 M , then L(â) is bounded by: i L(y i , f (x i ; â)) ≤ M 2 j σ 2 j ( ) Proof. Given each training sample (x i , y i ) the gradient of L w.r.t to a when taking (x i , y i ) as the input is as follows: g(x i ) = ∂L(y i , f (x i ; a)) ∂a = x i x T i a -y i x i We note that: (a -g(x i )) T x i -y i = a T x i -a T x i x T i x i + y i x T i x i -y i = a T x i -(a T x i )(x T i x i ) = a T x i -a T x i = 0 =⇒ y i = (a -g(x i )) T x i (19) Then the total training loss among all training samples is given by: M i=1 1 2 (â T x i -y i ) 2 By using Eq. 19, we can rewrite Eq. 20 as follows: M i=1 1 2 (â T x i -y i ) 2 = M i=1 1 2 (â T x i -(a -g(x i )) T x i )) 2 = M i=1 1 2 ((â -a + g(x i )) T x i )) 2 Recall the assumption that â = a -η i g(x i ); we rewrite Eq. 21 as follows: M i=1 1 2 (â T x i -y i ) 2 = M i=1 1 2 (g(x i ) -η i g(x i )) T x i ) 2 (22) λ min (H(t)) ≥ λ min (H(0)) - 2 √ 2M 2 ηtσ Φ(1 -ϵ) √ πδ > 0 (26) Φ(•) is the inverse cumulative distribution function for a d-degree chi-squared distribution χ 2 (d). Proof. We first compute the probability of ||w r (0) -w r (t)|| ≤ C. Based on the assumption w i (0), i = 1, ..., n} follows i.i.d. N (0, I) and the gradient for each weight follows i.i.d. N (0, σ), considering the weight updating rule defined in Eq. 9, each element in w r (0)-w r (t) follows a i.i.d. N (0, ηtσ). Therefore, ||wr (0)-wr || 2 η 2 t 2 σ 2 follows the chi-squared distribution with d degrees of freedom χ 2 (d). P (||w r (0) -w r || ≤ C) = P (||w r (0) -w r (t)|| 2 ≤ C 2 ) = P ( ||w r (0) -w r (t)|| 2 η 2 t 2 σ 2 ≤ C 2 η 2 t 2 σ 2 ) = P ( ||w r (0) -w r (t)|| 2 η 2 t 2 σ 2 ≤ Φ(1 -ϵ)) = 1 -ϵ Given an input sample x i and a weight vector w r (t) from W (t), we define the following event: A ir = {||w r (t) -w r (0)|| ≤ C} ∩ {I{x T i w r (0) ≥ 0} ̸ = I{x T i w r (t) ≥ 0}} (28) If ||w r (t) -w r (0)|| ≤ C holds true, x T i w r (t) = x T i (w r (t) -w r (0)) + x T i w r (0) = sign(x T i (w r (t) -w r (0)))||w r (t) -w r (0)|| + sign(x T i w r (0))||w r (0)|| Eq. 29 tells us that if ||w r (0)|| is larger than ||w r (t) -w r (0)||, then x T i w r (0) determines the sign value of x T i w r (t); in other words, x T i w r (t) always has the same sign values with x T i w r (0); i.e.,  I{x T i w r (0) ≥ 0} = I{x T i w r (t) ≥ 0}. That is, if ||w r (t) -w r (0)|| ≤ C and I{x T i w r (0) ≥ 0} ̸ = I{x T i w r (t) ≥ 0} P (A ir ) ≤ P ({||w r (0)|| ≤ C}) ≤ √ 2C √ π Therefore, if any weight vector w 1 , ..., w m satisfies ||w r (0) -w r (t)|| ≤ C, we can bound the entry-wise deviation on the Gram matrix H(t) at training step t: for any (i, j) ∈ [n] × [n]: E[|H ij (0) -H ij (t)|] =E[ 1 m |x T i x j m r=1 (I{x T i w r (0) ≥ 0, x T j w r (0) ≥ 0} -I{x T i w r (t) ≥ 0, x T j w r (t) ≥ 0})|] =E[ 1 m |x T i x j m r=1 (I{x T i w r (0) ≥ 0}I{x T j w r (0) ≥ 0} -I{x T i w r (t) ≥ 0}I{x T j w r (t) ≥ 0})|] ≤E[ 1 m m r=1 (I{A ir ∪ A jr }] ≤ P (A ir ) + P (A jr ) ≤ 2 √ 2C √ π (32) where the expectation is summing over the initial weight w(0). Hence, considering all the elements in H, we have: E[ M,M i=1,j=1 |H ij (0) -H ij (t)|] ≤ 2M 2 √ 2C √ π Therefore, by Markov's inequality, given the probability 1 -δ, we get: M,M i=1,j=1 |H ij (0) -H ij (t)| ≤ 2M 2 √ 2C √ πδ In Du et al. (2019b) , the authors prove that, given a small perturbation K: if [ ij |H ij (0) -H ij |] ≤ K, then λ min (H) ≥ λ min (H(0)) -K In our case, K in Eq. 35 is given by 2M 2 √ 2C √ πδ . Therefore, λ min (H(t)) ≥ λ min (H(0)) - 2M 2 √ 2C √ πδ = λ min (H(0)) - 2 √ 2M 2 ηtσ Φ(1 -ϵ) √ πδ We replace the term η in Eq.36 with η's upper bound given in the assumption of Theorem 3.2, i.e., η < λ0 √ πδ 2M 2 √ 2Φ(1-ϵ)tσ , we can get that λ min (H(t)) is always larger than 0; that is: λ min (H(t)) ≥ λ min (H(0)) - 2 √ 2M 2 ηtσ Φ(1 -ϵ) √ πδ > 0 This completes our proof. C PROOF OF THEOREM 3.5 Theorem 3.5 Given a neural network with ReLU activation function optimized by minimizing Eq. 8, we assume that each initial weight vector {w r (0), r = 1, ..., n} is i.i.d. generated from N (0, I) and the gradient for each weight follows an i.i.d. distribution N (0, σ). For some positive constants δ and ϵ, if the learning rate η satisfies η < λ0 √ πδ 2M 2 √ 2Φ(1-ϵ)tσ , then with with probability at least (1 -δ)(1 -ϵ), the following holds true: for any r ∈ [m], ||w r (0) -w r (t)|| ≤ C = ηtσ Φ(1 -ϵ), and at training step t, the Gram matrix H(t) satisfies: λ max (H(t)) ≤ λ max (H(0)) + 2 √ 2M 2 ηtσ Φ(1 -ϵ) √ πδ Φ(•) is the inverse cumulative distribution function for a d-degree chi-squared distribution χ 2 (d). The proof is similar to the proof of Theorem 3.2 (see Appendix B). We provide the entire proof below. Proof. We first compute the probability of ||w r (0) -w r (t)|| ≤ C. Based on the assumption that {w i (0), i = 1, ..., n} follow i.i.d. N (0, I) and the gradient of each weight follows i.i.d. N (0, σ), considering the weight updating rule defined in Eq. 9 with learning rate η, each element in w r (0)w r (t) follows an i.i.d. N (0, ηtσ). Therefore, ||wr (0)-wr || 2 η 2 t 2 σ 2 follows a chi-distribution with d degrees of freedom χ 2 (d): P (||w r (0) -w r || ≤ C) = P (||w r (0) -w r (t)|| 2 ≤ C 2 ) = P ( ||w r (0) -w r (t)|| 2 η 2 t 2 σ 2 ≤ C 2 η 2 t 2 σ 2 ) = P ( ||w r (0) -w r (t)|| 2 η 2 t 2 σ 2 ≤ Φ(1 -ϵ)) = 1 -ϵ Given an input sample x i and a weight vector w r (t) from W (t), we define the following event: A ir = {||w r (t) -w r (0)|| ≤ C} ∩ {I{x T i w r (0) ≥ 0} ̸ = I{x T i w r (t) ≥ 0}} If ||w r (t) -w r (0)|| ≤ C holds true, then: x T i w r (t) = x T i (w r (t) -w r (0)) + x T i w r (0) = sign(x T i (w r (t) -w r (0)))||w r (t) -w r (0)|| + sign(x T i w r (0))||w r (0)|| (41) Eq. 41 implies that if ||w r (0)|| is larger than ||w r (t) -w r (0)||, then x T i w r (0) determines the sign value of x T i w r (t). In other words, x T i w r (t) always has the same sign values as x T i w r (0); that is, I{x T i w r (0) ≥ 0} = I{x T i w r (t) ≥ 0}. Hence, if ||w r (t) -w r (0)|| ≤ C and I{x T i w r (0) ≥ 0} ̸ = I{x T i w r (t) ≥ 0} hold true, then ||w r (0)|| ≤ C. Therefore, the probability of event A ir : P (A ir ) ≤ P ({||w r (0)|| ≤ C}) (42) By the anti-concentration inequality of a Gaussian distribution Du et al. (2019b) , we have: P (A ir ) ≤ P ({||w r (0)|| ≤ C}) ≤ √ 2C √ π (43) Therefore, if any weight vector w 1 , ..., w m satisfies ||w r (0) -w r (t)|| ≤ C, we can bound the entry-wise deviation on the Gram matrix H(t) at the training step t: for any (i, j) ∈ [n] × [n]: E[|H ij (0) -H ij (t)|] =E[ 1 m |x T i x j m r=1 (I{x T i w r (0) ≥ 0, x T j w r (0) ≥ 0} -I{x T i w r (t) ≥ 0, x T j w r (t) ≥ 0})|] =E[ 1 m |x T i x j m r=1 (I{x T i w r (0) ≥ 0}I{x T j w r (0) ≥ 0} -I{x T i w r (t) ≥ 0}I{x T j w r (t) ≥ 0})|] (44) We note that all the samples in the training set S (Eq. 1) are normalized with their L2-norm. Hence, we have both ||x i || = 1 and ||x j || = 1. Therefore, using the Cauchy-Schwarz inequality, the above equation is bounded as follows: E[|H ij (0) -H ij (t)|] ≤E[ 1 m m r=1 (I{A ir ∪ A jr }] ≤ P (A ir ) + P (A jr )] ≤ 2 √ 2C √ π (45) where the expectation is over the initial weight w r (0), r = {1, ..., m}. Hence, considering all the elements in H, we have: E[ M,M i=1,j=1 |H ij (0) -H ij (t)|] ≤ 2M 2 √ 2C √ π (46) Therefore, by the Markov's inequality, given the probability 1 -δ, we get: M,M i=1,j=1 |H ij (0) -H ij (t)| ≤ 2M 2 √ 2C √ πδ ( ) Based on the matrix perturbation theory Bauer & Fike (1960); Eisenstat & Ipsen (1998), given a small perturbation K: if [ ij |H ij (0) -H ij (t)|] ≤ K, then λ max (H(t)) ≤ λ max (H(0)) + K (48) In our case, K in Eq. 48 is given by 2M 2 √ 2C √ πδ ; that is: λ max (H(t)) ≤ λ max (H(0)) + 2 √ 2M 2 ηtσ Φ(1 -ϵ) √ πδ This completes our proof. To empirically validate Theorem 3.5, we first create the training set S by normalizing the training samples in MNIST with their L2-norm. Next, we optimize a two-layer MLP with ReLU activation functions as defined in Eq. 7. We use the entire training set of MNIST and apply the gradient descent (Eq. 9) to update the weights. We vary the batch size as {64, 128, 256} and measure the standard deviation of gradients (σ) w.r.t. parameters across different training batches. A very small learning rate of η = 10 -8 is set to satisfy the assumption in Theorem 3.5. Fig. 5 demonstrates the training loss after one epoch vs. standard deviation of gradients (σ). Clearly, the results show that if a network has a lower gradient standard deviation, then it tends to have lower test loss values, and thus, a better generalization capacity. These results empirically prove our claims in Theorem 3.5.

D EXPERIMENTAL SETUP OF ZICO ON IMAGENET D.1 SEARCH SPACE

We use the commonly used MobileNetv2-based search space where the candidate networks are built by stacking multiple Inverted Bottleneck Blocks (IBNs) with SE modules Sandler et al. (2018) ; Pham et al. (2018) ; Lin et al. (2021) ; all the SE modules share the same se ratio as 0.25. For each IBN, we vary the kernel size of the depth-wise convolutional layer from {3, 5, 7} and sample the expansion ratio from {1, 2, 4, 6}. We consider ReLU as the activation function. For each pointwise convolutional layer, the range of the number of channels is from 8 to 1024 with a step size of 8. We use standard Kaiming Init to initialize all linear and convolution layers for every candidate networks He et al. (2015) .

D.2 SEARCH ALGORITHM

We use an Evolutionary Algorithm (EA) to conduct the zero-shot NAS because it is concise and easy to implementfoot_0 As shown in Algorithm 1, we search for the neural architectures with the highest ZiCo within the search space, given a specific budget B (e.g., FLOPs). We repeat the search T times; at each search step, we randomly select a structure from the candidate set F and mutate its architectures (e.g., kernel size, block type, number of blocks, and layer width) to generate a new network F i ∈ S. If the generated network F i meets the inference budget B, we calculate its ZiCo on Z and add F i to the candidate set F. We remove the network with the smallest ZiCo from F, if the number of architectures in F exceeds the threshold E. After T steps, we select the network with the largest ZiCo as the final (optimal) architecture F P . Table 3 : The correlation coefficients between various zero-cost proxies and two naive proxies (#Params and FLOPs) vs. test accuracy on NATSBench-SSS and NATSBench-TSS (KT and SPR represent Kendall's τ and Spearman's ρ, respectively). The results in italics represent the values of #Params' correlation coefficients. The results better than #Params are shown with bold fonts. Clearly, our proposed ZiCo is the only proxy that works consistently better than #Params and is generally the best among all these proxies. Both TE-NAS ‡ (Chen et al. (2021b) ) and NASI ‡ (Chen et al. (2021b) ) use NTK (Jacot et al. (2018) ) as the accuracy proxy to build their own search algorithms. 

E.2 COMPARISON ON TRANSNAS-BENCH-101-MICRO

In this section, we compare our proposed ZiCo against existing proxies on more diverse tasks. We compare our proposed ZiCo against existing proxies on one mainstream NAS benchmark TransNAS-Bench-101 Duan et al. (2021) . We pick the largest search space TransNAS-Bench-101-Micro which contains 4096 total architectures with different cell structures. We compare ZiCo with various proxies under the following four tasks: • Scene Classification. Scene classification is a 47-class classification task that predicts the room type in the image. • Jigsaw. In the Jigsaw task, the input image is divided into nine patches and shuffled based on one of 1,000 predefined permutations. The target here is to classify which permutation is used. • Autoencoding. Autoencoding is a pixel-level prediction task that encodes an input image into a low-dimension embedding vector and then reconstructs the raw image from the vector. • Surface Normal. Similar to autoencoding, surface normal is a pixel-level prediction task that predicts surface normal statistics. As shown in Table 5 , ZiCo consistently works well on Scene Classification, Jigsaw, and Surface Normal; ZiCo has only 0.01 or 0.02 lower correlation scores than the highest scores. Though Fisher works better than ZiCo on Autoencoding, ZiCo has significantly higher correlation scores than Fisher on the remaining three tasks. One possibility why Fisher works best on Autoencoding is that Autoencoding is an image-to-image task; Fisher is the only proxy that is built on the gradient w.r.t. feature maps and thus can better extract the information between the input and output images. Although Fisher works better than ZiCo on Autoencoding (we are still second best), ZiCo has a significantly higher score on the remaining tasks. As shown in the main paper, we again note that existing proxies do not achieve a high correlation on all tasks consistently. We provide some illustration figures of real test accuracy vs. various proxies on NATSBench-SSS search space for CIFAR10 (Fig. 6 ) and ImageNet16-120 datasets(Fig. 7 ). We also show the same illustrative results (real test accuracy vs. various proxies) on NASBench101 search space in Fig. 8 .

F ABLATION STUDY F.1 IMPACT OF MEAN AND STD

We randomly select 2000 networks from NATSBench-SSS on CIFAR10, CIFAR100, and Img16-120 datasets and compute the following proxies: (i) Mean value of gradients only; (ii) Standard deviation (STD) value of gradients only; (iii) Combination of mean and std value, i.e., our proposed ZiCo. We then calculate the correlation coefficients between these proxies and the real test accuracy. As shown in Table . 7, our proposed ZiCo performs better on these three datasets than either using mean only or STD only. Therefore, our proposed ZiCo is a better-designed proxy than using mean or STD individually.

F.2 SEARCH ALGORITHMS: ZERO-COST PT

In this section, we demonstrate that our proposed ZiCo can be combined with other search algorithms. We take the Zero-Cost-PT (Zero-PT) as an example Xiang et al. (2021b) because it is specifically designed for zero-shot proxies and is very time-efficient. Essentially, Zero-PT first integrates all candidate networks into a supernet and assigns learnable weights to each candidate operation (same as one-shot NAS). Then Zero-PT uses the zero-cost proxy instead of the training Finally, we set the initial learning rate as 0.4 with a cosine annealing scheduling scheme. Moreover, we train EfficientNets and the previous SOTA zero-shot NAS approach (Zen-score) under the same setup. As shown in Table 9 , ZiCo outperforms all of the previous zero-shot NAS approaches. For example, when the FLOPs budget is around 600M, ZiCo achieves 77.1% Top-1 accuracy, which is 1.0% and 1.6% higher than previous SOTA zero-shot NAS methods, i.e., Zen-score, and TE-NAS, respectively. Moreover, ZiCo finds a model with similar accuracy as EfficientNet-B1, but with 100M fewer FLOPs and much less search cost. Overall, compared to the regular one-shot or multi-shot NAS methods, ZiCo achieves comparable or higher test accuracy with 5-9500× less search time.

F.4 SEARCH SPACE: DARTS

In this section, we use ZiCo to conduct the zero-shot NAS on the DARTS search space. We first use Algorithm 1 to find the networks with the highest ZiCo without FLOPs budgets on the CIFAR10 dataset. We conduct the search for 100k steps; this takes 0.7 hours on a single NVIDIA 3090 GPU (i.e., 0.03 GPU days). Then, we train the obtained network with the exact same training setup as the original DARTS paper Liu et al. (2019) 4 ; specifically, we train the neural network for 600 epochs with a batch size of 128. We only use the standard data augmentation (normalization, cropping, and random flipping) together with the cutout tricks. We don't use knowledge distillation or any other advanced data augmentation tricks. Finally, we set the initial learning rate as 0.025 with a cosine annealing scheduling scheme. We repeat the same experiments for Zen-score. As shown in Table 10 , ZiCo outperforms previous zero-shot NAS approaches, e.g, Zen-score and TE-NAS. Moreover, compared to the regular one-shot or multi-shot NAS methods, ZiCo achieves comparable or higher test accuracy with at least 10× less search time. 



One can also use other methods to perform the search; see Appendix F.2. NASI uses NTK to build their own search algorithms. Here, we directly compute the correlation between NTK and the real test accuracy. We implement the code ourselves since the authors have not released the code yet. The difference between Table and Table 8 comes from the search algorithm: Table 4 uses traversal search among all candidate networks; Table 8 uses perturbation-based zero-cost PT Xiang et al. (2021b). Most of the baseline approaches inTable 10 use the same setup as ours.



IN NEURAL NETWORKS Kernel methods are widely explored to analyze the convergence property and generalization capacity of networks trained with gradient descent Neal (1996); Williams (1996); Du et al. (2019a); Lu et al. (2020); Allen-Zhu et al. (2019); Hanin & Nica (2020); Golikov et al. (2022). For example, the training of wide neural networks is proved to be equivalent to the optimization of a specific kernel function Arora et al. (2019a); Lee et al. (2019a); Chizat et al. (2019); Arora et al. (2019b); Cho & Saul (2009). Moreover, given the networks with specific width constraints, researchers proved that the training convergence and generalization capacity of networks can be described by some corresponding kernels Mei et al. (2019); Zhang et al. (2019b); Garriga-Alonso et al. (2019); Du et al.

SUMMARY OF OUR THEORETICAL ANALYSIS Theorem 3.1, Theorem 3.3 and Theorem 3.5 tell us that the network with a high training convergence speed and generalization capacity should have high absolute mean values and low standard deviation values for the gradient, w.r.t the parameters across different training samples/batches.

Figure 1: Training loss vs. square sum of mean gradients and the sum of gradients variances for linear networks on MNIST after one epoch. Clearly, larger mean gradient values lead to lower loss values; also, networks with smaller j σ 2 j have lower loss values.

hold true, then ||w r (0)|| ≤ C. Therefore, the probability of event A ir : P (A ir ) ≤ P ({||w r (0)|| ≤ C}) (30) By anti-concentration inequality of Gaussian distribution Du et al. (2019b), we have:

Figure 5: Test loss vs. standard deviation of gradients (σ in Eq. 13) for randomly sampled 500 twolayer MLPs with ReLU on MNIST after one training epoch. We train these networks by minimizing the MSE loss between the output of networks and the real labels. As shown, the Networks with smaller σ tend to have lower test loss values and thus have a better generalization capacity.

Figure 6: Real test accuracy vs. various proxies on NATSBench-SSS search space for CIFAR10 dataset. τ and ρ are short for Kendall's τ and Spearman's ρ, respectively.

Figure 7: Real test accuracy vs. various proxies on NATSBench-SSS search space for ImageNet16-120 dataset. τ and ρ are short for Kendall's τ and Spearman's ρ, respectively.

The correlation coefficients between various zero-cost proxies and two naive proxies (#Params and FLOPs) vs. test accuracy on NATSBench-TSS (KT and SPR represent Kendall's τ and Spearman's ρ, respectively). The best results are shown with bold fonts. Clearly, ZiCo is the only proxy that works consistently better than #Params and is generally the best proxy. We provide more results in Table3and Table4in Appendix E.1.

Figure3: Correlation coefficients of various proxies vs. test accuracy on NASBench101 search space. ZiCo has significantly higher correlation scores than other proxies, except for Zen-score. Comparison of Top-1 accuracy of our ZiCo-based NAS against SOTA NAS methods on ImageNet under various FLOP budgets (averages over three runs). For the 'Method' column, 'MS' means multi-shot NAS; 'OS' is short for one-shot NAS; Scaling represents network scaling methods; 'ZS' is short for zero-shot NAS. OFA ‡ is trained from scratch and reported inMoons et al. (2021).

ON IMAGENETSearch Space We use the commonly used MobileNetv2-based search space where the candidate networks are built by stacking multiple Inverted Bottleneck Blocks (IBNs) with SE modulesSandler et al. (2018);Pham et al. (2018);Lin et al. (2021). As for each IBN, the kernel size of the depthwise convolutional layer is sampled from {3, 5, 7} and the expansion ratio is randomly selected from {1, 2, 4, 6}. We consider ReLU as the activation function. We use standard Kaiming Init to initialize all linear and convolution layers for every candidate networksHe et al. (2015). More details of the search space are given in Appendix D. Ablation study. The correlation coefficients between: (a) ZiCo under varying number of batches and real test accuracy; (b) ZiCo under varying batch size and real test accuracy.(DONNA), but with fewer FLOPs and 648× faster search speedMoons et al. (2021). Moreover, if the FLOPs is 600M, ZiCo achieves 2.6% higher Top-1 Accuracy than the latest one-shot NAS method (MAGIC-AT) with a 3× reduction in terms of search timeXu et al. (2022).To make further comparison with #Params, we also use #Params as the proxy and Algorithm 1 to conduct the search under a 450M FLOPs budget. As shown in Table2, the obtained network by #Params has a 14.6% lower accuracy than ours (63.5% vs. 78.1%). Hence, even though the correlations for ZiCo and #Params in Table1and the optimal networks in Table4are similar for small-scale datasets, ZiCo significantly outperforms naive baselines like #Params for large datasets like ImageMet. To conclude, ZiCo achieves SOTA results for Zero-Shot NAS and outperforms naive methods, existing zero-shot proxies, as well as several one-shot and multi-shot methods.We remark that these results demonstrate two benefits of our proposed ZiCo: (i) Lightweight computation costs. As discussed in Sec 3, during the search process, to evaluate a given architecture, we only need to conduct the backward propagation twice (only takes 0.3s on an NVIDIA 3090 GPU). The computation efficiency and exemption of training enable ZiCo to significantly reduce the search time of NAS. (ii) High correlation with the real test accuracy. As demonstrated in Sec 4.3, ZiCo has a very high correlation score with real accuracy for architectures from various search spaces and datasets. Hence, ZiCo can accurately predict the test accuracy of diverse neural architectures, thus helping find the optimal architectures with the best test performance.

The test accuracy of optimal architectures obtained by various zero-shot proxies (averaged over 5 runs) on NATSBench-TSS search space. The best results are shown with bold fonts.

The correlation coefficients under different proxies vs. test performance on TransNAS-Bench-101-Mirco. Clearly, our proposed ZiCo is consistently very close to the best score (only 0.01 or 0.02 lower score) except for Autoencoding (still, ZiCo is the second best on Autoencoding). Though Fisher works better than ZiCo on Autoencoding, ZiCo has a significantly higher score on the rest of tasks. We note that existing proxies do not achieve a high correlation on all tasks consistently.

demonstrates the test accuracy of the best architectures found using various proxies on each of the above tasks in TransNAS-Bench-101-Micro. Once again, we see that ZiCo significantly outperforms existing proxies on all tasks except Autoencoding, where we trail Fisher by only 0.01 SSIM. Nonetheless, ZiCo is second best on the Autoencoding task. Note that, similar to the correlation results in Table5, other proxies do not consistently achieve high accuracy. For instance, while methods like Synflow or Zenscore achieve results close to ours on Scene Classification and Surface

The test performance of optimal architectures obtained by various zero-shot proxies (averaged over 5 runs) on TransNAS-Bench-101-Micro search space. The best results are shown with bold fonts.

The correlation coefficients under three different proxies vs. test accuracy on NATSBench-SSS (KT and SPR represent Kendall's τ and Spearman's ρ, respectively). Clearly, our proposed ZiCo works consistently better than using mean only and STD only on all these datasets.

The test accuracy of optimal architectures obtained by various zero-shot proxies (average on 5 runs) on NATSBench-TSS search space. The best results are shown with bold fonts.

Comparison of Top-1 accuracy of our ZiCo-based NAS against NAS methods with standalone training on ImageNet under various FLOP budgets. For the 'Method' column, 'MS' represents multi-shot NAS; 'OS' is short for one-shot NAS; Scaling represents network scaling methods; 'ZS' is short for zero-shot NAS. 'no KD' means we train the network without Knowledge Distillation (KD); '150E' means we train the network with 150 epochs, similar for 350E. The results are averaged over three suns. We note that some NAS methods use knowledge distillation to improve the test accuracy; hence, we remove those methods from this table. The results are averaged over three runs. accuracy to update the weights for each candidate operation. The final architecture is generated by selecting the operations with the highest weight values.We combine different accuracy proxies with Zero-PT under the NASBench-201 and report the optimal architectures found with various proxies 3 . As shown in Table8, the architectures found via

Comparison of Top-1 accuracy of our ZiCo-based NAS against NAS methods with standalone training on CIFAR10 on DARTS search space. For the 'Method' column,'MS' represents multi-shot NAS; 'OS' is short for one-shot NAS; 'ZS' is short for zero-shot NAS. '600E' means we train the network with 600 epochs, similar to 800E. The results are averaged over three suns. The results are averaged over three runs.

ACKNOWLEDGMENTS

This work was supported in part by the US National Science Foundation (NSF) grant CNS-2007284. 

availability

Our code is available at https://github.com/SLDGroup

5. CONCLUSION

In this work, we have proposed ZiCo, a new SOTA proxy for zero-shot NAS. As the main theoretical contribution, we first reveal how the mean value and standard deviation of gradients impact the training convergence of a given architecture. Based on this theoretical analysis, we have shown that ZiCo works better than all zero-shot NAS proxies proposed so far on multiple popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS) for multiple datasets (CIFAR10/100, ImageNet16-120). In particular, we have demonstrated that ZiCo is consistently better than (#Params) and existing zeroshot proxies. Moreover, ZiCo enables us to find architectures with competitive test performance to representative one-shot and multi-shot NAS methods, but with much lower search costs. For example, ZiCo-based NAS can find the architectures with 78.1%, 79.4%, and 80.4% test accuracies under 450M, 600M, and 1000M FLOPs budgets, respectively, on ImageNet within 0.4 GPU days.According to the Cauchy-Schwarz inequality and ||x i || = 1, the total training loss is bounded by:is always non-negative, the above upper bound of training loss satisfies:Note that, if 0 < η < 2, then η(2 -η) > 0. Therefore, the larger j µ 2 j term would make the upper bound of training loss in Eq. 23 closer to 0. In other words, the higher the gradient absolute mean values across different training samples/batches, the lower the training loss values the model converges to; i.e., the network converges at a faster rate.In particular, if η = 1 M , the Eq. 23 can be rewritten as:This completes our proof.B PROOF OF THEOREM 3.2Theorem 3.2 Given a neural network with ReLU activation function optimized by minimizing Eq. 8, we assume that each initial weight vector {w r (0), r = 1, ..., n} is i.i.d. generated from N (0, I) and the gradient for each weight follows an i.i.d. N (0, σ). For some positive constants δ and ϵ, if the learning rate η satisfies η <2Φ(1-ϵ)tσ , then with with probability at least (1 -δ)(1 -ϵ), the following holds true: for any r ∈ Specifically, we repeat the search 10 5 times (i.e., T = 10 5 ) with the population size E = 512. For each of the candidate architectures, we compute ZiCo with two batches randomly sampled from the training set of ImageNet with batch size 128. In total, it takes 10 hours on a single NVIDIA 3090 GPU for 10 5 search steps.

D.3 TRAINING DETAILS

We use the same data augmentations configurations as in Pham et al. (2018) : mix-up, labelsmoothing, random erasing, random crop/resize/flip/lighting, and AutoAugment. We use the SGD optimizer with momentum 0.9 and weight decay 4e-5. We take EfficientNet-B3 as a teacher network and use the knowledge distillation method to train the network. We set the initial learning rate as 0.1 and used the cosine annealing scheme to adjust the learning rate during training. We train the obtained network 480 epochs, which takes 83 hours on a 40-core Intel Xeon CPU and 8 NVIDIA 3090 GPU-powered server.

E SUPPLEMENTARY RESULTS ON NAS BENCHMARKS E.1 COMPARISON WITH MORE PROXIES

In this section, we further provide the comparison between our proposed ZiCo and more proxies proposed recently: KNAS (Xu et al. (2021) ), NASWOT (Lopes et al. (2021) ), GradSign ( Zhang & Jia (2022) ), and NTK (TE-NAS Chen et al. (2021b) , NASI Shu et al. (2022a) ). To compute the correlations, we use the official code released by the authors of the above papers to obtain the values of these proxies 2 . As shown in Table 3 , our proposed ZiCo performs better than all these proxies. For example, NASWOT and GradSign achieve a similar correlation score as ZiCo on NATSBench-TSS; however, ZiCo has a significantly higher correlation score than these two proxies on NATSBench-SSS.Beside the correlation coefficients, we also report the optimal architectures found with various proxies. As shown in Table 4 , the architectures found via ZiCo have the highest test accuracy on all these three datasets. 

