HOW TO TRAIN YOUR SUPER-NET: AN ANALYSIS OF TRAINING HEURISTICS IN WEIGHT-SHARING NAS

Abstract

Weight sharing promises to make neural architecture search (NAS) tractable even on commodity hardware. Existing methods in this space rely on a diverse set of heuristics to design and train the shared-weight backbone network, a.k.a. the supernet. Since heuristics substantially vary across different methods and have not been carefully studied, it is unclear to which extent they impact super-net training and hence the weight-sharing NAS algorithms. In this paper, we disentangle super-net training from the search algorithm, isolate 14 frequently-used training heuristics, and evaluate them over three benchmark search spaces. Our analysis uncovers that several commonly-used heuristics negatively impact the correlation between supernet and stand-alone performance, whereas simple, but often overlooked factors, such as proper hyper-parameter settings, are key to achieve strong performance. Equipped with this knowledge, we show that simple random search achieves competitive performance to complex state-of-the-art NAS algorithms when the super-net is properly trained.

1. INTRODUCTION

Neural architecture search (NAS) has received growing attention in the past few years, yielding stateof-the-art performance on several machine learning tasks (Liu et al., 2019a; Wu et al., 2019; Chen et al., 2019b; Ryoo et al., 2020) . One of the milestones that led to the popularity of NAS is weight sharing (Pham et al., 2018; Liu et al., 2019b) , which, by allowing all possible network architectures to share the same parameters, has reduced the computational requirements from thousands of GPU hours to just a few. Figure 1 shows the two phases that are common to weight-sharing NAS (WS-NAS) algorithms: the search phase, including the design of the search space and the search algorithm; and the evaluation phase, which encompasses the final training protocol on the proxy taskfoot_0 . While most works focus on developing a good sampling algorithm (Cai et al., 2019; Xie et al., 2019) or improving existing ones (Zela et al., 2020a; Nayman et al., 2019; Li et al., 2020) , they tend to overlook or gloss over important factors related to the design and training of the shared-weight backbone network, i.e. the super-net. For example, the literature encompasses significant variations of learning hyper-parameter settings, batch normalization and dropout usage, capacities for the initial layers of the network, and depth of the super-net. Furthermore, some of these heuristics are directly transferred from standalone network training to super-net training without carefully studying their impact in this drastically different scenario. For example, the fundamental assumption of batch normalization that the input data follows a slowly changing distribution whose statistics can be tracked during training is violated in WS-NAS, but nonetheless typically assumed to hold. In this paper, we revisit and systematically evaluate commonly-used super-net design and training heuristics and uncover the strong influence of certain factors on the success of super-net training. To this end, we leverage three benchmark search spaces, NASBench-101 (Ying et al., 2019) , NASBench-201 (Dong & Yang, 2020) , and DARTS-NDS (Radosavovic et al., 2019) , for which the ground-truth stand-alone performance of a large number of architectures is available. We report the results of our experiments according to two sets of metrics: i) metrics that directly measure the quality of the super-net, such as the widely-adopted super-net accuracy 2 and a modified Kendall-Tau correlation between the searched architectures and their ground-truth performance, which we refer to as sparse Search Algorithm f proxy < l a t e x i t s h a 1 _ b a s e 6 4 = " x o r d N t n C P H e H Q s W 9 S F K X f p 9 h i U g = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 x P 0 s 0 e p p M u 2 X K 3 7 V n w O t k i A n F c j R 6 J e / e g N F U k G l J R w b 0 w 3 8 x I Y Z 1 p Y R T q e l X m p o g s k Y D 2 n X U Y k F N W E 2 P 3 i K z p w y Q L H S r q R F c / X 3 R I a F M R M R u U 6 B 7 c g s e z P x P 6 + b 2 v g q z J h M U k s l W S y K U 4 6 s Q r P v 0 Y B p S i y f O I K J Z u 5 W R E Z Y Y 2 J d R i U X Q r D 8 8 i p p X V S D W r V 2 V 6 v U r / M 4 i n A C p 3 A O A V x C H W 6 h A U 0 g I O A Z X u H N 0 9 6 L 9 + 5 9 L F o L X j 5 z D H / g f f 4 A e D W Q 2 g = = < / l a t e x i t > f ws < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 i E + + P Y Z o j x Q 9 m A g Z y y C 7 q S U l x U = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 4 n 7 2 Z K b 9 c s W v + n O g V R L k p A I 5 G v 3 y V 2 + g S C q o t I R j Y 7 q B n 9 g w w 9 o y w u m 0 1 E s N T T A Z 4 y H t O i q x o C b M 5 t d O 0 Z l T B i h W 2 p W 0 a K 7 + n s i w M G Y i I t c p s B 2 Z Z W 8 m / u d 1 U x t f h R m T S W q p J I t F c c q R V W j 2 O h o w T Y n l E 0 c w 0 c z d i s g I a 0 y s C 6 j k Q g i W X 1 4 l r Y t q U K v W 7 m q V + n U e R x F O 4 B T O I Y B L q M M t N K A J B B 7 g G V 7 h z V P e i / f u f S x a C 1 4 + c w x / 4 H 3 + A P b L j 2 Q = < / l a t e x i t > 

Metric

3 i E + + P Y Z o j x Q 9 m A g Z y y C 7 q S U l x U = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 4 n 7 2 Z K b 9 c s W v + n O g V R L k p A I 5 G v 3 y V 2 + g S C q o t I R j Y 7 q B n 9 g w w 9 o y w u m 0 1 E s N T T A Z 4 y H t O i q x o C b M 5 t d O 0 Z l T B i h W 2 p W 0 a K 7 + n s i w M G Y i I t c p s B 2 Z Z W 8 m / u d 1 U x t f h R m T S W q p J I t F c c q R V W j 2 O h o w T Y n l E 0 c w 0 c z d i s g I a 0 y s C 6 j k Q g i W X 1 4 l r Y t q U K v W 7 m q V + n U e R x F O 4 B T O I Y B L q M M t N K A J B B 7 g G V 7 h z V P e i / f u f S x a C 1 4 + c w x / 4 H 3 + A P b L j 2 Q = < / l a t e x i t >

Fixed

Stand-Alone Net P proxy < l a t e x i t s h a 1 _ b a s e 6 4 = " i U 9 d 8 1 S B e M P s 6 1 p c p y s 1 4 g j B 6 j s  = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 N P p Z o t X T Z N o v V / y q P w d a J U F O K p C j 0 S 9 / 9 Q a K p I J K S z g 2 p h v 4 i Q 0 z r C 0 j n E 5 L v d T Q B J M x H t K u o x I L a s J s f v A U n T l l g G K l X U m L 5 u r v i Q w L Y y Y i c p 0 C 2 5 F Z 9 m b i f 1 4 3 t f F V m D G Z p J Z K s l g U p x x Z h W b f o w H T l F g + c Q Q T z d y t i I y w x s S 6 j E o u h G D 5 5 V X S u q g G t W r t r l a p X + d x F O E E T u E c A r i E O t x C A 5 p A Q M A z v M K b p 7 0 X 7 9 3 7 W L Q W v H z m G P 7 A + / w B V i 2 Q x A = = < / l a t e x i t > f proxy < l a t e x i t s h a 1 _ b a s e 6 4 = " x o r d N t n C P H e H Q s W 9 S F K X f p 9 h i U g = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 x P 0 s 0 e p p M u 2 X K 3 7 V n w O t k i A n F c j R 6 J e / e g N F U k G l J R w b 0 w 3 8 x I Y Z 1 p Y R T q e l X m p o g s k Y D 2 n X U Y k F N W E 2 P 3 i K z p w y Q L H S r q R F c / X 3 R I a F M R M R u U 6 B 7 c g s e z P x P 6 + b 2 v g q z J h M U k s l W S y K U 4 6 s Q r P v 0 Y B p S i y f O I K J Z u 5 W R E Z Y Y 2 J d R i U X Q r D 8 8 i p p X V S D W r V 2 V 6 v U r / M 4 i n A C p 3 A O A V x C H W 6 h A U 0 g I O A Z X u H N 0 9 6 L 9 + 5 9 L F o L X j 5 z D H / g f f 4 A e D W Q 2 g = = < / l a t e x i t > Super-Net P ws < l a t e x i t s h a 1 _ b a s e 6 4 = " d W M Z q L 3 5 I Y T 8 2 c S z w I x r S f i Y z Q 0 = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 G v 3 s y U z 7 5 Y p f 9 e d A q y T I S Q V y N P r l r 9 5 A k V R Q a Q n H x n Q D P 7 F h h r V l h N N p q Z c a m m A y x k P a d V R i Q U 2 Y z a + d o j O n D F C s t C t p 0 V z 9 P Z F h Y c x E R K 5 T Y D s y y 9 5 M / M / r p j a + C j M m k 9 R S S R a L 4 p Q j q 9 D s d T R g m h L L J 4 5 g o p m 7 F Z E R 1 p h Y F 1 D J h R A s v 7 x K W h f V o F a t 3 d U q 9 e s 8 j i K c w C m c Q w C X U I d b a E A T C D z A M 7 z C m 6 e 8 F + / d + 1 i 0 F r x 8 5 h j + w P v 8 A d U F j 0 4 = < / l a t e x i t > NASBench-101 NASBench-201 DARTS-NDS Evaluating NAS Random NAS … f ws < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 i E + + P Y Z o j x Q 9 m A g Z y y C 7 q S U l x U = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 4 n 7 2 Z K b 9 c s W v + n O g V R L k p A I 5 G v 3 y V 2 + g S C q o t I R j Y 7 q B n 9 g w w 9 o y w u m 0 1 E s N T T A Z 4 y H t O i q x o C b M 5 t d O 0 Z l T B i h W 2 p W 0 a K 7 + n s i w M G Y i I t c p s B 2 Z Z W 8 m / u d 1 U x t f h R m T S W q p J I t F c c q R V W j 2 O h o w T Y n l E 0 c w 0 c z d i s g I a 0 y s C 6 j k Q g i W X 1 4 l r Y t q U K v W 7 m q V + n U e R x F O 4 B T O I Y B L q M M t N K A J B B 7 g G V 7 h z V P e i / f u f S x a C 1 4 + c w x / 4 H 3 + A P b L j 2 Q = < / l a t C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 N P p Z o t X T Z N o v V / y q P w d a J U F O K p C j 0 S 9 / 9 Q a K p I J K S z g 2 p h v 4 i Q 0 z r C 0 j n E 5 L v d T Q B J M x H t K u o x I L a s J s f v A U n T l l g G K l X U m L 5 u r v i Q w L Y y Y i c p 0 C 2 5 F Z 9 m b i f 1 4 3 t f F V m D G Z p J Z K s l g U p x x Z h W b f o w H T l F g + c Q Q T z d y t i I y w x s S 6 j E o u h G D 5 5 V X S u q g G t W r t r l a p X + d x F O E E T u E c A r i E O t x C A 5 p A Q M A z v M K b p 7 0 X 7 9 3 7 W L Q W v H z m G P 7 A + / w B V i 2 Q x A = = < / l a t e x i t > f proxy < l a t e x i t s h a 1 _ b a s e 6 4 = " x o r d N t n C P H e H Q s W 9 S F K X f p 9 h i U g = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 x P 0 s 0 e p p M u 2 X K 3 7 V n w O t k i A n F c j R 6 J e / e g N F U k G l J R w b 0 w 3 8 x I Y Z 1 p Y R T q e l X m p o g s k Y D 2 n X U Y k F N W E 2 P 3 i K z p w y Q L H S r q R F c / X 3 R I a F M R M R u U 6 B 7 c g s e z P x P 6 + b 2 v g q z J h M U k s l W S y K U 4 6 s Q r P v 0 Y B p S i y f O I K J Z u 5 W R E Z Y Y 2 J d R i U X Q r D 8 8 i p p X V S D W r V 2 V 6 v U r / M 4 i n A C p 3 A O A V x C H W 6 h A U 0 g I O A Z X u H N 0 9 6 L 9 + 5 9 L F o L X j 5 z D H / g f f 4 A e D W Q 2 g = = < / l a t e x i t > Super-Net P ws < l a t e x i t s h a 1 _ b a s e 6 4 = " d W M Z q L 3 5 I Y T 8 2 c S z w I Kendall-Tau; ii) proxy metrics such as the ability to surpass random search and the stand-alone accuracy of the model found by the WS-NAS algorithm. x r S f i Y z Q 0 = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 G v 3 s y U z 7 5 Y p f 9 e d A q y T I S Q V y N P r l r 9 5 A k V R Q a Q n H x n Q D P 7 F h h r V l h N N p q Z c a m m A y x k P a d V R i Q U 2 Y z a + d o j O n D F C s t C t p 0 V z 9 P Z F h Y c x E R K 5 T Y D s y y 9 5 M / M / r p j a + C j M m k 9 R S S R a L 4 p Q j q 9 D s d T R g m h L L J 4 5 g o p m 7 F Z E R 1 p h Y F 1 D J h R A s v 7 x K W h f V o F a t 3 d U q 9 e s 8 j i K c w C m c Q w C X U I d b a E A T C D z A M 7 z C Via our extensive experiments (over 700 GPU days), we uncover that (i) the training behavior of a super-net drastically differs from that of a standalone network, e.g., in terms of feature statistics and loss landscape, thus allowing us to define training factor settings, e.g., for batch-normalization (BN) and learning rate, that are better suited for super-nets; (ii) while some neglected factors, such as the number of training epochs, have a strong impact on the final performance, others, believed to be important, such as path sampling, only have a marginal effect, and some commonly-used heuristics, such as the use of low-fidelity estimates, negatively impact it; (iii) the commonly-adopted super-net accuracy is unreliable to evaluate the super-net quality. Altogether, our work is the first to systematically analyze the impact of the diverse factors of super-net design and training, and we uncover the factors that are crucial to design a super-net, as well as the non-important ones. Aggregating these findings allows us to boost the performance of simple weight-sharing random search to the point where it reaches that of complex state-of-the-art NAS algorithms across all tested search spaces. We will release our code and trained models so as to establish a solid baseline to facilitate further research.

2. PRELIMINARIES AND RELATED WORK

We first introduce the necessary concepts that will be used throughout the paper. As shown in Figure 1 (a), weight-sharing NAS algorithms consist of three key components: a search algorithm that samples an architecture from the search space in the form of an encoding, a mapping function f proxy that maps the encoding into its corresponding neural network, and a training protocol for a proxy task P proxy for which the network is optimized. To train the search algorithm, one needs to additionally define the mapping function f ws that generates the shared-weight network. Note that the mapping f proxy frequently differs from f ws , since in practice the final model contains many more layers and parameters so as to yield competitive results on the proxy task. After fixing f ws , a training protocol P ws is required to learn the super-net. In practice, P ws often hides factors that are critical for the final performance of an approach, such as hyper-parameter settings or the use of data augmentation strategies to achieve state-of-the-art performance (Liu et al., 2019b; Chu et al., 2019; Zela et al., 2020a) . Again, P ws may differ from P proxy , which is used to train the architecture that has been found by the search. For example, our experiments reveal that the learning rate and the total number of epochs frequently differ due to the different training behavior of the super-net and stand-alone architectures. Many strategies have been proposed to implement the search algorithm, such as reinforcement learning (Zoph & Le, 2017; Zoph et al., 2018) , evolutionary algorithms (Real et al., 2017; Miikkulainen et al., 2019; So et al., 2019; Liu et al., 2018; Lu et al., 2018) , gradient-based optimization (Liu et al., 2019b; Xu et al., 2020; Li et al., 2020) , Bayesian optimization (Kandasamy et al., 2018; Jin et al., 2019; Zhou et al., 2019; Wang et al., 2020) , and separate performance predictors (Liu et al., 2018; Luo et al., 2018) . Until very recently, the common trend to evaluate NAS consisted of reporting the searched architecture's performance on the proxy task (Xie et al., 2019; Real et al., 2019; Ryoo et al., 2020) . This, however, hardly provides real insights about the NAS algorithms themselves, because of the many components involved in them. Many factors that differ from one algorithm to another can influence the performance. In practice, the literature even commonly compares NAS methods that employ different protocols to train the final model. Li & Talwalkar (2019) and Yu et al. (2020b) were the first to systematically compare different algorithms with the same settings for the proxy task and using several random initializations. Their surprising results revealed that many NAS algorithms produce architectures that do not significantly outperform a randomly-sampled architecture. Yang et al. (2020) highlighted the importance of the training protocol P proxy . They showed that optimizing the training protocol can improve the final architecture performance on the proxy task by three percent on CIFAR-10. This non-trivial improvement can be achieved regardless of the chosen sampler, which provides clear evidence for the importance of unifying the protocol to build a solid foundation for comparing NAS algorithms. In parallel to this line of research, the recent series of "NASBench" works (Ying et al., 2019; Zela et al., 2020b; Dong & Yang, 2020) proposed to benchmark NAS approaches by providing a complete, tabular characterization of a search space. This was achieved by training every realizable stand-alone architecture using a fixed protocol P proxy . Similarly, other works proposed to provide a partial characterization by sampling and training a sufficient number of architectures in a given search space using a fixed protocol (Radosavovic et al., 2019; Zela et al., 2020a; Wang et al., 2020) . While recent advances for systematic evaluation are promising, no work has yet thoroughly studied the influence of the super-net training protocol P ws and the mapping function f ws . Previous works (Zela et al., 2020a; Li & Talwalkar, 2019) performed hyper-parameter tuning to evaluate their own algorithms, and focused only on a few parameters. We fill this gap by benchmarking different choices of P ws and f ws and by proposing novel variations to improve the super-net quality. Recent works have shown that sub-nets of super-net training can surpass some human designed models without retraining (Yu et al., 2020a; Cai et al., 2020) and that reinforcement learning can surpass the performance of random search (Bender et al., 2020) . However, these findings are still only shown on MobileNet-like search spaces where we only search for the size of convolution kernels and the channel ratio for each layer. This is an effective approach to discover a compact network, but it does not change the fact that on cell-based search space super-net quality remains low.

3. EVALUATION METHODOLOGY

We first isolate 14 factors that need to be considered during the design and training of a super-net, and then introduce the metrics to evaluate the quality of the trained super-net. Note that these factors are agnostic to the search policy that is used after training the super-net.

3.1. DISENTANGLING THE SUPER-NET FROM THE SEARCH ALGORITHM

Our goal is to evaluate the influence of the super-net mapping f ws and weight-sharing training protocol P ws . As shown in Figure 2 , f ws translates an architecture encoding, which typically consists of a discrete number of choices or parameters, into a neural network. Based on a well-defined mapping, the super-net is a network in which every sub-path has a one-to-one mapping with an architecture encoding (Pham et al., 2018) . Recent works (Xu et al., 2020; Li et al., 2020; Ying et Weight-sharing mapping f ws . To make the search space manageable, all cell and macro parameters are fixed during the search, except for the topology of the cell and its possible operations. However, the exact choices for each of these fixed factors differ between algorithms and search spaces. We report the common factors in the left part of Table 1 . They include various implementation choices, e.g., the use of convolutions with a dynamic number of channels (Dynamic Channeling), super-convolutional layers that support dynamic kernel sizes (OFA Kernel) (Cai et al., 2020) , weight-sharing batchnormalization (WSBN) that tracks independent running statistics and affine parameters for different incoming edges (Luo et al., 2018) , and path and global dropout (Pham et al., 2018; Luo et al., 2018; Liu et al., 2019b) . They also include the use of low-fidelity estimates (Elsken et al., 2019) to reduce the complexity of super-net training, e.g., by reducing the number of layers (Liu et al., 2019b) and channels (Yang et al., 2020; Chen et al., 2019a) , the portion of the training set used for super-net training (Liu et al., 2019b) , or the batch size (Liu et al., 2019b; Pham et al., 2018; Yang et al., 2020) . Weight-sharing protocol P ws . Given a mapping f ws , different training protocols P ws can be employed to train the super-net. Protocols can differ in the training hyper-parameters and the sampling strategies they rely on. We will evaluate the different hyper-parameter choices listed in the right part of Table 1 . This includes the initial learning rate, the hyper-parameters of batch normalization, the total number of training epochs, and the amount of weight decay. We randomly sample one path to train the super-net (Guo et al., 2019) , which is also known as single-path one-shot (SPOS) or Random-NAS (Li & Talwalkar, 2019) . The reason for this choice is that Random-NAS is equivalent to the initial state of many search algorithms (Liu et al., 2019b; Pham et al., 2018; Luo et al., 2018) , some of which even freeze the sampler training so as to use random sampling to warm-up the super-net (Xu et al., 2020; Dong & Yang, 2019b) . Note that we also evaluated two variants of Random-NAS, but found their improvement to be only marginal. Please see Appendix C.2 for more detail. In our experiments, for the sake of reproducibility, we ensure that P ws and P proxy , as well as f ws and f proxy , are as close to each other as possible. For the hyper-parameters of P ws , we cross-validate each factor following the order in Table 1 , and after each validation, use the value that yields the best performance in P proxy . For all other factors, we change one factor at a time. Search spaces. We use three commonly-used search spaces, for which a large number of stand-alone architectures have been trained and evaluated on CIFAR-10 ( Krizhevsky et al., 2009) to obtain their ground-truth performance. In particular, we use NASBench-101 (Ying et al., 2019) , which consists of 423, 624 architectures and is compatible with weight-sharing NAS (Yu et al., 2020b; Zela et al., 2020b) ; NASBench-201 (Dong & Yang, 2020) , which contains more operations than NASBench-101 but fewer nodes; and DARTS-NDS (Radosavovic et al., 2019) that contains over 10 13 architectures of which a subset of 5000 models was sampled and trained in a stand-alone fashion. See Appendix A.2 for a detailed discussion.

3.2. SPARSE KENDALL-TAU -A NOVEL SUPER-NET EVALUATION METRIC

We define a novel super-net metric, which we name sparse Kendall-Tau. It is inspired by the Kendall-Tau metric used by Yu et al. (2020b) to measure the discrepancy between the ordering of stand-alone architectures and the ordering that is implied by the trained super-net. An ideal super-net should yield the same ordering of architectures as the stand-alone one and thus would lead to a high Kendall-Tau. However, Kendall-Tau is not robust to negligible performance differences between architectures (c.f. Figure 3 ). To robustify this metric, we share the rank between two architectures if their stand-alone Kendall-Tau is not robust when many architectures have similar performance. Minor performance differences can lead to large perturbations in the ranking. Our sparse Kendall-Tau alleviates this by dismissing minor differences in performance. accuracies differ by less than a threshold (0.1% here). Since the resulting ranks are sparse, we call this metric sparse Kendall-Tau (s-KdT). Note that we also compare Kendall-Tau and Spearman correlation in Appendix A.3, and provide implementation details in Appendix A.4. Although, sparse Kendall-Tau captures the super-net quality well, it may fail in extreme cases, such as when the top-performing architectures are ranked perfectly while poor ones are ordered randomly. To account for such rare situations and ensure the soundness of our analysis, we also report additional metrics. We define two groups of metrics to holistically evaluate different aspects of a trained super-net. The first group of metrics directly evaluates the quality of the super-net, including sparse Kendall-Tau and the widely-adopted super-net accuracy. For the super-net accuracy, we report the average accuracy of 200 architectures on the validation set of the dataset of interest. We will refer to this metric simply as accuracy. It is frequently used (Guo et al., 2019; Chu et al., 2019) to assess the quality of the trained super-net, but we will show later that it is in fact a poor predictor of the final stand-alone performance. The metrics in the second group evaluate the search performance of a trained super-net. The first metric is the probability to surpass random search: Given the ground-truth rank r of the best architecture found after n runs and the maximum rank r max , equal to the total number of architectures, the probability that the best architecture found is better than a randomly searched one is given by p = 1 -(1 -(r/r max )) n . Finally, where appropriate, we report the stand-alone accuracy of the model that was found by the complete WS-NAS algorithm. Concretely, we randomly sample 200 architectures, select the 3 best models based on the super-net accuracy and query the ground-truth performance. We then take the mean of these architectures as stand-alone accuracy. Note that the same architectures are used to compute the sparse Kendall-Tau.

4. ANALYSIS

We provide an analysis on the impact of the factors that are shown in Table 1 across three different search spaces. Note that, in this section, we present the factors that are the most important for performance; our analysis of the remaining factors is provided in Appendix C. The standalone performance of the architecture that is found by a NAS algorithm is clearly the most important metric to judge its merits. However, in practice, one cannot access this metric-we wouldn't need NAS if standalone performance was easy to query (the cost of computing stand-alone performance is discussed in Appendix B.2). Furthermore, stand-alone performance inevitably depends the sampling policy, and does not directly evaluate the quality of the super-net (see Appendix B.3). Consequently, it is important to rely on metrics that are well correlated with the final performance but can be queried efficiently. To this end, we collect all our experiments and plot the pairwise correlation between final performance, sparse Kendall-Tau, and super-net accuracy. As shown in Figure 4 , the super-net accuracy has a low correlation with the final performance on NASBench-101 and DARTS-NDS. Only on NASBench-201 does it reach a correlation of 0.52. The sparse Kendall-Tau yields a consistently higher correlation with the final performance. This is evidence that one should not focus too strongly on improving the super-net accuracy. While this metric remains computationally heavy, it serves as a middle ground that is feasible to evaluate in real-world applications.

4.1. EVALUATION

In the following experiments, we thus mainly rely on sparse Kendall-Tau, and use final search performance as a reference only. We report the training details in Appendix B.1 and the complete results of all metrics in Appendix C.6.

4.2. BATCH NORMALIZATION IN THE SUPER-NET

Batch normalization (BN) is commonly used in standalone networks to allow for faster and more stable training. It is thus also employed in most CNN search spaces. However, BN behaves differently in the context of WS-NAS, and special care has to be taken when using it. In a standalone network (c.f. Figure 5 (Top)), a BN layer during training computes the batch statistics µ B and σ B , normalizes BN A CNN Architecture Stand-alone training Supernet training A1 A2 AN BN … Y Y ↵ 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " M 5 g 2 i x h 8 z s 2 t p 2 z t V z f u B g l s x D 8 = " > A A A B 8 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k Y o 8 F L x 4 r 2 A 9 s Q 5 l s N + 3 S z S b s b o Q S + i + 8 e F D E q / / G m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U U d a i s Y h V N 0 D N B J e s Z b g R r J s o h l E g W C e Y 3 M 7 9 z h N T m s f y w U w T 5 k c 4 k j z k F I 2 V H v s o k j E O M m 8 2 K F f c q r s A W S d e T i q Q o z k o f / W H M U 0 j J g 0 V q H X P c x P j Z 6 g M p 4 L N S v 1 U s w T p B E e s Z 6 n E i G k / W 1 w 8 I x d W G Z I w V r a k I Q v 1 9 0 S G k d b T K L C d E Z q x X v X m 4 n 9 e L z V h 3 c + 4 T F L D J F 0 u C l N B T E z m 7 5 M h V 4 w a M b U E q e L 2 V k L H q J A a G 1 L J h u C t v r x O 2 l d V r 1 a 9 v q 9 V G v U 8 j i K c w T l c g g c 3 0 I A 7 a E I L K E h 4 h l d 4 c 7 T z 4 r w 7 H 8 v W g p P P n M I f O J 8 / f P S Q x g = = < / l a t e x i t >

Sample

Freq. ↵ 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " M S b 7 9 the activations f A (x) as (f A (x)µ B )/σ B , and finally updates the population statistics using a moving average. For instance, the mean statistics is updated as μ ← γ μ + (1γ)µ B . At test time, the stored population statistics are used to normalize the feature map. In the standalone setting, both batch and population statistics are unbiased estimators of the population distribution N (µ, σ). x T J n x b 8 e J p h L Z k b x g 1 R + W 8 = " > A A A B 8 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K x R 4 L X j x W s K 3 Y h j L Z b t q l m 0 3 Y 3 Q g l 9 F 9 4 8 a C I V / + N N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R Q 2 N r e 2 d 4 q 7 p b 3 9 g 8 O j 8 v F J R 8 e p o q x N Y x G r h w A 1 E 1 y y t u F G s I d E M Y w C w b r B 5 G b u d 5 + Y 0 j y W 9 2 a a M D / C k e Q h p 2 i s 9 N h H k Y x x k N V m g 3 L F r b o L k H X i 5 a Q C O V q D 8 l d / G N M 0 Y t J Q g V r 3 P D c x f o b K c C r Y r N R P N U u Q T n D E e p Z K j J j 2 s 8 X F M 3 J h l S E J Y 2 V L G r J Q f 0 9 k G G k 9 j Q L b G a E Z 6 1 V v L v 7 n 9 V I T N v y M y y Q 1 T N L l o j A V x M R k / j 4 Z c s W o E V N L k C p u b y V 0 j A q p s S G V b A j e 6 s v r p F O r e v X q 1 V 2 9 0 m z k c R T h D M 7 h E j y 4 h i b c Q g v a Q E H C M 7 z C m 6 O d F + f d + V i 2 F p x 8 5 h T + w P n 8 A X 5 5 k M c = < / l a t e x i t > ↵ n < l a t e x i t s h a 1 _ b a s e 6 4 = " i Y K a t q c D h V V D v Q m A v C w J x K o / H F s = " > A A A B 8 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k Y o 8 F L x 4 r 2 A 9 s Q 5 l s N + 3 S z S b s b o Q S + i + 8 e F D E q / / G m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U U d a i s Y h V N 0 D N B J e s Z b g R r J s o h l E g W C e Y 3 M 7 9 z h N T m s f y w U w T 5 k c 4 k j z k F I 2 V H v s o k j E O M j k b l C t u 1 V 2 A r B M v J x X I 0 R y U v / r D m K Y R k 4 Y K 1 L r n u Y n x M 1 S G U 8 F m p X 6 q W Y J 0 g i P W s 1 R i x L S f L S 6 e k Q u r D E k Y K 1 v S k I X 6 e y L D S O t p F N j O C M 1 Y r 3 p z 8 T + v l 5 q w 7 m d c J q l h k i 4 X h a k g J i b z 9 8 m Q K 0 a N m F q C V H F 7 K 6 F j V E i N D a l k Q / B W X 1 4 n 7 a u q V 6 t e 3 9 c q j X o e R x H O 4 B w u w Y M b a M A d N K E F F C Q 8 w y u 8 O d p 5 c d 6 d j 2 V r w c l n T u E P n M 8 f 2 a W R A w = = < / l a t e x i t > X X < l a t e x i t s h a 1 _ b a s e 6 4 = " Q n H E 5 U L M J a G 2 1 U h d 2 7 V m W R E f 8 Q s = " > A A A C B X i c b V C 7 T s M w F H X K q 5 R X g B E G i w q p D F Q J A s F Y Y G E s E n 1 I T R Q 5 r t N a t Z 3 I d h B V l I W F X 2 F h A C F W / o G N v 8 F 9 D N B y p C s d n X O v 7 r 0 n T B h V 2 n G + r c L C 4 t L y S n G 1 t L a + s b l l b + 8 0 V Z x K T B o 4 Z r F s h 0 g R R g V p a K o Z a S e S I B 4 y 0 g o H 1 y O / d U + k o r G 4 0 8 O E + B z 1 B I 0 o R t p I g b 3 v R R L h L A o u K w 9 H 8 B h 6 P M 1 h 5 i n a 4 y g P 7 L J T d c a A 8 8 S d k j K Y o h 7 Y X 1 4 3 x i k n Q m O G l O q 4 T q L 9 D E l N M S N 5 y U s V S R A e o B 7 p G C o Q J 8 r P x l / k 8 N A o X R j F 0 p T Q c K z + n s g Q V 2 r I Q 9 P J k e 6 r W W 8 k / u d 1 U h 1 d + B k V S a q J w J N F U c q g j u E o E t i l k m D N h o Y g L K m 5 F e I + M r F o E 1 z J h O D O v j x P m i d V 9 6 z q 3 J 6 W a 1 f T O I p g D x y A C n D B O a i B G 1 A H D Y D B I 3 g G r + D N e r J e r H f r Y 9 J a s K Y z u + A P r M 8 f I 5 + X t g = = < / l a t e x i t > f A (x) µ By contrast, when training a super-net (Figure 5 (Bottom)) the population statistics that are computed based on the running average are not unbiased estimators of the population distribution, because the effective architecture before the BN layer varies in each epoch. More formally, let f Ai denote the i-th architecture. During training, the batch statistics are computed as µ i B = j f Ai (x j )/m, and the output feature follows the distribution N (µ i B , σ i B ), where the superscript i indicates that the current batch statistics depends on A i only. The population mean statistics is then updated as μ ← γ μ + (1γ)µ i B . However, during training, different architecture from the super-net are sampled. Therefore, the population mean statistics essentially becomes a weighted combination of means from different architectures, i.e., μ ← α i µ i B = α i f Ai (x) , where α i is the sampling frequency of the i-th architecture. When evaluating a specific architecture A i at test time, the estimated population statistics thus depend on the other architectures in the super-net. This leads to a train-test discrepancy. One solution to mitigate this problem is to re-calibrate the batch statistics by recomputing the statistics on the entire training set before the the final evaluation (Yu & Huang, 2019) . While the cost of doing so is negligible for a standalone network, NAS algorithms typically sample ∼ 10 5 architectures for evaluation, which makes this approach intractable. In contrast to Dong & Yang (2020) and Bender et al. (2020) who use the training mode also during testing, we formalize a simple, yet effective, approach to tackle the train-test discrepancy of BN in super-net training: we leave the normalization based on batch statistics during training unchanged, but use batch statistics also during testing. Since super-net evaluation is always conducted over a complete dataset, we are free to perform inference in mini-batches of the same size as the ones used during training. This allows us to compute the batch statistics on the fly in the exact same way as during training. Figure 6 compares standard BN to our proposed modification. Using the tracked population statistics leads to many architectures with an accuracy around 10%, i.e., performing no better than random guessing. Our proposed modification allows us to significantly increase the fraction of high-performing architectures. Our results also show that the choice of fixing vs. learning an affine transformation in batch normalization should match the standalone protocol P proxy .

4.3. SUPER-NET LOSS LANDSCAPES

The training loss of the super-net encompasses the task losses of all possible architectures. We suspect that the training difficulty increases with the number of architectures represented by the super-net. To better study this, we visualize the loss landscape (Li et al., 2018) of the standalone network and a super-net with n = 300 architectures. Concretely, the landscape is computed over the super-net training loss under the single-path one-shot sampling method, L s (x, θ s ) = i L s (x, θ i ), where ∀i, ∪ i θ i = θ s . Figure 7 shows that the loss landscape of the super-net is less smooth than that of a standalone architecture, which confirms our intuition. A smoother landscape indicates that optimization will converge more easily to a good local optimum. With a smooth landscape, one can thus use a relatively large learning. By contrast, a less smooth landscape requires using a smaller one. Our experiments further confirm this observation. In the standalone protocol P proxy , the learning rate is set to 0.2 for NASBench-101, and to 0.1 for NASBench-201 and DARTS-NDS, respectively. All protocols use a cosine learning rate decay. Figure 8 shows that super-net training requires lower learning rates than standalone training. The same trend is shown for other search spaces in Appendix C.1. We set the learning rate to 0.025 to be consistent across the three search spaces. Reducing memory foot-print and training time by proposing smaller super-nets has been an active research direction, and the resulting super-nets are referred to as lower fidelity estimates (Elsken et al., 2019) . The impact of this approach on the super-net quality, however, has never been studied systematically over multiple search spaces . We compare four popular strategies in Table 2 . We deliberately prolong the training epochs inversely proportionally to the computational budget that would be saved by the low-fidelity estimates, e.g. if the channel number is reduced by half, we train the model for two times more epoch. Note that this provides an upper bound to the performance of low-fidelity estimates.

4.4. LOWER FIDELITY ESTIMATES LOWER THE RANKING CORRELATION

A commonly-used approach to reduce memory requirements is to decrease the batch size (Yang et al., 2020) . Surprisingly, lowering the batch size from 256 to 64 has limited impact on the accuracy, but decreases sparse Kendall-Tau and the final searched model's performance, the most important metric in practice. Another approach is to decrease the number of channels in the first layer (Liu et al., 2019b) . This reduces the total number of parameters proportionally, since the number of channels in consecutive layers depends on the first Table shows that this decreases the sparse Kendall-Tau from 0.7 to 0.5. By contrast, reducing the number of repeated cells (Pham et al., 2018; Chu et al., 2019 ) by one has little impact. Hence, to train a good super-net, one should avoid changes between f ws and f proxy , but one can reduce the batch size by a factor > 0.5 and use only one repeated cell. V M c t o u C c i y X h L a z W 4 H 9 F E p i M 1 y 0 = " > A A A B 7 3 i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 7 L H Q i 8 c K 9 g P a U D b b b b t 0 k 4 2 7 E 6 G E / g k v H h T x 6 t / x 5 r 9 x 0 + a g r Q 8 G H u / N M D M v i K U w 6 L r f z s b m 1 v b O b m G v u H 9 w e H R c O j l t G 5 V o x l t M S a W 7 A T V c i o i 3 U K D k 3 V h z G g a S d 4 J p I / M 7 T 1 w b o a I H n M X c D + k 4 E i P B K F q p 2 x i k K s F 5 c V A q u x V 3 A b J O v J y U I U d z U P r q D x V L Q h 4 h k 9 S Y n u f G 6 K d U o 2 C S z 4 v 9 x P C Y s i k d 8 5 6 l E Q 2 5 8 d P F v X N y a Z U h G S l t K 0 K y U H 9 P p D Q 0 Z h Y G t j O k O D G r X i b + 5 / U S H N X 8 V E R x g j x i y 0 W j R B J U J H u e D I X m D O X M E s q 0 s L c S N q G a M r Q R Z S F 4 q y + v k 3 a 1 4 l 1 X q v c t P h o n k m B E s u x k J D R n K O e W U K a F v Z W w K d W U o W 0 o K 8 F b j 7 x J O r W q d 1 O t P d x W G v W 8 j i J c w C V c g w d 3 0 I B 7 a E E b G M z g G V 7 h z Y m d F + f d + V i N F p x 8 5 x z + w P n 8 A d a w j z U = < / l a t e x i t > (a) Search space X Y 1 2 N (b) 2 edges X Y 1 2 N (c) n edges bCout/2c < l a t e x i t s h a 1 _ b a s e 6 4 = " c a C J C 7 n z G / n e h D W E Q e h L X h p w / s g = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y X E V J K C R M d K X R i L R B 9 S E 0 W O 6 7 R W H T u y H a Q q 6 s D C r 7 A w g B A r H 8 H G 3 + C m G a D l T E f n 3 K t 7 7 g k T R p V 2 n G + r t L G 5 t b 1 T 3 q 3 s 7 R 8 c H t n H J z 0 l U o l J F w s m 5 C B E i j D K S V d T z c g g k Q T F I S P 9 c N p e + P 0 H I h U V / F 7 P E u L H a M x p R D H S R g r s q s c i J o S E 7 S A T q Z 7 D S 9 i A n s y 1 w K 4 5 d S c H X C d u Q W q g Q C e w v 7 y R w G l M u M Y M K T V 0 n U T 7 G Z K a Y k b m F S 9 V J E F 4 i s Z k a C h H M V F + l j 8 x h + d G G c H I R I k E 1 z B X f 2 9 k K F Z q F o d m M k Z 6 o l a 9 h f i f N 0 x 1 1 P Q z y p N U E 4 6 X h 6 K U Q S 3 g o h E 4 o p J g z W a G I C y p y Q r x B E m E t e m t Y k p w V 1 9 e J 7 1 G 3 b 2 q N + 6 u a 6 1 m U U c Z V M E Z u A A u u A E t c A s 6 o A s w e A T P 4 B W 8 W U / W i / V u f S x H S 1 a x c w r + w P r 8 A Z S T l 2 A = < / l a t e x i t > bCout/nc < l a t e x i t s h a 1 _ b a s e 6 4 = " d W a x j g l m b E D h 7 Q i k f H n E f G n z Q y I = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y X E V J K C R M d K X R i L R B 9 S E 0 W O 6 7 R W H T u y H a Q q 6 s D C r 7 A w g B A r H 8 H G 3 + C m G a D l T E f n 3 K t 7 7 g k T R p V 2 n G + r t L G 5 t b 1 T 3 q 3 s 7 R 8 c H t n H J z 0 l U o l J F w s m 5 C B E i j D K S V d T z c g g k Q T F I S P 9 c N p e + P 0 H I h U V / F 7 P E u L H a M x p R D H S R g r s q s c i J o S E 7 S A T q Z 7 D S 8 i h J 3 M t s G t O 3 c k B 1 4 l b k B o o 0 A n s L 2 8 k c B o T r j F D S g 1 d J 9 F + h q S m m J F 5 x U s V S R C e o j E Z G s p R T J S f 5 U / M 4 b l R R j A y U S L B N c z V 3 x s Z i p W a x a G Z j J G e q F I n D V M d N f 2 M 8 i T V h O P l o S h l U A u 4 a A S O q C R Y s 5 k h C E t q s k I 8 Q R J h b X q r m B L c 1 Z f X S a 9 R d 6 / q j b v r W q t Z 1 F E G V X A G L o A L b k A L 3 I I O 6 A I M H s E z e A V v 1 p P 1 Y r 1 b H 8 v R k l X The last lower-fidelity factor is the portion of training data that is used (Liu et al., 2019b; Xu et al., 2020) . Surprisingly, reducing the training portion only marginally decreases the sparse Kendall-Tau for all three search spaces. On NASBench-201, keeping only 25% of the CIFAR-10 dataset results in a 0.1 drop in sparse Kendall-Tau. This explains why DARTS-based methods typically use only 50% of the data to train the super-net but can still produce reasonable results.

4.5. DYNAMIC CHANNELING HURTS SUPER-NET QUALITY

Dynamic channeling is an implicit factor in many search spaces (Ying et al., 2019; Cai et al., 2019; Guo et al., 2019; Dong & Yang, 2019b) . It refers to the fact that the number of channels of the intermediate layers depends on the number of incoming edges to the output node. This is depicted by Figure 9 Instead of sharing the channels between architectures, we propose to disable dynamic channelling completely. As the channel number only depends on the incoming edges, we separate the search space into a discrete number of sub-spaces, each with a fixed number of incoming edges. As shown in Table 3 , disabling dynamic channeling improves the sparse Kendall-Tau and the final search performance by a large margin and yields a new state of the art on NASBench101. We compose another baseline, where we enable dynamic channeling during super-net training. During validation, we compute the average sparse Kendall-Tau of each sub-space, where we sample 200 architectures that shares the same number of channels. We call this baseline-v2. In Table 3 , we can see this surpasses the original baseline by a significant margin. It further evidence the importance of disabling dynamic channels. Nonetheless, the best is to disable dynamic channeling during both the training and the validation phase.

5. HOW SHOULD YOU TRAIN YOUR SUPER-NET?

Figure 10 summarizes the influence of all tested factors on the final performance. It stands out that properly tuned hyper-parameters lead to the biggest improvements by far. Surprisingly, most other factors and techniques either have a hardly measurable effect or in some cases even lead to worse performance. Based on these findings, here is how you should train your super-net: & Yang (2020) . We report the mean over 3 runs. Note that NASBench-101 (n = 7) in (Yu et al., 2020b) is identical to our setting. Our new strategy significantly surpasses the random search baseline. Li & Talwalkar (2019) Trained according to Liu et al. (2019b) for 600 epochs. DARTS-V2 (Liu et al., 2019b) , ENAS (Pham et al., 2018) , NAO (Luo et al., 2018) . Random-NAS (Li & Talwalkar, 2019) , GDAS (Dong & Yang, 2019b) On NASBench-201, both random NAS and our approach samples 100 final architectures to follow Dong & Yang (2020) Comparison to the state of the art. Table 4 shows that carefully controlling the relevant factors and adopting the techniques proposed in Section 4 allow us to considerably improve the performance of Random-NAS. Thanks to our evaluation, we were able to show that simple Random-NAS together with an appropriate training protocol P ws and mapping function f ws yields results that are competitive to and sometimes even surpass state-of-the-art algorithms. Our results provide a strong baseline upon which future work can build.



Proxy task refers to the tasks that neural architecture search aims to optimize on. The mean accuracy over a small set of randomly sampled architectures during super-net training.



t e x i t s h a 1 _ b a s e 6 4 = " i U 9 d 8 1 S B e M P s 6 1 p c p y s 1 4 g j B 6 j s = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a

Figure 1: WS-NAS benchmarking. Green blocks indicate which aspects of NAS are benchmarked in different works. A search algorithm usually consists of a search space that encompass many architectures, and a policy to select the best one. P indicates a training protocol, and f a mapping function from the search space to a neural network. (a) Early works fixed and compared the metrics on the proxy task, which doesn't allow for a holistic comparison between algorithms. (b) The NASBench benchmark series partially alleviates the problem by sharing the stand-alone training protocol and search space across algorithms. However, the design of the weight-sharing search space and training protocol is still not controlled. (c) We fill this gap by benchmarking existing techniques to construct and train the shared-weight backbone. We provide a controlled evaluation across three benchmark spaces.

Figure 2: Constructing a super-net 2019) separate the encoding into cell parameters, which define the basic building blocks of a network, and macro parameters, which define how cells are assembled into a complete architecture.

Figure3: Kendall-Tau vs sparse Kendall-Tau. Kendall-Tau is not robust when many architectures have similar performance. Minor performance differences can lead to large perturbations in the ranking. Our sparse Kendall-Tau alleviates this by dismissing minor differences in performance.

Figure 4: Super-net evaluation. We collect all experiments across 3 benchmark spaces. (Top) Pairwise plots of super-net accuracy, final performance, and the sparse Kendall-Tau. Each point corresponds to statistics computed over a trained super-net. (Bottom) Spearman correlation coefficients between the metrics.

Figure 5: Batch normalization in standalone and super-net training.

Figure 7: Loss landscapes.

3 5 X o t j 6 M A 5 3 A B V + D B L d T h D p r Q A g Y S n u E V 3 p x H 5 8 V 5 d z 6 W r R t O P n M G f + B 8 / g D C P I / A < / l a t e x i t > Cin < l a t e x i t s h a 1 _ b a s e 6 4 = " a 5 L s + 6 6 v h 2 l i W S 9 w M H T w Q8 P l v l Q = " > A A A B 7 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x U w S 4 L 3 b i s Y B / Q D i W T p m 1 o J j M k d 4 Q y 9 C P c u F D E r d / j z r 8 x 0 8 5 C W w 8 E D u f c S + 4 5 Q S y F Q d f 9 d g p b 2 z u 7 e 8 X 9 0 s H h 0 f F J + f S s Y 6 J E M 9 5 m k Y x 0 L 6 C G S 6 F 4 G w V K 3 o s 1 p 2 E g e T e Y N T O / + 8 S 1 E Z F 6 x H n M / Z B O l B g L R t F K 3 e Y w F W p R G p Y r b t V d g m w S L y c V y N E a l r 8 G o 4 g l I V f I J D W m 7 7 k x + i n V K J j k i 9 I g M T y m b E Y n v G + p o i E 3 fr o 8 d 0 G u r D I i 4 0 j b p 5 A s 1 d 8 b K Q 2 N m Y e B n Q w p T s 2 6 l 4 n / e f 0 E x 3 X f B o o T 5 I q

Figure 9: NASBench-101 dynamic channel.

(a): for a search cell with n intermediate nodes, where X and Y are the input and output node with C in and C out channels, respectively. When there are n = 2 edges (c.f. Figure9(b)), the associated channel numbers decrease so that their sum equals C out . That is, the intermediate nodes have C out /2 channels. In the general case, shown in Figure9(c), the number of channels in intermediate nodes is thus C out /n for n incoming edges. A weight sharing approach has to cope with this architecture-dependent fluctuation of the number of channels during training.Let C denote the number of channels of a given architecture, and C max the maximum number of channels for a node across the entire search space. All existing approaches allocate C max channels and, during training, extract a subset of these channels. The existing methods then differ in how they extract the channels:Guo et al. (2019) use a fixed chunk of channels, e.g., [0 : C];Zhang et al. (2018) randomly shuffle the channels before extracting a fixed chunk; andDong & Yang (2019a)  linearly interpolate the C max channels into C channels using a moving average across neighboring channels.

Figure10: Influence of factors on the final model. We plot the difference in percent between the searched model's performance with and without applying the corresponding factor. For the hyper-parameters of P ws , the baseline is Random NAS, as reported in Table4. For the other factors, the baseline of each search space uses the best setting of the hyper-parameters. Each experiment was run at least 3 times.1. Do not use super-net accuracy to judge the quality of your super-net. The sparse Kendall-Tau has much higher correlation with the final search performance. 2. When batch normalization is used, do not use the moving average statistics during evaluation.Instead, compute the statistics on the fly over a batch of the same size as used during training. 3. The loss landscape of super-nets is less smooth than that of standalone networks. Start from a smaller learning rate than standalone training. 4. Do not use other low-fidelity estimates than moderately reducing the training set size to decrease the search time. 5. Do not use dynamic channeling in search spaces that have a varying number of channels in the intermediate nodes. Break the search space into multiple sub-spaces such that dynamic channeling is not required.

al., Summary of factors



Dynamic channels on NASBench-101.

Final results. Results on NASBench-101 and 201 are from Yu et al. (2020b), and Dong

Random NAS (Ours) 93.12 ± 0.06 92.71 ± 0.15 94.26 ± 0.05 97.08

