HOW TO TRAIN YOUR SUPER-NET: AN ANALYSIS OF TRAINING HEURISTICS IN WEIGHT-SHARING NAS

Abstract

Weight sharing promises to make neural architecture search (NAS) tractable even on commodity hardware. Existing methods in this space rely on a diverse set of heuristics to design and train the shared-weight backbone network, a.k.a. the supernet. Since heuristics substantially vary across different methods and have not been carefully studied, it is unclear to which extent they impact super-net training and hence the weight-sharing NAS algorithms. In this paper, we disentangle super-net training from the search algorithm, isolate 14 frequently-used training heuristics, and evaluate them over three benchmark search spaces. Our analysis uncovers that several commonly-used heuristics negatively impact the correlation between supernet and stand-alone performance, whereas simple, but often overlooked factors, such as proper hyper-parameter settings, are key to achieve strong performance. Equipped with this knowledge, we show that simple random search achieves competitive performance to complex state-of-the-art NAS algorithms when the super-net is properly trained.

1. INTRODUCTION

Neural architecture search (NAS) has received growing attention in the past few years, yielding stateof-the-art performance on several machine learning tasks (Liu et al., 2019a; Wu et al., 2019; Chen et al., 2019b; Ryoo et al., 2020) . One of the milestones that led to the popularity of NAS is weight sharing (Pham et al., 2018; Liu et al., 2019b) , which, by allowing all possible network architectures to share the same parameters, has reduced the computational requirements from thousands of GPU hours to just a few. Figure 1 shows the two phases that are common to weight-sharing NAS (WS-NAS) algorithms: the search phase, including the design of the search space and the search algorithm; and the evaluation phase, which encompasses the final training protocol on the proxy taskfoot_0 . While most works focus on developing a good sampling algorithm (Cai et al., 2019; Xie et al., 2019) or improving existing ones (Zela et al., 2020a; Nayman et al., 2019; Li et al., 2020) , they tend to overlook or gloss over important factors related to the design and training of the shared-weight backbone network, i.e. the super-net. For example, the literature encompasses significant variations of learning hyper-parameter settings, batch normalization and dropout usage, capacities for the initial layers of the network, and depth of the super-net. Furthermore, some of these heuristics are directly transferred from standalone network training to super-net training without carefully studying their impact in this drastically different scenario. For example, the fundamental assumption of batch normalization that the input data follows a slowly changing distribution whose statistics can be tracked during training is violated in WS-NAS, but nonetheless typically assumed to hold. In this paper, we revisit and systematically evaluate commonly-used super-net design and training heuristics and uncover the strong influence of certain factors on the success of super-net training. To this end, we leverage three benchmark search spaces, NASBench-101 (Ying et al., 2019) , NASBench-201 (Dong & Yang, 2020), and DARTS-NDS (Radosavovic et al., 2019) , for which the ground-truth stand-alone performance of a large number of architectures is available. We report the results of our experiments according to two sets of metrics: i) metrics that directly measure the quality of the super-net, such as the widely-adopted super-net accuracyfoot_1 and a modified Kendall-Tau correlation between the searched architectures and their ground-truth performance, which we refer to as sparse

Weight sharing Training protocol

Super-Net P ws < l a t e x i t s h a 1 _ b a s e 6 4 = " d W M Z q L 3 5 I Y T 8 2 c S z w I x r S f i Y z Q 0 = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 G v 3 s y U z 7 5 Y p f 9 e d A q y T I S Q V y N P r l r 9 5 A k V R Q a Q n H x n Q D P 7 F h h r V l h N N p q Z c a m m A y x k P a d V R i Q U 2 Y z a + d o j O n D F C s t C t p 0 V z 9 P Z F h Y c x E R K 5 T Y D s y y 9 5 M / M / r p j a + C j M m k 9 R S S R a L 4 p Q j q 9 D s d T R g m h L L J 4 5 g o p m 7 F Z E R 1 p h Y F 1 D J h R A s v 7 x K W h f V o F a t 3 d U q 9 e s 8 j i K c w C m c Q w C X U I d b a E A T C D z A M 7 z C m 6 e 8 F + / d + 1 i 0 F r x 8 5 h j + w P v 8 A d U F j 0 4 = < / l a t e x i t > Search Algorithm f proxy < l a t e x i t s h a 1 _ b a s e 6 4 = " x o r d N t n C P H e H Q s W 9 S F K X f p 9 h i U g = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 x P 0 s 0 e p p M u 2 X K 3 7 V n w O t k i A n F c j R 6 J e / e g N F U k G l J R w b 0 w 3 8 x I Y Z 1 p Y R T q e l X m p o g s k Y D 2 n X U Y k F N W E 2 P 3 i K z p w y Q L H S r q R F c / X 3 R I a F M R M R u U 6 B 7 c g s e z P x P 6 + b 2 v g q z J h M U k s l W S y K U 4 6 s Q r P v 0 Y B p S i y f O I K J Z u 5 W R E Z Y Y 2 J d R i U X Q r D 8 8 i p p X V S D W r V 2 V 6 v U r / M 4 i n A C p 3 A O A V x C H W 6 h A U 0 g I O A Z X u H N 0 9 6 L 9 + 5 9 L F o L X j 5 z D H / g f f 4 A e D W Q 2 g = = < / l a t e x i t > f ws < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 i E + + P Y Z o j x Q 9 m A g Z y y C 7 q S U l x U = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 4 n 7 2 Z K b 9 c s W v + n O g V R L k p A I 5 G v 3 y V 2 + g S C q o t I R j Y 7 q B n 9 g w w 9 o y w u m 0 1 E s N T T A Z 4 y H t O i q x o C b M 5 t d O 0 Z l T B i h W 2 p W 0 a K 7 + n s i w M G Y i I t c p s B 2 Z Z W 8 m / u d 1 U x t f h R m T S W q p J I t F c c q R V W j 2 O h o w T Y n l E 0 c w 0 c z d i s g I a 0 y s C 6 j k Q g i W X 1 4 l r Y t q U K v W 7 m q V + n U e R x F O 4 B T O I Y B L q M M t N K A J B B 7 g G V 7 h z V P e i / f u f S x a C 1 4 + c w x / 4 H 3 + A P b L j 2 Q = < / l a t e x i t > Metric Fixed Metrics Search phase Evaluation phase (b) NASBench series (a) Paradigm of NAS Fixed Stand-Alone Net f ws < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 i E + + P Y Z o j x Q 9 m A g Z y y C 7 q S U l x U = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 4 n 7 2 Z K b 9 c s W v + n O g V R L k p A I 5 G v 3 y V 2 + g S C q o t I R j Y 7 q B n 9 g w w 9 o y w u m 0 1 E s N T T A Z 4 y H t O i q x o C b M 5 t d O 0 Z l T B i h W 2 p W 0 a K 7 + n s i w M G Y i I t c p s B 2 Z Z W 8 m / u d 1 U x t f h R m T S W q p J I t F c c q R V W j 2 O h o w T Y n l E 0 c w 0 c z d i s g I a 0 y s C 6 j k Q g i W X 1 4 l r Y t q U K v W 7 m q V + n U e R x F O 4 B T O I Y B L q M M t N K A J B B 7 g G V 7 h z V P e i / f u f S x a C 1 4 + c w x / 4 H 3 + A P b L j 2 Q = < / l a t e x i t > Metric Metrics Fixed Stand-Alone Net P proxy < l a t e x i t s h a 1 _ b a s e 6 4 = " i U 9 d 8 1 S B e M P s 6 1 p c p y s 1 4 g j B 6 j s = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 N P p Z o t X T Z N o v V / y q P w d a J U F O K p C j 0 S 9 / 9 Q a K p I J K S z g 2 p h v 4 i Q 0 z r C 0 j n E 5 L v d T Q B J M x H t K u o x I L a s J s f v A U n T l l g G K l X U m L 5 u r v i Q w L Y y Y i c p 0 C 2 5 F Z 9 m b i f 1 4 3 t f F V m D G Z p J Z K s l g U p x x Z h W b f o w H T l F g + c Q Q T z d y t i I y w x s S 6 j E o u h G D 5 5 V X S u q g G t W r t r l a p X + d x F O E E T u E c A r i E O t x C A 5 p A Q M A z v M K b p 7 0 X 7 9 3 7 W L Q W v H z m G P 7 A + / w B V i 2 Q x A = = < / l a t e x i t > f proxy < l a t e x i t s h a 1 _ b a s e 6 4 = " x o r d N t n C P H e H Q s W 9 S F K X f p 9 h i U g = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 x P 0 s 0 e p p M u 2 X K 3 7 V n w O t k i A n F c j R 6 J e / e g N F U k G l J R w b 0 w 3 8 x I Y Z 1 p Y R T q e l X m p o g s k Y D 2 n X U Y k F N W E 2 P 3 i K z p w y Q L H S r q R F c / X 3 R I a F M R M R u U 6 B 7 c g s e z P x P 6 + b 2 v g q z J h M U k s l W S y K U 4 6 s Q r P v 0 Y B p S i y f O I K J Z u 5 W R E Z Y Y 2 J d R i U X Q r D 8 8 i p p X V S D W r V 2 V 6 v U r / M 4 i n A C p 3 A O A V x C H W 6 h A U 0 g I O A Z X u H N 0 9 6 L 9 + 5 9 L F o L X j 5 z D H / g f f 4 A e D W Q 2 g = = < / l a t e x i t > Super-Net P ws < l a t e x i t s h a 1 _ b a s e 6 4 = " d W M Z q L 3 5 I Y T 8 2 c S z w I Kendall-Tau; ii) proxy metrics such as the ability to surpass random search and the stand-alone accuracy of the model found by the WS-NAS algorithm. x r S f i Y z Q 0 = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 G v 3 s y U z 7 5 Y p f 9 e d A q y T I S Q V y N P r l r 9 5 A k V R Q a Q n H x n Q D P 7 F h h r V l h N N p q Z c a m m A y x k P a d V R i Q U 2 Y z a + d o j O n D F C s t C t p 0 V z 9 P Z F h Y c x E R K 5 T Y D s y y 9 5 M / M / r p j a + C j M m k 9 R S S R a L 4 p Q j q 9 D s d T R g m h L L J 4 5 g o p m 7 F Z E R 1 p h Y F 1 D J h R A s v 7 x K W h f V o F a t 3 d U q 9 e s 8 j i K c w C m c Q w C X U I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 4 n 7 2 Z K b 9 c s W v + n O g V R L k p A I 5 G v 3 y V 2 + g S C q o t I R j Y 7 q B n 9 g w w 9 o y w u m 0 1 E s N T T A Z 4 y H t O i q x o C b M 5 t d O 0 Z l T B i h W 2 p W 0 a K 7 + n s i w M G Y i I t c p s B 2 Z Z W 8 m / u d 1 U x t f h R m T S W q p J I t F c c q R V W j 2 O h o w T Y n l E 0 c w 0 c z d i s g I a 0 y s C 6 j k Q g i W X 1 4 l r Y t q U K v W 7 m q V + n U e R x F O 4 B T O I Y B L q M M t N K A J B B 7 g G V 7 h z V P e i / f u f S x a C 1 4 + c w x / 4 H 3 + A P b L j 2 Q = < / l a t C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 N P p Z o t X T Z N o v V / y q P w d a J U F O K p C j 0 S 9 / 9 Q a K p I J K S z g 2 p h v 4 i Q 0 z r C 0 j n E 5 L v d T Q B J M x H t K u o x I L a s J s f v A U n T l l g G K l X U m L 5 u r v i Q w L Y y Y i c p 0 C 2 5 F Z 9 m b i f 1 4 3 t f F V m D G Z p J Z K s l g U p x x Z h W b f o w H T l F g + c Q Q T z d y t i I y w x s S 6 j E o u h G D 5 5 V X S u q g G t W r t r l a p X + d x F O E E T u E c A r i E O t x C A 5 p A Q M A z v M K b p 7 0 X 7 9 3 7 W L Q W v H z m G P 7 A + / w B V i 2 Q x A = = < / l a t e x i t > f proxy < l a t e x i t s h a 1 _ b a s e 6 4 = " x o r d N t n C P H e H Q s W 9 S F K X f p 9 h i U g = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a C I V z / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y c 9 x P 0 s 0 e p p M u 2 X K 3 7 V n w O t k i A n F c j R 6 J e / e g N F U k G l J R w b 0 w 3 8 x I Y Z 1 p Y R T q e l X m p o g s k Y D 2 n X U Y k F N W E 2 P 3 i K z p w y Q L H S r q R F c / X 3 R I a F M R M R u U 6 B 7 c g s e z P x P 6 + b 2 v g q z J h M U k s l W S y K U 4 6 s Q r P v 0 Y B p S i y f O I K J Z u 5 W R E Z Y Y 2 J d R i U X Q r D 8 8 i p p X V S D W r V 2 V 6 v U r / M 4 i n A C p 3 A O A V x C H W 6 h A U 0 g I O A Z X u H N 0 9 6 L 9 + 5 9 L F o L X j 5 z D H / g f f 4 A e D W Q 2 g = = < / l a t e x i t > Super-Net P ws < l a t e x i t s h a 1 _ b a s e 6 4 = " d W M Z q L 3 5 I Y T 8 2 c S z w I x r S f i Y z Q 0 = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I V / / H m 3 / j J N m D J h Y 0 F F X d d H d F C W f G + v 6 3 V 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t o 1 J N a J M o r n Q n w o Z y J m n T M s t p J 9 E U i 4 j T d j S + m f n t R 6 o N U / L e T h I a C j y U L G Y E W y e 1 G v 3 s y U z 7 5 Y p f 9 e d A q y T I S Q V y N P r l r 9 5 A k V R Q a Q n H x n Q D P 7 F h h r V l h N N p q Z c a m m A y x k P a d V R i Q U 2 Y z a + d o j O n D F C s t C t p 0 V z 9 P Z F h Y c x E R K 5 T Y D s y y 9 5 M / M / r p j a + C j M m k 9 R S S R a L 4 p Q j q 9 D s d T R g m h L L J 4 5 g o p m 7 F Z E R 1 p h Y F 1 D J h R A s v 7 x K W h f V o F a Via our extensive experiments (over 700 GPU days), we uncover that (i) the training behavior of a super-net drastically differs from that of a standalone network, e.g., in terms of feature statistics and loss landscape, thus allowing us to define training factor settings, e.g., for batch-normalization (BN) and learning rate, that are better suited for super-nets; (ii) while some neglected factors, such as the number of training epochs, have a strong impact on the final performance, others, believed to be important, such as path sampling, only have a marginal effect, and some commonly-used heuristics, such as the use of low-fidelity estimates, negatively impact it; (iii) the commonly-adopted super-net accuracy is unreliable to evaluate the super-net quality. Altogether, our work is the first to systematically analyze the impact of the diverse factors of super-net design and training, and we uncover the factors that are crucial to design a super-net, as well as the non-important ones. Aggregating these findings allows us to boost the performance of simple weight-sharing random search to the point where it reaches that of complex state-of-the-art NAS algorithms across all tested search spaces. We will release our code and trained models so as to establish a solid baseline to facilitate further research.

2. PRELIMINARIES AND RELATED WORK

We first introduce the necessary concepts that will be used throughout the paper. As shown in Figure 1 (a), weight-sharing NAS algorithms consist of three key components: a search algorithm that samples an architecture from the search space in the form of an encoding, a mapping function f proxy that maps the encoding into its corresponding neural network, and a training protocol for a proxy task P proxy for which the network is optimized. To train the search algorithm, one needs to additionally define the mapping function f ws that generates the shared-weight network. Note that the mapping f proxy frequently differs from f ws , since in practice the final model contains many more layers and parameters so as to yield competitive results on the proxy task. After fixing f ws , a training protocol P ws is required to learn the super-net. In practice, P ws often hides factors that are critical for the final performance of an approach, such



Proxy task refers to the tasks that neural architecture search aims to optimize on. The mean accuracy over a small set of randomly sampled architectures during super-net training.



d b a E A T C D z A M 7 z C m 6 e 8 F + / d + 1 i 0 F r x 8 5 h j + w P v 8 A d U F j 0 4 = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 3 i E + + P Y Z o j x Q 9 m A g Z y y C 7 q S U l x U = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m A c k S 5 i d z C Z j 5 r H M z C p h y T 9 4 8 a C I

t e x i t s h a 1 _ b a s e 6 4 = " i U 9 d 8 1 S B e M P s 6 1 p c p y s 1 4 g j B 6 j s = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K Q I 9 B L x 4 j m I c k S 5 i d z C Z D 5 r H M z I p h y V d 4 8 a

Figure 1: WS-NAS benchmarking. Green blocks indicate which aspects of NAS are benchmarked in different works. A search algorithm usually consists of a search space that encompass many architectures, and a policy to select the best one. P indicates a training protocol, and f a mapping function from the search space to a neural network. (a) Early works fixed and compared the metrics on the proxy task, which doesn't allow for a holistic comparison between algorithms. (b) The NASBench benchmark series partially alleviates the problem by sharing the stand-alone training protocol and search space across algorithms. However, the design of the weight-sharing search space and training protocol is still not controlled. (c) We fill this gap by benchmarking existing techniques to construct and train the shared-weight backbone. We provide a controlled evaluation across three benchmark spaces.

