HOW TO TRAIN YOUR SUPER-NET: AN ANALYSIS OF TRAINING HEURISTICS IN WEIGHT-SHARING NAS

Abstract

Weight sharing promises to make neural architecture search (NAS) tractable even on commodity hardware. Existing methods in this space rely on a diverse set of heuristics to design and train the shared-weight backbone network, a.k.a. the supernet. Since heuristics substantially vary across different methods and have not been carefully studied, it is unclear to which extent they impact super-net training and hence the weight-sharing NAS algorithms. In this paper, we disentangle super-net training from the search algorithm, isolate 14 frequently-used training heuristics, and evaluate them over three benchmark search spaces. Our analysis uncovers that several commonly-used heuristics negatively impact the correlation between supernet and stand-alone performance, whereas simple, but often overlooked factors, such as proper hyper-parameter settings, are key to achieve strong performance. Equipped with this knowledge, we show that simple random search achieves competitive performance to complex state-of-the-art NAS algorithms when the super-net is properly trained.

1. INTRODUCTION

Neural architecture search (NAS) has received growing attention in the past few years, yielding stateof-the-art performance on several machine learning tasks (Liu et al., 2019a; Wu et al., 2019; Chen et al., 2019b; Ryoo et al., 2020) . One of the milestones that led to the popularity of NAS is weight sharing (Pham et al., 2018; Liu et al., 2019b) , which, by allowing all possible network architectures to share the same parameters, has reduced the computational requirements from thousands of GPU hours to just a few. Figure 1 shows the two phases that are common to weight-sharing NAS (WS-NAS) algorithms: the search phase, including the design of the search space and the search algorithm; and the evaluation phase, which encompasses the final training protocol on the proxy taskfoot_0 . While most works focus on developing a good sampling algorithm (Cai et al., 2019; Xie et al., 2019) or improving existing ones (Zela et al., 2020a; Nayman et al., 2019; Li et al., 2020) , they tend to overlook or gloss over important factors related to the design and training of the shared-weight backbone network, i.e. the super-net. For example, the literature encompasses significant variations of learning hyper-parameter settings, batch normalization and dropout usage, capacities for the initial layers of the network, and depth of the super-net. Furthermore, some of these heuristics are directly transferred from standalone network training to super-net training without carefully studying their impact in this drastically different scenario. For example, the fundamental assumption of batch normalization that the input data follows a slowly changing distribution whose statistics can be tracked during training is violated in WS-NAS, but nonetheless typically assumed to hold. In this paper, we revisit and systematically evaluate commonly-used super-net design and training heuristics and uncover the strong influence of certain factors on the success of super-net training. To this end, we leverage three benchmark search spaces, NASBench-101 (Ying et al., 2019) , NASBench-201 (Dong & Yang, 2020), and DARTS-NDS (Radosavovic et al., 2019) , for which the ground-truth stand-alone performance of a large number of architectures is available. We report the results of our experiments according to two sets of metrics: i) metrics that directly measure the quality of the super-net, such as the widely-adopted super-net accuracyfoot_1 and a modified Kendall-Tau correlation between the searched architectures and their ground-truth performance, which we refer to as sparse



Proxy task refers to the tasks that neural architecture search aims to optimize on. The mean accuracy over a small set of randomly sampled architectures during super-net training.1

