EXPLORING SINGLE-PATH ARCHITECTURE SEARCH RANKING CORRELATIONS

Abstract

Recently presented benchmarks for Neural Architecture Search (NAS) provide the results of training thousands of different architectures in a specific search space, thus enabling the fair and rapid comparison of different methods. Based on these results, we quantify the ranking correlations of single-path architecture search methods in different search space subsets and under several training variations; studying their impact on the expected search results. The experiments support the few-shot approach and Linear Transformers, provide evidence against disabling cell topology sharing during the training phase or using regularization and other common training additions in the NAS-Bench-201 search space. Additionally, we find that super-network size and path sampling strategies require further research to be understood better.

1. INTRODUCTION

The development and study of algorithms that automatically design neural networks, Neural Architecture Search (NAS), has become a significant influence in recent years; owed to the promise of creating better models with less human effort and in shorter time. Whereas the first generations of algorithms required training thousands of networks in thousands of GPU hours using reinforcement learning (Zoph & Le (2016) ; Zoph et al. (2018) ), greedy progressive optimization (Liu et al. (2018a) ), regularized evolution (Real et al. (2018) ) and more, the invention of weight sharing during search (Pham et al. (2018) ) reduced the computation cost to few GPU hours, and thus made NAS accessible to a much wider audience. While this also enables gradient based NAS (Liu et al. (2018b) ), the necessity to compare operations against each other leads to an increased memory requirement. The issue is commonly alleviated by training a small search network consisting of cells with a shared topology, later scaling the resulting architecture up by adding more cells and increasing the number of channels. Although the standalone network is often trained from scratch, reusing the search network weights can increase both training speed and final accuracy (Yan et al. ( 2019 2020)) and can even be applied directly to huge data sets. However, the aforementioned weight sharing methods only yield a single result, require manually fine-tuning the loss function when there are multiple objectives, and can not guarantee results within constraints (e.g. latency, FLOPs). The single-path one-shot approach seeks to combine the best of both worlds, requiring only one additional step in the search phase (Guo et al. (2020) ): Firstly a full sized weight-sharing model (super-network) is fully trained by randomly choosing one of the available operations at each layer in every training step. Then, as specific architectures can be evaluated by choosing the model's operations accordingly, a hyper-parameter optimization method can be used to find combinations of operations maximizing the super-network accuracy. If the rankings of the architectures by their respective super-network accuracy and by their stand-alone model retraining results are consistent, the quality of the discovered candidates is high. However, since the single-path method's search spaces are often gigantic and the network training costly (see e.g. Guo et al. (2020); Chu et al. (2019b; a) ), a study of the ranking correlation is usually limited to a handful of architectures. In this work we study the single-path one-shot super-network predictions and ranking correlation throughout an entire search space, as all stand-alone model re-sults are known in advance. This enables us to quantify the effects of several super-network training variations and search space subsets, to gain further insights on the popular single-path one-shot method itself. We briefly list the closest related work in Section 2 and introduce the measurement metric, benchmark dataset, super-network training and experiment design in Section 3. We then systematically evaluate several variations in the single-path one-shot approach with a novel method, computing the ranking correlation of the trained super-networks with the ground-truth top-N best architectures. Experiments on search space subsets in Section 4.1 once again demonstrate that the ranking is more difficult as the search space increases in size, and that the operations that make the ranking especially hard are Zero and Pool. Section 4.2 evaluates Linear Transformers (Chu et al. ( 2019a)), which we find to perform very well in specific search space subsets, and otherwise even harmful. Furthermore, some commonly used training variations such as learning rate warmup, gradient clipping, data augmentation and regularization are evaluated in Section 4.3, where we find that none of these provides a measurable improvement. We further test disabling cell topology sharing only during training time and find that training the network in the same way as evaluating it is more effective. We finally list some grains of salt in Section 5 and conclude the paper with Section 6.

2. RELATED WORK

A high quality architecture ranking prediction is the foundation of any NAS algorithm. In this paper we explore the effects of several super-network training variations on the ranking prediction of the aforementioned single-path one-shot approach (Guo et al. ( 2020 

3.1. METRIC

As we correlate the super-network accuracy prediction and the benchmark results, but are only interested in a correct ranking, we need a ranking correlation metric. We choose Kendall's Tau (τ , KT), a commonly used ranking metric (Sciuto et al. (2019); Chu et al. (2019b) ) that counts how often all pairs of observations (x i , y i ) and (x j , y j ) 1. are concordant, agreeing on a sorting order (x i < x j and y i < y j ; or x i > x j and y i > y j ) 2. are discordant, disagreeing on a sorting order (x i < x j and y i > y j ; or x i > x j and y i < y j ) 3. are neither and is then calculated by their difference and normalized by the number of possible different pairs. τ = (num concordant)-(num discordant) ( n 2 ) τ ranges from -1 in perfect disagreement to +1 in perfect agreement, and is around zero for independent X and Y . A small selection of experiments that use additional metrics can be found in Appendix D.



); Hu et al. (2020)). More recent gradient based methods require to have only one path in memory (Dong & Yang (2019); Cai et al. (2019); Hu et al. (

)). Recent efforts have shown improvements by strictly fair operation sampling in the super-network training phase (Chu et al. (2019b)) and adding a linear 1×1 convolution to skip connections, improving training stability (Chu et al. (2019a)). Other works divide the search space, exploring multiple models with different operation-subsets (Zhao et al. (2020)), or one model with several smaller blocks that use a trained teacher as a guiding signal (Li et al. (2020b)). Due to the often gigantic search spaces and the inherent randomness of network training and hyperparameter optimization algorithms, the reproducibility of NAS methods has become a major concern. NAS Benchmarks attempt to alleviate this issue by providing statistics (e.g. validation loss, accuracy and latency) of several thousand different networks on multiple data sets (Ying et al. (2019); Dong & Yang (2020)), providing the ground-truth training results that we use for our evaluation.

