EXPLORING SINGLE-PATH ARCHITECTURE SEARCH RANKING CORRELATIONS

Abstract

Recently presented benchmarks for Neural Architecture Search (NAS) provide the results of training thousands of different architectures in a specific search space, thus enabling the fair and rapid comparison of different methods. Based on these results, we quantify the ranking correlations of single-path architecture search methods in different search space subsets and under several training variations; studying their impact on the expected search results. The experiments support the few-shot approach and Linear Transformers, provide evidence against disabling cell topology sharing during the training phase or using regularization and other common training additions in the NAS-Bench-201 search space. Additionally, we find that super-network size and path sampling strategies require further research to be understood better.

1. INTRODUCTION

The development and study of algorithms that automatically design neural networks, Neural Architecture Search (NAS), has become a significant influence in recent years; owed to the promise of creating better models with less human effort and in shorter time. 2020)) and can even be applied directly to huge data sets. However, the aforementioned weight sharing methods only yield a single result, require manually fine-tuning the loss function when there are multiple objectives, and can not guarantee results within constraints (e.g. latency, FLOPs). The single-path one-shot approach seeks to combine the best of both worlds, requiring only one additional step in the search phase (Guo et al. (2020) ): Firstly a full sized weight-sharing model (super-network) is fully trained by randomly choosing one of the available operations at each layer in every training step. Then, as specific architectures can be evaluated by choosing the model's operations accordingly, a hyper-parameter optimization method can be used to find combinations of operations maximizing the super-network accuracy. If the rankings of the architectures by their respective super-network accuracy and by their stand-alone model retraining results are consistent, the quality of the discovered candidates is high. However, since the single-path method's search spaces are often gigantic and the network training costly (see e.g. Guo et al. (2020); Chu et al. (2019b; a) ), a study of the ranking correlation is usually limited to a handful of architectures. In this work we study the single-path one-shot super-network predictions and ranking correlation throughout an entire search space, as all stand-alone model re-



Whereas the first generations of algorithms required training thousands of networks in thousands of GPU hours using reinforcement learning (Zoph & Le (2016); Zoph et al. (2018)), greedy progressive optimization (Liu et al. (2018a)), regularized evolution (Real et al. (2018)) and more, the invention of weight sharing during search (Pham et al. (2018)) reduced the computation cost to few GPU hours, and thus made NAS accessible to a much wider audience. While this also enables gradient based NAS (Liu et al. (2018b)), the necessity to compare operations against each other leads to an increased memory requirement. The issue is commonly alleviated by training a small search network consisting of cells with a shared topology, later scaling the resulting architecture up by adding more cells and increasing the number of channels. Although the standalone network is often trained from scratch, reusing the search network weights can increase both training speed and final accuracy (Yan et al. (2019); Hu et al. (2020)). More recent gradient based methods require to have only one path in memory (Dong & Yang (2019); Cai et al. (2019); Hu et al. (

