RETHINKING ARCHITECTURE SELECTION IN DIFFER-ENTIABLE NAS

Abstract

Differentiable Neural Architecture Search is one of the most popular Neural Architecture Search (NAS) methods for its search efficiency and simplicity, accomplished by jointly optimizing the model weight and architecture parameters in a weight-sharing supernet via gradient-based algorithms. At the end of the search phase, the operations with the largest architecture parameters will be selected to form the final architecture, with the implicit assumption that the values of architecture parameters reflect the operation strength. While much has been discussed about the supernet's optimization, the architecture selection process has received little attention. We provide empirical and theoretical analysis to show that the magnitude of architecture parameters does not necessarily indicate how much the operation contributes to the supernet's performance. We propose an alternative perturbation-based architecture selection that directly measures each operation's influence on the supernet. We re-evaluate several differentiable NAS methods with the proposed architecture selection and find that it is able to extract significantly improved architectures from the underlying supernets consistently. Furthermore, we find that several failure modes of DARTS can be greatly alleviated with the proposed selection method, indicating that much of the poor generalization observed in DARTS can be attributed to the failure of magnitude-based architecture selection rather than entirely the optimization of its supernet.

1. INTRODUCTION

Neural Architecture Search (NAS) has been drawing increasing attention in both academia and industry for its potential to automatize the process of discovering high-performance architectures, which have long been handcrafted. Early works on NAS deploy Evolutionary Algorithm (Stanley & Miikkulainen, 2002; Real et al., 2017; Liu et al., 2017) and Reinforcement Learning (Zoph & Le, 2017; Pham et al., 2018; Zhong et al., 2018) to guide the architecture discovery process. Recently, several one-shot methods have been proposed that significantly improve the search efficiency (Brock et al., 2018; Guo et al., 2019; Bender et al., 2018) . As a particularly popular instance of one-shot methods, DARTS (Liu et al., 2019) enables the search process to be performed with a gradient-based optimizer in an end-to-end manner. It applies continuous relaxation that transforms the categorical choice of architectures into continuous architecture parameters α. The resulting supernet can be optimized via gradient-based methods, and the operations associated with the largest architecture parameters are selected to form the final architecture. Despite its simplicity, several works cast doubt on the effectiveness of DARTS. For example, a simple randomized search (Li & Talwalkar, 2019) outperforms the original DARTS; Zela et al. (2020) observes that DARTS degenerates to networks filled with parametric-free operations such as the skip connection or even random noise, leading to the poor performance of the selected architecture. While the majority of previous research attributes the failure of DARTS to its supernet optimization (Zela et al., 2020; Chen & Hsieh, 2020; Chen et al., 2021) , little has been discussed about the validity of another important assumption: the value of α reflects the strength of the underlying operations. In this paper, we conduct an in-depth analysis of this problem. Surprisingly, we find that in many cases, α does not really indicate the operation importance in a supernet. Firstly, the operation associated with larger α does not necessarily result in higher validation accuracy after discretization. Secondly, as an important example, we show mathematically that the domination of skip connection observed in DARTS (i.e. α skip becomes larger than other operations.) is in fact a reasonable outcome of the supernet's optimization but becomes problematic when we rely on α to select the best operation. If α is not a good indicator of operation strength, how should we select the final architecture from a pretrained supernet? Our analysis indicates that the strength of each operation should be evaluated based on its contribution to the supernet performance instead. To this end, we propose an alternative perturbation-based architecture selection method. Given a pretrained supernet, the best operation on an edge is selected and discretized based on how much it perturbs the supernet accuracy; The final architecture is derived edge by edge, with fine-tuning in between so that the supernet remains converged for every operation decision. We re-evaluate several differentiable NAS methods (DARTS (Liu et al., 2019) , SDARTS (Chen & Hsieh, 2020), SGAS (Li et al., 2020) ) and show that the proposed selection method is able to consistently extract significantly improved architectures from the supernets than magnitude-based counterparts. Furthermore, we find that the robustness issues of DARTS can be greatly alleviated by replacing the magnitude-based selection with the proposed perturbation-based selection method.

2. BACKGROUND AND RELATED WORK

Preliminaries of Differentiable Architecture Search (DARTS) We start by reviewing the formulation of DARTS. DARTS' search space consists of repetitions of cell-based microstructures. Every cell can be viewed as a DAG with N nodes and E edges, where each node represents a latent feature map x i , and each edge is associated with an operation o (e.g. skip connect, sep conv 3x3) from the search space O. Continuous relaxation is then applied to this search space. Concretely, every operation on an edge is activated during the search phase, with their outputs mixed by the architecture parameter α to form the final mixed output of that edge m( x i ) = o∈O exp αo o exp α o o(x i ). This particular formulation allows the architecture search to be performed in a differentiable manner: DARTS jointly optimizes α and model weight w with the following bilevel objective via alternative gradient updates: (1) We refer to the continuous relaxed network used in the search phase as the supernet of DARTS. At the end of the search phase, the operation associated with the largest α o on each edge will be selected from the supernet to form the final architecture. Progressive search space shrinking There is a line of research on NAS that focuses on reducing the search cost and aligning the model sizes of the search and evaluation phases via progressive search space shrinking (Liu et al., 2018; Li et al., 2019; Chen et al., 2021; Li et al., 2020) . The general scheme of these methods is to prune out weak operations and edges sequentially during the search phase, based on the magnitude of α following DARTS. Our method is orthogonal to them in this respect, since we select operations based on how much it contributes to the supernet's performance rather than the α value. Although we also discretize edges greedily and fine-tune the network in between, the purpose is to let the supernet recover from the loss of accuracy after discretization to accurately evaluate operation strength on the next edge, rather than to reduce the search cost.



val (w * , α) s.t. w * = arg min w L train (w, α).

mode analysis of DARTS Several works cast doubt on the robustness of DARTS. Zela et al. (2020) tests DARTS on four different search spaces and observes significantly degenerated performance. They empirically find that the selected architectures perform poorly when DARTS' supernet falls into high curvature areas of validation loss (captured by large dominant eigenvalues of the Hessian ∇ 2 α,α L val (w, α)). While Zela et al. (2020) relates this problem to the failure of supernet training in DARTS, we examine it from the architecture selection aspects of DARTS, and show that much of DARTS' robustness issue can be alleviated by a better architecture selection method.

