Λ-DARTS: MITIGATING PERFORMANCE COLLAPSE BY HARMONIZING OPERATION SELECTION AMONG CELLS

Abstract

Differentiable neural architecture search (DARTS) is a popular method for neural architecture search (NAS), which performs cell-search and utilizes continuous relaxation to improve the search efficiency via gradient-based optimization. The main shortcoming of DARTS is performance collapse, where the discovered architecture suffers from a pattern of declining quality during search. Performance collapse has become an important topic of research, with many methods trying to solve the issue through either regularization or fundamental changes to DARTS. However, the weight-sharing framework used for cell-search in DARTS and the convergence of architecture parameters has not been analyzed yet. In this paper, we provide a thorough and novel theoretical and empirical analysis on DARTS and its point of convergence. We show that DARTS suffers from a specific structural flaw due to its weight-sharing framework that limits the convergence of DARTS to saturation points of the softmax function. This point of convergence gives an unfair advantage to layers closer to the output in choosing the optimal architecture, causing performance collapse. We then propose two new regularization terms that aim to prevent performance collapse by harmonizing operation selection via aligning gradients of layers. Experimental results on six different search spaces and three different datasets show that our method (Λ-DARTS) does indeed prevent performance collapse, providing justification for our theoretical analysis and the proposed remedy. We

1. INTRODUCTION

With the growth of the popularity of deep learning models, neural architecture design has become one of the most important challenges of machine learning. Neural architecture search (NAS), a now prominent branch of AutoML, aims to perform neural architecture design in an automatic way (Elsken et al., 2019; Ren et al., 2021) . Initially, the problem of NAS was addressed through reinforcement learning or evolutionary algorithms by the seminal works of (Zoph & Le, 2017; Baker et al., 2017; Stanley & Miikkulainen, 2002; Real et al., 2017) . But despite the recent advancements in efficiency (Baker et al., 2018; Liu et al., 2018; Cai et al., 2018) , these methods remain impractical and not widely accessible. One-shot methods aim to address this impracticality by performing the architecture search in a one-shot and end-to-end manner (Liu et al., 2019; Guo et al., 2020; Pham et al., 2018; Brock et al., 2018; Bender et al., 2018) . Another technique for increasing the efficiency of search is cell-search (Zoph et al., 2018) , which performs NAS for a set of cells stacked on top of each other according to a pre-defined macro-architecture. Differentiable neural architecture search (DARTS) (Liu et al., 2019) is a one-shot method that performs cell-search using gradient descent and continuous relaxation. It performs the cell-search using a weight-sharing framework. As a result of these innovations, DARTS is one of the most efficient methods of architecture search (Elsken et al., 2019) . One of the most severe issues of DARTS is the performance collapse problem (Zela et al., 2020) , which states that the quality of the architecture selected by DARTS gradually degenerates into an architecture solely consisting of skip-connections. This issue is usually attributed to gradient vanishing (Zhou et al., 2020) or large Hessians of the loss function (Zela et al., 2020) , and mostly addressed by heavily modifying DARTS (Chu et al., 2021; Wang et al., 2021; Gu et al., 2021; Ye et al., 2022) . In this paper, we argue that these theories fail to grasp the main reason behind the performance collapse: the conditions imposed on the convergence of DARTS by the weight-sharing framework. Specifically, we first define a measure to show the correlation between the gradient of each layer corresponding to the architecture parameters -dubbed layer alignment (Λ). Then, following a careful analysis of the convergence conditions of DARTS and the effects of weight-sharing, we find that: 1. Due to the value of Λ, DARTS can only achieve convergence by reaching the saturation point of the softmax function, where the normalized architecture parameters become almost one-hot vectors. This convergence point does not depend directly on the loss function, giving an unfair advantage to the layers that do not suffer from gradient vanishing. 2. Low value for Λ means that the optimal architecture corresponding to each layer varies wildly, with layers that are closer to the output -and therefore not suffering from gradient vanishingmainly preferring non-parametric operations. 3. The aforementioned issues are direct contributors to the performance collapse problem. Furthermore, increasing the depth of the search model or the number of iterations used for the search increases the severity of performance collapse. In Figure -1 , we can see an illustration of DARTS and the concept of layer alignment (which we will define in Section-3.2), and observe a clear correlation between the layer alignment and the performance of the discovered architecture on the NAS-Bench-201 search space (Dong & Yang, 2020). In this paper, we will use a combination of analytical and empirical evidence to support the claim that this relationship is in fact a causal relationship, resulting from the weight-sharing framework used in DARTS. Following these conclusions, we introduce a regularization term that aims to alleviate the problem of performance collapse by increasing the layer alignment. Then, through a comprehensive empirical examination, we show that our method -dubbed Λ-DARTS -significantly improves the performance of DARTS on various search spaces and datasets without any modification to the algorithm of DARTS or the structure of the cells. Specifically, our method achieves an average accuracy of 96.57% and 83.85% on the CIFAR-10 and CIFAR-100 datasets on the DARTS search space, improving upon the current state-of-the-art by 0.06% and 0.39%, respectively. To the best of our knowledge, our work is the first to investigate DARTS from the point-of-view of its wight-sharing framework and convergence conditions.

2. RELATED WORK

DARTS (Liu et al., 2019) proposed a continuous and differentiable search space through weighting a fixed set of operations to make NAS more scalable. It trains a super-graph with gradient descent and chooses the sub-graph consisted of weightiest operation edges. Its simplicity made it very popular and many variations emerged to address its theoretical and empirical setbacks:



Figure 1: An illustration of the differentiable formulation for NAS by DARTS, and the effects of layer alignment over the trajectory of the performance of the discovered architectures by DARTS and Λ-DARTS. The experiments are performed on the NAS-Bench-201 search space, averaged over 4 runs with a 95% confidence interval.

