DRNAS: DIRICHLET NEURAL ARCHITECTURE SEARCH

Abstract

This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random variables, modeled by Dirichlet distribution. With recently developed pathwise derivatives, the Dirichlet parameters can be easily optimized with gradient-based optimizer in an end-to-end manner. This formulation improves the generalization ability and induces stochasticity that naturally encourages exploration in the search space. Furthermore, to alleviate the large memory consumption of differentiable NAS, we propose a simple yet effective progressive learning scheme that enables searching directly on large-scale tasks, eliminating the gap between search and evaluation phases. Extensive experiments demonstrate the effectiveness of our method. Specifically, we obtain a test error of 2.46% for CIFAR-10, 23.7% for ImageNet under the mobile setting. On NAS-Bench-201, we also achieve state-of-the-art results on all three datasets and provide insights for the effective design of neural architecture search algorithms.

1. INTRODUCTION

Recently, Neural Architecture Search (NAS) has attracted lots of attentions for its potential to democratize deep learning. For a practical end-to-end deep learning platform, NAS plays a crucial role in discovering task-specific architecture depending on users' configurations (e.g., dataset, evaluation metric, etc.). Pioneers in this field develop prototypes based on reinforcement learning (Zoph & Le, 2017) , evolutionary algorithms (Real et al., 2019) and Bayesian optimization (Liu et al., 2018) . These works usually incur large computation overheads, which make them impractical to use. More recent algorithms significantly reduce the search cost including one-shot methods (Pham et al., 2018; Bender et al., 2018) , a continuous relaxation of the space (Liu et al., 2019) and network morphisms (Cai et al., 2018) . In particular, Liu et al. (2019) proposes a differentiable NAS framework -DARTS, converting the categorical operation selection problem into learning a continuous architecture mixing weight. They formulate a bi-level optimization objective, allowing the architecture search to be efficiently performed by a gradient-based optimizer. While current differentiable NAS methods achieve encouraging results, they still have shortcomings that hinder their real-world applications. Firstly, several works have cast doubt on the stability and generalization of these differentiable NAS methods (Chen & Hsieh, 2020; Zela et al., 2020a) . They discover that directly optimizing the architecture mixing weight is prone to overfitting the validation set and often leads to distorted structures, e.g., searched architectures dominated by parameter-free operations. Secondly, there exist disparities between the search and evaluation phases, where proxy tasks are usually employed during search with smaller datasets or shallower and narrower networks, due to the large memory consumption of differentiable NAS. In this paper, we propose an effective approach that addresses the aforementioned shortcomings named Dirichlet Neural Architecture Search (DrNAS). Inspired by the fact that directly optimizing the architecture mixing weight is equivalent to performing point estimation (MLE/MAP) from a probabilistic perspective, we formulate the differentiable NAS as a distribution learning problem instead, which naturally induces stochasticity and encourages exploration. Making use of the probability simplex property of the Dirichlet samples, DrNAS models the architecture mixing weight as random variables sampled from a parameterized Dirichlet distribution. Optimizing the Dirichlet objective can thus be done efficiently in an end-to-end fashion, by employing the pathwise derivative estimators to compute the gradient of the distribution (Martin Jankowiak, 2018) . A straightforward optimization, however, turns out to be problematic due to the uncontrolled variance of the Dirichlet, i.e., too much variance leads to training instability and too little variance suffers from insufficient exploration. In light of that, we apply an additional distance regularizer directly on the Dirichlet concentration parameter to strike a balance between the exploration and the exploitation. We further derive a theoretical bound showing that the constrained distributional objective promotes stability and generalization of architecture search by implicitly controlling the Hessian of the validation error. Furthermore, to enable a direct search on large-scale tasks, we propose a progressive learning scheme, eliminating the gap between the search and evaluation phases. Based on partial channel connection (Xu et al., 2020) , we maintain a task-specific super-network of the same depth and number of channels as the evaluation phase throughout searching. To prevent loss of information and instability induced by partial connection, we divide the search phase into multiple stages and progressively increase the channel fraction via network transformation (Chen et al., 2016) . Meanwhile, we prune the operation space according to the learnt distribution to maintain the memory efficiency. We conduct extensive experiments on different datasets and search spaces to demonstrate DrNAS's effectiveness. Based on the DARTS search space (Liu et al., 2019) , we achieve an average error rate of 2.46% on CIFAR-10, which ranks top amongst NAS methods. Furthermore, DrNAS achieves superior performance on large-scale tasks such as ImageNet. It obtains a top-1/5 error of 23.7%/7.1%, surpassing the previous state-of-the-art (24.0%/7.3%) under the mobile setting. On NAS- Bench-201 (Dong & Yang, 2020) , we also set new state-of-the-art results on all three datasets with low variance. Our code is available at https://github.com/xiangning-chen/DrNAS.

2. THE PROPOSED APPROACH

In this section, we first briefly review differentiable NAS setups and generalize the formulation to motivate distribution learning. We then layout our proposed DrNAS and describe its optimization in section 2.2. In section 2.3, we provide a generalization result by showing that our method implicitly regularizes the Hessian norm over the architecture parameter. The progressive architecture learning method that enables direct search is then described in section 2.4.

2.1. PRELIMINARIES: DIFFERENTIABLE ARCHITECTURE SEARCH

Cell-Based Search Space The cell-based search space is constructed by replications of normal and reduction cells (Zoph et al., 2018; Liu et al., 2019) . A normal cell keeps the spatial resolution while a reduction cell halves it but doubles the number of channels. Every cell is represented by a DAG with N nodes and E edges, where every node represents a latent representation x i and every edge (i, j) is associated with an operations o (i,j) (e.g., max pooling or convolution) selected from a predefined candidate space O. The output of a node is a summation of all input flows, i.e., x j = i<j o (i,j) (x i ), and a concatenation of intermediate node outputs, i.e., concat(x 2 , ..., x N -1 ), composes the cell output, where the first two input nodes x 0 and x 1 are fixed to be the outputs of previous two cells.

Gradient-Based Search via Continuous Relaxation

To enable gradient-based optimization, Liu et al. ( 2019) apply a continuous relaxation to the discrete space. Concretely, the information passed from node i to node j is computed by a weighted sum of all operations alone the edge, forming a mixed-operation ô(i,j) (x) = o∈O θ (i,j) o o(x). The operation mixing weight θ (i,j) is defined over the probability simplex and its magnitude represents the strength of each operation. Therefore, the architecture search can be cast as selecting the operation associated with the highest mixing weight for each edge. To prevent abuse of terminology, we refer to θ as the architecture/operation mixing weight, and concentration parameter β in DrNAS as the architecture parameter throughout the paper. Bilevel-Optimization with Simplex Constraints With continuous relaxation, the network weight w and operation mixing weight θ can be jointly optimized by solving a constraint bi-level optimization

