DRNAS: DIRICHLET NEURAL ARCHITECTURE SEARCH

Abstract

This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random variables, modeled by Dirichlet distribution. With recently developed pathwise derivatives, the Dirichlet parameters can be easily optimized with gradient-based optimizer in an end-to-end manner. This formulation improves the generalization ability and induces stochasticity that naturally encourages exploration in the search space. Furthermore, to alleviate the large memory consumption of differentiable NAS, we propose a simple yet effective progressive learning scheme that enables searching directly on large-scale tasks, eliminating the gap between search and evaluation phases. Extensive experiments demonstrate the effectiveness of our method. Specifically, we obtain a test error of 2.46% for CIFAR-10, 23.7% for ImageNet under the mobile setting. On NAS-Bench-201, we also achieve state-of-the-art results on all three datasets and provide insights for the effective design of neural architecture search algorithms.

1. INTRODUCTION

Recently, Neural Architecture Search (NAS) has attracted lots of attentions for its potential to democratize deep learning. For a practical end-to-end deep learning platform, NAS plays a crucial role in discovering task-specific architecture depending on users' configurations (e.g., dataset, evaluation metric, etc.). Pioneers in this field develop prototypes based on reinforcement learning (Zoph & Le, 2017) , evolutionary algorithms (Real et al., 2019) and Bayesian optimization (Liu et al., 2018) . These works usually incur large computation overheads, which make them impractical to use. More recent algorithms significantly reduce the search cost including one-shot methods (Pham et al., 2018; Bender et al., 2018) , a continuous relaxation of the space (Liu et al., 2019) and network morphisms (Cai et al., 2018) . In particular, Liu et al. (2019) proposes a differentiable NAS framework -DARTS, converting the categorical operation selection problem into learning a continuous architecture mixing weight. They formulate a bi-level optimization objective, allowing the architecture search to be efficiently performed by a gradient-based optimizer. While current differentiable NAS methods achieve encouraging results, they still have shortcomings that hinder their real-world applications. Firstly, several works have cast doubt on the stability and generalization of these differentiable NAS methods (Chen & Hsieh, 2020; Zela et al., 2020a) . They discover that directly optimizing the architecture mixing weight is prone to overfitting the validation set and often leads to distorted structures, e.g., searched architectures dominated by parameter-free operations. Secondly, there exist disparities between the search and evaluation phases, where proxy tasks are usually employed during search with smaller datasets or shallower and narrower networks, due to the large memory consumption of differentiable NAS. In this paper, we propose an effective approach that addresses the aforementioned shortcomings named Dirichlet Neural Architecture Search (DrNAS). Inspired by the fact that directly optimizing the architecture mixing weight is equivalent to performing point estimation (MLE/MAP) from a probabilistic perspective, we formulate the differentiable NAS as a distribution learning problem * Equal Contribution. 1

