EFFICIENT DIFFERENTIABLE NEURAL ARCHITECTURE SEARCH WITH MODEL PARALLELISM

Abstract

Neural architecture search (NAS) automatically designs effective network architectures. Differentiable NAS with supernets that encompass all potential architectures in a large graph cuts down search overhead to few GPU days or less. However, these algorithms consume massive GPU memory, which will restrain NAS from large batch sizes and large search spaces (e.g., more candidate operations, diverse cell structures, and large depth of supernets). In this paper, we present binary neural architecture search (NASB) with consecutive model parallel (CMP) to tackle the problem of insufficient GPU memory. CMP aggregates memory from multiple GPUs for supernets. It divides forward/backward phases into several sub-tasks and executes the same type of sub-tasks together to reduce waiting cycles. This approach improves the hardware utilization of model parallel, but it utilizes large GPU memory. NASB is proposed to reduce memory footprint, which excludes inactive operations from computation graphs and computes those operations on the fly for inactive architectural gradients in backward phases. Experiments show that NASB-CMP runs 1.2× faster than other model parallel approaches and outperforms state-of-the-art differentiable NAS. NASB can also save twice GPU memory more than PC-DARTS. Finally, we apply NASB-CMP to complicated supernet architectures. Although deep supernets with diverse cell structures do not improve NAS performance, NASB-CMP shows its potential to explore supernet architecture design in large search space 1 .

1. INTRODUCTION

Neural architecture search (NAS) has revolutionized architecture designs of deep learning from manually to automatically in various applications, such as image classification (Zoph & Le, 2016) and semantic segmentation (Liu et al., 2019a) . Reinforcement learning (Zoph & Le, 2016; Zoph et al., 2018; Pham et al., 2018 ), evolutionary algorithms (Real et al., 2017; 2019) , and differentiable algorithms (Liu et al., 2019b; Cai et al., 2019) have been applied to discover the optimal architecture from a large search space of candidate network structures. Supernets (Zoph et al., 2018; Pham et al., 2018) comprising all possible networks reduce search spaces from complete network architectures to cell structures. Recent acceleration techniques of differentiable NAS (Xie et al., 2019; Yao et al., 2020; Chen et al., 2019; Xu et al., 2020) further diminish search costs to affordable computation overheads (e.g., half GPU day). Prior work (Xu et al., 2020) randomly samples partial channels of intermediate feature maps in the mixed operations. However, supernets of differentiable NAS consume gigantic GPU memory, which constrains NAS from using large batch sizes and imposes restrictions on supernet architectures' complexity. For example, NAS determines networks in shallow supernets (e.g., 8 layers) for deep compact networks (e.g., 20 layers). The cell structures are also required to remain identical for the same type of cells. Data parallelism can increase the search efficiency of NAS by using large batch sizes, such as SNAS (Xie et al., 2019) , but it requires supernet complexity low enough to fit in a single GPU. In contrast, model parallelism can parallelize complex supernets, which distributes partial models to multiple devices. Nevertheless, model parallelism suffers from low hardware utilization. Only one device executes its model partition, while other devices stay idle. How to take advantage of multiple GPUs for large supernets efficiently is an open problem. 1 Search and evaluation code are released at link 1

