FAST MNAS: UNCERTAINTY-AWARE NEURAL ARCHI-TECTURE SEARCH WITH LIFELONG LEARNING

Abstract

Sampling-based neural architecture search (NAS) always guarantees better convergence yet suffers from huge computational resources compared with gradient-based approaches, due to the rollout bottleneck -exhaustive training for each sampled generation on proxy tasks. This work provides a general pipeline to accelerate the convergence of the rollout process as well as the RL learning process in samplingbased NAS. It is motivated by the interesting observation that both the architecture and the parameter knowledge can be transferred between different experiments and even different tasks. We first introduce an uncertainty-aware critic (value function) in PPO to utilize the architecture knowledge in previous experiments, which stabilizes the training process and reduces the searching time by 4 times. Further, a life-long knowledge pool together with a block similarity function is proposed to utilize lifelong parameter knowledge and reduces the searching time by 2 times. It is the first to introduce block-level weight sharing in RL-based NAS. The block similarity function guarantees a 100% hitting ratio with strict fairness. Besides, we show a simply designed off-policy correction factor that enables "replay buffer" in RL optimization and further reduces half of the searching time. Experiments on the MNAS search space show the proposed FNAS accelerates standard RL-based NAS process by ∼10x (e.g. ∼256 2x2 TPUv2*days / 20,000 GPU*hour → 2,000 GPU*hour for MNAS), and guarantees better performance on various vision tasks.

1. INTRODUCTION

Neural architecture search (NAS) has made great progress in different tasks such as image classification (Tan & Le, 2019) and object detection (Tan et al., 2019b) . And usually, there are four commonly used NAS algorithms: differentiable, one-shot, evolutional, and reinforcement learning (RL) based method. The RL-based method, due to its fair sampling and training processes, has often achieved a great performance among different tasks. However, one of the biggest challenges of it is the high demand for computing resources, which makes it hard to follow by the research community. RL-based NAS consumes a large number of computing powers on two aspects: a) the need for sampling a large number of architectures to optimize the RL agent and b) the tedious training and testing process of these samples on proxy tasks. For example, the originator of NAS (Zoph & Le, 2016) requires 12,800 generations of architecture and current state-of-the-art MNAS (Tan et al., 2019a) and MobileNet-V3 (Howard et al., 2019) require 8000 or more generations to find the optimal architectures. Besides, each generation is usually trained for 5 epochs. All in all, it costs nearly 64 TPUv2 devices for 96 hours or 20,000 GPU hours on V100 for just one single searching process. With such a severe drawback, researchers start looking for other options like differential (Liu et al., 2018b; Chen et al., 2019) , or one-shot based (Bender, 2019; Guo et al., 2019 ) method for NAS. The one-shot family has drawn lots of attention recently due to its efficiency. It applies a single super-network based search space with that all the architectures, also called sub-networks, share parameters with the super-network during the training process. In this way, the training process is condensed from training thousands of sub-networks into training a super-network. However, this share-weight strategy may bring problems for the performance estimation of sub-networks. For example, two sub-networks may propagate conflicting gradients to their shared components, and the shared components may converge to favor one of the sub-networks and repel the other randomly. This

