FAST MNAS: UNCERTAINTY-AWARE NEURAL ARCHI-TECTURE SEARCH WITH LIFELONG LEARNING

Abstract

Sampling-based neural architecture search (NAS) always guarantees better convergence yet suffers from huge computational resources compared with gradient-based approaches, due to the rollout bottleneck -exhaustive training for each sampled generation on proxy tasks. This work provides a general pipeline to accelerate the convergence of the rollout process as well as the RL learning process in samplingbased NAS. It is motivated by the interesting observation that both the architecture and the parameter knowledge can be transferred between different experiments and even different tasks. We first introduce an uncertainty-aware critic (value function) in PPO to utilize the architecture knowledge in previous experiments, which stabilizes the training process and reduces the searching time by 4 times. Further, a life-long knowledge pool together with a block similarity function is proposed to utilize lifelong parameter knowledge and reduces the searching time by 2 times. It is the first to introduce block-level weight sharing in RL-based NAS. The block similarity function guarantees a 100% hitting ratio with strict fairness. Besides, we show a simply designed off-policy correction factor that enables "replay buffer" in RL optimization and further reduces half of the searching time. Experiments on the MNAS search space show the proposed FNAS accelerates standard RL-based NAS process by ∼10x (e.g. ∼256 2x2 TPUv2*days / 20,000 GPU*hour → 2,000 GPU*hour for MNAS), and guarantees better performance on various vision tasks.

1. INTRODUCTION

Neural architecture search (NAS) has made great progress in different tasks such as image classification (Tan & Le, 2019) and object detection (Tan et al., 2019b) . And usually, there are four commonly used NAS algorithms: differentiable, one-shot, evolutional, and reinforcement learning (RL) based method. The RL-based method, due to its fair sampling and training processes, has often achieved a great performance among different tasks. However, one of the biggest challenges of it is the high demand for computing resources, which makes it hard to follow by the research community. RL-based NAS consumes a large number of computing powers on two aspects: a) the need for sampling a large number of architectures to optimize the RL agent and b) the tedious training and testing process of these samples on proxy tasks. For example, the originator of NAS (Zoph & Le, 2016) requires 12,800 generations of architecture and current state-of-the-art MNAS (Tan et al., 2019a) and MobileNet-V3 (Howard et al., 2019) require 8000 or more generations to find the optimal architectures. Besides, each generation is usually trained for 5 epochs. All in all, it costs nearly 64 TPUv2 devices for 96 hours or 20,000 GPU hours on V100 for just one single searching process. With such a severe drawback, researchers start looking for other options like differential (Liu et al., 2018b; Chen et al., 2019) , or one-shot based (Bender, 2019; Guo et al., 2019 ) method for NAS. The one-shot family has drawn lots of attention recently due to its efficiency. It applies a single super-network based search space with that all the architectures, also called sub-networks, share parameters with the super-network during the training process. In this way, the training process is condensed from training thousands of sub-networks into training a super-network. However, this share-weight strategy may bring problems for the performance estimation of sub-networks. For example, two sub-networks may propagate conflicting gradients to their shared components, and the shared components may converge to favor one of the sub-networks and repel the other randomly. This In this work, we seek to combine the privilege of RL-based methods and one-shot methods, by leveraging the knowledge from previous NAS experiments. The proposed method is based on two key observations: First, the optimal architectures for different tasks have common architecture knowledge. Second, the parameter knowledge can also be transferred across experiments and even tasks. Based on the observations, for transferable architecture knowledge, we develop Uncertainty-Aware Critic (UAC) to learn the architecture-performance joint distribution from other experiments even other tasks in an unbiased manner, utilizing the transferability of the structural knowledge, which reduces the sample's training time by 50% and the result is shown in Figure 1 (with UAC); For the transferable parameter knowledge, we propose Lifelong Knowledge Pool (LKP) to restore the block-level parameters and fairly share them to new samples' initialization, which speeds up each samples' convergence for 2 times, as shown in Figure 1 (with LKP); Finally, we also developed an Architecture Experience Buffer (AEB) with a significant off-policy correctness factor to store the old models for reusing in RL optimization, with half of the time saved. And this is shown in Figure 1 (with AEB). Under the strictly same environment as MNAS and MobileNet-v3, FNAS speed up the searching process by 10× and the performances are even better.



Figure 1: Reward along sample generation between FNAS and MNAS. Blue dots are the searching result of MNAS, while red dots are the results of FNAS.

