EFFICIENT ONE-SHOT NEURAL ARCHITECTURE SEARCH WITH PROGRESSIVE CHOICE FREEZING EVOLUTIONARY SEARCH

Abstract

Neural Architecture Search (NAS) is a fast-developing research field to promote automatic machine learning. Among the recently populated NAS methods, oneshot NAS has attracted significant attention since it greatly reduces the training cost compared with the previous NAS methods. In one-shot NAS, the best candidate network architecture is searched within a supernet, which is trained only once. In practice, the searching process involves numerous inference processes for each user case, which causes high overhead in terms of latency and energy consumption. To tackle this problem, we first observe that the choices of the first few blocks that belong to different candidate networks will become similar at the early search stage. Furthermore, these choices are already close to the optimal choices obtained at the end of the search. Leveraging this interesting feature, we propose a progressive choice freezing evolutionary search (PCF-ES) method that gradually freezes block choices for all candidate networks during the searching process. This approach gives us an opportunity to reuse intermediate data produced by the frozen blocks instead of re-computing them. The experiment results show that the proposed PCF-ES provides up to 55% speedup and reduces energy consumption by 51% during the searching stage.

1. INTRODUCTION

Neural Architecture Search (NAS) has been proposed and extensively studied as an efficient tool for designing state-of-the-art neural networks (Elsken et al., 2019; Wistuba et al., 2019; Ren et al., 2020) . NAS approaches automate the architecture design process and can achieve higher accuracy compared to human-designed architectures (Liu et al., 2019; Xie et al., 2019; Cai et al., 2019) . However, the early NAS methods, such as reinforcement NAS (Zoph & Le, 2016) , came with the price of expensive computation costs since every searched architecture needs to be trained from scratch, which makes the total search time unacceptable. To reduce the search cost of earlier NAS methods, the weight sharing technique has been proposed (Yu et al., 2020; Chen et al., 2020) , among which the one-shot NAS method has attracted a lot of attention recently (Bender et al., 2018; Li et al., 2020) . The one-shot NAS method is known as cost-efficient as it requires training a supernet only once. A supernet is a stack of basic blocks, each of which contains multiple choices. A candidate network architecture (defined as subnet) can be formed by selecting one choice for each block in the supernet, and its corresponding weights can be inherited from the supernet. During the architecture searching stage, candidate architectures are evaluated on the validation dataset and the best architecture, i.e., the architecture with the highest validation accuracy, is updated in every searching epoch of Evolutionary Algorithm (EA) (Real et al., 2019) . Surprisingly, although training is commonly deemed as a lengthy and energy-consuming task, the architecture searching stage in one-shot NAS is much more costly (Cai et al., 2020) than training a supernet. The reason is that a new searching stage should be performed whenever a different searching scenario is given, e.g., different hardware constraints, learning tasks, and workloads, while the trained supernet can be reused. Hence, the numerous inferences on the subnets can take a much longer time than training a supernet only once. According to (You et al., 2020) , searching can be 10 GPU days longer than supernet training when 10 different constraints/platforms are required. To tackle this problem, our work first makes a key observation that, for the first few continuous blocks of the candidate architecture (defined as continuous shallow blocks), their optimal choices can be determined at an early search epoch. Based on this observation, we propose to freeze the choices of continuous shallow blocks at the early search epoch, which means these choices will not be changed during the remaining search epochs. This strategy elaborately "creates" redundant computations in the continuous shallow blocks since all candidates will share exactly the same architecture, inherited weights, and input validation data for the shallow blocks during later search epochs. Then we leverage such redundancy and propose a simple yet effective data reuse scheme to save large amounts of computations, thus further reducing time and energy cost. Specifically, we propose to reuse the last output of the continuous shallow blocks instead of re-computing it repeatedly throughout the remaining searching stage. Interestingly, we further discover that the freezing strategy may in turn help to determine the optimal choices of the subsequent blocks earlier. Such phenomenon enables us to keep freezing the choices of blocks progressively after the initial freezing, which will create more redundant computations of the blocks that possess the same architecture (choice), thus more computations can be saved. With the proposed freezing technique, the intermediate data (the last output of the continuous shallow blocks) of a certain subnet can be stored and reused during the evaluation of other subnets. However, as the searching stage requires to evaluate a subnet with a large batch (e.g., batch size = 5000) of input samples, storing the intermediate data of only one subnet may cause serious memory issues (Sec.3.5). Inspired by the Importance Sampling technique employed in many training methods (Zeng et al., 2021) , we propose to sample the "important" input data that contribute more to distinguish the evaluation accuracy of the candidate subnets. More importantly, we empirically demonstrate that the important samples are shared across different subnets. Therefore, it only requires to store the intermediate data of important samples for one certain subnet, and then reuse it for all other subnets. We evaluate the proposed method on multiple benchmarks trained with the state-of-the-art approaches on the ImageNet dataset (Krizhevsky et al., 2012) . The experimental results indicate superb performance in improving the search efficiency while maintaining the search performance with only 0.1% searching accuracy loss. Our contributions can be summarized as follows: • We observe that, in the one-shot NAS evolutionary searching stage, the optimal architecture of shallow blocks is determined at the early searching stage. • We propose to freeze the choices of continuous shallow blocks for all candidates at the early stage, and progressively freeze the choices of the subsequent blocks in the later stage. This approach creates a great amount of redundant computations, which provide us a good opportunity to reuse the intermediate data and reduce the searching time. • To alleviate memory capacity issue for storing intermediate data, we leverage the concept of importance sampling and propose a distinguish-based sampling method to reduce the size of the intermediate data. • We conduct extensive experiments on different benchmarks with our proposed methods. The evaluation results show that our method can achieve up to 55% time saving and 51% energy saving with 0.1% accuracy loss.

2. REVIEW OF ONE-SHOT NAS

Different from the traditional neural network training that aims to optimize weights given a network architecture, NAS seeks to optimize both weight and architecture at the same time. Conventional NAS methods (Zoph & Le, 2016; Baker et al., 2016; Zhong et al., 2018; Zela et al., 2018) have tried to solve these two optimization problems at a nested approach. However, these methods are usually prohibitively expensive because each architecture sampled from the search space has to be trained from the scratch and evaluated separately. Recent works (Bender et al., 2018; Pham et al., 2018; Cai et al., 2019; 2018) 



have proposed a weight sharing strategy to reduce high costs of architecture and weight searching procedure in conventional NAS. As one of the most popular weight sharing techniques, one-shot NAS achieves unprecedented search efficiency by decoupling the whole searching process into two stages: supernet training (Fig.1 (a))and subnet searching (Fig.1 (b)). One-shot NAS encodes the search space into a supernet and trains it only once. Then it allows

