FTSO: EFFECTIVE NAS VIA FIRST TOPOLOGY SECOND OPERATOR

Abstract

Existing one-shot neural architecture search (NAS) methods generally contain a giant super-net, which leads to heavy computational cost. Our method, named FTSO, separates the whole architecture search into two sub-steps. In the first step, we only search for the topology, and in the second step, we only search for the operators. FTSO not only reduces NAS's search time from days to 0.68 seconds, but also significantly improves the accuracy. Specifically, our experiments on ImageNet show that within merely 18 seconds, FTSO can achieve 76.4% testing accuracy, 1.5% higher than the baseline, PC-DARTS. In addition, FTSO can reach 97.77% testing accuracy, 0.27% higher than the baseline, with 99.8% of search time saved on CIFAR10.

1. INTRODUCTION

Since the great success of AlexNet (Krizhevsky et al., 2012) in image classification, most modern machine learning models have been developed based on deep neural networks. For neural networks, their performance is greatly determined by the architectures. Thus, in the past decade, a tremendous amount of work (Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016) has been done to investigate proper network architecture design. However, as the network size has grown larger and larger, it has gradually become unaffordable to manually search for better network architectures via trial and error due to the expensive time and resource overhead. To ease this problem, a new technique called neural architecture search (NAS) was introduced, which allows computers to search for better network architectures automatically instead of relying on human experts.. Early-proposed reinforcement learning-based NAS methods (Zoph & Le, 2017; Baker et al., 2017; Zoph et al., 2018) typically have an RNN-based controller to sample candidate network architectures from the search space. Although these algorithms can provide promising accuracy, their computation cost is usually unaffordable, e.g., 1800 GPU-days are required for NASNet. To ease the search efficiency problem, one-shot approaches (Pham et al., 2018; Cai et al., 2019; Liu et al., 2019) with parameter sharing were proposed. These methods first create a huge directed acyclic graph (DAG) super-net, containing the whole search space. Then, the kernel weights are shared among all the sampled architectures via the super-net. This strategy makes it possible to measure the candidate architecture's performance without repeatedly retraining it from scratch. However, these algorithms suffer from the super-nets' computational overheads. This problem is particularly severe for differentiable models (Liu et al., 2019; Xu et al., 2020) . Limited by current NAS algorithms' inefficiency, it is rather challenging to find satisfying network architectures on large-scale datasets and high-level tasks. For instance, current speed-oriented NAS approaches generally require days to accomplish one search trial on ImageNet, e.g., 8.3 GPU-days for ProxylessNAS (Cai et al., 2019) and 3.8 GPU-days for PC-DARTS (Xu et al., 2020) . Therefore,we argue that it is essential to propose a new well-defined search space, which is not only expressive enough to cover the most powerful architectures, but also compact enough to filter out the poor architectures. Shu et al. (2020) , who demonstrated that randomly replacing operators in a found architecture does not harm the accuracy much, we believe that it could not only bring no reduction to the testing accuracy but also significantly benefit the search efficiency if we omit the influence of operators and cluster architectures according to the topology. Thus, in this paper, we propose to separately search for the network topology and the operators. We name this new method Effective NAS via First Topology Second Operator (FTSO).

Motivated by

In this paper, we mathematically prove that FTSO reduces the required parameters by 5.3 × 10 7 and decreases the FLOPs per iteration by 1 × 10 5 . Besides, FTSO significantly promotes the accuracy compared to the baseline by greatly shrinking the search space, reducing the operators complexity in magnitude and lowering the required searching period from 50 epochs to one iteration. What's more, the Matthew effect is eased. Furthermore, we empirically show that FTSO is superior in both efficiency and effectiveness, accomplishing the whole architecture search in 0.68 seconds. On ImageNet, FTSO can achieve 76.4% testing accuracy, 1.5% higher than the baseline, within merely 18 seconds. More importantly, when we only search for one iteration, FTSO can reach 75.64% testing accuracy, 0.74% higher than the baseline in just 0.68 seconds. Besides, if we allow FTSO to search for 19 minutes, 76.42% Top1 and 93.2% Top5 testing accuracy can be obtained. In addition, FTSO can reach 97.77% testing accuracy, 0.27% higher than the baseline, with 99.8% of search time saved on CIFAR10. Although in this paper we have only implemented FTSO within a continuous search space, we illustrate in Section 5 that FTSO can be seamlessly transferred to other NAS algorithms.

2. RELATED WORK

In general, existing NAS algorithms can be divided into three categories, namely, reinforcement learning-based, revolution-based and differentiable. Early-proposed reinforcement learning-based methods (Zoph & Le, 2017; Zoph et al., 2018) generally suffer from high computational cost and low-efficiency sampling. Instead of sampling a discrete architecture and then evaluating it, DARTS (Liu et al., 2019) treats the whole search space as a continuous super-net. It assigns every operator a real number weight and treats every node as the linear combination of all its transformed predecessors. To be specific, DARTS's search space is a directed acyclic graph (DAG) containing two input nodes inherited from previous cells, four intermediate nodes and one output node. Each node denotes one latent representation and each edge denotes an operator. Every intermediate node x j is calculated from all its predecessors x i , i.e., x j = i<j o∈O exp α o i,j o ∈O exp α o i,j o(x i ), where O denotes the collection of all candidate operators, α o i,j denotes the weight for operator o from node i to j. This strategy allows DARTS to directly use gradients to optimize the whole super-net. After the super-net converges, DARTS only retains the operators with the largest weights. In this way, the final discrete architecture is derived. The main defect of DARTS is that it needs to maintain and do all calculations on a giant super-net, which inevitably leads to heavy computational overheads and over-fitting. Proposed to relieve the computational overhead of DARTS, DARTS-ES (Zela et al., 2020) reduces the number of searching epochs via early stopping, according to the Hessian matrix's max eigenvalue. PC-DARTS (Xu et al., 2020) decreases the FLOPs per iteration by only calculating a proportion of the input channels and retaining the remainder unchanged, and normalizes the edge weights to stabilize the search. To be specific, in PC-DARTS, every intermediate node x j is computed from all its predecessors x i , i.e., x j = i<j exp βi,j i <j exp β i ,j f i,j (x i ), where β i,j describes the input node i's importance to the node j, and f i,j is the weighted sum of all the candidate operators' outputs between node i and j. Specifically, f i,j (x i , S i,j ) = o∈O e α o i,j o ∈O e α o i,j o(S i,j * x i ) + (1 -S i,j ) * x i , where S i,j denotes a binary vector, in which only 1/K elements are 1.

3. FTSO: EFFECTIVE NAS VIA FIRST TOPOLOGY SECOND OPERATOR

Existing NAS approaches generally suffer from the heavy computational overhead and the unsatisfying testing accuracy leaded by the huge search space. Such problems are especially stern in one-shot and differentiable methods because, these algorithms need to maintain and even do all the calculations directly on the search space. To ease such problems, it is of great demand to investigate the correlations among different architectures and to shrink the search space according to these prior knowledge. We notice that there

