IT-NAS: INTEGRATING LITE-TRANSFORMER INTO NAS FOR ARCHITECTURE SELETION

Abstract

Neural Architecture Search (NAS) aims to search for the best network in the predefined search space. However, much work focuses on the search strategy but little on the architecture selection process. Despite the fact that the weight-sharing based NAS has promoted the search efficiency, we notice that the architecture selection is quite unstable or circuitous. For instance, the differentiable NAS may derive the suboptimal architecture due to the performance collapse caused by bilevel optimization, or the One-shot NAS requires sampling and evaluating a large number of candidate structures. Recently, the self-attention mechanism achieves better performance in terms of the long-range modeling capabilities. Considering that different operations are widely distributed in the search space, we suggest leveraging the self-attention mechanism to extract the relationship among them and to determine which operation is superior to others. Therefore, we integrate Lite-Transformer into NAS for architecture selection. Specifically, we regard the feature map of each candidate operation as distinct patches and feed them into the Lite-Transformer module along with an additional Indicator Token (called IT). The cross attention among various operations can be extracted by the self-attention mechanism, and the importance of each candidate operation is then shown by the softmax result between the query of indicator token (IT) and other values of operational tokens. We experimentally demonstrate that our framework can select the truly representative architecture in different search spaces and achieves 2.39% test error on CIFAR-10 in DARTS search space, and 24.1% test error on ImageNet in the ProxylessNAS (w/o SE module) search space, as well as the stable and comparable performance in NAS-Bench-201 search space, S1-S4 search spaces and NAS-Bench-1Shot1 search space.

1. INTRODUCTION

Neural Architecture Search (NAS) is emerging as a new paradigm for designing network structures. It has been demonstrated to outperform manually designed networks in many tasks, including image classification (Zoph & Le, 2016; Zoph et al., 2018; Guo et al., 2020) , object detection (Chen et al., 2019; Ghiasi et al., 2019) , semantic segmentation (Chen et al., 2018; Liu et al., 2019) and so on. The fundamental disadvantage of earlier NAS methods, which primarily relied on heuristic algorithms like reinforcement learning (Baker et al., 2016; Bello et al., 2017; Zoph et al., 2018) or evolutionary algorithms (Real et al., 2017; Liu et al., 2018a; Real et al., 2019) , is the necessity to train each architecture from scratch for validation, thereby impeding further advancement of NAS. Fortunately, ENAS (Pham et al., 2018) proposes a weight-sharing mechanism, which greatly improves search efficiency. More recently, the differentiable NAS methods (Liu et al., 2018b; Xie et al., 2018) and the One-shot NAS methods (Guo et al., 2020; Chu et al., 2021b; You et al., 2020) have been more popular. They both firstly train a super-network, and then derive the final architecture based on different strategies. However, the differentiable approach selects the target network with the largest architecture parameters, which cannot fully reflect the true operation strength (Wang et al., 2021a; Xie et al., 2021b) . Nevertheless, DARTS-PT requires fine-tuning after discretizing each edge, resulting in additional selection time. The One-shot method selects the optimal one by sampling and evaluating a large number of candidate structures, yielding the time-consuming issues (Chen et al., 2021a) . Whereas, BN-NAS is not suitable for search space without batch normaliza-tion layer. We are endeavoring to explore whether there is a more appropriate way for both popular search spaces to robustly and quickly select the architecture, as has rarely been studied before. The attention mechanism can be used to emphasize the important components of the input while ignoring other trivial ones. The Transformer architecture (Vaswani et al., 2017) has reignited a boom in the field of Computer Vision (CV) (Touvron et al., 2021; Liu et al., 2021) since ViT (Dosovitskiy et al., 2021) achieves competitive performance compared to Convolutional Neural Networks (CNNs). It simply slices the image into small patches and then model the cross-attention between long sequences to locate key information, resulting in better performance. Meanwhile, Batchformer (Hou et al., 2022) introduces a batch transformer module that is applied to the batch dimension of each mini-batch of data to implicitly explore the sample relationships. Inspired by this, we can intuitively analogize the candidate operations in the NAS search space to patches, and then leverage the self-attention mechanism to describe the interactions among different operations. As a result, the optimal architecture can be selected according to the self-attention weights. In summary, we suggest integrating Lite-Transformer into NAS for architecture selection. Specifically, we insert the Lite-Transformer module on each edge and determine the optimal operation associated with that edge in the cell-based search space. While in the chain-style search space, where the network is defined by a sequence of layers containing various choice blocks, the Lite-Transformer module is inserted on each layer to select the appropriate block. We adopt a more broad perspective that treats candidate operations as patches. The patches are then linearly mapped and further packed into three matrices, namely Q, K, and V. The softmax result between Q and K is a square matrix termed the attention map, which can be regarded as the attention weight between different candidate operations. Furthermore, to address the issue of asymmetric attention matrix, i.e., the mutual attention value between any two operations cannot determine which is more favorable. We introduce an additional indicator token (called IT) to calculate the cross-attention between IT and the other operational tokens, which is inspired by the truth claimed by EViT (Liang et al., 2022) that the class token can be used to determine the importance of other tokens. In this way, the importance corresponding to each operation can be represented by the row of the indicator token (IT) in the attention matrix. After the super-network along with the Lite-Transformer module is trained to convergence, we simply need to forward propagate once on the validation dataset in order to determine the optimal architecture by computing the self-attention weights based on the indicator token (IT). In general, our main contributions can be summarized as follows: • We revisit the architecture selection process of neural architecture search (NAS) in a fresh perspective and, to our knowledge, are the first to integrate Lite-Transformer into NAS for architecture selection, utilizing the self-attention mechanism to explore the interaction among different candidate operations by regarding each one as operational token. • We introduce an additional indicator token (called IT) to compute the cross-attention between IT and the other operational tokens. In this case, the row of IT in self-attention weights matrix can be used to establish the priority of each candidate operation. • Experimental results show that IT-NAS achieves better performance in DARTS and Prox-ylessNAS search space, as well as stable and comparable performance in NAS-Benches, including S1-S4, NAS-Bench-201, and NAS-Bench-1Shot1. • More comprehensive experiments demonstrate the robustness and effectiveness of IT-NAS in selecting architectures. We also theoretically and empirically analyze why the selfattention mechanism can effectively select optimal architectures, proving the priority of our proposed method.

2. RELATED WORKS

Neural Architecture Search. Neural Architecture Search (NAS) aims to select the optimal architecture in the pre-defined search space. Earlier NAS approaches (Baker et al., 2016; Zoph et al., 2018; Real et al., 2017; 2019) incurred substantial search overhead due to the requirement to train each candidate architecture from scratch. ENAS (Pham et al., 2018) firstly proposed the weight-sharing mechanism such that weights can be shared among different sub-structures in the super-network,

