TIERED PRUNING FOR EFFICIENT DIFFERENTIABLE INFERENCE-AWARE NEURAL ARCHITECTURE SEARCH

Abstract

We propose three novel pruning techniques to improve the cost and results of Inference-Aware Differentiable Neural Architecture Search (DNAS). First, we introduce Prunode, a stochastic bi-path building block for DNAS, which can search over inner hidden dimensions with O(1) memory and compute complexity. Second, we present an algorithm for pruning blocks within a stochastic layer of the SuperNet during the search. Third, we describe a novel technique for pruning unnecessary stochastic layers during the search. The optimized models resulting from the search are called PRUNET and establishes a new state-of-the-art Pareto frontier for NVIDIA V100 in terms of inference latency for ImageNet Top-1 image classification accuracy. PRUNET as a backbone also outperforms GPUNet and EfficientNet on the COCO object detection task on inference latency relative to mean Average Precision (mAP).



New concepts in NAS succeed with the evergrowing search space, increasing the dimensionality and complexity of the problem. Balancing the search-cost and quality of the search hence is essential for employing NAS in practice. Traditional NAS methods require evaluating many candidate networks to find optimized ones with respect to the desired metric. This approach can be successfully applied to simple problems like CIFAR-10 Krizhevsky et al. (2010) , but for more demanding problems, these methods may turn out to be computationally prohibitive. 2019), i.e., the final result can still be improved. In our experiments, we focus on a search space based on a state-of-the-art network to showcase the value of our methodology. We aim to revise the weight-sharing approach to save resources and improve the method's reliability by introducing novel pruning techniques described below. Prunode: pruning internal structure of the block In the classical SuperNet approach, search space is defined by the initial SuperNet architecture. That means GPU memory capacity significantly limits search space size. In many practical use cases, one would limit themselves to just a few candidates per block. For example, FBNet Wu et al. ( 2019) defined nine candidates: skip connection and 8 Inverted Residual Blocks (IRB) with expansion, kernel, group ∈ { (1, 3, 1), (1, 3, 2), (1, 5, 1), (1, 5, 2), (3, 3, 1), (3, 5, 1 ), (6, 3, 1), (6, 5, 1)}. In particular, one can see that the expansion parameter was limited to only three options: 1, 3, and 6, while more options could be considered -not only larger values but also denser sampling using non-integer values. Each additional parameter increases memory and compute costs, while only promising ones can improve the search. Selecting right parameters for a search space requires domain knowledge. To solve this problem, we introduce a special multi-block called Prunode, which optimizes the value of parameters, such as the expansion parameter in the IRB block. The computation and memory cost of Prunode is equal to the cost of calculating two candidates. Essentially, the Prunode in each iteration emulates just two candidates, each with a different number of channels in the internal structure. These candidates are modified based on the current architecture weights. The procedure encourages convergence towards an optimal number of channels. Pruning blocks within a stochastic layer In the classical SuperNet approach, all candidates are trained together throughout the search procedure, but ultimately, one or a few candidates are sampled as a result of the search. Hence, large amounts of resources are devoted to training blocks that are ultimately unused. Moreover, since they are trained together, results can be biased due to co-adaptation among operations Bender et al. (2018) . We introduce progressive SuperNet pruning based on trained architecture weights to address this problem. This methodology removes blocks from the search space when the likelihood of the block being sampled is below a linearly changing threshold. Reduction of the size of the search space saves unnecessary computation cost and reduces the co-adoption among operations. Pruning unnecessary stochastic layers By default, layer-wise SuperNet approaches force all networks that can be sampled from the search space to have the same number of layers, which is very limiting. That is why it is common to use a skip connection as an alternative to residual blocks in order to mimic shallower networks. Unfortunately, skip connections blocks' output provides biased information when averaged with the outputs of other blocks. Because of this, SuperNet may tend to sample networks that are shallower than optimal. To solve this problem, we provide a novel method for skipping whole layers in a SuperNet. It introduces the skip connection to the SuperNet during the procedure. Because of that, the skip connection is not present in the search space at the beginning of the search. Once the skip connection is added to the search space, the outputs of the remaining blocks are multiplied by coefficients.

2. RELATED WORKS

In NAS literature, a widely known SuperNet approach Liu et al. 



Figure 1: PRUNET establishes a new state-of-theart Pareto frontier in terms of inference latency for ImageNet Top-1 image classification accuracy. Neural Architecture Search (NAS) is a wellestablished technique in Deep Learning (DL); conceptually it is comprised of a search space of permissible neural architectures, a search strategy to sample architectures from this space, and an evaluation method to assess the performance of the selected architectures. Because of practical reasons, Inference-Aware Neural Architecture Search is the cornerstone of the modern Deep Learning application deployment process. Wang et al. (2022); Wu et al. (2019); Yang et al. (2018) use NAS to directly optimize inference specific metrics (e.g., latency) on targeted devices instead of limiting the model's FLOPs or other proxies. Inference-Aware NAS streamlines the development-to-deployment process.New concepts in NAS succeed with the evergrowing search space, increasing the dimensionality and complexity of the problem. Balancing the search-cost and quality of the search hence is essential for employing NAS in practice.

To minimize this computational cost, recent research has focused on partial training Falkner et al. (2018); Li et al. (2020a); Luo et al. (2018), performing network morphism Cai et al. (2018a); Jin et al. (2019); Molchanov et al. (2021) instead of training from scratch, or training many candidates at the same time by sharing the weights Pham et al. (2018). These approaches can save computational time, but their reliability is questionable Bender et al. (2018); Xiang et al. (2021); Yu et al. (2021); Liang et al. (2019); Chen et al. (2019); Zela et al. (

(2018b); Wu et al. (2019) constructs a stochastic network. At the end of the architecture search, the final non-stochastic network is sampled from the SuperNet using differentiable architecture parameters. The PRUNET algorithm utilizes this scheme -it is based on the weight-sharing design Cai et al. (2019); Wan et al. (2020) and it relies on the Gumbel-Softmax distribution Jang et al. (2016). The PRUNET algorithm is agnostic to one-shot NAS Liu et al. (2018b); Pham et al. (2018); Wu et al. (2019) where only one SuperNet is trained or few-shot NAS Zhao et al. (2021) where multiple SuperNets were trained to improve the accuracy. In this work, we evaluate our method on search space based on the state-of-the-art GPUNet model Wang et al. (2022). We follow its structure including the number of channels, the number of layers, and the basic block types. Other methods that incorporate NAS Dai et al. (2019); Dong et al. (2018); Tan et al. (2019) but remain computationally expensive. Differentiable NAS Cai et al. (2018b); Vahdat et al. (2020); Wu et al. (2019) significantly reduces the training cost. MobileNets Howard et al. (2017); Sandler et al. (2018) started to discuss the importance of model size and latency on embedded systems while

