TIERED PRUNING FOR EFFICIENT DIFFERENTIABLE INFERENCE-AWARE NEURAL ARCHITECTURE SEARCH

Abstract

We propose three novel pruning techniques to improve the cost and results of Inference-Aware Differentiable Neural Architecture Search (DNAS). First, we introduce Prunode, a stochastic bi-path building block for DNAS, which can search over inner hidden dimensions with O(1) memory and compute complexity. Second, we present an algorithm for pruning blocks within a stochastic layer of the SuperNet during the search. Third, we describe a novel technique for pruning unnecessary stochastic layers during the search. The optimized models resulting from the search are called PRUNET and establishes a new state-of-the-art Pareto frontier for NVIDIA V100 in terms of inference latency for ImageNet Top-1 image classification accuracy. PRUNET as a backbone also outperforms GPUNet and EfficientNet on the COCO object detection task on inference latency relative to mean Average Precision (mAP).



New concepts in NAS succeed with the evergrowing search space, increasing the dimensionality and complexity of the problem. Balancing the search-cost and quality of the search hence is essential for employing NAS in practice. Traditional NAS methods require evaluating many candidate networks to find optimized ones with respect to the desired metric. This approach can be successfully applied to simple problems like CIFAR-10 Krizhevsky et al. ( 2010), but for more demanding problems, these methods may turn out to be computationally prohibitive. (2019) , i.e., the final result can still be improved. In our experiments, we focus on a search space based on a state-of-the-art network to showcase the value of our methodology. We aim to revise the



Figure 1: PRUNET establishes a new state-of-theart Pareto frontier in terms of inference latency for ImageNet Top-1 image classification accuracy. Neural Architecture Search (NAS) is a wellestablished technique in Deep Learning (DL); conceptually it is comprised of a search space of permissible neural architectures, a search strategy to sample architectures from this space, and an evaluation method to assess the performance of the selected architectures. Because of practical reasons, Inference-Aware Neural Architecture Search is the cornerstone of the modern Deep Learning application deployment process. Wang et al. (2022); Wu et al. (2019); Yang et al. (2018) use NAS to directly optimize inference specific metrics (e.g., latency) on targeted devices instead of limiting the model's FLOPs or other proxies. Inference-Aware NAS streamlines the development-to-deployment process.New concepts in NAS succeed with the evergrowing search space, increasing the dimensionality and complexity of the problem. Balancing the search-cost and quality of the search hence is essential for employing NAS in practice.

To minimize this computational cost, recent research has focused on partial training Falkner et al. (2018); Li et al. (2020a); Luo et al. (2018), performing network morphism Cai et al. (2018a); Jin et al. (2019); Molchanov et al. (2021) instead of training from scratch, or training many candidates at the same time by sharing the weights Pham et al. (2018). These approaches can save computational time, but their reliability is questionable Bender et al. (2018); Xiang et al. (2021); Yu et al. (2021); Liang et al. (2019); Chen et al. (2019); Zela et al.

