TOWARDS PERFORMANCE-MAXIMIZING NETWORK PRUNING VIA GLOBAL CHANNEL ATTENTION

Abstract

Network pruning has attracted increasing attention recently for its capability of significantly reducing the computational complexity of large-scale neural networks while retaining the high performance of referenced deep models. Compared to static pruning removing the same network redundancy for all samples, dynamic pruning could determine and eliminate model redundancy adaptively and obtain different sub-networks for each input that achieve state-of-the-art performance with a higher compression ratio. However, since the system has to preserve the complete network information for running-time pruning, dynamic pruning methods are usually not memory-efficient. In this paper, our interest is to explore a static alternative, dubbed GlobalPru, to conventional static pruning methods that can take into account both compression ratio and model performance maximization. Specifically, we propose a novel channel attention-based learn-to-rank algorithm to learn the optimal consistent (global) channel attention prior among all sample-specific (local) channel saliencies, based on which Bayesian-based regularization forces each samplespecific channel saliency to reach an agreement on the global channel ranking simultaneously with model training. Hence, all samples can empirically share the same pruning priority of channels to achieve channel pruning with minimal performance loss. Extensive experiments demonstrate that the proposed GlobalPru can achieve better performance than state-of-the-art static and dynamic pruning methods by significant margins.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved great success in many visual recognition tasks including image classification He et al. (2016) , object detection Ren et al. (2015) , image segmentation Dai et al. (2016) , etc. The success of CNNs is inseparable from an excessive number of parameters that are well organized to perform sophisticated computations, which conflicts with the increasing demand for deploying these resource-consuming applications on resource-limited devices. et al. (2021b) . However, both existing static methods and dynamic methods have some limitations. Particularly, given the fact that the channel redundancy is highly sample-dependent, static pruning methods may remove some channels that are not redundant for certain images. Consequently, static methods refrain from a larger pruning rate to avoid a significant accuracy drop. To tackle the issue of image-specific redundant channels, dynamic pruning methods remove image-specific channels. In this way, dynamic methods achieve the state-of-the-art pruning ratio without significantly sacrificing performance. Despite the significant advantage, dynamic pruning usually requires preserving the full original model during inference, which restricts its practical deployment on resource-limited devices. In this paper, we propose a new paradigm of static pruning method named GlobalPru. Although born from static, GlobalPru tackles the issue of image-specific redundant channels by making all images share the same ranking of channels with respect to redundancy. In other words, GlobalPru forces all images to agree on the same ranking of channel saliency (referred to as global channel ranking) to reduce the image-specific channel redundancy to the greatest extent possible. By removing channels with the lowest global rankings, GlobalPru can avoid the problem of existing static pruning methods, which remove more important channels while retaining less important channels for specific images with a high probability. More specifically, we first propose a novel global channel attention mechanism. Channel attention is local in the sense that the ranking of channels with respect to their importance is image-specific. Different from existing ones, our proposed global channel attention mechanism can identify the global channel ranking across all different samples in the training set, particularly through a learn-to-rank regularization. In detail, to make the static GlobalPru approach the maximum image-specific compression ratio of dynamic pruning as well as stabilize the training process, we first use a majority-voting-based strategy to specify the global ranking to make the static GlobalPru approach the maximum image-specific compression ratio of dynamic pruning. Then, given a certain ranking, all the image-specific channel rankings are forced to agree on the ranking via learn-to-rank regularization. When all the image-specific channel rankings are the same as the given ranking, the ranking becomes the global ranking. As a result of exposing global ranking for all images during training stages, GlobalPru can also avoid the disadvantage of existing dynamic pruning which needs to store the entire model for deciding image-specific channel ranking during inference and perform more efficient pruning on globally ordered channels. Our contributions are summarized as follows: • We propose GloablPru, a static network pruning method. GlobalPru tackles the issue of image-specific channel redundancy faced by existing static methods by learning a global ranking of channels w.r.t redundancy. GlobalPru produces a pruned network such that GlobalPru is a more memory-efficient solution than existing dynamic methods. • To the best of our knowledge, we for the first time propose a global channel attention mechanism where all the images share the same ranking of channels w.r.t. importance. Instead of repeatedly computing image-specific channel rankings under existing local attention mechanisms, our proposed global attention enriches the representation capacity of models and therefore greatly improves the pruning efficiency. • Extensive experimental results show that GlobalPru can achieve state-of-the-art performance with almost all popular convolution neural network architectures. 



Network pruning has been proposed to effectively reduce the deep model's resource cost without a significant accuracy drop. Unstructured pruning methods Han et al. (2015b;a) usually reach a higher compression rate, while relying on dedicated hardwares/libraries to achieve the actual effect. In contrast, structured pruning methods Li et al. (2016); He et al. (2017b) preserve the original convolutional structure and are more hardware-friendly. Considering the greater reduction in terms of floating-point operations (FLOPs) and hardware commonality, this research focuses on channel pruning. Existing methods perform channel pruning either statically or dynamically. Static pruning methods remove the same channels for all images Molchanov et al. (2019); Tang et al. (2020) while dynamic pruning removes different channels for different images Rao et al. (2018); Tang

STATIC PRUNING & DYNAMIC PRUNING As the most traditional and classic model pruning method, static pruning shares a compact model among all different samples Wen et al. (2016); Liu et al. (2017a); Liebenwein et al. (2019); Molchanov et al. (2019); Tang et al. (2020). To be specific, static methods select pruning results through trade-offs on different samples, which leads to final compact models having limited representation capacity and thus suffering an obvious accuracy drop with large pruning rates. Recently, some works turn their attention to the pursuit of the ultimate pruning rate and focus on excavating sample-wise model redundancy, named dynamic pruning. Dynamic pruning generates different compact models for different samples Dong et al. (2017); Gao et al. (2018); Hua et al. (2019); Rao et al. (2018); Tang et al. (2021b). Actually, dynamic methods learn a path-decision module to find the optimal model path for each input during inference. For example, state-of-the-art work Liu et al. (2019) investigates a feature decay regularization to identify informative features for different samples, and therefore achieves an intermediate feature map to the model sparsity. Tang et al. (2021a) further improves the dynamic pruning efficiency by embedding the manifold information of all samples into the space of pruned networks. Despite dynamic methods achieve higher compression rate, most of them are not memory-efficient cause most of them requires to deploy the full model even in inference stage.

