TOWARDS PERFORMANCE-MAXIMIZING NETWORK PRUNING VIA GLOBAL CHANNEL ATTENTION

Abstract

Network pruning has attracted increasing attention recently for its capability of significantly reducing the computational complexity of large-scale neural networks while retaining the high performance of referenced deep models. Compared to static pruning removing the same network redundancy for all samples, dynamic pruning could determine and eliminate model redundancy adaptively and obtain different sub-networks for each input that achieve state-of-the-art performance with a higher compression ratio. However, since the system has to preserve the complete network information for running-time pruning, dynamic pruning methods are usually not memory-efficient. In this paper, our interest is to explore a static alternative, dubbed GlobalPru, to conventional static pruning methods that can take into account both compression ratio and model performance maximization. Specifically, we propose a novel channel attention-based learn-to-rank algorithm to learn the optimal consistent (global) channel attention prior among all sample-specific (local) channel saliencies, based on which Bayesian-based regularization forces each samplespecific channel saliency to reach an agreement on the global channel ranking simultaneously with model training. Hence, all samples can empirically share the same pruning priority of channels to achieve channel pruning with minimal performance loss. Extensive experiments demonstrate that the proposed GlobalPru can achieve better performance than state-of-the-art static and dynamic pruning methods by significant margins.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved great success in many visual recognition tasks including image classification He et al. (2016 ), object detection Ren et al. (2015) , image segmentation Dai et al. (2016) , etc. The success of CNNs is inseparable from an excessive number of parameters that are well organized to perform sophisticated computations, which conflicts with the increasing demand for deploying these resource-consuming applications on resource-limited devices. et al. (2021b) . However, both existing static methods and dynamic methods have some limitations. Particularly, given the fact that the channel redundancy is highly sample-dependent, static pruning methods may remove some channels that are not redundant for certain images. Consequently, static methods refrain from a larger pruning rate to avoid a significant accuracy drop. To tackle the issue of image-specific redundant channels, dynamic pruning methods remove image-specific channels. In this way, dynamic methods achieve the state-of-the-art pruning ratio without significantly sacrificing performance. Despite the significant advantage, dynamic pruning usually requires preserving the full original model during inference, which restricts its practical deployment on resource-limited devices.



Network pruning has been proposed to effectively reduce the deep model's resource cost without a significant accuracy drop. Unstructured pruning methods Han et al. (2015b;a) usually reach a higher compression rate, while relying on dedicated hardwares/libraries to achieve the actual effect. In contrast, structured pruning methods Li et al. (2016); He et al. (2017b) preserve the original convolutional structure and are more hardware-friendly. Considering the greater reduction in terms of floating-point operations (FLOPs) and hardware commonality, this research focuses on channel pruning. Existing methods perform channel pruning either statically or dynamically. Static pruning methods remove the same channels for all images Molchanov et al. (2019); Tang et al. (2020) while dynamic pruning removes different channels for different images Rao et al. (2018); Tang

