PRUNING BY ACTIVE ATTENTION MANIPULATION

Abstract

Filter pruning of a CNN is typically achieved by applying discrete masks on the CNN's filter weights or activation maps, post-training. Here, we present a new filter-importance-scoring concept named pruning by active attention manipulation (PAAM), that sparsifies the CNN's set of filters through a particular attention mechanism, during-training. PAAM learns analog filter scores from the filter weights by optimizing a cost function regularized by an additive term in the scores. As the filters are not independent, we use attention to dynamically learn their correlations. Moreover, by training the pruning scores of all layers simultaneously, PAAM can account for layer inter-dependencies, which is essential to finding a performant sparse sub-network. PAAM can also train and generate a pruned network from scratch in a straightforward, one-stage training process without requiring a pre-trained network. Finally, PAAM does not need layer-specific hyperparameters and pre-defined layer budgets, since it can implicitly determine the appropriate number of filters in each layer. Our experimental results on different network architectures suggest that PAAM outperforms state-of-the-art structuredpruning methods (SOTA). On CIFAR-10 dataset, without requiring a pre-trained baseline network, we obtain 1.02% and 1.19% accuracy gain and 52.3% and 54% parameters reduction, on ResNet56 and ResNet110, respectively. Similarly, on the ImageNet dataset, PAAM achieves 1.06% accuracy gain while pruning 51.1% of the parameters on ResNet50. For Cifar-10, this is better than the SOTA with a margin of 9.5% and 6.6%, respectively, and on ImageNet with a margin of 11%.



Both data-free and data-informed methods generally determine the importance of a filter in a given layer, locally. However, filter-importance is a global property, as it changes relative to the selection of the filters in the previous and next layers. Moreover, determining the optimal filter-budget for each layer (a vital element of any pruning method), is also a challenge, that all local, importance-metric methods face. The most trivial way to overcome these challenges, is to evaluate the network loss with and without each combination of k candidate filters out of N . However, this approach would require the evaluation of N k subnetworks, which is in practice impossible to achieve. Training-aware pruning methods aim to learn binary masks for turning on and off each filter. A regularization metric often accompanies them, with a penalty guiding the masks to the desired budget. Mask learning, simultaneously for all filters, is an effective method for identifying a globally optimal subset of the network. However, due to the discrete nature of the filters and binary masks, the optimization problem is generally non-convex and NP-hard. A simple trick of many recent works Gao et al. (2020; 2021); Li et al. (2022) is to use straight-through estimators Bengio et al. (2013) to calculate the derivatives, by considering binary functions as identities. While ingenious, this precludes learning the relative importance of filters among each other. Even more importantly, the on-off bits within the masks are assumed to be independent, which is a gross oversimplification. This paper solves the above problems by introducing PAAM, a novel end-to-end pruning method, by active attention manipulation. PAAM also employs an l1 regularization technique, encouraging filterscore decrease. However, PAAM scores are analog, and multiply the activation maps during score training. Moreover, a proper score spread is ensured through a specialized activation function. This allows PAAM to learn the relative importance of filters globally, through gradient descent. Moreover, the scores are not considered independent, and their hidden correlations are learned in a scalable fashion, by employing an attention mechanism specifically tuned for filter scores. Given a global pruning budget, PAAM finds the optimal pruning threshold from the cumulative histogram of filter scores, and keeps only the filters within the budget. This relieves PAAM from having to determine per layer allocation budgets in advance. PAAM then retrains the network without considering the scores. This process is repeated until convergence. PAAM pipeline is shown in Figures 1 2 . Our experimental results show that PAAM yields higher pruning ratios while preserving higher accuracy. In summary, this work has the following contributions: 1. We introduce PAAM, an end-to-end algorithm that learns the importance scores directly from the network weights and filters. Our method allows extracting hidden correlations in



Figure 2: PAAM learns the importance scores of the filters from the filter weights. et al. (2020). As modern hardware is tuned towards dense computations, structured pruning offers a more favorable balance between accuracy and performance Hoefler et al. (2021). A very prominent family of structured-pruning methods is filter pruning. Choosing which filters to remove according to a carefully chosen importance metric (or filter score) is an essential part of any method in this family. Data-free methods solely rely on the value of weights, or the network structure, in order to determine the importance of filters. Magnitude pruning, for example, is one of the simplest and most common of such methods. It prunes the filters that have the smallest weight-values in the l1 norm. Data-informed methods focus on the feature maps generated from the training data (or a subset of samples) rather than the filters alone. These methods vary from sensitivity-based approaches (which consider the statistical sensitivity of the output feature maps with regard to the input data Malik & Naumann (2020); Liebenwein et al. (2020)), to correlation-based methods (with an inter-channel perspective, to keep the least similar (or least correlated) feature maps Sun et al. (2015); Sui et al. (2021)).

