WASTE NOT, WANT NOT: ALL-ALIVE PRUNING FOR EXTREMELY SPARSE NETWORKS

Abstract

Network pruning has been widely adopted for reducing computational cost and memory consumption in low-resource devices. Recent studies show that saliencybased pruning can achieve high compression ratios (e.g., 80-90% of the parameters in original networks are removed) without sacrificing much accuracy loss. Nevertheless, finding the well-trainable networks with sparse parameters (e.g., < 10% of the parameters remaining) is still challenging to network pruning, commonly believed to lack model capacity. In this work, we revisit the procedure of existing pruning methods and observe that dead connections, which do not contribute to model capacity, appear regardless of pruning methods. To this end, we propose a novel pruning method, called all-alive pruning (AAP), producing the pruned networks with only trainable weights. Notably, AAP is broadly applicable to various saliency-based pruning methods and model architectures. We demonstrate that AAP equipped with existing pruning methods (i.e., iterative pruning, one-shot pruning, and dynamic pruning) consistently improves the accuracy of original methods at 128×-4096× compression ratios on three benchmark datasets.

1. INTRODUCTION

The state-of-the-art neural networks have shown remarkable performance gains on various downstream tasks such as computer vision, natural language processing, and speech recognition. Because neural networks are typically overparameterized, they require high computational cost and memory consumption. Such a nature inherently hinders the deployment of models with excessive parameters on low-end devices such as mobile, embedded, and on-device systems. Network pruning (Reed, 1993) is the prevalent technique to compress high-capacity models by removing unnecessary units such as weights/filters while maintaining the performance with minimal accuracy loss. Existing pruning methods can be divided into two categories. 1) The first group enforces the sparsity as model regularization (Chauvin, 1988; Weigend et al., 1990; Ishikawa, 1996; Molchanov et al., 2017a; Carreira-Perpiñán & Idelbayev, 2018; Louizos et al., 2018) . It is theoretically well-investigated and does not require network retraining. 2) Another group develops saliency criteria to prune less important units (Mozer & Smolensky, 1988; LeCun et al., 1989; Karnin, 1990; Hassibi et al., 1993; Han et al., 2015; Guo et al., 2016; Lee et al., 2019; Park et al., 2020; Evci et al., 2019) . Because of its simple operation and outstanding pruning performance, magnitude pruning (MP) (Han et al., 2015; Narang et al., 2017; Zhu & Gupta, 2018) is the most popular saliency-based pruning method. Recently, the effectiveness of MP is highlighted by the success of the lottery ticket hypothesis (Frankle & Carbin, 2019) and learning rate rewinding (Renda et al., 2020) , which achieves less than 1% accuracy loss even after pruning 90% of the parameters. However, all pruning methods show that the trade-off between sparsity and accuracy is significantly degraded, especially at the extremely high sparsity (Gale et al., 2019; Liu et al., 2019; Blalock et al., 2020) . Considering the explosive increase in the size of state-of-the-art models for downstream tasks (e.g., GPT-3 (Brown et al., 2020) for machine translation and FixEfficientNet-L2 (Touvron et al., 2020) for image classification), the effective pruning methods at extreme compression ratios must be accomplished, and it is particularly crucial for adopting high-capacity models to low-resource devices (e.g. One Laptop per Child). To break the performance bottleneck at high compression ratios of the state-of-the-art pruning methods, we investigate the procedure of existing pruning methods. Surprisingly, it is revealed that all existing studies overlook the existence of dead neurons after pruning -dead neurons are the nodes

annex

LeNet-300-100 on MNIST, ResNet-18 on CIFAR-10, and ResNet-50 on Tiny-ImageNet. We used IMP with learning rate rewinding (Renda et al., 2020) as the base pruning method. The percentage of dead connections is calculated by the number of dead connections divided by the total number of connections in a given network.with no input connections or output connections. That is, dead neurons make all their connected weights useless, called dead connections. Although Han et al. ( 2015) have already raised the potential issue of dead neurons, they anticipated that the dead neurons could be negligible after multiple iterative pruning. Unfortunately, we find that this problem is not as simple as their expectation.Figure 1 depicts prediction accuracy (green line) and the percentage of dead connections (red line) over various compression ratios. For this empirical study, iterative magnitude pruning (IMP) with learning rate rewinding (Renda et al., 2020) is employed as the baseline pruning method. Then, we analyze the correlation between pruning accuracy and the occurrence of dead neurons on three benchmark datasets. We discover two meaningful findings: 1) The dead neurons cannot be successfully removed by iterative pruning, especially at high compression ratios.2) The severe performance degradation at high compression ratios is strongly correlated with the number of dead connections, i.e., they are inversely proportional to each other.Based on these valuable findings, we propose all-alive pruning (AAP), which improves the pruning performance by effectively eliminating dead connections. Specifically, we search dead neurons at the pruning stage by inspecting their gradient flows -when zero gradients are passing through the node, we regard them as dead neurons. Once identified, dead neurons and corresponding dead connections (any weights linked from or to dead neurons) are removed together during the pruning procedure. Note that detecting gradient-based dead neurons can be applied for complex model architectures with skip connections. If we remove more weights than a sparsity threshold, we can revive some weights with the highest saliency scores. As a result, AAP constitutes all-alive subnetwork, i.e., all connections in the subnetwork are kept as trainable weights.To summarize, the key advantage of AAP is two-fold: 1) AAP is versatile -it is broadly applicable to various saliency-based pruning methods and model architectures by minimizing the loss of prediction accuracy. 2) AAP consistently improves the accuracy of the original pruning methods at high compression ratios and breaks the state-of-the-art performance on three benchmark datasets (i.e., MNIST, CIFAR-10, and Tiny-ImageNet).

2. RELATED WORK

Network pruning utilizes the highly overparameterized nature of modern neural networks to compress the model. Also, it provides meaningful insight into what leads to the success of neural networks; small subnetworks in the original network can achieve comparable performance and improve generalization effects (Arora et al., 2018) .Recent advances. In general, network pruning can be categorized into two groups. The first approach employs the loss function with regularization terms to enforce the sparsity (Chauvin, 1988; Weigend et al., 1990; Ishikawa, 1996) . Also, Molchanov et al. (2017a) proposed variational dropout to produce

