SUCCINCT NETWORK CHANNEL AND SPATIAL PRUN-ING VIA DISCRETE VARIABLE QCQP

Abstract

Reducing the heavy computational cost of large convolutional neural networks is crucial when deploying the networks to resource-constrained environments. In this context, recent works propose channel pruning via greedy channel selection to achieve practical acceleration and memory footprint reduction. We first show this channel-wise approach ignores the inherent quadratic coupling between channels in the neighboring layers and cannot safely remove inactive weights during the pruning procedure. Furthermore, we show that these pruning methods cannot guarantee the given resource constraints are satisfied and cause discrepancy with the true objective. To this end, we formulate a principled optimization framework with discrete variable QCQP, which provably prevents any inactive weights and enables the exact guarantee of meeting the resource constraints in terms of FLOPs and memory. Also, we extend the pruning granularity beyond channels and jointly prune individual 2D convolution filters spatially for greater efficiency. Our experiments show competitive pruning results under the target resource constraints on CIFAR-10 and ImageNet datasets on various network architectures.

1. INTRODUCTION

Deep neural networks are the bedrock of artificial intelligence tasks such as object detection, speech recognition, and natural language processing (Redmon & Farhadi, 2018; Chorowski et al., 2015; Devlin et al., 2019) . While modern networks have hundreds of millions to billions of parameters to train, it has been recently shown that these parameters are highly redundant and can be pruned without significant loss in accuracy (Han et al., 2015; Guo et al., 2016) . This discovery has led practitioners to desire training and running the models on resource-constrained mobile devices, provoking a large body of research on network pruning. Unstructured pruning, however, does not directly lead to any practical acceleration or memory footprint reduction due to poor data locality (Wen et al., 2016) , and this motivated research on structured pruning to achieve practical usage under limited resource budgets. To this end, a line of research on channel pruning considers completely pruning the convolution filters along the input and output channel dimensions, where the resulting pruned model becomes a smaller dense network suited for practical acceleration and memory footprint reduction (Li et al., 2017; Luo et al., 2017; He et al., 2019; Wen et al., 2016; He et al., 2018a) . However, existing channel pruning methods perform the pruning operations with a greedy approach and does not consider the inherent quadratic coupling between channels in the neighboring layers. Although these methods are easy to model and optimize, they cannot safely remove inactive weights during the pruning procedure, suffer from discrepancies with the true objective, and prohibit the strict satisfaction of the required resource constraints during the pruning process. The ability to specify hard target resource constraints into the pruning optimization process is important since this allows the user to run the pruning and optional finetuning process only once. When the pruning process ignores the target specifications, the users may need to apply multiple rounds of pruning and finetuning until the specifications are eventually met, resulting in an extra computation overhead (Han et al., 2015; He et al., 2018a; Liu et al., 2017) . In this paper, we formulate a principled optimization problem that prunes the network layer channels while respecting the quadratic coupling and exactly satisfying the user-specified FLOPs and 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X (l-1) X (l) X (l+1) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X (l-1) X (l) X (l+1) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X (l-1) X (l) X (l+1) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X (l-1) X (l) X (l+1) W (l) W (l+1) W (l) •,j is pruned to zero X (l) j = σ X (l-1) W (l) •,j pruned to zero = 0H l ,W l X (l) j leads to inactive weights W (l+1) j (∵ Equation (1)) W (l+1) j W (l) W (l+1) W (l) •,j is pruned to zero X (l) j = σ X (l-1) W (l) •,j pruned to zero = 0H l ,W l X (l) j leads to inactive weights W (l+1) j (∵ Equation (1)) W (l+1) j W (l) W (l+1) W (l) •,j is pruned to zero X (l) j = σ X (l-1) W (l) •,j pruned to zero = 0H l ,W l X (l) j leads to inactive weights W (l+1) j (∵ Equation (1)) W (l+1) j W (l) W (l+1) W (l) •,j is pruned to zero X (l) j = σ X (l-1) W (l) •,j pruned to zero = 0H l ,W l X (l) j leads to inactive weights W (l+1) j (∵ Equation (1)) W (l+1) j Figure 1 : Illustration of a channel pruning procedure that leads to inactive weights. When j-th output channel of l-th convolution weights W (l) •,j is pruned, i.e. W (l) •,j = 0 C l-1 ,K l ,K l , then the j-th feature map of l-th layer X (l) j should also be 0. Consequently, X (l) j yields inactive weights W (l+1) j . Note that we use W (l) •,j to denote the tensor W (l) •,j,•,• , following the indexing rules of NumPy (Van Der Walt et al., 2011) . memory constraints. This new formulation leads to an interesting discrete variable QCQP (Quadratic Constrained Quadratic Program) optimization problem, which directly maximizes the importance of neurons in the pruned network under the specified resource constraints. Also, we increase the pruning granularity beyond channels and jointly prune individual 2D convolution filters spatially for greater efficiency. Furthermore, we generalize our formulation to cover nonsequential convolution operations, such as skip connections, and propose a principled optimization framework for handling various architectural implementations of skip connections in ResNet (He et al., 2016) . Our experiments on CIFAR-10 and ImageNet datasets show the state of the art results compared to other channel pruning methods that start from pretrained networks.

2. MOTIVATION

In this section, we first discuss the motivation of our method concretely. Suppose the weights in a sequential CNN form a sequence of 4-D tensors, W (l) ∈ R C l-foot_0 ×C l ×K l ×K l ∀l ∈ [L] where C l-1 , C l , and K l represent the number of input channels, the number of output channels, and the filter size of l-th convolution weight tensor, respectively. We denote the feature map after l-th convolution as X (l) ∈ R C l ×H l ×W l . Concretely, X (l) j = σ(X (l-1) W (l) •,j ) = σ( C l-1 i=1 X (l-1) i * W (l) i,j ), where σ is the activation function, * denotes 2-D convolution operation, and denotes the sum of channel-wise 2-D convolutions. Now consider pruning these weights in channel-wise direction. We show that naive channel-wise pruning methods prevent exact specification of the target resource constraints due to unpruned inactive weights and deviate away from the true objective by ignoring quadratic coupling between channels in the neighboring layers.

2.1. INACTIVE WEIGHTS

According to Han et al. (2015) , network pruning produces dead neurons with zero input or output connections. These dead neurons cause inactive weights 1 , which do not affect the final output activations of the pruned network. These inactive weights may not be excluded automatically through the standard pruning procedure and require additional post-processing which relies on ad-hoc heuristics. For example, Figure 1 shows a standard channel pruning procedure that deletes weights across the output channel direction but fails to prune the inactive weights. Concretely, deletion of weights on j-th output channel of l-th convolution layer leads to W (l) •,j = 0 C l-1 ,K l ,K l . Then, X (l) j becomes a dead neuron since X (l) j = σ(X (l-1) W (l) •,j ) = σ( C l-1 i=1 X (l-1) i



Rigorous mathematical definition of inactive weights is provided in Supplementary material D.

