SUCCINCT NETWORK CHANNEL AND SPATIAL PRUN-ING VIA DISCRETE VARIABLE QCQP

Abstract

Reducing the heavy computational cost of large convolutional neural networks is crucial when deploying the networks to resource-constrained environments. In this context, recent works propose channel pruning via greedy channel selection to achieve practical acceleration and memory footprint reduction. We first show this channel-wise approach ignores the inherent quadratic coupling between channels in the neighboring layers and cannot safely remove inactive weights during the pruning procedure. Furthermore, we show that these pruning methods cannot guarantee the given resource constraints are satisfied and cause discrepancy with the true objective. To this end, we formulate a principled optimization framework with discrete variable QCQP, which provably prevents any inactive weights and enables the exact guarantee of meeting the resource constraints in terms of FLOPs and memory. Also, we extend the pruning granularity beyond channels and jointly prune individual 2D convolution filters spatially for greater efficiency. Our experiments show competitive pruning results under the target resource constraints on CIFAR-10 and ImageNet datasets on various network architectures.

1. INTRODUCTION

Deep neural networks are the bedrock of artificial intelligence tasks such as object detection, speech recognition, and natural language processing (Redmon & Farhadi, 2018; Chorowski et al., 2015; Devlin et al., 2019) . While modern networks have hundreds of millions to billions of parameters to train, it has been recently shown that these parameters are highly redundant and can be pruned without significant loss in accuracy (Han et al., 2015; Guo et al., 2016) . This discovery has led practitioners to desire training and running the models on resource-constrained mobile devices, provoking a large body of research on network pruning. Unstructured pruning, however, does not directly lead to any practical acceleration or memory footprint reduction due to poor data locality (Wen et al., 2016) , and this motivated research on structured pruning to achieve practical usage under limited resource budgets. To this end, a line of research on channel pruning considers completely pruning the convolution filters along the input and output channel dimensions, where the resulting pruned model becomes a smaller dense network suited for practical acceleration and memory footprint reduction (Li et al., 2017; Luo et al., 2017; He et al., 2019; Wen et al., 2016; He et al., 2018a) . However, existing channel pruning methods perform the pruning operations with a greedy approach and does not consider the inherent quadratic coupling between channels in the neighboring layers. Although these methods are easy to model and optimize, they cannot safely remove inactive weights during the pruning procedure, suffer from discrepancies with the true objective, and prohibit the strict satisfaction of the required resource constraints during the pruning process. The ability to specify hard target resource constraints into the pruning optimization process is important since this allows the user to run the pruning and optional finetuning process only once. When the pruning process ignores the target specifications, the users may need to apply multiple rounds of pruning and finetuning until the specifications are eventually met, resulting in an extra computation overhead (Han et al., 2015; He et al., 2018a; Liu et al., 2017) . In this paper, we formulate a principled optimization problem that prunes the network layer channels while respecting the quadratic coupling and exactly satisfying the user-specified FLOPs and  X (l-1) X (l) X (l+1) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X (l-1) X (l) X (l+1) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X (l-1) X (l) X (l+1) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X (l-1) X (l) X (l+1) W (l) W (l+1) W (l) •,j is pruned to zero X (l) j = σ X (l-1) W (l) •,j pruned to zero = 0H l ,W l X (l) j leads to inactive weights W (l+1) j (∵ Equation (1)) W (l+1) j W (l) W (l+1) W (l) •,j is pruned to zero X (l) j = σ X (l-1) W (l) •,j pruned to zero = 0H l ,W l X (l) j leads to inactive weights W (l+1) j (∵ Equation (1)) W (l+1) j W (l) W (l+1) W (l) •,j is pruned to zero X (l) j = σ X (l-1) W (l) •,j pruned to zero = 0H l ,W l X (l) j leads to inactive weights W (l+1) j (∵ Equation (1)) W (l+1) j W (l) W (l+1) W (l) •,j is pruned to zero X (l) j = σ X (l-1) W (l) •,j pruned to zero = 0H l ,W l X (l) j leads to inactive weights W (l+1) j (∵ Equation (1)) W (l+1) j Figure 1 : Illustration of a channel pruning procedure that leads to inactive weights. When j-th output channel of l-th convolution weights W (l) •,j is pruned, i.e. W •,j = 0 C l-1 ,K l ,K l , then the j-th feature map of l-th layer X (l) j should also be 0. Consequently, X (l) j yields inactive weights W (l+1) j . Note that we use W (l) •,j to denote the tensor W (l) •,j,•,• , following the indexing rules of NumPy (Van Der Walt et al., 2011) . memory constraints. This new formulation leads to an interesting discrete variable QCQP (Quadratic Constrained Quadratic Program) optimization problem, which directly maximizes the importance of neurons in the pruned network under the specified resource constraints. Also, we increase the pruning granularity beyond channels and jointly prune individual 2D convolution filters spatially for greater efficiency. Furthermore, we generalize our formulation to cover nonsequential convolution operations, such as skip connections, and propose a principled optimization framework for handling various architectural implementations of skip connections in ResNet (He et al., 2016) . Our experiments on CIFAR-10 and ImageNet datasets show the state of the art results compared to other channel pruning methods that start from pretrained networks.

2. MOTIVATION

In this section, we first discuss the motivation of our method concretely. Suppose the weights in a sequential CNN form a sequence of 4-D tensors, W (l) ∈ R C l-foot_0 ×C l ×K l ×K l ∀l ∈ [L] where C l-1 , C l , and K l represent the number of input channels, the number of output channels, and the filter size of l-th convolution weight tensor, respectively. We denote the feature map after l-th convolution as X (l) ∈ R C l ×H l ×W l . Concretely, X (l) j = σ(X (l-1) W (l) •,j ) = σ( C l-1 i=1 X (l-1) i * W (l) i,j ), where σ is the activation function, * denotes 2-D convolution operation, and denotes the sum of channel-wise 2-D convolutions. Now consider pruning these weights in channel-wise direction. We show that naive channel-wise pruning methods prevent exact specification of the target resource constraints due to unpruned inactive weights and deviate away from the true objective by ignoring quadratic coupling between channels in the neighboring layers.

2.1. INACTIVE WEIGHTS

According to Han et al. (2015) , network pruning produces dead neurons with zero input or output connections. These dead neurons cause inactive weights 1 , which do not affect the final output activations of the pruned network. These inactive weights may not be excluded automatically through the standard pruning procedure and require additional post-processing which relies on ad-hoc heuristics. For example, Figure 1 shows a standard channel pruning procedure that deletes weights across the output channel direction but fails to prune the inactive weights. Concretely, deletion of weights on j-th output channel of l-th convolution layer leads to W (l) •,j = 0 C l-1 ,K l ,K l . Then, X (l) j becomes a dead neuron since X (l) j = σ(X (l-1) W (l) •,j ) = σ( C l-1 i=1 X (l-1) i * W (l) i,j ) = 0 C l ,H l ,W l . The convolution operation on the dead neuron results in a trivially zero output, as below: X (l+1) p = σ C l i=1 X (l) i * W (l+1) i,p = σ C l i=1 1 i =j X (l) i * W (l+1) i,p + X (l) j dead * W (l+1) j,p inactive =0 H l+1 ,W l+1 . (1) Equation ( 1) shows that the dead neuron X (l) j causes weights W (l+1) j,p , ∀p ∈ [C l+1 ] to be inactive. Such inactive weights do not account for the actual resource usage, even when they remain in the pruned network, which prevents the exact modeling of the user-specified hard resource constraints (FLOPs and network size). Furthermore, inactive weights unpruned during the pruning procedure are a bigger problem for nonsequential convolutional networks due to their skip connections. To address this problem, we introduce a quadratic optimization-based algorithm that provably eliminates all the inactive weights during the pruning procedure. Existing channel pruning methods remove channels according to their importance. However, measuring a channel's contribution to the network should also take into account the channels in the neighboring layers, as illustrated in Figure 2 . In the example, we define the importance of a channel as the absolute sum of weights in the channel, as in Li et al. (2017) , and assume the objective is to maximize the absolute sum of weights in the whole pruned network, excluding the inactive weights. We compare two different channel pruning methods: (a) a standard channel pruning method that greedily prunes each channel independently, and (b) our pruning method that considers the effect of the channels in neighboring layers when pruning. As a result of running each pruning algorithms, (a) will prune the second output channel of the first convolution and the third output channel of the second convolution, and (b) will prune the first output channel of the first convolution, the third output channel of the second convolution, and the first input channel of the second convolution. The objective values for each pruned networks are (a) 18 and (b) 21, respectively. This shows that the coupling effect of the channels in neighboring layers directly affects the objective values, and finally results in a performance gap between (a) and (b). We call this coupling relationship as the quadratic coupling between the neighboring layers and formulate the contributions to the objective by quadratic terms of neighboring channel activations. To address this quadratic coupling, we propose a channel pruning method based on the QCQP framework with importance evaluation respecting both the input and the output channels.

2.2. QUADRATIC COUPLING

X (l-1) 4 12 1 4 5 2 3 1 X (l) 9 12 8 11 7 9 3 3 2 X (l+1) pruned inactive pruned Objective value : (3 + 5 + 4) + (3 + 3) = 18 (a) Greedy X (l-1) 4 12 1 4 5 2 3 1 X (l)

3. METHOD

In this section, we first propose our discrete QCQP formulation of channel pruning for the sequential convolutional neural networks (CNNs). Then, we present an extended version of our formulation for joint channel and shape pruning of 2D convolution filters. The generalization to the nonsequential convolution (skip addition and skip concatenation) is introduced in Supplementary material A.

3.1. FORMULATION OF CHANNEL PRUNING FOR SEQUENTIAL CNNS

To capture the importance of weights in W (l) , we define the importance tensor as I (l) ∈ R C l-1 ×C l ×K l ×K l + . Following the protocol of Han et al. (2015) ; Guo et al. (2016) , we set I (l) = γ l W (l) where γ l is the 2 normalizing factor in l-th layer or vec(W (l) ) -1 . Then, we define the binary pruning mask as A (l) ∈ {0, 1} C l-1 ×C l ×K l ×K l . For channel pruning in sequential CNNs, we define channel activation r (l) ∈ {0, 1} C l to indicate which indices of channels remain in the l-th layer of the pruned network. Then, the weights in W (l) i,j are active if and only if r (l-1) i r (l) j = 1, which leads to A (l) i,j = r (l-1) i r (l) j J K l . For example, in Figure 2b, r (l-1) = [1, 1, 1] , r (l) = [0, 1] , and r (l+1) = [1, 1, 0] , therefore, A (l) =   0 1 0 1 0 1   ⊗ J K l and A (l+1) = 0 0 0 1 1 0 ⊗ J K l+1 . We wish to directly maximize the sum of the importance of active weights after the pruning procedure under given resource constraints : 1) FLOPs, 2) memory, and 3) network size. Concretely, our optimization problem isfoot_1 maximize r (0:L) L l=1 I (l) , A (l) (2) subject to L l=0 a l r (l) 1 + L l=1 b l A (l) 1 ≤ M A (l) = r (l-1) r (l) ⊗ J K l ∀l ∈ [L] r (l) ∈ {0, 1} C l . In our formulation, the actual resource usage of the pruned network is exactly computed by specifying the number of channels in the pruned network (= r (l) 1 ) and the pruning mask sparsity (= A (l) 1 ) in each layer. Concretely, the left hand side of the inequality in the first constraint in Equation (2) indicates the actual resource usage. Table 1 shows a l , b l terms used for computing usage of each resource. Note that this optimization problem is a discrete nonconvex QCQP of the channel activations [r (0) , . . . , r (L) ], where the objective, which is the same with the objective in Section 2.2, respects the quadratic coupling of channel activations (= r (l) ). Please refer to Supplementary material E for the details on the standard QCQP form of Equation (2).

3.2. FORMULATION OF JOINT CHANNEL AND SPATIAL PRUNING

Resource constraint (M) For further efficiency, we increase the pruning granularity to additionally perform spatial pruning in 2-D convolution filters. Concretely, we prune by each weight vector across the input channel direction instead of each channel to perform channel and spatial pruning processes simultaneously. a l b l Network size 0 1 Memory H l W l 1 FLOPs 0 H l W l First, we define the shape column W (l) •,j,a,b by the vector of weights at spatial position (a, b) of a 2-D convolution filter along the j-th output channel dimension. Then, we define shape column activation q (l) ∈ {0, 1} C l ×K l ×K l to indicate which shape columns in the l-th convolution layer remain in the pruned network. Figure 3 shows the illustration of each variables. Note that this definition induces constraints on the channel activation variables. In detail, the j-th output channel activation in l-th layer is set if and only if at least one shape column activation in the j-th output channel is set. Concretely, the new formulation should include the constraints r (l) j ≤ a,b q (l) j,a,b and q (l) j,a,b ≤ r (l) j ∀a, b. W l H l r (l) ∈ {0, 1} Cl q (l) 1 q (l) Cl r (l-1) ∈ {0, 1} Cl-1 Figure 3 : Input channel activation = r (l-1) , shape column activation = q (l) , and the corresponding mask = A (l) for l-th convolution layer, where A (l) = r (l-1) ⊗ q (l) . We now reformulate the optimization problem to include the shape column activation variables. Again, we aim to maximize the sum of the importance of active weights after pruning under the given resource constraints. Then, our optimization problem for simultaneous channel and spatial pruning is maximize r (0:L) ,q (1:L) L l=1 I (l) , A (l) (3) subject to L l=0 a l r (l) 1 + L l=1 b l A (l) 1 ≤ M r (l) j ≤ a,b q (l) j,a,b and q (l) j,a,b ≤ r (l) j ∀l, j, a, b A (l) = r (l-1) ⊗ q (l) ∀l r (l) ∈ {0, 1} C l and q (l) ∈ {0, 1} C l ×K l ×K l ∀l ∈ [L]. We note that this optimization problem is also a discrete nonconvex QCQP. The details on the standard QCQP form of Equation ( 3) is provided in Supplementary material E. Furthermore, Proposition 1 below shows that the constraints in Equation (2) and Equation (3) provably eliminate any unpruned inactive weights and accurately model the resource usage as well as the objective of the pruned network. Also, Proposition 1 can be generalized to nonsequential networks with skip addition. The generalization and the proofs are given in Supplementary material D. Proposition 1. Optimizing over the input and output channel activation variables r (0:L) and shape column activation variables q (1:L) under the constraints in Equation (3) provably removes any inactive weights in the pruned network, guaranteeing exact computation of 1) resource usage and 2) the sum of the importance of active weights in the pruned network.

3.3. OPTIMIZATION

Concretely, Equation (2) and Equation (3) fall into the category of binary Mixed Integer Quadratic Constraint Quadratic Programming (MIQCQP). We solve these discrete QCQP problems with the CPLEX library (INC, 1993) , which provides MIQCQP solvers based on the branch and cut technique. However, the branch and cut algorithm can lead to exponential search time (Mitchell, 2002) on large problems. Therefore, we provide a practical alternative utilizing a block coordinate descent style optimization method in Supplementary material B.

4. RELATED WORKS

Importance of channels Most of the channel pruning methods prune away the least important channels with a simple greedy approach, and the evaluation method for the importance of channels has been the main research problem (Molchanov et al., 2017; 2019; Liu et al., 2019) . Channel pruning is divided into two major branches according to the method of evaluating the importance of channels: the trainable-importance method, which evaluates the importance of channels while training the whole network from scratch, and the fixed-importance method, which directly evaluates the importance of channels on the pretrained network. Trainable-importance channel pruning methods include the regularizer-based methods with group sparsity regularizers (Wen et al., 2016; Alvarez & Salzmann, 2016; Yang et al., 2019; Liu et al., 2017; Louizos et al., 2018; Liu et al., 2017; Gordon et al., 2018) and data-driven channel pruning methods (Kang & Han, 2020; You et al., 2019) . Fixedimportance channel pruning methods first prune away most of the weights and then finetune the significantly smaller pruned network (Molchanov et al., 2017; 2019; Hu et al., 2016; He et al., 2018a; Li et al., 2017; He et al., 2019; Luo et al., 2017) . As a result, fixed-importance methods are much more efficient than the trainable-importance channel pruning methods in terms of computational cost and memory as trainable-importance methods have to train the whole unpruned network. Our framework is on the line of fixed-importance channel pruning works. Predefined target structure and Automatic target structure Layer-wise channel pruning methods (Li et al., 2017; He et al., 2019) , which perform pruning operations per each layer independently, require users to predefine the target pruned structure. Also, LCCL (Dong et al., 2017 ) exploits a predefined low-cost network to improve inference time. Another line of research finds the appropriate target structure automatically (He et al., 2018b; Yang et al., 2018; Liu et al., 2019; Molchanov et al., 2017) . Our method also finds the target structure automatically under the explicit target resource constraints. Dynamic pruning and static pruning Dynamic pruning has a different network structure depending on the input during inference time, while static pruning has a fixed network structure during inference time. CGNet (Hua et al., 2019) dynamically identifies unnecessary features to reduce the computation, and FBS (Gao et al., 2019) dynamically skip computations on the unimportant channels. Our framework is static pruning and has a fixed network structure during inference time. Quadratic coupling CCP (Peng et al., 2019) formulates a QP (quadratic formulation) to consider the quadratic coupling between channels in the same layer under layer-wise constraints on the maximum number of channels. On the other hand, our formulation considers the quadratic coupling between channels in the neighboring layers under the target resource constraints. Channel pruning in nonsequential blocks Many network architectures contain nonsequential convolution operations, such as skip addition (He et al., 2016; Sandler et al., 2018; Tan & Le, 2019) and skip concatenation (Huang et al., 2017) . Since these network architectures outperform sequential network architectures, pruning a network with nonsequential convolution operations is crucial. However, most channel pruning methods (Liu et al., 2017; He et al., 2018a; 2019; Molchanov et al., 2017; 2019) do not consider the nonsequential convolution operations and use the same method from the sequential network architecture. However, channel pruning methods ignorant of nonsequential convolution operations may result in the misalignment of feature maps connected by skip connections (You et al., 2019) . GBN (You et al., 2019) forces parameters connected by a nonsequential convolution operation to share the same pruning pattern to solve this misalignment problem. In contrast, our formulation does not require strict pattern sharing. This flexibility allows for our methods to delete more channels under given constraints. Spatial pruning of convolution filters Spatial pruning methods aim to prune convolution filters along the channel dimension for inference efficiency. Spatial pruning methods manually define the spatial patterns of filters (Lebedev & Lempitsky, 2016; Anwar et al., 2017) or optimize spatial patterns of filters with group sparse regularizers (Wen et al., 2016; Lebedev & Lempitsky, 2016) . Among these works, Lebedev & Lempitsky (2016) empirically demonstrates that enforcing sparse spatial patterns in 2-D filters along the input channel leads to great speed-up during inference time using group sparse convolution operations (Chellapilla et al., 2006) . Our proposed method enforces the spatial patterns in 2-D filters as in Lebedev & Lempitsky (2016) for speed-up in inference.

5. EXPERIMENTS

We compare the classification accuracy of the pruned network against several pruning baselines on CIFAR-10 ( Krizhevsky et al., 2009) and ImageNet (Russakovsky et al., 2015) datasets using various ResNet architectures (He et al., 2016) , DenseNet-40 (Huang et al., 2017) , and VGG-16 (Simonyan & Zisserman, 2015) . Note that most pruning baselines apply an iterative pruning procedure, which repeatedly alternates between network pruning and finetuning until the target resource constraints are satisfied (Han et al., 2015; He et al., 2018a; Liu et al., 2017; Yang et al., 2018) . In contrast, since our methods explicitly include the target resource constraint to the optimization framework, we only need one round of pruning and finetuning.

5.1. EXPERIMENTAL SETTINGS

We follow the 'smaller-norm-less-important' criterion (Ye et al., 2018; Liu et al., 2017) , which evaluates the importance of weights as the absolute value of weight (Han et al., 2015; Guo et al., 2016) . We assume the network size and FLOPs reduction are linearly proportional to the sparsity in shape column activations, as empirically shown in Lebedev & Lempitsky (2016) . Also, we ignore the extra memory overhead for storing the shape column activations due to its negligible size compared to the total network size. In the experiment tables, FLOPs and the network size of the pruned network are computed according to the resource specifications in Equations ( 2) and (3). Also, 'Pruning ratio' in the tables denotes the ratio of pruned weights among the total weights in baseline networks. 'ours-c' and 'ours-cs' in the tables denote our method with channel pruning and our method with both the channel and spatial pruning, respectively. 3 shows the experiment results on ImageNet. In ResNet-50, 'ours-c' and 'ours-cs' achieve results comparable to GBN, a trainable-importance channel pruning method which is the previous state of the art, even though our method is a fixedimportance channel pruning method. In particular, top1 pruned accuracy in 'ours-cs' exceeds SFP by 1.32% using a similar number of FLOPs. Both 'ours-cs 

6. CONCLUSION

We present a discrete QCQP based optimization framework for jointly pruning channel and spatial filters under various architecture realizations. Since our methods model the inherent quadratic coupling between channels in the neighboring layers to eliminate any inactive weights during the



Rigorous mathematical definition of inactive weights is provided in Supplementary material D. ⊗ denotes the outer product of tensors and Jn is a n-by-n matrix of ones.



Figure 2: A comparison of the greedy channel pruning method and our pruning method. Parallelograms represent feature maps and squares represent 2-D filters of convolution weights. Gray squares are filters which account for the objective. The numbers on each squares represent the absolute sum of weights in the filter.

Resource constraints and the corresponding a l , b l values.

Pruned accuracy and accuracy drop from the baseline network at given FLOPs (left) and pruning ratios (right) on various network architectures32, at CIFAR-10.5.2 CIFAR-10CIFAR-10 dataset has 10 different classes with 5k training images and 1k test images per each classKrizhevsky et al. (2009). In CIFAR-10 experiments, we evaluate our methods on various network architectures: 32, 56, We provide implementation of the details for the experiments in Supplementary material C. We show the experiment results of pruning under FLOPs constraints in the left column of Table2and under final network size constraints in the right column of Table2.On the FLOPs experiments in the left column of Table2, 'ours-c' shows comparable results against FPGM, which is the previous state of the art method, on ResNet-20, 32, and 56. Moreover, 'ours-cs' significantly outperforms both 'ours-c' and FPGM on the same architectures showing the state of the art performance. Also, 'ours-c' shows comparable results against slimming(Liu et al., 2017) and SCP(Kang & Han, 2020), while 'ours-cs' outperforms existing baselines by a large margin on DenseNet-40.On the network size experiments in the right column of Table2, 'ours-c' shows results competitive to FPGM and SCP, while 'ours-cs' again achieves the state of the art performance on ResNet-20, 32, and 56. Notably, in ResNet-56, 'ours-cs' achieves a minimal accuracy drop of -0.10 with the pruning ratio of 52.6%. These results show simultaneous channel and spatial pruning produces more efficient networks with better performance compared to other channel pruning methods on CIFAR-10.5.3 IMAGENETILSVRC-2012(Russakovsky et al., 2015) is a large-scale dataset with 1000 classes that comes with 1.28M training images and 50k validation images. We conduct our methods under the fixed FLOPs constraint on50,. For more implementation details of the ImageNet experiments, refer to Supplementary material C. Table

significantly better performance thanMolchanov et al. (2017) on VGG-16. Top1,5 pruned accuracy and accuracy drop from the baseline network at given FLOPs on various network architectures50, at ImageNet.

annex

pruning procedure, they allow exact modeling of the user-specified resource constraints and enable the direct optimization of the true objective on the pruned network. The experiments show our proposed method significantly outperforms other fixed-importance channel pruning methods, finding smaller and faster networks with the least drop in accuracy.

