TRAINABILITY PRESERVING NEURAL PRUNING

Abstract

Many recent works have shown trainability plays a central role in neural network pruning -unattended broken trainability can lead to severe under-performance and unintentionally amplify the effect of retraining learning rate, resulting in biased (or even misinterpreted) benchmark results. This paper introduces trainability preserving pruning (TPP), a scalable method to preserve network trainability against pruning, aiming for improved pruning performance and being more robust to retraining hyper-parameters (e.g., learning rate). Specifically, we propose to penalize the gram matrix of convolutional filters to decorrelate the pruned filters from the retained filters. In addition to the convolutional layers, per the spirit of preserving the trainability of the whole network, we also propose to regularize the batch normalization parameters (scale and bias). Empirical studies on linear MLP networks show that TPP can perform on par with the oracle trainability recovery scheme. On nonlinear ConvNets (ResNet56/VGG19) on CIFAR10/100, TPP outperforms the other counterpart approaches by an obvious margin. Moreover, results on ImageNet-1K with ResNets suggest that TPP consistently performs more favorably against other top-performing structured pruning approaches.

1. INTRODUCTION

Neural pruning aims to remove redundant parameters without seriously compromising the performance. It normally consists of three steps (Reed, 1993 Wang et al., 2023) : pretrain a dense model; prune the unnecessary connections to obtain a sparse model; retrain the sparse model to regain performance. Pruning is usually categorized into two classes, unstructured pruning (a.k.a. element-wise pruning or fine-grained pruning) and structured pruning (a.k.a. filter pruning or coarse-grained pruning). Unstructured pruning chooses a single weight as the basic pruning element; while structured pruning chooses a group of weights (e.g., 3d filter or a 2d channel) as the basic pruning element. Structured pruning fits more for acceleration because of the regular sparsity. Unstructured pruning, in contrast, results in irregular sparsity, hard to exploit for acceleration unless customized hardware and libraries are available (Han et al., 2016a; 2017; Wen et al., 2016) . Recent papers (Renda et al., 2020;  Le & Hua, 2021) report an interesting phenomenon: During retraining, a larger learning rate (LR) helps achieve a significantly better final performance, empowering the two baseline methods, random pruning and magnitude pruning, to match or beat many more complex pruning algorithms. The reason behind is argued (Wang et al., 2021a; 2023) to be related to the trainability of neural networks (Saxe et al., 2014; Lee et al., 2020; Lubana & Dick, 2021) . They make two major observations to explain the LR effect mystery (Wang et al., 2023) . (1) The weight removal operation immediately breaks the network trainability or dynamical isometry (Saxe et al., 2014) (the ideal case of trainability) of the trained network. (2) The broken trainability slows down the optimization in retraining, where a greater LR aids the model converge faster, thus a better performance is observed earlier -using a smaller LR can actually do as well, but needs more epochs. Although these works (Lee et al., 2020;  Lubana & Dick, 2021; Wang et al., 2021a; 2023) provide a plausibly sound explanation, a more practical issue is how to recover the broken trainability or maintain it during pruning. In this regard, Wang et al. (2021a) proposes to apply weight orthogonalization based on QR decomposition (Trefethen & Bau III, 1997; Mezzadri, 2006) to the pruned F (i) F (i+1) F (i+2) Then only the unimportant parameters are enforced with the proposed TPP regularization terms, which is the key to maintain trainability when the unimportant weights are eventually eliminated from the network. Notably, the critical part of a regularization-based pruning algorithm lies in its specific regularization term, i.e., Eqs. ( 3) and ( 5), which we will show perform more favorably than other alternatives (see Tabs. 1 and 10). W (i) W (i+1) model. However, their method is shown to only work for linear MLP networks. On modern deep convolutional neural networks (CNNs), how to maintain trainability during pruning is still elusive. We introduce trainability preserving pruning (TPP), a new and novel filter pruning algorithm (see Fig. 1 ) that maintains trainability via a regularized training process. By our observation, the primary cause that pruning breaks trainability lies in the dependency among parameters. The primary idea of our approach is thus to decorrelate the pruned weights from the kept weights so as to "cut off" the dependency, so that the subsequent sparsifying operation barely hurts the network trainability. Specifically, we propose to regularize the gram matrix of weights: All the entries representing the correlation between the pruned filters (i.e., unimportant filters) and the kept filters (i.e., important filters) are encouraged to diminish to zero. This is the first technical contribution of our method. The second one lies in how to treat the other entries. Conventional dynamical isometry wisdom suggests orthogonality, namely, 1 self-correlation and 0 cross-correlation, even among the kept filters, while we find directly translating the orthogonality idea here is unnecessary or even harmful because the too strong penalty will constrain the optimization, leading to deteriorated local minimum. Rather, we propose not to impose any regularization on the correlation entries of kept filters. Finally, modern deep models are typically equipped with batch normalization (BN) (Ioffe & Szegedy, 2015) . However, previous filter pruning papers rarely explicitly take BN into account (except two (Liu et al., 2017; Ye et al., 2018) ; the differences of our work from theirs will be discussed in Sec. 3.2) to mitigate the side effect when it is removed because its associated filter is removed. Since they are also a part of the whole trainable parameters in the network, unattended removal of them will also lead to severely crippled trainability (especially at large sparsity). Therefore, BN parameters (both the scale and bias included) ought to be explicitly taken into account too, when we develop the pruning algorithm. Based on this idea, we propose to regularize the two learnable parameters of BN to minimize the influence of its absence later. Practically, our TPP is easy to implement and robust to hyper-parameter variations. On ResNet50 ImageNet, TPP delivers encouraging results compared to many recent SOTA filter pruning methods.

Contributions. (1)

We present the first filter pruning method (trainability preserving pruning) that effectively maintains trainability during pruning for modern deep networks, via a customized weight gram matrix as regularization target. (2) Apart from weight regularization, a BN regularizer is introduced to allow for their subsequent absence in pruning -this issue has been overlooked by most previous pruning papers, although it is shown to be pretty important to preserve trainability, especially in the large sparsity regime. It is noted that random pruning of a normally-sized (i.e., not severely over-parameterized) network usually leads to significant performance drop. We need to cleverly choose some unimportant parameters to remove. Such a criterion for choosing is called pruning criterion. In the area, there have been two major paradigms to address the pruning criterion problem dating back to the 1990s: regularization-based methods and importance-based (a.k.a. saliency-based) methods (Reed, 1993) . Specifically, the regularization-based approaches choose unimportant parameters via a sparsityinducing penalty term (e.g., Wen et  y = W x, ||y|| = y ⊤ y = √ x ⊤ W ⊤ W x = ||x||, iff. W ⊤ W = I, where I represents the identity matrix. Orthogonality of a weight matrix can be easily realized by matrix orthogonalization techniques such as QR decomposition (Trefethen & Bau III, 1997; Mezzadri, 2006) . Exact (namely all the Jacobian singular values are exactly 1) dynamical isometry  KK ⊤ = I ⇒ L orth = KK ⊤ -I, ◁ kernel orthogonality KK ⊤ = I ⇒ L orth = KK ⊤ -I. ◁ orthogonal convolution Clearly the difference lies in the weight matrix K vs. K: (1) K denotes the original weight matrix in a convolutional layer. Weights of a CONV layer make up a 4d tensor R N ×C×H×W (N stands for the output channel number, C for the input channel number, H and W for the height and width of the CONV kernel). Then, K is a reshaped version of the 4d tensor: ). In our case, we aim to remove some filters, so naturally we can regularize the weight gram matrix to be close to a partial identity matrix, with the diagonal entries at the pruned filters zeroed (see Fig. 2(b) ; note the diagonal green zeros). K ∈ R N ×CHW (if N < CHW ; otherwise, K ∈ R CHW ×N ). (2) In contrast, K ∈ R N H f o W f o ×CH f i W f i stands The above scheme is simple and straightforward. However, it is not in the best shape by our empirical observation. It imposes too strong unnecessary constraint on the remaining weights, which will in turn hurt the optimization. Therefore, we propose to seek a weaker constraint, not demanding the perfect trainability (i.e., exact isometry realized by orthogonality), but only a benign status, which describes a state of the neural network where gradients can flow effectively through the model without being interrupted. Orthogonality requires the Jacobian singular values to be exactly 1; in contrast, a benign trainability only requires them not to be extremely large or small so that the network can be trained normally. To this end, we propose to decorrelate the kept filters from the pruned ones: in the target gram matrix, all the entries associated with the pruned filters are zero; all the other entries stay as they are (see Fig. 2(c )). This scheme will be empirically justified (Tab. 3). Specifically, all the filters in a layer are sorted based on their L 1 -norms. Then, we consider those with the smallest L 1 norms as unimportant filters (the S l below) (so the proposed method also falls into the magnitude-based pruning method group). Then, the proposed regularization term is, L 1 = L l=1 ||W l W ⊤ l ⊙ (1 -mm ⊤ )|| 2 F , m j = 0 if j ∈ S l , else 1, where W refers to the weight matrix; 1 represents the matrix full of 1; m is a 0/1-valued column mask vector; ⊙ is the Hadamard (element-wise) product; and || • || F denotes the Frobenius norm. (2) BN regularization. Per the idea of preserving trainability, BN is not ignorable since BN layers are also trainable. Removing filters will change the internal feature distributions. If the learned BN statistics do not change accordingly, the error will accumulate and result in deteriorated performance (especially for deep networks). Consider the following BN formulation (Ioffe & Szegedy, 2015) , f = γ W * X -µ √ σ 2 + ϵ + β, where * stands for convolution; µ/σ 2 refers to the running mean/variance; ϵ, a small number, is used for numerical stability. The two learnable parameters are γ and β. Although unimportant weights are enforced with regularization for sparsity, their magnitude can barely be exact zero, making the subsequent removal of filters biased. This will skew the feature distribution and render the BN statistics inaccurate. Using these biased BN statistics will be improper and damages trainability. To mitigate such influence from BN, we propose to regularize both the γ and β of pruned feature map channels to zero, which gives us the following BN penalty term, L 2 = L l=1 j∈S l γ 2 j + β 2 j . The merits of BN regularization will be justified in our experiments (Tab. 4). To sum, with the proposed regularization terms, the total error function is E = L cls + λ 2 (L 1 + L 2 ), where (2) We also compare the test accuracy before retraining -from this metric, we will see how robust different methods are in the face of weight removal. TPP can perform as well as OrthP on linear MLP. In Fig. 3 , (b) is the one equipped with OrthP, which can exactly recover dynamical isometry (note its mean JSV right after pruning is 1.0000), so it works as the oracle here. (1) OrthP improves the best accuracy from 91.36/90.54 to 92.79/92.77. L Using TPP, we obtain 92.81/92.77. Namely, in terms of accuracy, our method is as good as the oracle scheme. (2) Note the mean JSV right after pruning -the L 1 pruning destroys the mean JSV from 2.4987 to 0.0040, and OrthP brings it back to 1.0000. In comparison, TPP achieves 3.4875, at the same order of magnitude of 1.0000, also as good as OrthP. These demonstrate, in terms of either the final evaluation metric (test accuracy) or the trainability measure (mean JSV), our TPP performs as well as the oracle method OrthP on the linear MLP. Loss surface analysis with ResNet56 on CIFAR10. We further analyze the loss surfaces (Li et al., 2018) of pruned networks (before retraining) by different methods. Our result (due to limited space, we defer this result to Appendix; see Fig. 4 ) suggests that the loss surface of our method is flatter than other methods, implying the loss landscape is easier for optimization.

4.2. RESNET56 ON CIFAR10 / VGG19 ON CIFAR100

Here we compare our method to other plausible solutions on the CIFAR datasets (Krizhevsky, 2009) with non-linear convolutional architectures. The results in Tab. 1 (for CIFAR10) and Tab. 10 (for CIFAR100, deferred to Appendix due to limited space here) show that, (1) OrthP does not work well -L 1 + OrthP underperforms the original L 1 under all the five pruning ratios for both ResNet56 and VGG19. This further confirms the weight orthogonalization method proposed for linear networks indeed does not generalize to non-linear CNNs. (2) For KernOrth vs. OrthConv, the results look mixed -OrthConv is generally better when applied before the L 1 pruning. This is reasonable since OrthConv is shown more effective than KernOrth in enforcing more isometry (Wang et al., 2020) , which in turn can stand more damage of pruning. (3) Of particular note is that, none of the above five methods actually outperform the L 1 pruning or the simple scratch training. It means that neither enforcing more isometry before pruning nor compensating isometry after pruning can help recover trainability. In stark contrast, our TPP method outperforms L 1 pruning and scratch consistently against different pruning ratios (only one exception is pruning ratio 0.7 on ResNet56, but our method is still the second best and the gap to the best is only marginal: 93.46 vs. 93.51). Besides, note that the accuracy trend -in general, with a larger sparsity ratio, TPP beats L 1 or Scratch by a more pronounced margin. This is because, ar a larger pruning ratio, the trainability is impaired more, where our method can help more, thus harvesting more performance gains. We will see similar trends many times. (4) In Tabs. 1 and 10, we also present the results when the initial retraining LR is 1e-3. Wang et al. (2021a) argue that if the broken dynamical isometry can be well maintained/recovered, the final performance gap between LR 1e-2 and 1e-3 should be diminished. Now that TPP is claimed to be able to maintain trainability, the performance gap should become smaller. This is empirically verified in the table. In general, the accuracy gap between LR 1e-2 and LR 1e-3 of TPP is smaller than that of L 1 pruning. Two exceptions are PR 0.9/0.95 on ResNet56: LR 1e-3 is unusually better than LR 1e-2 for L 1 pruning. Despite them, the general picture is that the accuracy gap between LR 1e-3 and 1e-2 turns smaller with TPP. This is a sign that trainability is effectively maintained.

4.3. IMAGENET BENCHMARK

We further evaluate TPP on ImageNet-1K (Deng et al., 2009) 

4.4. ABLATION STUDY

This section presents ablation studies to demonstrate the merits of TPP's two major innovations: (1) We propose not to over-penalize the kept weights in orthogonalization (i.e., (c) vs. (b) in Fig. 2 ). (2) We propose to regularize the two learnable parameters in BN. The results are presented in Tabs. 3 and 4 , where we compare the accuracy right after pruning (i.e., without retraining). We have the following major observations: (1) Tab. 3 shows using decorrelate (Fig. 2(c )) is better than using diagonal (Fig. 2(b )), generally speaking. Akin to Tabs. 1 and 10, at a greater sparsity ratio, the advantage of decorrelate is more pronounced, except for too large sparsity (0.95 for ResNet56, 0.9 for VGG19) because too large sparsity will break the trainability beyond repair. (2) For BN regularization, in Tab. 4, when it is switched off, the performance degrades. It also poses the similar trend: BN regularization is more helpful under the larger sparsity.

5. CONCLUSION

Trainability preserving is shown to be critical in neural network pruning, while few works have realized it on the modern large-scale non-linear deep networks. Towards this end, we present a new filter and novel pruning method named trainability preserving pruning (TPP) based on regularization. Specifically, we propose an improved weight gram matrix as regularization target, which does not unnecessarily over-penalize the retained important weights. Besides, we propose to regularize the BN parameters to mitigate its damage to trainability. Empirically, TPP performs as effectively as the ground-truth trainability recovery method and is more effective than other counterpart approaches based on weight orthogonality. Furthermore, on the standard ImageNet-1K benchmark, TPP also matches or even beats many recent SOTA filter pruning approaches. As far as we are concerned, TPP is the first approach that explicitly tackles the trainability preserving problem in structured pruning that easily scales to the large-scale datasets and networks. Training setups and hyper-parameters. Tab. 5 summarizes the detailed training setups. For the hyper-parameters that are introduced in our TPP method: regularization granularity ∆, regularization ceiling τ and regularization update interval K u , we summarize them in Tab. 6. We mainly refer to the official code of GReg-1 (Wang et al., 2021b) when setting up these hyper-parameters, since we tap into a similar growing regularization scheme as GReg-1 does. For small datasets (CIFAR and MNIST), each reported result is averaged by at least 3 random runs, mean and std reported. For ImageNet-1K, we cannot run multiple times due to our limited resource budget. This said, in general the results on ImageNet have been shown pretty stable. Published as a conference paper at ICLR 2023 Hardware and running time. We conduct all our experiments using 4 NVIDIA V100 GPUs (16GB memory per GPU). It takes roughly 41 hrs to prune ResNet50 on ImageNet using our TPP method (pruning and 90-epoch retraining both included). Among them, 12 hrs (namely, close to 30%) are spent on pruning and 29 hrs are spent on retraining (about 20 mins per epoch). Layerwise pruning ratios. The layerwise pruning ratios are pre-specified in this paper. For the ImageNet benchmark, we exactly follow GReg (Wang et al., 2021b) for the layerwise pruning ratios to keep fair comparison to it. The specific numbers are summarized in Tab. 7. Each number is the pruning ratio shared by all the layers of the same stage in ResNet34/50. On top of these ratios, some layers are skipped, such as the last CONV layer in a residual block. The best way to examine the detailed layerwise pruning ratio is to check the code at: https://github.com/MingSun-Tse/TPP. B SENSITIVITY ANALYSIS OF HYPER-PARAMETERS Among the three hyper-parameters in Tab. 6, regularization ceiling τ works as a termination condition. We only requires it to be large enough to ensure the weights are compressed to a very small amount. It does not have to be 1. The final performance is also less sensitive to it. The pruned performance seems to be more sensitive to the other two hyper-parameters, so here we conduct hyper-parameter sensitivity analysis to check their robustness. Results are presented in Tab. 8 and Tab. 9. Pruning ratio 0.7 (for ResNet56) and 0.5 (for VGG19) are chosen here because the resulted sparsity is the most representative (i.e., not too large or small). (1) For K u , in general, a larger K u tends to deliver a better result. This is no surprise since a larger K u allows more iterations for the network to adapt and recover when undergoing the penalty. (2) For ∆, we do not see a clear pattern here: either a small or large ∆ can achieve the best result (for different networks). On the whole, when varying the hyper-parameters within a reasonable range, the performance is pretty robust, no catastrophic failures. Moreover, note that the default setting is actually not the best for both K u and ∆. This is because we did not heavily search the best hyper-parameters; however, they still achieve encouraging performance compared to the counterpart methods, as we have shown in the main text. 

C ALGORITHM DETAILS

The details of our TPP method is summarized in Algorithm 1.

Algorithm 1 Trainability Preserving Pruning (TPP)

1: Input: Pretrained model Θ, layerwise pruning ratio r l of l-th layer, for l ∈ {1, 2, • • • , L}. 2: Input: Regularization ceiling τ , penalty coefficient update interval K u , penalty granularity ∆. 3: Init: Iteration i = 0. λ j = 0 for all filter j. Set pruned filter indices S l by L 1 -norm sorting. 4: while λ j ≤ τ , for j ∈ S l do 5: if i % K u = 0 then 6: λ j = λ j + ∆ for j ∈ S l . ▷ Update regularization co-efficient in Eq. ( 6) 7: end if 8: Network forward, loss (Eq. ( 6)) backward, parameter update by stochastic gradient descent.

9:

Update iteration: i = i + 1. 10: end while 11: Remove filters in S l to obtain a smaller model Θ ′ . 12: Retrain Θ ′ to regain accuracy. 13: Output: Retrained model Θ ′ .

D RESULTS OMITTED FROM THE MAIN TEXT

Loss surface visualization. The loss surface visualization of ResNet56 on CIFAR10 is presented in Fig. 4 . VGG19 on CIFAR100. The results of VGG19 on CIFAR100 is shown in Tab. 10. Examination of the early retraining phase. To further understand how pruning hurts trainability and how our TPP method maintains it, in Tab. 11, we list the mean JSVs of the first 10-epoch retraining (at pruning ratio 0.9). Note the obvious mean JSV gap between LR 0.01 and LR 0.001 without OrthP: LR 0.01 can reach mean JSV 0.65 just after 1 epoch of retraining while LR 0.001 takes over 8 epochs. When OrthP is used, this gap greatly shrinks. We also list the test accuracy of the first 10-epoch retraining in Tab. 12. Particularly note that the test accuracy correlates well with the mean JSV trend under each setting, implying that the damaged trainability primarily answers for the under-performance of LR 0.001 after pruning. Then, when TPP is used in place of OrthP, we can see after 1-epoch retraining, the model can achieve mean JSV above 1 and test accuracy over 90%, which are particularly similar to the effect of using OrthP. These re-iterate that the proposed TPP method can work just as effectively as the ground-truth trainability recovery method OrthP on this toy setup. ResNet56 on CIFAR10. Pruning ratio: 0.9 (zoom in to examine the details). First, note that the results of L 1 +KernOrth, L 1 +OrthConv, KernOrth+L 1 , and OrthConv+L 1 in Tab. 1 also involve training (the KernOrth/OrthConv is essentially a regularized training), which takes 50k iterations. We make following changes: • Our TPP takes 100k iterations based on the default hyper-parameter setup, so to make a fair comparison, we decrease the regularization update interval K u from 10 to 5, making the regularization of TPP also take 50k iterations. • Meanwhile, we add 128 retraining epochs (50k / 391 iters per epoch ≈ 128 epochs) to the L 1 and L 1 +OrthP methods (when their retraining epochs are increased, the LR decay epochs are proportionally scaled); plus the original 120 epochs, the total epochs are 248 now. We make the following observations: • For L 1 pruning, more retraining epochs do not always help. Comparing these results to Tab. 1, we may notice at small PR (0.3, 0.5), the accuracy drops a little (this probably is due to overfitting -when the PR is small, the pruned model does not need so many epochs to recover; while too long training triggers overfitting). For larger PR (like 0.95), more epochs help quite significantly (improving the accuracy by 0.91%). • L 1 +OrthP still underperforms L 1 , same as in Tab. 1. • Despite using fewer epochs, TPP is still pretty robust -Compared to Tab. 1, the performance varies by a very marginal gap ( 0.1%, within the std range, so not a statistically significant gap). In general, TPP is still the best among all the compared methods, and, the advantage is more obvious at larger PRs, implying TPP is more valuable in more aggressive pruning cases.

E.2 TPP + Random BASE MODELS

Pruning is typically conducted on a pretrained model. TPP is also brought up in this context. This said, the fundamental idea of TPP, i.e., the proposed regularized training which can preserve trainability from the sparsifying action, is actually independent of the base model. Therefore, we may expect TPP can also surpass the L 1 pruning method (Li et al., 2017) 3 ), namely, trainability not preserved. Note, when trainability is preserved, retraining LR 0.01 and 0.001 do not pose obvious test accuracy gap; while trainability is not preserved, the gap would be obvious -see Fig. 5(a) . Specifically, we first use these methods to decide the layerwise pruned indices, given a total pruning ratio. Then, we inherit these layerwise pruned indices when using our TPP method. Results are shown in Tab. 15. We observe, at small PRs (0.3-0.7), TPP performs similarly to L 1 . This agrees with Tab. 1, where TPP is comparable to L 1 . At large PRs (0.9, 0.95), the advantage of TPP starts to expose more -at PR 0.9/0.95, TPP beats L 1 by 0.9/1.44 with SSL learned pruned indices, which is a statistically significant advantage as indicated by the std (and again, when PR is larger, the advantage of TPP is generally more pronounced). This table shows the advantage of TPP indeed can carry over to other layerwise pruning ratios derived from more advanced pruning criteria. E.4 TRAINING CURVE PLOTS: TPP vs. L 1 PRUNING In Fig. 8 , we show the training curves of our TPP compared to L 1 pruning with ResNet56 on CIFAR10.

E.5 PRUNING RESNET50 ON IMAGENET WITH LARGER SPARSITY RATIOS

It is noted that our method only beats some of the compared methods marginally (<0.5% top-1 accuracy) in low sparsity regime (around 2×∼ 3×speedup, see Tab. 2). This is mainly because when the sparsity is low, network trainability is not seriously damaged, thus our trainability-preserving method cannot fully expose its advantage. Here we showcase a scenario that trainability is intentionally damaged more dramatically. In Tab. 2, when pruning ResNet50, researchers typically do not prune all the layers -the last CONV layer in a residual block is usually spared (Li et al., 2017; Wang et al., 2021b) , for the sake of performance. Here we intentionally prune all the layers (only excluding the first CONV and last We can observe that at large PRs (0.9, 0.95), TPP significantly accelerates the optimization in the comparison to L 1 (Li et al., 2017) , because of better trainability preserved before retraining. The results in Tab. 16 show that, when the trainability is impaired more, our TPP beats L 1 by 0.77 to 2.17 top-1 accuracy on ImageNet, much more significant than Tab. 2. 

F MORE DISCUSSIONS

Can TPP be useful for finding lottery ticket subnetwork in a filter level? To our best knowledge, filter-level winning tickets (WT) are still hard to find even using the original LTH pipeline. Few attempts in this direction have succeeded -E.g., Chen et al. (2022) tried, but they can only achieve a bit marginal sparsity ( 30%) with filter-level WT (see their Fig. 3 , ResNet50 on ImageNet) while weight-level WT typically can be found at over 90% sparsity. This said, we do think this paper can contribute in that direction since preserving trainability is also a central issue in LTH, too.



https://github.com/pytorch/examples/tree/master/imagenet https://github.com/MingSun-Tse/Regularization-Pruning https://github.com/samaonline/Orthogonal-Convolutional-Neural-Networks https://github.com/Eric-mingjie/rethinking-network-pruning/tree/master/imagenet/l1-norm-pruning



Figure 1: Illustration of the proposed TPP algorithm on a typical residual block. Weight parameters are classified into two groups as a typical pruning algorithm does: important (white color) and unimportant (orange or blue color), right from the beginning (before any training starts) based on the filter L 1 -norms. Then only the unimportant parameters are enforced with the proposed TPP regularization terms, which is the key to maintain trainability when the unimportant weights are eventually eliminated from the network. Notably, the critical part of a regularization-based pruning algorithm lies in its specific regularization term, i.e., Eqs. (3) and (5), which we will show perform more favorably than other alternatives (see Tabs. 1 and 10).

Figure 2: Regularization target comparison between the proposed scheme (c) and similar counterparts (a) and (b). Green part stands for zero entries. Index 1 to N denotes the filter indices. In (b, c), filter 2 and N are the unimportant filters to be removed. (a) Regularization target of pure kernel orthogonality (an identity matrix), no pruning considered. (b) Regularization target of directly applying the weight orthogonality to filter pruning. (c) Regularization target of the proposed weight de-correlation solution in TPP: only regularize the filters to be removed, leave the others unconstrained. This scheme maintains trainability while imposing the least constraint on the weights.

Figure 4: Loss surface visualization of pruned models by different methods (w/o retraining).ResNet56 on CIFAR10. Pruning ratio: 0.9 (zoom in to examine the details).

Figure 7: Layerwise pruning ratios learned by SSL (Wen et al., 2016) with ResNet56 on CIFAR10, given different total pruning ratios (indicated in the title of each sub-figure). classifier FC) in ResNet50. For the L 1 pruning method (Li et al., 2017) (we report a stronger version re-implemented by Wang et al. (2023)), different layers are pruned independently since the layerwise pruning ratio has been specified. All the hyper-parameters of the retraining process are maintained the same for fair comparison, per the spirit brought up in Wang et al. (2023).

Figure 8: Training curves during retraining with ResNet56 on CIFAR10 at different pruning ratios (PRs). We can observe that at large PRs (0.9, 0.95), TPP significantly accelerates the optimization in the comparison to L 1(Li et al., 2017), because of better trainability preserved before retraining.

; Han et al., 2015; 2016b; Li et al., 2017; Liu et al., 2019b; Wang et al., 2021b; Gale et al., 2019; Hoefler et al., 2021;

(3)  Practically, the proposed method can easily scale to modern deep networks (such as ResNets) and datasets (such as ImageNet-1K(Deng et al., 2009)). It achieves promising pruning performance in the comparison to many SOTA filter pruning methods.Hoefler et al., 2021;Wang et al., 2022). This paper targets structured pruning (filter pruning, to be specific) because it is more imperative to make modern networks (e.g., ResNets(He et al., 2016)) faster rather than smaller compared to the early single-branch convolutional networks.

This scheme maintains trainability while imposing the least constraint on the weights.

for the doubly block-Toeplitz representation of K (H f o stands for the output feature map height, H f i for the input feature map height. W f o and W f i can be inferred the same way for width).

OrthP, to recover broken trainability after pruning pretrained models. Furthermore, since weight orthogonality is closely related to network trainability and there have been plenty of orthogonality regularization approaches(Xie et al., 2017;Wang et al., 2020;Huang et al., 2018; 2020;Wang et al., 2020), a straightforward solution is to combine them with L 1 pruning (Li et al., 2017) to see whether they can help maintain or recover the broken trainability. Two plausible combination schemes are easy to see: 1) apply orthogonality regularization methods before L 1 pruning, 2) apply orthogonality regularization methods after L 1 pruning, i.e., in retraining. Two representative orthogonality regularization methods are selected because of their proved effectiveness: kernel orthogonality (KernOrth)(Xie et al., 2017) and convolutional orthogonality (OrthConv)(Wang et al., 2020), so in total there are four combinations: L 1 + KernOrth, L 1 + OrthConv, KernOrth + L 1 , OrthConv + L 1 .

Test accuracy (%) comparison among different isometry maintenance or recovery methods on ResNet56 on CIFAR10. Scratch stands for training from scratch. KernOrth means Kernel Orthogonalization (Xie et al., 2017); OrthConv means Convolutional Orthogonalization(Wang et al., 2020). Two retraining LR schedules are evaluated here: initial LR 1e-2 vs. 1e-3. Acc. diff. refers to the accuracy gap of LR 1e-3 against LR 1e-2.

Comparison on ImageNet-1K validation set. * Advanced training recipe (such as cosine LR schedule) is used; we single them out for fair comparison.

Test accuracy (without retraining) comparison between two plausible schemes diagonal vs. decorrelate in our TPP method.

Test accuracy (without retraining) comparison w.r.t. the proposed weight gram matrix regularization and BN regularization. PR stands for layerwise pruning ratio.

Data split. All the datasets in this paper are public datasets with standard APIs in PyTorch(Paszke et al., 2019). We employs these standard APIs for the train/test data split to keep fair comparison with other methods.

Summary of training setups. In the parentheses of SGD are the momentum and weight decay. For LR schedule, the first number is initial LR; the second (in brackets) is the epochs when LR is decayed by factor 1/10; and #epochs stands for the total number of epochs.

Hyper-parameters of our methods.

A brief summary of the layerwise pruning ratios (PRs) of ImageNet experiments.

Robustness analysis of K u on the CIFAR10 and 100 datasets with our TPP algorithm. In default, K u = 10. Layerwise PR = 0.7 for ResNet56 and 0.5 for VGG19. The best is highlighted in red and the worst in blue.

Robustness analysis of ∆ on the CIFAR10 and 100 datasets with our TPP algorithm. In default, ∆ = 1e -4. Layerwise PR = 0.7 for ResNet56 and 0.5 for VGG19. The best is highlighted in red and the worst in blue.

Test accuracy (%) comparison among different dynamical isometry maintenance or recovery methods on VGG19 on CIFAR100. Scratch stands for training from scratch. KernOrth means Kernel Orthogonalization (Xie et al., 2017); OrthConv means Convolutional Orthogonalization(Wang et al., 2020). Two retraining LR schedules are evaluated here: initial LR 1e-2 vs. 1e-3. Acc. diff. refers to the accuracy gap of LR 1e-3 against LR 1e-2. TPP takes a few epochs for regularized training before the sparsifying action, while L 1 pruning is one-shot, taking no epochs. Namely, the total training cost of TPP is larger than those one-shot methods. It is of interest how the comparison will change if these one-shot methods are given more epochs for training.

Mean JSV of the first 10 epochs under different retraining settings. Epoch 0 refers to the model just pruned, before any retraining. Pruning ratio is 0.9. Note, with OrthP, the mean JSV is 1 because OrthP can achieve exact isometry

Test accuracy (%) of the first 10 epochs corresponding to Tab. 11 under different retraining settings. Epoch 0 refers to the model just pruned, before any retraining. Pruning ratio is 0.9. , w/ TPP (ours) 89.21 91.54 91.01 91.45 91.83 91.56 90.89 91.33 90.68 91.54 91.21 LR=10 -3 , w/ TPP (ours) 89.21 92.12 91.82 92.09 92.15 91.95 92.00 92.02 92.09 92.08 92.08 Now all the comparison methods in Tab. 1 have the same training cost (i.e., the same 248 total epochs). The new results of L 1 , L 1 +OrthP, and TPP under this strict comparison setup are presented below:

Test accuracy comparison under the same total epochs (ResNet56 on CIFAR10).

Comparison between L 1 pruning(Li et al., 2017) and our TPP with pruned indices derived from more advanced pruning criterion (Taylor-FO(Molchanov et al., 2019)) or regularization schemes (SSL(Wen et al., 2016), DeepHoyer(Yang et al., 2020)). Network/Dataset: ResNet56/CIFAR10: Unpruned acc. 93.78%, Params: 0.85M, FLOPs: 0.25G. Total PR represents the pruning ratio (PR) of the whole network. Note, due to the non-uniform layerwise PRs, the speedup below, which depends on the feature map spatial size, can be quite different from each other, even under the same total PR.

Top-1 accuracy comparison between TPP and L 1 pruning with larger pruning ratios (PRs). All layers but the 1st CONV and last FC layer (including the downsample layers and the 3rd CONV in a residual block) are pruned. Uniform layerwise pruning ratio is used.

availability

https://github.com/MingSun-Tse/TPP.

A IMPLEMENTATION DETAILS

Code reference. We mainly refer to the following code implementations in this work. They are all open-licensed.

annex

only exit for pruning pretrained models, but also exits for pruning random models; so TPP performs better than L 1 pruning, especially in the large sparsity regime. Moreover, in Fig. 5 , we show the mean JSV and test accuracy when pruning a random model with different schemes (L 1 , L 1 +OrthP, and our TPP). We observe that, in general, the JSV and test accuracy of pruning a random model pose similar patterns to pruning a pretrained model (Fig. 3 ):• Using L 1 , the LR=0.01 achieves "better" (not really better, but due to damaged trainability, the performance of LR=0.001 has been underestimated, see Wang et al. (2023) for more detailed discussions) JSV and test accuracy; note the test accuracy solid line is above the dashed line by an obvious margin.• While using L 1 +OrthP or our TPP, the LR=0.001 can actually match LR=0.01. Same as the case of pruning a pretrained model, here, TPP behaves similarly to the oracle trainability recovery method OrthP.• To summarize, the trainability-preserving effect of TPP also generalizes to the case of pruning random networks. (Molchanov et al., 2019) . SSL and DeepHoyer are regularization-based pruning methods like ours; differently, their layerwise pruned indices (as well as the pruning ratio) is not pre-specified, but "learned" by the regularized training. As such, the layerwise pruning ratios are not uniform (see Fig. 7 for an example).Taylor-FO is a more complex pruning criterion than magnitude by taking into account the first-order gradient information.

