TRAINABILITY PRESERVING NEURAL PRUNING

Abstract

Many recent works have shown trainability plays a central role in neural network pruning -unattended broken trainability can lead to severe under-performance and unintentionally amplify the effect of retraining learning rate, resulting in biased (or even misinterpreted) benchmark results. This paper introduces trainability preserving pruning (TPP), a scalable method to preserve network trainability against pruning, aiming for improved pruning performance and being more robust to retraining hyper-parameters (e.g., learning rate). Specifically, we propose to penalize the gram matrix of convolutional filters to decorrelate the pruned filters from the retained filters. In addition to the convolutional layers, per the spirit of preserving the trainability of the whole network, we also propose to regularize the batch normalization parameters (scale and bias). Empirical studies on linear MLP networks show that TPP can perform on par with the oracle trainability recovery scheme. On nonlinear ConvNets (ResNet56/VGG19) on CIFAR10/100, TPP outperforms the other counterpart approaches by an obvious margin. Moreover, results on ImageNet-1K with ResNets suggest that TPP consistently performs more favorably against other top-performing structured pruning approaches.

1. INTRODUCTION

Neural pruning aims to remove redundant parameters without seriously compromising the performance. It normally consists of three steps (Reed, 1993; Han et al., 2015; 2016b; Li et al., 2017; Liu et al., 2019b; Wang et al., 2021b; Gale et al., 2019; Hoefler et al., 2021; Wang et al., 2023) : pretrain a dense model; prune the unnecessary connections to obtain a sparse model; retrain the sparse model to regain performance. Pruning is usually categorized into two classes, unstructured pruning (a.k.a. element-wise pruning or fine-grained pruning) and structured pruning (a.k.a. filter pruning or coarse-grained pruning). Unstructured pruning chooses a single weight as the basic pruning element; while structured pruning chooses a group of weights (e.g., 3d filter or a 2d channel) as the basic pruning element. Structured pruning fits more for acceleration because of the regular sparsity. Unstructured pruning, in contrast, results in irregular sparsity, hard to exploit for acceleration unless customized hardware and libraries are available (Han et al., 2016a; 2017; Wen et al., 2016) . Recent papers (Renda et al., 2020; Le & Hua, 2021 ) report an interesting phenomenon: During retraining, a larger learning rate (LR) helps achieve a significantly better final performance, empowering the two baseline methods, random pruning and magnitude pruning, to match or beat many more complex pruning algorithms. The reason behind is argued (Wang et al., 2021a; 2023) to be related to the trainability of neural networks (Saxe et al., 2014; Lee et al., 2020; Lubana & Dick, 2021) . They make two major observations to explain the LR effect mystery (Wang et al., 2023) . (1) The weight removal operation immediately breaks the network trainability or dynamical isometry (Saxe et al., 2014) (the ideal case of trainability) of the trained network. (2) The broken trainability slows down the optimization in retraining, where a greater LR aids the model converge faster, thus a better performance is observed earlier -using a smaller LR can actually do as well, but needs more epochs. Although these works (Lee et al., 2020; Lubana & Dick, 2021; Wang et al., 2021a; 2023) provide a plausibly sound explanation, a more practical issue is how to recover the broken trainability or maintain it during pruning. In this regard, Wang et al. (2021a) proposes to apply weight orthogonalization based on QR decomposition (Trefethen & Bau III, 1997; Mezzadri, 2006) to the pruned

availability

https://github.com/MingSun-Tse/TPP.

