THE CROSSWORD PUZZLE: SIMPLIFYING DEEP NEU-RAL NETWORK PRUNING WITH FABULOUS COORDI-NATES

Abstract

Pruning is a promising technique to shrink the size of Deep Neural Network models with only negligible accuracy overheads. Recent efforts rely on experiencederived metric to guide pruning procedure, which heavily saddles with the effective generalization of pruning methods. We propose The Cross Puzzle, a new method to simplify this procedure by automatically deriving pruning metrics. The key insight behind our method is that: For Deep Neural Network Models, a Pruning-friendly Distribution of model's weights can be obtained, given a proper Coordinate. We experimentally confirm the above insight, and denote the new Coordinate as the Fabulous Coordinates. Our quantitative evaluation results show that: the Crossword Puzzle can find a simple yet effective metric, which outperforms the state-of-the-art pruning methods by delivering no accuracy degradation on ResNet-56 (CIFAR-10)/-101 (ImageNet), while the pruning rate is raised to 70%/50% for the respective models.

1. INTRODUCTION

Pruning Deep Neural Network models is promising the reduce the size of these models while keeping the same level of the accuracy. Prior arts focus on the designs of the pruning method, such as iterative pruning (Han et al. (2015a) , one-shot pruning (Lee et al. (2018) ), pruning without training (Ramanujan et al. (2020) ), etc. However, prior works craft the pruning metrics via additional efforts, based on the testing experiences. Our goal in this work is to design a method for automatically searching a proper metric for model pruning. Based on the classic pipelines (e.g. Genetic Algorithm (Mitchell (1998)) and Ant Colony Optimization (Dorigo & Di Caro (1999) ), we first systematically summarize such a method requires three components: ➊ Basic building blocks of pruning criteria; ➋ Objective function to evaluate auto-generated pruning metrics; ➌ Heuristic searching process to guide the searching. Based on the above summary, prior works mainly focus on the first and third components (for instance, we can use L 1 -norm (Li et al. (2016) ) and geometric median (He et al. (2018b) ) as building blocks, and simulated annealing (Kirkpatrick et al. (1983) ) as our searching guider). Therefore, it's still unclear that how objective functions should be measured for the quality of a certain pruning metric (namely the unfilled letters in our "crossword puzzle" denotation). This motivates us to examine the essential condition(s) of a good-quality pruning criterion. Based on a simple magnitude-based pruning method (Han et al. (2015b) ) and the follow-up weight distribution analysis (Liu et al. ( 2018)), we formalize that one essential condition and describe it as follows: Given a coordinate Ψ (the formal expression of a pruning criterion) and neural network model M , Ψ is highly likely to be high-qualifiedfoot_0 , if the distribution D(M ) got from Ψ(M ) obeys the following requirements: • Centralized distribution: the statistics are concentrated on one center in the distribution, which is an important symbol of overparameterized neural networks. • Retraining recovers centralized distribution: through retraining, statistics can regather at the original distribution center after the peak on that center is cutting off by pruning, which means that the objects located at the center are able to replace each other. • Central collapse: at the end of the pruning, we can observe that retraining can't drive the statistics to fill the center void again (namely central collapse). This is a signal demonstrating that there is nearly none redundancy in the model, which also alludes to the success of our criterion selection. We denote such a coordinate Ψ as the Fabulous Coordinates, and the distribution D generated by it as the Fabulous Distribution. Based on our formalization, we can convert the pruning criterion searching problem into finding the Fabulous Coordinates. By quantitatively depicting the Fabulous Distribution using Loose-KL-Centralization-Degree (LKL-CD), we formulate the objective function and build the Crossword Puzzle, a pruning criteria searching framework. We experimentally evaluate the effectiveness of the Crossword Puzzle. First, we use the Crossword Puzzle to find a Fabulous Coordinate. Then, we leverage the found Fabulous Coordinate to guide our pruning, and the results confirm the effectiveness of our method on CIFAR-10 and ImageNet. Our results show that we can prune VGG-16-bn (CIFAR-10)/ResNet-56 (CIFAR-10)/-101 (ImageNet) to remain about 50/30/50% weightsfoot_1 without accuracy degradation, which beats the human-tuned pruning pipelines such as FPGM (He et al. ( 2018b)) and RL-MCTS (Wang & Li ( 2022)). This reduction on weights also brings about faster inference. On CIFAR-10, we can achieve maximal 3× acceleration of ResNet-50 with the same accuracy, compared to original model. The boost of inference speed and the reduction of memory footprint make the application of high-accuracy models on edge devices feasible.

2. RELATED WORKS

In this section, we brief related works to justify the novelty of our work. We first classify prior works based on the pruning criterion, which include magnitude-/impact-based pruning. We then give an overview on works that utilizes distribution analysis. Finally, we justify the novelty of our method.

2.1. MAGNITUDE-BASED PRUNING

We refer magnitude-based pruning to the network-slimming approaches based on the importance of neural network's weights, which are measured by L 1 /L 2 -norm/absolute value of network's parameters/feature-maps/filters/layers (either locally or globally). Though the rationale behind them is intuitive, the methods can usually achieve outstanding pruning results with an easy-to-operate pipeline, which are extensible to be applied on different types of neural network (like Multi-layer Perceptron Hertz et al. (1991) 

2.2. IMPACT-BASED PRUNING

We refer impact-based pruning to methods for eliminating the weights while minimizing overheads on the model. The ancestors (LeCun et al. (1989); Hassibi et al. (1993) ) of this series of works aim to find a criterion different from those described in magnitude-based approaches with a possibly more reasonable theoretical explanation. OBD LeCun et al. (1989) and OBS Hassibi et al. (1993) 



We refer "a coordinate to be highly-qualified", if we can use it to prune neural network model with (almost) no accuracy drop under a relatively-high pruning rate. Since the discrepancy between weight reduction and FLOPs decrease is usually small, we only showcase FLOPs decrease in our experimental results (Section 5).



, Convolution Neural Network (CNN) Han et al. (2015b) and Transformer Mao et al. (2021)). For CNN, Han et al. (2015a)'s Deep Compression intrigues lots of follow-up works on this direction (e.g. Li et al. (2016); Gordon et al. (2020); Elesedy et al. (2020); Tanaka et al. (2020); Liu et al. (2021)). More recently, the Lottery Ticket Hypothesis Frankle & Carbin (2019) shares some similarities with this line of works: such a method assumes that a subnet, from a dense randomly-initialized neural network, can be trained separately; and it can achieve a similar level of the testing accuracy (under the same amount of training iterations). A substantial amount of efforts focus on extending this hypothesis furthermore (e.g. Zhou et al. (2019); Ramanujan et al. (2020); Malach et al. (2020); Pensia et al. (2020); Orseau et al. (2020); Qian & Klabjan (2021); Chijiwa et al. (2021a); da Cunha et al. (2022)).

