AN EMPIRICAL STUDY OF A PRUNING MECHANISM Anonymous authors Paper under double-blind review

Abstract

Many methods aim to prune neural network to the maximum extent. However, there are few studies that investigate the pruning mechanism. In this work, we empirically investigate a standard framework for network pruning: pretraining large network and then pruning and retraining it. The framework has been commonly used based on heuristics, i.e., finding a good minima with a large network (pretraining phase) and retaining it with careful pruning and retraining (pruning and retraining phase). For the pretraining phase, the reason for which the large network is required to achieve good performance is examined. We hypothesize that this might come from the network relying on only a portion of its weights when trained from scratch. This way of weight utilization is referred to as imbalanced utility. The measures for weight utility and utility imbalance are proposed. We investigate the cause of the utility imbalance and the characteristics of the weight utility. For the pruning and retraining phase, whether the pruned-and-retrained network benefits from the pretrained network is examined. We visualize the accuracy surface of the pretrained, pruned, and retrained networks and investigate the relation between them. The validation accuracy is also interpreted in association with the surface.

1. INTRODUCTION

Deep learning is currently one of the most powerful machine learning methods. It requires neural network to train, which usually takes a few to hundreds times more weights than training data (He et al., 2016; Zagoruyko & Komodakis, 2016; Huang et al., 2017; Karen & Andrew, 2015) . Usually, in common regimes, a greater number of weights leads to better performance (Zagoruyko & Komodakis, 2016) . However, paradoxically, neural networks are also compressible. Many of the recent pruning methods aim to maximally compress the networks (Han et al., 2015; Liu et al., 2017; He et al., 2019; You et al., 2019) , however, there are few works that investigate why and how the pruning mechanism works (Frankle et al., 2019; Elesedy et al., 2020) . In this work, we empirically investigate a standard framework for network pruning: pretraining a large network and then pruning and retraining it. The framework has been commonly used based on heuristics, i.e. finding a good minima with a larger network and retaining it with careful pruning and retraining (Han et al., 2015; Liu et al., 2017) . We investigate the heuristic in two parts, i.e., one for the pretraining phase and the other for the pruning and retraining phase. For the pretraining phase, the reason for training the large network to obtain a good minima is investigated. Since the neural network is generally compressible, the pretrained large network can be pruned to a smaller one. However, a network with the same number of weights as that of the pruned network cannot achieve similar performance when trained from scratch (Frankle & Carbin, 2018) . We conjecture that this comes from the networks not utilizing all of their weights. Thus we hypothesize: if trained from scratch, there is a utility imbalance among the weights in neural network. For investigation, the measures for the weight utility and the utility imbalance are proposed. Thereafter, the cause of the utility imbalance and the characteristics of the weight utility in various conditions are examined. For the pruning and retraining phase, we verify the heuristic that once a good minima is obtained with the large network, it can be retained by careful pruning and retraining (Han et al., 2015; Renda et al., 2020) . Our investigation is based on the loss surface visualization on a two dimensional plane formed by three points in the weight space where each point represents the pretrained network, the pruned network, and the pruned and retrained network. We examine (1) the dynamics of the network on the loss surface throughout pruning and (2) the validation accuracy of the networks over varying pruning methods and retraining methods. Contributions. • The utility imbalance among the weights increases during optimization. • The neural networks utilize the weights in proportion to their size. • If a pretrained network is carefully pruned and retrained, then the pruned-and-retrained network shares the same loss basin with the pretrained network.

2. WEIGHT UTILITY ANALYSIS FOR THE PRETRAIN MECHANISM

Then why do we have to train a large network and then prune to a smaller one? Why not just train the smaller one to get the performance we need? Why is it difficult? The investigation about the questions starts with a hypothesis: let N large be a large network that does not utilize all of its weights, and thus can be easily compressed into a smaller network N pruned with minimal loss change. And let N small be a network trained from scratch, whose number of weights is comparable to that of N pruned , which is sufficient to achieve a similar level of loss to those of N large or N pruned . However, N small generally performs worse, because N small does not utilize all of its weights either. Therefore, we hypothesize that the neural network does not utilize all of its weights when trained from scratch in general. And we refer to the phenomenon which the neural network utilize the weights unevenly as utility imbalance. Thus, Main Hypothesis. If trained from scratch, there is utility imbalance among the weights in a neural network. And we empirically measure the utility of weights as: Definition 1 (Utility measure). Let W be a set of total weights in a network N , W s be a subset of W , and X be a dataset. Suppose f W (x) and f W \Ws (x) are probability mass functions resulting from a softmax layer, where x ∼ X is an input and f W \Ws (x) is obtained by zeroing out the weights in W s . Then, the utility of W s can be measured as U (W s ) = E x∼X d KL f W (x), f W \Ws (x) , where d KL is KL-divergence. For reference, the way of the measurement, i.e., network ablation, was similarly done in (Casper et al., 2019; 2020; Meyes et al., 2019; Cheney et al., 2017) . We also define the utility imbalance as:



Figure1: The examples of the imbalance in weight utilization. (left) We trained a network following the procedure for 1× network in Section 2.2.1. The weights were randomly ablated from the network and the training accuracy was measured for 500 times. The average accuracy is plotted with the minimum and maximum values in the bar. Note that the accuracy difference is about 60% when the ablation ratio is 0.1. This implies the utility imbalance among the weights. (right) The training accuracy with respect to the ablation regarding the magnitude of the weights. Each legend indicates the ablation ratio. Note that the training accuracy drops only when the weights with the large magnitude is ablated. The utility of the weight is biased with regard to the magnitude.

