TVSPRUNE -PRUNING NON-DISCRIMINATIVE FILTERS VIA TOTAL VARIATION SEPARABILITY OF INTERMEDI-ATE REPRESENTATIONS WITHOUT FINE TUNING

Abstract

Achieving structured, data-free sparsity of deep neural networks (DNNs) remains an open area of research. In this work, we address the challenge of pruning filters without access to the original training set or loss function. We propose the discriminative filters hypothesis, that well-trained models possess discriminative filters, and any non discriminative filters can be pruned without impacting the predictive performance of the classifier. Based on this hypothesis, we propose a new paradigm for pruning neural networks: distributional pruning, wherein we only require access to the distributions that generated the original datasets. Our approach to solving the problem of formalising and quantifying the discriminating ability of filters is through the total variation (TV) distance between the class-conditional distributions of the filter outputs. We present empirical results that, using this definition of discriminability, support our hypothesis on a variety of datasets and architectures. Next, we define the LDIFF score, a heuristic to quantify the extent to which a layer possesses a mixture of discriminative and non-discriminative filters. We empirically demonstrate that the LDIFF score is indicative of the performance of random pruning for a given layer, and thereby indicates the extent to which a layer may be pruned. Our main contribution is a novel one-shot pruning algorithm, called TVSPrune, that identifies non-discriminative filters for pruning. We extend this algorithm to IterTVSPrune, wherein we iteratively apply TVSPrune, thereby enabling us to achieve greater sparsity. Last, we demonstrate the efficacy of the TVSPrune on a variety of datasets, and show that in some cases, we can prune up to 60% of parameters with only a 2% loss of accuracy without any fine-tuning of the model, beating the nearest baseline by almost 10%. Our code is available here 1 .

1. INTRODUCTION

Deep neural networks are, in general, highly overparameterized, leading to significant implementation challenges in terms of reducing inference time, power consumption, and memory footprint. This is especially crucial for deployment on real world, resource constrained devices (Molchanov et al., 2019b; Prakash et al., 2019) . A variety of solutions have been proposed to solve this problem, which can broadly be grouped into quantization, sparsification or pruning, knowledge distillation, and neural architecture search (NAS) (Hoefler et al., 2021) . Pruning can be further divided into unstructured pruning, wherein individual parameters are set to zero, or structured pruning, wherein entire filters or channels are removed from the architecture (Hoefler et al., 2021) . Structured pruning yields immediate improvements in inference time, memory footprint, and power consumption, without requiring any specialized software frameworks. Unstructured pruning, on the other hand, typically yields models that are significantly more sparse than those obtained with structured pruning, but do not provide the same improvements in inference time without specialized sparse linear algebra implementations (Hoefler et al., 2021; Blalock et al., 2020) . In this work, we consider the problem of pruning CNNs without access to the training set or loss function. Data-free pruning is an important problem due to concerns such as privacy and security Yin et al. (2020) , as well as the cost of retraining models (Tanaka et al., 2020; Hoefler et al., 2021) . We offer a new perspective to this problem, which we call distributional pruning, wherein we have no access to the training data or the loss function, but have access to data distribution, either through it's moments, or through additional samples separate from the training set. To facilitate distributional pruning, two crucial questions need to be answered. First, what makes a filter valuable to the classification performance of the model?, and second, how do we characterize which layers can be effectively sparsified? To answer these questions, we first identify discriminative filters, which are filters with class-conditional outputs that are well-separated in terms of the total variation. We propose the discriminative filters hypothesis, which states that well-trained models possess a mix of discriminative and non discriminative filters, and that discriminative filters are useful for generalization of the classifier. Based on this hypothesis, discriminative filters are useful for classification purposes whereas non-discriminative filters are not, thus allowing us to prune the latter. Furthermore, layers that possess a mix of discriminative and non-discriminative filters can be effectively pruned, thereby providing a method to identify difficult-to-prune layers. We formally state our contributions below. 1. We begin by proposing a quantitative measure of discriminative ability of filters in terms of the TV distance between their class-conditional outputs. Specifically, we say a filter is TVseparable if the pairwise minimum TV distance between class-conditional distributions of the outputs is larger than a given threshold. If the class conditional distributions are gaussian -a common assumption as noted in Wong et al. ( 2021); Wen et al. ( 2016) -we can compute the Hellinger distance based lower bound to estimate whether filters are TV-separable using easily computed class-conditional moments. We describe this in section 4. 2. We produce the empirical observation that the classwise outputs of at least some filters present in CNNs that generalize well are TV-separable, and that untrained models, or models that generalize poorly do not possess discriminative filters; these are presented in section 7. Based on these observations, in section 2, we propose the discriminative filters hypothesis, which states that well-trained convolutional neural networks possess a mix of discriminative and non-discriminative filters, and discriminative filters are useful for classification whereas the latter are not. We use this hypothesis to motivate a distributional approach to pruning. 3. Based on the discriminative filters hypothesis, we aim to use TV separation to identify which filters to prune in a model. We assume the class-conditional distributions are Gaussian, and, using the Hellinger lower bounds discussed in section 4, we compute lower bounds on the TV-separation for each filter. We identify important filters (those that cannot be pruned) as those filters with the Hellinger lower bound of the TV-separation that are greater than a separation threshold; those filters that are not discriminative with respect to the separation threshold can be pruned. 2019), some layers are more difficult to prune than others. We address the problem of identifying which layers can be effectively pruned using TV-separability. Based on the discriminative filters hypothesis, a layer can be effectively pruned if it possesses a mixture of discriminative and non-discriminative filters. Thus, in section 5, we propose an informative heuristic, which we call the LDIFFscore, that quantifies the extent to which a layer possesses a mixture of discriminative and nondiscriminative filters. We empirically validate this heuristic in section 7. 5. We use TV-separability and LDIFF scores to develop TVSPRUNE, a layer-wise, threshold based method for structured pruning, requiring no fine tuning and only the class-conditional moments of the outputs of each filter. We also extend this algorithm to an iterative variant, ITERTVSPRUNE, which enables superior sparsification of the model. We formally state these algorithms in section 6, in Algorithms 1 and 2. We show that on the CIFAR-10 dataset, our method achieves over 40% sparsification with minimal reduction in accuracy on VGG models without any fine tuning; furthermore, our method outperforms contemporary methods such as (Sui et al., 2021; Molchanov et al., 2019b) in this regime.



As noted in Hoefler et al. (2021); Liebenwein et al. (

