WHAT TO PRUNE AND WHAT NOT TO PRUNE AT INI-TIALIZATION

Abstract

Post-training dropout based approaches achieve high sparsity and are well established means of deciphering problems relating to computational cost and overfitting in Neural Network architectures citesrivastava2014dropout, (Pan et al., 2016), Zhu & Gupta (2017), LeCun et al. (1990). Contrastingly, pruning at initialization is still far behind Frankle et al. (2020). Initialization pruning is more efficacious when it comes to scaling computation cost of the network. Furthermore, it handles overfitting just as well as post training dropout. It is also averse to retraining losses. In approbation of the above reasons, the paper presents two approaches to prune at initialization. The goal is to achieve higher sparsity while preserving performance. 1) K-starts, begins with k random p-sparse matrices at initialization. In the first couple of epochs the network then determines the "fittest" of these p-sparse matrices in an attempt to find the "lottery ticket" Frankle & Carbin (2018) p-sparse network. The approach is adopted from how evolutionary algorithms find the best individual. Depending on the Neural Network architecture, fitness criteria can be based on magnitude of network weights, magnitude of gradient accumulation over an epoch or a combination of both. 2) Dissipating gradients approach, aims at eliminating weights that remain within a fraction of their initial value during the first couple of epochs. Removing weights in this manner despite their magnitude best preserves performance of the network. Contrarily, the approach also takes the most epochs to achieve higher sparsity. 3) Combination of dissipating gradients and kstarts outperforms either methods and random dropout consistently. The benefits of using the provided pertaining approaches are: 1) They do not require specific knowledge of the classification task, fixing of dropout threshold or regularization parameters 2) Retraining of the model is neither necessary nor affects the performance of the p-sparse network. We evaluate the efficacy of the said methods on Autoencoders and Fully Connected Multilayered Perceptrons. The datasets used are MNIST and Fashion MNIST.

1. INTRODUCTION

Computational complexity and overfitting in neural networks is a well established problem Frankle & Carbin (2018 ), Han et al. (2015) , LeCun et al. (1990 ), Denil et al. (2013) . We utilize pruning approaches for the following two reasons: 1) To reduce the computational cost of a fully connected neural network. 2) To reduce overfitting in the network. 2016), the paper attempts to propose two pre-training pruning approaches: kstarts and dissipating gradients. Moreover, it appears to be the case that when isolated from other factors sparse networks outperform fully connected networks. When not isolated they perform at least as well up to a percentage of sparsity depending on the number of parameters in the said network. kstarts and dissipating gradients provide are simple nevertheless effective methods to quickly look for best sparse networks to. The approaches exploit the knowledge that a network has multiple underlying p-sparse networks that perform just as well and in some cases even better when contrasted with their fully connected 1



Given a large number of post-training pruning approaches Srivastava et al. (2014), Geman et al. (1992), Pan et al. (

