WHAT TO PRUNE AND WHAT NOT TO PRUNE AT INI-TIALIZATION

Abstract

Post-training dropout based approaches achieve high sparsity and are well established means of deciphering problems relating to computational cost and overfitting in Neural Network architectures citesrivastava2014dropout, (Pan et al., 2016), Zhu & Gupta (2017), LeCun et al. (1990). Contrastingly, pruning at initialization is still far behind Frankle et al. (2020). Initialization pruning is more efficacious when it comes to scaling computation cost of the network. Furthermore, it handles overfitting just as well as post training dropout. It is also averse to retraining losses. In approbation of the above reasons, the paper presents two approaches to prune at initialization. The goal is to achieve higher sparsity while preserving performance. 1) K-starts, begins with k random p-sparse matrices at initialization. In the first couple of epochs the network then determines the "fittest" of these p-sparse matrices in an attempt to find the "lottery ticket" Frankle & Carbin (2018) p-sparse network. The approach is adopted from how evolutionary algorithms find the best individual. Depending on the Neural Network architecture, fitness criteria can be based on magnitude of network weights, magnitude of gradient accumulation over an epoch or a combination of both. 2) Dissipating gradients approach, aims at eliminating weights that remain within a fraction of their initial value during the first couple of epochs. Removing weights in this manner despite their magnitude best preserves performance of the network. Contrarily, the approach also takes the most epochs to achieve higher sparsity. 3) Combination of dissipating gradients and kstarts outperforms either methods and random dropout consistently. The benefits of using the provided pertaining approaches are: 1) They do not require specific knowledge of the classification task, fixing of dropout threshold or regularization parameters 2) Retraining of the model is neither necessary nor affects the performance of the p-sparse network. We evaluate the efficacy of the said methods on Autoencoders and Fully Connected Multilayered Perceptrons. The datasets used are MNIST and Fashion MNIST.

1. INTRODUCTION

Computational complexity and overfitting in neural networks is a well established problem Frankle & Carbin (2018) , Han et al. (2015) , LeCun et al. (1990 ), Denil et al. (2013) . We utilize pruning approaches for the following two reasons: 1) To reduce the computational cost of a fully connected neural network. 2) To reduce overfitting in the network. Given a large number of post-training pruning approaches Srivastava et al. (2014 ), Geman et al. (1992 ), Pan et al. (2016) , the paper attempts to propose two pre-training pruning approaches: kstarts and dissipating gradients. Moreover, it appears to be the case that when isolated from other factors sparse networks outperform fully connected networks. When not isolated they perform at least as well up to a percentage of sparsity depending on the number of parameters in the said network. kstarts and dissipating gradients provide are simple nevertheless effective methods to quickly look for best sparse networks to. The approaches exploit the knowledge that a network has multiple underlying p-sparse networks that perform just as well and in some cases even better when contrasted with their fully connected counterparts Frankle & Carbin (2018) . What percentage of sparsity is realized, depends largely on the number of parameters originally present in the network. Such sparse networks are potent in preventing over-fitting and reducing computational cost. We use a simple intuitive models that achieve good results and exploits the fact that a number of sub networks in a Neural Network has the potential to individually learn the input Srivastava et al. (2014) . We decide on a sparse network early on based on the dropout method and use only that for training. This provides an edge for faster computation, quicker elimination of excess weights and reduced generalization error. The sparsity achieved is superior to random dropout. Section II gives a general introduction to all the methods, section III defines p-sparsity, section IV provides the algorithm for both approaches, section V describes experimental setup and results, section VI discusses various design choices, section VII gives a general discussion of results, section VIII discusses limitations of the approach and section IX provides conclusions and final remarks.

2.1.1. KSTARTS AND EVOLUTIONARY ALGORITHMS

We take the concept of k random starts from Evolutionary Algorithms (Vikhar, 2016) that use a fitness function or heuristic to perform "natural selection" in optimization and search based problems (Goldberg & Holland, 1988) . It is relatively simple to fit genetic algorithms to the problem at hand. Other method that would be equally effective with a little bit of modification are Hunting Search (Oftadeh et al., 2010 ), Natural Adaptation Strategies (Wierstra et al., 2008) , firefly algorithm (Yang, 2010) etc. The basic components of the algorithm are: (1) Population: A product of network weights and sparse matrices. (2) Individual: An instance of the population. (3) Fitness Function: The heuristic chosen for evaluation of the population.

2.1.2. POPULATION

We first initialize K sparse matrices, a single instance of these K sparse matrices can be seen in equation ??. In every iteration we multiply model weights W of the Network layer in question with every instance of the K sparse matrices. The resulting set of matrices is our population for that iteration. Each iteration is referred to as a new generation. population = W * K -SparseM atrices (1)

2.1.3. INDIVIDUAL

Each individual, in a population of K instances, is a sparse matrix of size equal to the size of network weights, W. The number of 0's and 1's in the sparsity matrix are determined by the connectivity factor p which is further described in section 3. An sparse matrix of p ≈ 0.5 will have ≈ 50% 0's and ≈ 50% 1s.



The poset-training pruning has several approaches in place such as adding various regularization schemes to prune the network Louizos et al. (2017), Pan et al. (2016) or using second derivative or hessian of the weights for dropout LeCun et al. (1990), Hassibi & Stork (1993). Han et al. (2015), Alford et al. (2019), Zhu & Gupta (2017) use an efficient iterative pruning method to iteratively increase sparsity. Srivastava et al. (2014) dropout random hidden units with p probability instead of weights to avoid overfitting in general. Each of these approaches is effective and achieves good sparsity post-training.

