HYPERPRUNING: EFFICIENT PRUNING THROUGH LYA-PUNOV METRIC HYPERSEARCH

Abstract

Various pruning methods have been introduced for over-parameterized recurrent neural networks to improve efficiency in terms of power and storage. With the advance in pruning methods and their variety, a new problem of 'hyperpruning' is becoming apparent: finding a suitable pruning method with optimal hyperparameter configuration for a particular task and network. Such search is different from the standard hyperparameter search, where the accuracy of the optimal configuration is unknown. In the context of network pruning, the accuracy of the non-pruned (dense) model sets the target for the accuracy of the pruned model. Thereby, the goal of hyperpruning is to reach or even surpass this target. It is critical to develop efficient strategies for hyperpruning since direct search through pruned variants would require time-consuming training without guarantees for improved performance. To address this problem, we introduce a novel distance based on Lyapunov Spectrum (LS) which provides means to compare pruned variants with the dense model and early in training to estimate the accuracy that pruned variants will achieve after extensive training. The ability to predict performance allows us to incorporate the LS-based distance with Bayesian hyperparameter optimization methods and to propose an efficient and first-of-its-kind hyperpruning approach called LS-based Hyperpruning (LSH) which can optimize the search time by an order of magnitude compared to standard full training search with the loss (or perplexity) being the accuracy metric. Our experiments on stacked LSTM and RHN language models trained with the Penn Treebank dataset show that with a given budget of training epochs and desired pruning ratio, LSH obtains more optimal variants than standard loss-based hyperparameter optimization methods. Furthermore, as a result of the search, LSH identifies pruned variants that outperform state-of-the-art pruning methods and surpass the accuracy of the dense model.

1. INTRODUCTION

Over the last decade, the performance of sequence models, i.e. Recurrent Neural Networks (RNN), has been significantly enhanced in various applications such as action recognition (Su et al., 2020) , video summarization (Zhao et al., 2018) and voice conversion (Huang et al., 2021) . In particular, RNN variants such as LSTM (Hochreiter & Schmidhuber, 1997; Zaremba et al., 2014; Malhotra et al., 2015) and RHN (Zilly et al., 2017) excel in various NLP application ranging from machine translation (Wu et al., 2016) to language modeling (Irie et al., 2019) . However, the architectural inherent computational demands of RNN, being linear to input sequence length and quadratic to model size, lead to a slowdown in training and inference. This hinders these models from being deployed on resource-limited devices such as mobile devices. Multiple methods have been proposed to alleviate this problem, including network quantization (Hernández et al., 2020; Han et al., 2015a) and weight sharing (Ullrich et al., 2017) . Among these approaches, network pruning is advantageous since aims to achieve a sparse model which would require fewer computational resources. A particularly notable pruning approach is dense-to-sparse, where the network is gradually being pruned starting from a non-pruned (dense) model (Han et al., 2015b; Guo et al., 2016) . While the inference time of the pruned network will eventually decrease, the training time remains similar to or even longer than the time of training a dense model. Such training time typically extends to multiple days or weeks and poses challenges in achieving pruned models efficiently. The Dynamic Sparse Training (DST) approach was introduced recently to meet the rising demand of optimizing computational costs for achieving pruned variants (Bellec et al., 2017) . In contrast to dense-to-sparse approaches, DST is a sparse-to-sparse pruning method that starts with a sparse model and maintains the number of non-zero parameters over the entire training to improve not only the training speed along with the inference speed. DST involves three main procedural steps of pruning: weight removal, weight growth, and weight redistribution. For each step, salience (controls) such as magnitude-or gradient-based could be applied to decide how it is conducted. There are no universal controls applicable to all tasks and networks, as each has its intended scenario. This unique mapping between a scenario and a control in DST prevents generalizing a standard rule over all scenarios. Therefore, for a certain scenario, the controls which characterize the pruning method become additional key hyperparameters that need to be set such that pruning is executed in the most optimal way. We term this type of hyperparameter search as 'Hyperpruning'. Hyperpruning concerns with selecting both the pruning methodology with its controls along with other hyperparameters (related to training) for a particular scenario. Specifically, hyperpruning requires searching over methodological and non-methodological hyperparameters. Methodological hyperparameters define the pruning method, and non-methodological hyperparameters are independent of the pruning method and are applicable across them. A unique feature of hyperpruning that does not typically hold for other hyperparameter optimization problems is the existence of estimated target accuracy since the optimal accuracy of its non-pruned (dense) counterpart is available and can serve as a loose upper bound for the pruned network to guide the hyperpruning process. However, even with this knowledge, searching through all pruning methods and their hyperparameters is a time-consuming task since each variant requires extensive training. Furthermore, there is no guarantee for achieving a more accurate configuration after investigating multiple unsuccessful configurations. Hyperparameter Optimization algorithms accelerate the search process by implementing a distance that could be optimal to either efficiently evaluate configuration variants or effectively generate reliable variants (Snoek et al., 2012; Hutter et al., 2011; Bergstra et al., 2011) . In particular, such a distance will aim to provide early estimation of the accuracy of a considered configuration without proceeding with full training. This distance also targets to improve



Figure 1: (A) Lyapunov Spectrum curves of dense (black), and two pruned variants (green and red) at pre-training, Epoch 1, 3, and 5 (left to right). (B) L 2 distance of two variants to the dense reference in LS space over training. (C) Perplexity curves for two pruned variants over training.

