HYPERPRUNING: EFFICIENT PRUNING THROUGH LYA-PUNOV METRIC HYPERSEARCH

Abstract

Various pruning methods have been introduced for over-parameterized recurrent neural networks to improve efficiency in terms of power and storage. With the advance in pruning methods and their variety, a new problem of 'hyperpruning' is becoming apparent: finding a suitable pruning method with optimal hyperparameter configuration for a particular task and network. Such search is different from the standard hyperparameter search, where the accuracy of the optimal configuration is unknown. In the context of network pruning, the accuracy of the non-pruned (dense) model sets the target for the accuracy of the pruned model. Thereby, the goal of hyperpruning is to reach or even surpass this target. It is critical to develop efficient strategies for hyperpruning since direct search through pruned variants would require time-consuming training without guarantees for improved performance. To address this problem, we introduce a novel distance based on Lyapunov Spectrum (LS) which provides means to compare pruned variants with the dense model and early in training to estimate the accuracy that pruned variants will achieve after extensive training. The ability to predict performance allows us to incorporate the LS-based distance with Bayesian hyperparameter optimization methods and to propose an efficient and first-of-its-kind hyperpruning approach called LS-based Hyperpruning (LSH) which can optimize the search time by an order of magnitude compared to standard full training search with the loss (or perplexity) being the accuracy metric. Our experiments on stacked LSTM and RHN language models trained with the Penn Treebank dataset show that with a given budget of training epochs and desired pruning ratio, LSH obtains more optimal variants than standard loss-based hyperparameter optimization methods. Furthermore, as a result of the search, LSH identifies pruned variants that outperform state-of-the-art pruning methods and surpass the accuracy of the dense model.

1. INTRODUCTION

Over the last decade, the performance of sequence models, i.e. Recurrent Neural Networks (RNN), has been significantly enhanced in various applications such as action recognition (Su et al., 2020 ), video summarization (Zhao et al., 2018) and voice conversion (Huang et al., 2021) . In particular, RNN variants such as LSTM (Hochreiter & Schmidhuber, 1997; Zaremba et al., 2014; Malhotra et al., 2015) and RHN (Zilly et al., 2017) excel in various NLP application ranging from machine translation (Wu et al., 2016) to language modeling (Irie et al., 2019). However, the architectural inherent computational demands of RNN, being linear to input sequence length and quadratic to model size, lead to a slowdown in training and inference. This hinders these models from being deployed on resource-limited devices such as mobile devices. Multiple methods have been proposed to alleviate this problem, including network quantization (Hernández et al., 2020; Han et al., 2015a) and weight sharing (Ullrich et al., 2017) . Among these approaches, network pruning is advantageous since aims to achieve a sparse model which would require fewer computational resources. A particularly notable pruning approach is dense-to-sparse, where the network is gradually being pruned starting from a non-pruned (dense) model (Han et al., 2015b; Guo et al., 2016) . While the inference time of the pruned network will eventually decrease, the training time remains similar to or even longer than the time of training a dense model. Such training time typically extends to multiple days or weeks and poses challenges in achieving pruned models efficiently.

