SELFISH SPARSE RNN TRAINING

Abstract

Sparse neural networks have been widely applied to reduce the necessary resource requirements to train and deploy over-parameterized deep neural networks. For inference acceleration, methods that induce sparsity from a pre-trained dense network (dense-to-sparse) work effectively. Recently, dynamic sparse training (DST) has been proposed to train sparse neural networks without pre-training a large and dense network (sparse-to-sparse), so that the training process can also be accelerated. However, previous sparse-to-sparse methods mainly focus on Multilayer Perceptron Networks (MLPs) and Convolutional Neural Networks (CNNs), failing to match the performance of dense-to-sparse methods in Recurrent Neural Networks (RNNs) setting. In this paper, we propose an approach to train sparse RNNs with a fixed parameter count in one single run, without compromising performance. During training, we allow RNN layers to have a non-uniform redistribution across cell weights for a better regularization. Further, we introduce SNT-ASGD, a variant of the averaged stochastic gradient optimizer, which significantly improves the performance of all sparse training methods for RNNs. Using these strategies, we achieve state-of-the-art sparse training results, even better than dense model results, with various types of RNNs on Penn TreeBank and Wikitext-2 datasets.

1. INTRODUCTION

Recurrent neural networks (RNNs) (Elman, 1990) , with a variant of long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) , have been highly successful in various fields, including language modeling (Mikolov et al., 2010) , machine translation (Kalchbrenner & Blunsom, 2013) , question answering (Hirschman et al., 1999; Wang & Jiang, 2017) , etc. As a standard task to evaluate models' ability to capture long-range context, language modeling has witnessed great progress in RNNs. Mikolov et al. (2010) demonstrated that RNNs perform much better than backoff models for language modeling. After that, various novel RNN architectures such as Recurrent Highway Networks (RHNs) (Zilly et al., 2017 ), Pointer Sentinel Mixture Models (Merity et al., 2017 ), Neural Cache Model (Grave et al., 2017) At the same time, as the performance of deep neural networks (DNNs) improves, the resources required to train and deploy deep models are becoming prohibitively large. To tackle this problem, various dense-to-sparse methods have been developed, including but not limited to pruning (LeCun et al., 1990; Han et al., 2015) , Bayesian methods (Louizos et al., 2017a; Molchanov et al., 2017 ), distillation (Hinton et al., 2015) , L 1 Regularization (Wen et al., 2018) , and low-rank decomposition (Jaderberg et al., 2014) . Given a pre-trained model, these methods work effectively to accelerate the inference. Recently, some dynamic sparse training (DST) approaches (Mocanu et al., 2018; Mostafa & Wang, 2019; Dettmers & Zettlemoyer, 2019; Evci et al., 2020) have been proposed to bring efficiency for both, the training phase and the inference phase by dynamically changing the sparse connectivity during training. However, previous approaches are mainly for CNNs. For RNNs, the long-term dependencies and repetitive usage of recurrent cells make them more difficult to be sparsified (Kalchbrenner et al., 2018; Evci et al., 2020) . More importantly, the state-of-the-art performance achieved by RNNs on language modeling is mainly associated with the optimizer, averaged stochastic gradient descent (ASGD) (Polyak & Juditsky, 1992) , which is not compatible with the existing DST approaches. The above-mentioned problems heavily limit the performance



, Mixture of Softmaxes (AWD-LSTM-MoS) (Yang et al., 2018), ordered neurons LSTM (ON-LSTM) (Shen et al., 2019), and effective regularization like variational dropout (Gal & Ghahramani, 2016), weight tying (Inan et al., 2017), DropConnect (Merity et al., 2018) have been proposed to significantly improve the performance of RNNs.

