SELFISH SPARSE RNN TRAINING

Abstract

Sparse neural networks have been widely applied to reduce the necessary resource requirements to train and deploy over-parameterized deep neural networks. For inference acceleration, methods that induce sparsity from a pre-trained dense network (dense-to-sparse) work effectively. Recently, dynamic sparse training (DST) has been proposed to train sparse neural networks without pre-training a large and dense network (sparse-to-sparse), so that the training process can also be accelerated. However, previous sparse-to-sparse methods mainly focus on Multilayer Perceptron Networks (MLPs) and Convolutional Neural Networks (CNNs), failing to match the performance of dense-to-sparse methods in Recurrent Neural Networks (RNNs) setting. In this paper, we propose an approach to train sparse RNNs with a fixed parameter count in one single run, without compromising performance. During training, we allow RNN layers to have a non-uniform redistribution across cell weights for a better regularization. Further, we introduce SNT-ASGD, a variant of the averaged stochastic gradient optimizer, which significantly improves the performance of all sparse training methods for RNNs. Using these strategies, we achieve state-of-the-art sparse training results, even better than dense model results, with various types of RNNs on Penn TreeBank and Wikitext-2 datasets.

1. INTRODUCTION

Recurrent neural networks (RNNs) (Elman, 1990) , with a variant of long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) , have been highly successful in various fields, including language modeling (Mikolov et al., 2010) , machine translation (Kalchbrenner & Blunsom, 2013) , question answering (Hirschman et al., 1999; Wang & Jiang, 2017) , etc. As a standard task to evaluate models' ability to capture long-range context, language modeling has witnessed great progress in RNNs. Mikolov et al. (2010) demonstrated that RNNs perform much better than backoff models for language modeling. After that, various novel RNN architectures such as Recurrent Highway Networks (RHNs) (Zilly et al., 2017 ), Pointer Sentinel Mixture Models (Merity et al., 2017 ), Neural Cache Model (Grave et al., 2017) , Mixture of Softmaxes (AWD-LSTM-MoS) (Yang et al., 2018) , ordered neurons LSTM (ON-LSTM) (Shen et al., 2019) , and effective regularization like variational dropout (Gal & Ghahramani, 2016) , weight tying (Inan et al., 2017 ), DropConnect (Merity et al., 2018) have been proposed to significantly improve the performance of RNNs. At the same time, as the performance of deep neural networks (DNNs) improves, the resources required to train and deploy deep models are becoming prohibitively large. To tackle this problem, various dense-to-sparse methods have been developed, including but not limited to pruning (LeCun et al., 1990; Han et al., 2015) , Bayesian methods (Louizos et al., 2017a; Molchanov et al., 2017 ), distillation (Hinton et al., 2015) , L 1 Regularization (Wen et al., 2018) , and low-rank decomposition (Jaderberg et al., 2014) . Given a pre-trained model, these methods work effectively to accelerate the inference. Recently, some dynamic sparse training (DST) approaches (Mocanu et al., 2018; Mostafa & Wang, 2019; Dettmers & Zettlemoyer, 2019; Evci et al., 2020) have been proposed to bring efficiency for both, the training phase and the inference phase by dynamically changing the sparse connectivity during training. However, previous approaches are mainly for CNNs. For RNNs, the long-term dependencies and repetitive usage of recurrent cells make them more difficult to be sparsified (Kalchbrenner et al., 2018; Evci et al., 2020) . More importantly, the state-of-the-art performance achieved by RNNs on language modeling is mainly associated with the optimizer, averaged stochastic gradient descent (ASGD) (Polyak & Juditsky, 1992) , which is not compatible with the existing DST approaches. The above-mentioned problems heavily limit the performance • We present an approach to analyze the evolutionary trajectory of the sparse connectivity optimized by dynamic sparse training from the perspective of graph. With this approach, we show that there exist many good structural local optima (sparse sub-networks having equally good performance) in RNNs, which can be found in an efficient and robust manner. • Our analysis shows two surprising phenomena in the setting of RNNs contrary to CNNs: (1) random-based weight growth performs better than gradient-based weight growth, (2) uniform sparse distribution performs better than Erdős-Rényi (ER) sparse initialization. These results highlight the need to choose different sparse training methods for different architectures.

2. RELATED WORK

Dense-to-Sparse. There are a large amount of works operating on a dense network to yield a sparse network. We divide them into three categories based on the training cost in terms of memory and computation. (1) Iterative Pruning and Retraining. (2019) proposed the Lottery Ticket Hypothesis showing that the sub-networks ("winning tickets") obtained via iterative pruning combined with their "lucky" initialization can outperform the dense networks. Zhou et al. (2019) discovered that the sign of their initialization is the crucial factor that



Figure 1: Schematic diagram of the Selfish-RNN. W i , W f , W c , W o refer to LSTM cell weights.Colored squares and white squares refer to nonzero weights and zero weights, respectively. Light blue squares are weights to be removed and orange squares are weights to be grown.

To the best of our knowledge, pruning was first proposed byJanowsky (1989)  andMozer & Smolensky (1989)  to yield a sparse network from a pre-trained network.Recently, Han et al. (2015)  brought it back to people's attention based on the idea of iterative pruning and retraining with modern architectures. Some recent works were proposed to further reduce the number of iterative retraining e.g.,Narang et al. (2017); Zhu & Gupta (2017). Frankle & Carbin

