EFFICIENT RECURRENT ARCHITECTURES THROUGH ACTIVITY SPARSITY AND SPARSE BACK-PROPAGATION THROUGH TIME

Abstract

Recurrent neural networks (RNNs) are well suited for solving sequence tasks in resource-constrained systems due to their expressivity and low computational requirements. However, there is still a need to bridge the gap between what RNNs are capable of in terms of efficiency and performance and real-world application requirements. The memory and computational requirements arising from propagating the activations of all the neurons at every time step to every connected neuron, together with the sequential dependence of activations, contribute to the inefficiency of training and using RNNs. We propose a solution inspired by biological neuron dynamics that makes the communication between RNN units sparse and discrete. This makes the backward pass with backpropagation through time (BPTT) computationally sparse and efficient as well. We base our model on the gated recurrent unit (GRU), extending it with units that emit discrete events for communication triggered by a threshold so that no information is communicated to other units in the absence of events. We show theoretically that the communication between units, and hence the computation required for both the forward and backward passes, scales with the number of events in the network. Our model achieves efficiency without compromising task performance, demonstrating competitive performance compared to state-of-the-art recurrent network models in real-world tasks, including language modeling. The dynamic activity sparsity mechanism also makes our model well suited for novel energy-efficient neuromorphic hardware.

1. INTRODUCTION

Large scale models such as GPT-3 (Brown et al., 2020) and DALL-E (Ramesh et al., 2021) have demonstrated that scaling up deep learning models to billions of parameters improve not just their performance but lead to entirely new forms of generalisation. But for resource constrained environments, transformers are impractical due to their computational and memory requirements during training as well as inference. Recurrent neural networks (RNNs) may provide a viable alternative in such low-resource environments, but require further algorithmic and computational optimizations. While it is unknown if scaling up recurrent neural networks can lead to similar forms of generalization, the limitations on scaling them up preclude studying this possibility. The dependence of each time step's computation on the previous time step's output prevents easy parallelisation of the model computation. Moreover, propagating the activations of all the units in each time step is computationally inefficient and leads to high memory requirements when training with backpropagation through time (BPTT). While allowing extraordinary task performance, the biological brain's recurrent architecture is extremely energy efficient (Mead, 2020) . One of the brain's strategies to reach these high levels of efficiency is activity sparsity. In the brain, (asynchronous) event-based and activity-sparse communication results from the properties of the specific physical and biological substrate on which the brain is built. Biologically realistic spiking neural networks and neuromorphic hardware aim to use these principles to build energy-efficient software and hardware models (Roy et al., 2019; Schuman et al., 2017) . However, despite progress in recent years, their task performance has been relatively limited for real-world tasks compared to recurrent architectures based on LSTM and GRU. In this work, we propose an activity sparsity mechanism inspired by biological neuron models, to reduce the computation required by RNNs at each time step. Our method adds a mechanism to the recurrent units to emit discrete events for communication triggered by a threshold so that no information is communicated to other units in the absence of events. With event-based communication, units in the model can decide when to send updates to other units, which then trigger the update of receiving units. When events are sent sparingly, this leads to activity-sparsity where most units do not send updates to other units most of the time, leading to substantial computational savings during training and inference. We formulate the gradient updates of the network to be sparse using a novel method, extending the benefit of the computational savings to training time. We theoretically show, in the continuous time limit, that the time complexity of calculating weight updates is proportional to the number of events in the network. We demonstrate these properties using Gated Recurrent Unit (GRU) (Cho et al., 2014) as a case study, and call our model Event-based Gated Recurrent Unit (EGRU). We note, however, that our dynamic activity-sparsity mechanism can be applied to any RNN architecture. In summary, the main contributions of this paper are the following: 1. We introduce a variant of the GRU with an event-generating mechanism, called the EGRU. 2. We theoretically show that, in the continuous time limit, both the forward pass computation and the computation of parameter updates in the EGRU scales with the number of events (active units). 3. We demonstrate that the EGRU exhibits task-performance competitive with state-of-the-art recurrent network architectures on real-world machine learning benchmarks. 4. We empirically show that EGRU exhibits high levels of activity-sparsity during both inference (forward pass) and learning (backward pass). We note here that methods for training with parameter sparsity or improving handling of long-term dependencies are both orthogonal to, and can be combined with our approach (which we plan to do in future work). Our focus, in this paper, is exclusively on using activity-sparsity to increase the efficiency of RNNs, specifically the GRU. We expect our method to be more efficient but not better at handling long-range dependencies compared to the GRU. The sparsity of the backward-pass overcomes one of the major roadblocks in using large recurrent models, which is having enough computational resources to train them. We demonstrate the task performance and activity sparsity of the model implemented in PyTorch, but this formulation will also allow the model to run efficiently on CPU-based nodes when implemented using appropriate software paradigms. Moreover, an implementation on novel neuromorphic hardware like Davies et al. (2018); Höppner et al. (2017) , that is geared towards event-based computation, can make the model orders of magnitude more energy efficient (Ostrau et al., 2022) .

2. RELATED WORK

Activity sparsity in RNNs has been proposed previously (Neil et al., 2017; 2016; Hunter et al., 2022) 2017), and can be implemented to have parallel computation of intermediate updates between events, while also being activity sparse in its output. Models based on sparse communication (Yan et al., 2022) for scalability have been proposed recently for feedforward networks, using locality sensitivity hashing to dynamically choose downstream units



Shazeer et al., 2017), our architecture uses a unit-local decision making process for the dynamic activity-sparsity. The cost of computation is lower in our model compared toNeil et al. (

