

A

Recurrent neural networks are usually trained with backpropagation through time, which requires storing a complete history of network states, and prohibits updating the weights 'online' (after every timestep). Real Time Recurrent Learning (RTRL) eliminates the need for history storage and allows for online weight updates, but does so at the expense of computational costs that are quartic in the state size. This renders RTRL training intractable for all but the smallest networks, even ones that are made highly sparse. We introduce the Sparse n-step Approximation (SnAp) to the RTRL influence matrix. SnAp only tracks the influence of a parameter on hidden units that are reached by the computation graph within n timesteps of the recurrent core. SnAp with n = 1 is no more expensive than backpropagation but allows training on arbitrarily long sequences. We find that it substantially outperforms other RTRL approximations with comparable costs such as Unbiased Online Recurrent Optimization. For highly sparse networks, SnAp with n = 2 remains tractable and can outperform backpropagation through time in terms of learning speed when updates are done online.

1. I

Recurrent neural networks (RNNs) have been successfully applied to a wide range of sequence learning tasks, including text-to-speech (Kalchbrenner et al., 2018 ), language modeling (Dai et al., 2019) , automatic speech recognition (Amodei et al., 2016 ), translation (Chen et al., 2018) and reinforcement learning (Espeholt et al., 2018) . RNNs have greatly benefited from advances in computational hardware, dataset sizes, and model architectures. However, the algorithm used to compute their gradients in almost all practical applications has not changed since the introduction of Back-Propagation Through Time (BPTT). The key limitation of BPTT is that the entire state history must be stored, meaning that the memory cost grows linearly with the sequence length. For sequences too long to fit in memory, as often occurs in domains such as language modelling or long reinforcement learning episodes, truncated BPTT (TBPTT) (Williams & Peng, 1990 ) can be used. Unfortunately the truncation length used by TBPTT also limits the duration over which temporal structure can be realiably learned. Forward-mode differentiation, or Real-Time Recurrent Learning (RTRL) as it is called when applied to RNNs (Williams & Zipser, 1989) , solves some of these problems. It doesn't require storage of any past network states, can theoretically learn dependencies of any length and can be used to update parameters at any desired frequency, including every step (i.e. fully online). However, its fixed storage requirements are O(k • |θ|), where k is the state size and |θ| is the number of parameters θ in the core. Perhaps even more daunting, the computation it requires is O(k 2 • |θ|). This makes it impractical for even modestly sized networks. The advantages of RTRL have led to a search for more efficient approximations that retain its desirable properties, whilst reducing its computational and memory costs. One recent line of work introduces unbiased, but noisy approximations to the influence update. Unbiased Online Recurrent Optimization (UORO) (Tallec & Ollivier, 2018 ) is an approximation with the same cost as TBPTT -O(|θ|) -however its gradient estimate is severely noisy (Cooijmans & Martens, 2019) and its performance has in practice proved worse than TBPTT (Mujika et al., 2018) . Less noisy approximations with better accuracy on a variety of problems include both Kronecker Factored RTRL (KF-RTRL) (Mujika et al., 2018) and Optimal Kronecker-Sum Approximation (OK) (Benzing et al., 2019) . However, both increase the computational costs to O(k 3 ). The last few years have also seen a resurgence of interest in sparse neural networks -both their properties (Frankle & Carbin, 2019) and new methods for training them (Evci et al., 2019) . A number of works have noted their theoretical and practical efficiency gains over dense networks (Zhu & Gupta, 2018; Narang et al., 2017; Elsen et al., 2019) . Of particular interest is the finding that scaling the state size of an RNN while keeping the number of parameters constant leads to increased performance (Kalchbrenner et al., 2018) . In this work we introduce a new sparse approximation to the RTRL influence matrix. The approximation is biased but not stochastic. Rather than tracking the full influence matrix, we propose to track only the influence of a parameter on neurons that are affected by it within n steps of the RNN. The algorithm is strictly less biased but more expensive as n increases. The cost of the algorithm is controlled by n and the amount of sparsity in the Jacobian of the recurrent cell. We study the nature of this bias in Appendix C. Larger n can be coupled with concomitantly higher sparsity to keep the cost fixed. This enables us to achieve the benefits of RTRL with a computational cost per step comparable in theory to BPTT. The approximation approaches full RTRL as n increases. Our contributions are as follows: • We propose SnAp -a practical approximation to RTRL, which is is applicable to both dense and sparse RNNs, and is based on the sparsification of the influence matrix. • We show that parameter sparsity in RNNs reduces the costs of RTRL in general and SnAp in particular. • We carry out experiments on both real-world and synthetic tasks, and demonstrate that the SnAp approximation: (1) works well for language modeling compared to the exact unapproximated gradient; (2) admits learning temporal dependencies on a synthetic copy task and (3) can learn faster than BPTT when run fully online.

2. B

We consider recurrent networks whose dynamics are governed by h t = f θ (h t-1 , x t ) where h t ∈ R k is the state, x t ∈ R a is an input, and θ ∈ R p are the network parameters. It is assumed that at each step t ∈ {1, ..., T }, the state is mapped to an output y t = g φ (h t ), and the network receives a loss L t (y t , y * t ). The system optimizes the total loss L = t L t with respect to parameters θ by following the gradient ∇ θ L. The standard way to compute this gradient is BPTT -running backpropagation on the computation graph "unrolled in time" over a number of steps T :  ∇ θ L = T t=1



∂Lt ∂ht is the backpropagation rule. The slightly nonstandard notation θ t refers to the copy of the parameters used at time t, but the weights are shared for all timesteps and the gradient adds over all copies.

