

A

Recurrent neural networks are usually trained with backpropagation through time, which requires storing a complete history of network states, and prohibits updating the weights 'online' (after every timestep). Real Time Recurrent Learning (RTRL) eliminates the need for history storage and allows for online weight updates, but does so at the expense of computational costs that are quartic in the state size. This renders RTRL training intractable for all but the smallest networks, even ones that are made highly sparse. We introduce the Sparse n-step Approximation (SnAp) to the RTRL influence matrix. SnAp only tracks the influence of a parameter on hidden units that are reached by the computation graph within n timesteps of the recurrent core. SnAp with n = 1 is no more expensive than backpropagation but allows training on arbitrarily long sequences. We find that it substantially outperforms other RTRL approximations with comparable costs such as Unbiased Online Recurrent Optimization. For highly sparse networks, SnAp with n = 2 remains tractable and can outperform backpropagation through time in terms of learning speed when updates are done online.

1. I

Recurrent neural networks (RNNs) have been successfully applied to a wide range of sequence learning tasks, including text-to-speech (Kalchbrenner et al., 2018) , language modeling (Dai et al., 2019) , automatic speech recognition (Amodei et al., 2016 ), translation (Chen et al., 2018) and reinforcement learning (Espeholt et al., 2018) . RNNs have greatly benefited from advances in computational hardware, dataset sizes, and model architectures. However, the algorithm used to compute their gradients in almost all practical applications has not changed since the introduction of Back-Propagation Through Time (BPTT). The key limitation of BPTT is that the entire state history must be stored, meaning that the memory cost grows linearly with the sequence length. For sequences too long to fit in memory, as often occurs in domains such as language modelling or long reinforcement learning episodes, truncated BPTT (TBPTT) (Williams & Peng, 1990 ) can be used. Unfortunately the truncation length used by TBPTT also limits the duration over which temporal structure can be realiably learned. Forward-mode differentiation, or Real-Time Recurrent Learning (RTRL) as it is called when applied to RNNs (Williams & Zipser, 1989) , solves some of these problems. It doesn't require storage of any past network states, can theoretically learn dependencies of any length and can be used to update parameters at any desired frequency, including every step (i.e. fully online). However, its fixed storage requirements are O(k • |θ|), where k is the state size and |θ| is the number of parameters θ in the core. Perhaps even more daunting, the computation it requires is O(k 2 • |θ|). This makes it impractical for even modestly sized networks. The advantages of RTRL have led to a search for more efficient approximations that retain its desirable properties, whilst reducing its computational and memory costs. One recent line of work introduces unbiased, but noisy approximations to the influence update. Unbiased Online Recurrent Optimization (UORO) (Tallec

