ENCODING RECURRENCE INTO TRANSFORMERS

Abstract

This paper novelly breaks down with ignorable loss an RNN layer into a sequence of simple RNNs, each of which can be further rewritten into a lightweight positional encoding matrix of a self-attention, named the Recurrence Encoding Matrix (REM). Thus, recurrent dynamics introduced by the RNN layer can be encapsulated into the positional encodings of a multihead self-attention, and this makes it possible to seamlessly incorporate these recurrent dynamics into a Transformer, leading to a new module, Self-Attention with Recurrence (RSA). The proposed module can leverage the recurrent inductive bias of REMs to achieve a better sample efficiency than its corresponding baseline Transformer, while the self-attention is used to model the remaining non-recurrent signals. The relative proportions of these two components are controlled by a data-driven gated mechanism, and the effectiveness of RSA modules are demonstrated by four sequential learning tasks.

1. INTRODUCTION

Sequential data modeling is an important topic in machine learning, and the recurrent networks such as LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Chung et al., 2014) have served as the benchmarks in this area over a long period of time. The success mainly contributes to the variety of recurrent dynamics introduced by these models, referred to as the recurrent inductive bias. More specifically, the dependence between any two inputs can be described by a parametric form, which heavily depends on their relative temporal locations. However, the recurrent models are well known to suffer from two drawbacks. The first one is the gradient vanishing problem (Hochreiter et al., 2001) , i.e. the recurrent models have difficulty in depicting the possibly high correlation between distant inputs. This problem cannot be solved fundamentally by the recurrent models themselves, although it can be alleviated to some extent, say by introducing long memory patterns (Zhao et al., 2020) . Secondly, the sequential nature renders these models difficult to be trained in parallel (Vaswani et al., 2017) . In practice, many techniques were proposed to improve the computational efficiency of recurrent models, while they all come with compromises (Luo et al., 2020; Lei et al., 2017) . In recent years, Transformers (Vaswani et al., 2017) have been revolutionizing the field of natural language processing by achieving the state-of-the-art performance on a wide range of tasks, such as language modeling (Kenton & Toutanova, 2019), machine translation (Dai et al., 2019 ) and text summarization (Liu & Lapata, 2019), etc. They have also demonstrated great potential in other types of sequence learning problems, for instance, time series forecasting (Zhou et al., 2021; Li et al., 2019) . The success of Transformers is due to the fact that the similarity between any two tokens is well taken into account (Vaswani et al., 2017) , and hence they can model long range dependence effortlessly. Moreover, contrary to the recurrent models, the self-attention mechanism in Transformers is feed-forward in nature, and thus can be computed in parallel on the GPU infrastructure (Vaswani et al., 2017) . However, the flexibility also leads to sample inefficiency in training a Transformer, i.e. much more samples will be needed to guarantee good generalization ability (d'Ascoli et al., 2021) . Moreover, the chronological orders are usually ignored by Transformers since they are time-invariant, and some additional efforts, in the form of positional encoding, will be required to further aggregate the temporal information (Shaw et al., 2018; Vaswani et al., 2017; Dai et al., 2019) . In short, both recurrent and Transformer models have the pros and cons in modeling sequential data. On one hand, due to inductive bias, the recurrent models excel at capturing the recurrent patterns even with relatively small sample sizes; see Figure 1(a) . Meanwhile, sample size is the performance bottleneck for the Transformer models and, when there are sufficient samples, they are supposed to be able to depict any recurrent or non-recurrent patterns in the data; see Figure 1(b) . On the other hand, sequential data have recurrent patterns more or less, and Transformers may have an improved performance if the recurrent model can be involved to handle these patterns, especially when the sample size is relatively small. Specifically, if the recurrent and non-recurrent components are separable, then one can apply a parsimonious recurrent model on the recurrent component and a Transformer on the non-recurrent one. As a result, the sample efficiency can be improved comparing to the Transformer-only baseline; see illustration in Figure 1 

(c).

There have been various attempts in the literature to combine the two models. Some earlier works were to simply stack them together in a straightforward manner. Chen et al. ( 2018 2019) stacked a recurrent layer prior to the multihead self-attention. These proposals inherit both the aforementioned shortcomings of Transformer and recurrent models. In particular, for a very long input sequence, the sequential operation in the recurrent layers become extremely expensive. Recent efforts have been spent on integrating recurrence and self-attention systematically. Feedback Transformer (Fan et al., 2021) introduces the memory vectors to aggregate information across layers, and uses them to update the next token in a recursive manner. However, the computationally expensive sequential operation limits its attractiveness. Another line of research applies the recurrent operation only to aggregate the temporal information at a coarser scale, while the token-by-token dependence is learned by self-attention instead. Transformer-XL (Dai et al., 2019) partitions the long inputs into segments and introduces a segment-level recurrence. Meanwhile, Temporal Latent Bottleneck (TLB) (Didolkar et al., 2022) and Block-Recurrent Transformer (BRT) (Hutchins et al., 2022) further divide the segments into smaller chunks, and each chunk is summarized into a few state vectors. A recurrent relation is then formed on the sequence of state vectors. These hierarchical designs are useful to reduce the computational burden, while they overlook recurrent dynamics at a finer scale. In an attempt to simplify the numerical calculation of RNNs, we found surprisingly that an RNN layer with linear activation can be broken down into a series of simple RNNs with scalar hidden coefficients. Each simple RNN induces a distinct recurrent pattern, and their combination forms the recurrent dynamics of the RNN layer. Hence the calculation time can be greatly reduced by training these simple RNNs in parallel. On top of that, it can be equivalently rewritten into the positional encodings of a multihead self-attention (MHSA). This spontaneously inspires a solution, the multihead Self-Attention with Recurrence (RSA), to combine self-attention with RNN into one single operation while maintaining parallel computation. This solution enables our design to preserve the merits from both Transformer and recurrent models, while their respective shortcomings are avoided. More importantly, it can be used to replace the self-attention of existing networks, such as Transformer XL, TLB and BRT, to further explore recurrent dynamics at the finer scale. Our paper makes three main contributions below. 1. With ignorable approximation loss, we demonstrate that an RNN layer with linear activation is equivalent to a multihead self-attention (MHSA); see Figure 2 . Specifically, each attention



Figure 1: (a) -(c) plot the two data features, namely the sample signal and the sample size as the x and y axis, respectively. The model performance in each data region for the RNN, Transformer and RSA are given, where deeper color implies a better performance. (d) In each attention head, the proposed RSA attaches an REM to a normalized self-attention score via a gated mechanism, with gate value σ(µ). The REM depicts a type of recurrence dependence structure between the tokens in X, and is parameterized by one or two parameters, i.e. λ or (γ, θ), where λ = tanh(η), γ = σ(ν).

) mixed and matched a Transformer's encoder with an recurrent-based decoder. Hao et al. (2019) introduced an additional recurrent encoder to a Transformer, while Wang et al. (

