RELATIVE POSITIONAL ENCODING FAMILY VIA UNITARY TRANSFORMATION

Abstract

Relative positional encoding is widely used in vanilla and linear transformers to represent positional information. However, existing encoding methods of a vanilla transformer are not always directly applicable to a linear transformer, because the latter requires a decomposition of the query and key representations into separate kernel functions. Nevertheless, principles to design encoding methods suitable for linear transformers remain under-studied. In this work, we put together a variety of existing encoding approaches under a canonical form and further propose a family of relative positional encoding algorithms via unitary transformation. Our formulation leads to a principled framework that can be used to develop new relative positional encoding methods that preserve the linear space-time complexity. Equipped with different parameters, the proposed linearized relative positional encoding (LRPE) family derives effective encoding for various applications. Experiments show that compared with existing methods, LRPE achieves competitive performance on language modeling and various challenging downstream tasks, e.g., machine translation and text classification. In the meantime, it highlights a general paradigm to design broadly more relative positional encoding methods, applicable inclusively to linear and vanilla transformers.

1. INTRODUCTION

Transformers have achieved remarkable progress in natural language processing (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020) , computer vision (Dosovitskiy et al., 2020; Liu et al., 2021; Arnab et al., 2021) and audio processing (Gulati et al., 2020) . As an important ingredient in transformers, positional encoding assigns a unique representation for each position of a token in a sequence so that the transformers can sense the position of input tokens. Among these encoding methods, absolute positional encoding (Vaswani et al., 2017; Sukhbaatar et al., 2015; Devlin et al., 2019; Liu et al., 2020) maps each individual position index into a continuous encoding. Whereas relative positional encoding (Shaw et al., 2018; Su et al., 2021; Horn et al., 2021; Liutkus et al., 2021; Huang et al., 2020; Raffel et al., 2019) generates encoding for each query-key pair, representing their relative positional offset. We focus on relative positional encoding as they are not constrained by input lengths (Chen, 2021) while showing superior performance (Shaw et al., 2018) . Linear transformers Chen (2021); Qin et al. (2022) ; Su et al. (2021) attract more attention recently as they can achieve linear space-time complexity with respect to input sequence length, while maintaining comparable performance with vanilla transformers. Most existing linear transformers use absolute positional encoding methods to encode positional information, since most existing relative positional encoding methods are designed for vanilla transformers and are not directly applicable to linear transformers. The main cause behind this limitation is that linear transformers decompose key and value representations in the self-attention modules into separate kernel functions to achieve linear space-time complexity. Such an additional requirement on the decomposibility is not always satisfied by existing relative positional encoding methods. On the other hand, despite some individual works (Qin et al., 2022; Chen, 2021) , general principles to design relative positional encoding for linear transformers remain largely under-studied. A recent work, RoPE Su et al. (2021) proposes a new set of multiplicative encoding solutions based on rotate positional encoding and can be applied to linear transformers. In Section C.7, we show that RoPE can be seen as a special form of LRPE.

Softmax

d n × Q V K T 𝑤 𝑡-𝑠 d n × × Q 𝑤 𝑡 × × 𝑤 𝑠 V × × d * d d * d d * d O(n 2 d + n 2 d)≈O(n 2 ) O(nd 2 + nd 2 )≈O(n) Q, K, and V are all in the shape of n by d, where n is input length and d is feature dimension. Tensors in the same dashed line box are associated for computation. In the vanilla relative positional encoding, query key attention has to be calculated first, leading to a quadratic complexity. W t-s refers to relative positional encoding, where t, s are two positional indices on the query and key, respectively. Our LRPE achieves a decomposable encoding, i.e., W t and W s are only dependent on positions of the query and key, making it fully compatible with linear transformers. When dealing with long sequences, d ≪ n, the computation complexity is dominated by n, rendering d negligible. K T In this work, we aim to bridge this gap and study principal framework to develop relative positional encoding applicable for both linear and vanilla transformers. To this end, we start by presenting a canonical form of relative positional encoding, which reveals that differences in existing encoding methods boil down to choices of a set of query, key and relative positional matrix primitives. By properly selecting and composing these primitives, we could derive various existing encoding methods for vanilla (Vaswani et al., 2017; Huang et al., 2020; Shaw et al., 2018) and linear (Qin et al., 2022) transformers. Taking advantage of the canonical form, we introduce the main contribution of our work, i.e., a special family of relative positional encoding methods called linearized relative positional encoding (LRPE). Specifically, we supply a sufficient condition to design compatible encoding methods specially for linear transformers and prove that the linearized relative positional encoding is unitary transformation. Benefits of using unitary transformation are two-fold. On one side, since it is derived from the decomposable positional matrix, it can maintain the linear space-time complexity as shown in Fig. 1 . Second, the property of the unitary transformation allows us to effectively derive the family of closed-form solutions. In particular, we show that a number of encoding methods pertain to the LRPE family, including those used in RoPE (Su et al., 2021) and PermuteFormer (Chen, 2021) . Furthermore, LRPE sheds light on a simple yet flexible theoretical paradigm to develop new effective relative positional encodings. To demonstrate this, we derive non-exhaustively three additional LRPE encoding methods by parameterizing the generic solution differently, including solutions living in either real or complex domains. Since unitary transformations are special cases of relative positional matrix, LRPE are applicable in both linear and vanilla transformers, and exclusively suitable within encoder and/or decoder layers. We experimentally demonstrate the effectiveness of the LRPE family on autoregressive and bidirectional language modelling, and on challenging downstream tasks, including machine translation and text classification. Results show that LRPE achieves competitive capability in representing relative positional information, commonly resulting in superior performance than previous encoding methods. In summary, our main contributions are three-fold: • We present a canonical form of relative positional encoding, which derives most existing relative positional encoding methods as its special case, including those used in linear and vanilla transformers. • Based on the canonical form, we propose linearized relative position encoding (LRPE), a simple yet principal formulation to derive an encoding family that respect the linear spacetime complexity in linear transformers, while being also applicable to vanilla transformers. We show several existing relative positional encoding methods in linear transformers are in LRPE family. We also provide additional particular solutions from this generic form. • Experiments on various downstream tasks, including language modeling, machine translation and text classification show that the LRPE family show more robust and commonly superior results across tasks than previous relative encoding methods, are flexible in be-ing plugged into linear/vanilla models, in encoder and/or decoder layers. In addition, it is generic to derive existing and potentially new encoding methods.

2. BACKGROUND AND PRELIMINARY

In this section, we provide preliminary knowledge and describe related work to facilitate the rest discussions. In the following, we denote the k-th row of matrix M as m T k , the d-dimensional identity matrix as I d . We omit the subscript d when it is unambiguous from the context. The complete list of notations can be found in Appendix A.

2.1. TRANSFORMER AND ITS LINEARIZATION

We first briefly review vanilla transformer (Vaswani et al., 2017) and its linearization (Katharopoulos et al., 2020) . The key component of transformer models is the self-attention block, which involves three matrices Q (Query), K (Key) and V(Value); each of them is a linear projection taking X ∈ R n×d as input: Q = XW Q , K = XW K , V = XW V ∈ R n×d . The output O ∈ R n×d is computed using the Softmax weighted sum: O = Softmax(QK T / √ d)V. The computation overhead of the vanilla transformer grows quadratically with respect to the sequence length n, which becomes the bottleneck for transformers to handle long input sequences. Linearization of self-attention aims to reduce the computation complexity to linear (Katharopoulos et al., 2020; Ke et al., 2021; Qin et al., 2022; Vyas et al., 2020; Peng et al., 2021; Xiong et al., 2021) , typically achieved via a decomposable kernel function ϕ : R d → R d. Specifically, the output of linear attention is computed as: O = ∆ -1 ϕ(Q)[ϕ(K) T V], ∆ = diag(ϕ(Q)[ϕ(K) T 1 n ]). The key property of linear attention is the decomposability of the kernel function. This enables to compute ϕ(K) T V ∈ R d×d first, which leads to the O(nd 2 ) complexity, further reducing to O(n) with longer inputs (d ≪ n). See Appendix B for a detailed discussion. 2.2 POSITIONAL ENCODING Self-attention is capable of parallel sequence processing but cannot capture positional information of each token. To address this issue, positional encoding methods are proposed, which can be generally categorized into two groups: absolute positional encoding and relative positional encoding. Absolute positional encoding employs handcraft functions (Vaswani et al., 2017; Sukhbaatar et al., 2015) or learnable encoding lookup tables P ∈ R n×d (Devlin et al., 2019; Liu et al., 2020) to represent position indices as encodings. These encodings are then combined with the context vector additively: q s = W Q (x s + p s ), k s = W K (x s + p s ), v s = W V (x s + p s ), where the encoding formulation only depends on the absolute position index s, and the positional encoding size is restricted by the input sequence length. Relative positional encoding considers relative position offsets between two input tokens (Shaw et al., 2018) , i.e., e st = x T s W T Q W K x t + f (x s , x t , t -s), where s, t are the two positional indexes, e st denotes the attention score before softmax. Compared to absolute positional encoding, relative positional encoding generally achieves better performance as it can handle variable input length (Chen, 2021) . However, extra cost on computation and memory makes it not so efficient than absolute positional encoding (Likhomanenko et al., 2021) . Most existing relative positional encoding methods (Raffel et al., 2019; Shaw et al., 2018; Huang et al., 2020) require computing query-key attention QK T and combine with relative positional information, which incurs quadratic complexity. In contrast, linear attention avoids such a query-key product to achieve the linear complexity. Therefore, common relative positional encoding methods are usually not applicable in linear transformers.

3. OUR METHOD

In this section, we present our main technical contribution on linearized relative positional encoding, which is an encoding family that preserve the linear space-time complexity. Specifically, we start by presenting a canonical form of relative positional encoding, and show that existing encoding methods can be derived by instantiating the canonical form with different choices of so-called primitive queries, keys and positional matrices in Section 3.1. When imposing the decomposability constraint on this canonical form, we obtain a sufficient condition for linearized relative positional encoding (LRPE) and derive a family of concrete solutions in real and complex domains in Section 3.2. We provide an implementation sketch in Section 3.3.

3.1. CANONICAL FORM OF RELATIVE POSITIONAL ENCODING

In order to better establish connections between existing relative positional encoding methods and understand their design principles, in this section, we first present a canonical form of relative positional encoding. In particular, given a query q s and key k s pair, their relative positional encoding f rel : C d × C d → C can be represented as: f rel (q s , k t ) = m l=1 (q (l) s ) H W (l) t-s k(l) t , where H represents conjugate transposition and m represents number of primitives. We refer q(l) s ∈ C d (l) 1 , k(l) t ∈ C d (l) 2 , W (l) t-s ∈ C d (l) 1 ×d (l) 2 as query, key and relative positional matrix primitives, respectively, used as constituent components to construct the relative positional encoding. Note that query primitives do not always indicate a reliance on query embeddings, similarly for other primitives. For example, an identify matrix can also serve as primitives, as we will show shortly in Section 3.1.1. To demonstrate Eq. 6 is a generic formulation, we show that it flexibly induces a wide range of existing relative encoding methods (Shaw et al., 2018; Su et al., 2021; Horn et al., 2021; Liutkus et al., 2021; Huang et al., 2020; Raffel et al., 2019) by selecting and compositing different choices of primitives. Among them, we highlight two examples in the following section, and leave the complete discussions in the Appendix C.1.

3.1.1. TYPICAL ENCODING EXAMPLES FROM THE CANONICAL FROM

Additive. In (Huang et al., 2020) , the relative positional encoding is formulated as an extra additive term to the query-key inner-product: f rel (q s , k t ) = q H s k t + w t-s , which can be derived by including an extra identity term as a primitive, formally denoted as: m = 2, q(1) s = q s , k(1) t = k t , W t-s = I d , q(2) s = I d , k(2) t = I d , W t-s = w t-s I d . (8) Multiplicative. In RoPE (Su et al., 2021) , the relative positional encoding works in the form of the weighted inner product: f rel (q s , k t ) = q H s W t-s k t , which can be denoted as: m = 1, q(1) s = q s , k(1) t = k t , W t-s = W t-s . (10)

3.1.2. SIMPLIFICATION

For the ease of remaining discussion, we introduce necessary notations and simplify Eq. 6. d1 = m l=1 d (l) 1 , d2 = m l=1 d (l) 2 , qs = (q (1) s ) T , . . . , (q (m) s ) T T ∈ C d1 , kt = ( k(1) t ) T , . . . , ( k(m) t ) T T ∈ C d2 , Ŵt-s = block-diag{W (1) t-s . . . , W (m) t-s } ∈ C d1× d2 . ( ) with these notations, we can rewrite Eq. 6 into the matrix form: f rel (q s , k t ) = qH s Ŵt-s kt . Since every component of qs and kt are handled with no difference, without losing generality, we only discuss cases where m = 1: f rel (q s , k t ) = q H s W t-s k t .

3.2. LINEARIZED RELATIVE POSITION ENCODING

Eq. 6 is a canonical form of relative positional encoding, meaning that its variants are applicable to vanilla transformers but not necessarily for linear ones. To design relative encoding compatible with linear transformers, the attention computation has to respect the decomposibilty condition. This additional condition leads to the linearized relative position encoding (LRPE) family, defined as follows. Definition 3.1. A relative position encoding is called linearized relative position encoding (LRPE), when the following holds: ∀q s , k t ∈ C d , f rel (q s , k t ) = q H s W t-s k t = (M s q s ) H (M t k t ) = q H s M H s M t k t , where q s , k t ∈ C d , W s , M s ∈ C d×d , W 0 = I d . The assumption of W 0 = I d implies that the interaction between tokens from the same position only depends on the content, which is reasonable enough that most encoding methods respect. In its essence, Eq. 13 ensures the positional matrix is decomposable. In this way, the query-key innerproduct can be avoided in the attention computation. Consequently, complexity of computing LRPE is O(nd 2 ), where n is sequence length, d is embedding dimension as Appendix C.2 shows in detail. We prove that Eq. 13 can be simplified based on the following proposition: Proposition 3.2. Eq. 13 is equivalent to Eq. 14 and W t is Unitary matrix, W t-s = W H s W t . Proof of Proposition 3.2. According to the arbitrariness of q s , k t , Eq. 13 is equivalent to W t-s = M H s M t . ( ) Take s = t in Eq 13, we get (since we assume that W 0 = I d ): M H s M s = W 0 = I d . Thus, M s is a unitary matrix. On the other hand, note that for any unitary matrix P, we always have W t-s = M H s M t = M H s I d M t = M H s P H PM t = (PM s ) H (PM t ). ( ) This means that left multiplying M t by a unitary matrix P does not change Eq. 13. Since M s and M H 0 are also unitary matrices, we can perform the following transformation: M s = M H 0 M s . ( ) With M s , Eq. 15 becomes W t-s = M H s M t . (19) Take s = 0, we have W t = M H 0 M t = M H 0 M 0 M t = I d M t = M t . Thus Eq. 19 becomes W t-s = W H s W t . Since M s is a unitary matrix, W s is also a unitary matrix, i.e., W H s W s = I d . The detailed proof can be found in Appendix 3.2. In the following section, we derive some particular solutions for Eq. 14.

3.2.1. PARTICULAR SOLUTIONS

In this section, we discuss Eq. 14 and give a family of solutions. It is worth noting that the solutions we provide are all in the form of W s = P H Λ (s) P, where P, Λ (s) are unitary matrix. The complete derivation can be found in Appendix C.4, C.5, C.6.  Permutation (Solution 3) Householder - Type1 Type2 Type4 Householder learnable - - Type3 - Permutation - Type5 Type6 Type7 FFT Type8 - - - Unitary (Solution 1) The first case is discussed in the complex domain, which is not common in transformer models yet exhibiting an elegant solution. Proposition 3.1. The following form of W s ∈ C d×d satisfies Eq. 14: W s = P H Λ (s) P, Λ (s) = diag{exp(isα 1 ), . . . , exp(isα d )}, where P ∈ C d×d is unitary matrix, α k , k = 1, . . . , d are parameters. Orthogonal (Solution 2) Now we consider the real domain, a more general case in transformers. Proposition 3.2. The following form of W s ∈ R d×d satisfies Eq. 14: W s = P T Λ (s) P, Λ (s) = A (s) B (s) , A (s) =    A (s) 1 . . . A (s) p    ∈ R 2p×2p , B (s) = I q ∈ R q×q , A (s) k = cos(sα k ) -sin(sα k ) sin(sα k ) cos(sα k ) , ( ) where P ∈ R d×d is orthogonal matrix, α k , k = 1, . . . , d are parameters. Permutation (Solution 3) The last case is inspired by PermuteFormer (Chen, 2021) , which is associated with the permutation matrix: Proposition 3.3. The following form of W k ∈ R d×d satisfies Eq. 14: W k = P T Λ (k) P, π : {1, 2, • • • , d} → {1, 2, • • • , d} is permutation, Λ (k) = (I) π k , ( ) where P ∈ R d×d is the orthogonal matrix.

3.3. THE LRPE FAMILY

LRPE (W s = P H Λ (s) P) contains two components, i.e., a fixed unitary matrix P and a unitary matrix family Λ (s) as mentioned in proposition 3.1, 3.2, and 3.3. The P can be seen as a rotation matrix that rotates the token feature to a particular coordinate system and the Λ (s) derives the positional information from the rotated feature. In this paper, we select three types of commonly used orthogonal matrix, i.e., (1) householder matrix (Golub & Van Loan, 2013) , (2) permutation matrix, and (3) FFT matrix (Bracewell & Bracewell, 1986) . We combine the selected Matrix P with three LRPE solutions above to obtain 8 practical LRPE types as shown in Table 1 . Detailed information can be found in Appendix C.7. 

4.1. EXPERIMENTAL SETTINGS

Primary tasks. We validate the effectiveness of the proposed LRPE on various NLP tasks that resort to different Transformer architectures. Specifically, we first study the autoregressive language model (Radford et al., 2018) with a GPT-like decoder-only structure. This is followed by the bidirectional language model (encoder-only), which adopts the Roberta architecture (Liu et al., 2020) and is pretrained and then fine-tuned on several downstream tasks from the GLUE benchmark (Wang et al., 2018) . Then, we evaluate LRPE on machine translation (encoder-decoder). Competing methods. Our baseline is the Transformer model (Vaswani et al., 2017) without relative positional encoding. For comparison, we also choose four state-of-the-art methods , i.e., RoPE (Su et al., 2021) , SPE (Liutkus et al., 2021) , PermutateFormer (abbreviated as "PER") (Chen, 2021), T5 (Raffel et al., 2019) and Transformer-XL (Dai et al., 2019 ) (abbreviated as "XL"), and test them in both linear attention and vanilla attention. In the linear attention, we employ 1+elu(•) (Katharopoulos et al., 2020) as the kernel function. Training configuration. Our experiments are implemented in the Fairseq framework (Ott et al., 2019) with V100 GPUs. All the methods share the same configurations which are listed in Appendix D.1. Autoregressive language model. The autoregressive language model has 6 decoder layers and is trained on the WikiText-103 dataset (Merity et al., 2017) . We use the Perplexity (PPL) as the evaluation metric and report the results in Table 2 . We observe that under the linear setting, most variants of LRPE present performance gain over the baseline. Our best model, i.e., Type2, outperforms RoPE and SPE on both validation and test sets to a large margin, and achieves comparable results to PER with minor difference. Clearly, the proposed method is effective in encoding causal data. Bidirectional language model. The bidirectional model follows an encoder-only structure, i.e., Roberta (Liu et al., 2020) , with 12 layers. We first pretrain it on the WikiText-103 dataset, and present the results in Fig. 2 and Appendix D.2. Generally, LRPE (Type2 in this case) has better performance, i.e., smaller validation PPL in all evaluation steps, than competing methods. Notably, it surpasses RoPE, SPE and PER by nearly 10%, 27% and 37% in terms of the final PPL, indicating its superiority in bidirectional language modeling. We then fine-tune the pretrained model on the GLUE dataset (Wang et al., 2018) . We use different learning rates among 1e-5, 3e-5, 6e-5, 1e-4 and choosing the best result after fine-tuning for 3 epochs. From Table 3 , we find that the two representative LRPE variants, i.e., Type5, perform consistently better than other methods in all metrics. The average score of Type5 beats RoPE, SPE, and PER by more than 4.4%. Machine translation. For this task, we adopt the base transformer model which consists of 6 encoder layers and 6 decoder layers, and train it on the WMT'14 En-De dataset (Bojar et al., 2014) . We ran each experiment 5 times and report the averaged results. Note that in practical, we only embed the linear attention and its corresponding relative positional encoding in encoders, since we empirically find that the model cannot converge appropriately when the linear attention appears in decoders. We measure the accuracy with BLEU, and the quantitative results on both validation and test sets are displayed in Table 4 . Most variants in the LRPE family have comparable performances to the competing methods on the validation set, and Type4 ranks first on the test data, which demonstrates again the validity of our LRPE. However, a few variants present less competitive results than the state-of-the-arts on the test data. Empirically, this is caused by the relatively high sensitivity of parameter tuning on the machine translation performance while all listed methods share the identical parameter setting. We will concentrate on how to further improve their accuracy by specializing the parameters for each variant in the future work. 

4.3. MODEL ANALYSIS

An explanation of LRPE. According to the discussion in Section. 3.3, The LRPE rotates the token feature through P, and encodes the positional information through Λ (s) . In Table 5 , we ablate the effectiveness of the P matrix on the autoregressive language modeling task. Our approach with the Householder matrix and the Permutation matrix achieve marginally better results than the one without rotation (Identity matrix). It indicates that we can get better performance by carefully selecting the projection of the positional encoding. Complexity and efficiency. The implementation of the proposed LRPE does not affect the computational complexity of the linear transformer, i.e., preserving the linear complexity as O(n). We also measure the training speed of the bidirectional language model on the same local machine (i.e., a GeForce GTX 1060 card), and observe that the speed after using LRPE is only 9% slower than the baseline on average. The detailed comparison of the efficiency can be found in Appendix D.3. In general, UPRE does not incur significant computational burden to the transformer, and can fulfill the practical needs by maintaining comparable efficiency. Generalization to vanilla attention. Finally, we investigate the generalization of LRPE towards the vanilla attention. The results are reported in Fig. 2 , Table 2 , 3, and 4. The conclusion is consistent with that of the linear setting, i.e., improving the vanilla transformer baseline and achieving competitive performance to the competing methods. It indicates the good flexibility of LRPE as it can be seamlessly applied to any attention type.

CONCLUSION

In this paper, we standardize the form of relative positional encoding in both linear and vanilla transformers, and focus the case in the linear attention. The unitary transformation is employed as a special solution to the linearized relative positional encoding, and the solutions as per various constraints constitute the unitary relative positional encoding (LRPE) family. We validate the effectiveness of LRPE through extensive experiments on several NLP tasks with different transformer architectures. It outperforms state-of-the-art methods under both linear and vanilla settings.

Appendix

A MATHEMATICAL NOTATIONS  block-diag{W 1 , W 2 , . . . , W n } =     W 1 W 2 . . . W n     . B COMPUTATION OF VANILLA/LINEAR ATTENTION B.1 BASIC NOTATIONS Both vanilla and linear attention blocks involve three matrices, i.e., Q (Query), K (Key) and V (Value). All of them are linear projections of input X ∈ C n×d , i.e., X =    x T 1 . . . x T n    ∈ R n×d , Q =    q T 1 . . . q T n    = XW Q =    x T 1 W Q . . . x T n W Q    ∈ R n×d , K =    k T 1 . . . k T n    = XW K =    x T 1 W K . . . x T n W K    ∈ R n×d , V =    v T 1 . . . v T n    = XW V =    x T 1 W V . . . x T n W V    ∈ R n×d , where W Q , W K , W V ∈ R d×d . The vector form is organized as q s = W T Q x s , k s = W T K x s , v s = W T V x s . The attention output is O =    o T 1 . . . o T n    ∈ R n×d .

B.2 VANILLA ATTENTION

In vanilla attention, the output is computed using the Softmax weighted sum, i.e., o s = Attention(q s , K, V) = n t=1 a st v t = n t=1 exp q T s k t / √ d v t n r=1 exp q T s k r / √ d , O = Softmax(QK T / √ d)V.

B.3 LINEAR ATTENTION

The linear attention is formulated as follows, o s = LinearAttention(q s , K, V) = n t=1 a st v t = n t=1 ϕ(q s ) T ϕ(k t ) n t=1 ϕ(q s ) T ϕ(k t ) v t = n t=1 ϕ(q s ) T ϕ(k t )v t n t=1 ϕ(q s ) T ϕ(k t ) = ϕ(q s ) T n t=1 ϕ(k t )v t ϕ(q s ) T n t=1 ϕ(k t ) , O = ∆ -1 ϕ(Q)ϕ(K) T V = ∆ -1 ϕ(Q)[ϕ(K) T V], ∆ = diag(ϕ(Q)[ϕ(K) T 1 n ]). C PROOF OF THEOREM

C.1 MORE EXAMPLES

In the following, we provide two additional examples of relative positional encoding with the canonical form. RPR (Shaw et al., 2018) : f rel (q s , k t ) = q H s k t + q H s c t-s , c t-s = w clip(t-s,k) , clip(x, k) = max(-k, min(k, x)), w s ∈ C d , -k ≤ s ≤ k. (32) The canonical form is m = 2, q(1) s = q s , k(1) t = k t , W t-s = I d , q(2) s = q s , k(2) t = I d , W t-s = 1 d [c t-s . . . c t-s ] d columns . Under review as a conference paper at ICLR 2023 DeBERTa (Huang et al., 2020) : f rel (q s , k t ) = q H s k t + q H s kg(s-t) + qH g(t-s) k t , g(x) =    0 x ≤ -c 2c -1 x ≥ c x + c others. (34) The canonical form is m = 3, q(1) s = q s , k(1) t = k t , W t-s = I d , q(2) s = q s , k(2) t = I d , W t-s = 1 d kg(s-t) . . . kg(s-t) d columns , q(3) s = I d , k(3) t = k t , W t-s = 1 d qg(t-s) . . . qg(t-s) d columns . ( ) cosFormer (Qin et al., 2022) : f rel (q s , k t ) = q H s k t cos(α(t -s)), which indicates that the relative positional encoding is effectively a coefficient term in the attention matrix, as such, it can be derived via a positional matrix primitive with the coefficients. m = 1, q(1) s = q s , k(1) t = k t , W t-s = cos(α(t -s))I d . C.2 LINEARIZED RELATIVE POSITIONAL ENCODING Proof of 3.2. For this, we only need to prove that the time complexity is linear with respect to n. To this end, we first give basic notations as follows, Q =    q H 1 . . . q H n    ∈ C n×d , K =    k H 1 . . . k H n    ∈ C n×d , V =    v H 1 . . . v H n    ∈ C n×d , Q =    (M 1 q 1 ) H . . . (M n q n ) H    ∈ C n×d , K =    (M 1 k 1 ) H . . . (M n k n ) H    ∈ C n×d . The time complexity of transforming Q, K to Q, K is O(nd 2 ). The next step is to calculate the output, i.e., O = Q(K H V) ∈ C n×d , O = ∆ -1 Q KH V = ∆ -1 Q[ KH V], ∆ = diag( Q)[ KH 1 n ]. Clearly, Eq. 39 is a standard formulation for the linear attention with the time complexity as O(nd 2 ). Combing it with the first step, we have the total time complexity as O(nd 2 ), which is unchanged.

C.3 LINEARIZED RELATIVE POSITIONAL ENCODING

Before the proof, we first give the following theorems (Yao & Algebra, 2015) : Theorem C.1. If matrix W ∈ C d×d is a unitary matrix, there exists another unitary matrix P ∈ C d×d , such that W = P H ΛP, Λ = diag{exp(iθ 1 ), . . . , exp(iθ d )}, i 2 = -1. ( ) Theorem C.2. If matrix W ∈ R d×d is an orthogonal matrix, there exists another orthogonal matrix P ∈ R d×d , such that W = P T ΛP, Λ = diag{Λ 1 , . . . , Λ r ; 1, . . . , 1; -1, . . . , -1}, Λ k = cos θ k -sin θ k sin θ k cos θ k , k = 1, . . . r. C.4 UNITARY (SOLUTION 1) Proof of Proposition 3.1. According to Theorem C.1, we can assume that W s has the following form (P ∈ C d×d is a unitary matrix), W s = P H Λ (s) P, Λ (s) = diag{exp(iθ 1 ), . . . , exp(iθ (s) d )}. Hence, Eq. 14 is equivalent to W H s W t = W t-s , P H Λ (s) H PP H Λ (t) P = P H Λ (t-s) P, P H Λ (s) H Λ (t) P = P H Λ (t-s) P, Λ (s) H Λ (t) = Λ (t-s) , diag j(θ (t) 1 -θ (s) 1 ), j(θ (t) 2 -θ (s) 2 ), • • • , j(θ (t) d -θ (s) d ) = diag jθ (t-s) 1 , jθ (t-s) 2 , • • • , jθ (t-s) d . (43) In this case, ∀k = 1, . . . , d, we have θ (t) k -θ (s) k = θ (t-s) k + 2kπ, k ∈ Z. ( ) Note that 2tπ does not affect the result, so we can assume t = 0, i.e., θ (t) k -θ (s) k = θ (t-s) k . ( ) Taking t = s + 1, we get θ (s+1) k -θ (s) k = θ (1) k , θ (s) k = sθ (1) k ≜ sα k . (46) C.5 ORTHOGONAL (SOLUTION 2) Proof of Proposition 3.2. According to Theorem C.2, we can assume that W s has the following form (P ∈ R d×d is an orthogonal matrix), W s = P T Λ (s) P, Λ (s) =   A (s) B (s) C (s)   , A (s) =    A (s) 1 . . . A (s) n    ∈ R 2p×2p , B (s) = I q ∈ R q×q , C (s) = -I r ∈ R r×r , A (s) k = cos θ (s) k -sin θ (s) k sin θ (s) k cos θ (s) k . Hence, Eq. 14 is equivalent to W T s W t = W t-s , P T Λ (s) T PP T Λ (t) P = P T Λ (t-s) P, P T Λ (s) T Λ (t) P = P T Λ (t-s) P, Λ (s) T Λ (t) = Λ (t-s) ,    A (s) T B (s) T C (s) T      A (t) B (t) C (t)   =   A (t-s) B (t-s) C (t-s)   , where A (s) T A (t) = A (t-s) , B (s) T B (t) = B (t-s) , B (s) T B (t) = C (t-s) . For A (s) , considering the k-th component, we get A (s) k T A (t) k = A (t-s) k = cos θ (s) k sin θ (s) k -sin θ (s) k cos θ (s) k cos θ (t) k -sin θ (t) k sin θ (t) k cos θ (t) k = cos θ (s) k cos θ (t) k + sin θ (s) k cos θ (t) k sin θ (s) k cos θ (t) k -cos θ (s) k sin θ (t) k -sin θ (s) k cos θ (t) k + cos θ (s) k sin θ (t) k cos θ (s) k cos θ (t) k + sin θ (s) k sin θ (t) k =   cos θ (t) k -θ (s) k -sin θ (t) k -θ (s) k sin θ (t) k -θ (s) k cos θ (t) k -θ (s) k   = A (t-s) k = cos θ (t-s) k -sin θ (t-s) k sin θ (t-s) k cos θ (t-s) k . Hence, ∀k = 1, . . . , d, we have θ (t) k -θ (s) k = θ (t-s) k + 2kπ, k ∈ Z. Note that 2tπ does not affect the result, so we can assume t = 0, i.e., θ (t) k -θ (s) k = θ (t-s) k . ( ) Taking t = s + 1, we have θ (s+1) k -θ (s) k = θ (1) k , θ (s) k = sθ (1) k ≜ sα k . (53) Next, for B (s) , the conclusion is more obvious, i.e., B (s) T B (t) = I T q I q = I q = B (t-s) . (54) Finally, for C (s) , we have C (s) T C (t) = (-I T r )(-I r ) = I r ̸ = C (t-s) . (55) In that case, we must have r = 0.

C.6 PERMUTATION (SOLUTION 3)

Prior to the proof, we first provide some relevant definitions and propositions. Definition C.3. Permutation π is a bijection defined on the integer set: π : {1, 2, • • • , d} → {1, 2, • • • , d}, d ∈ Z + . Definition C.4. For matrix M =      m T 1 m T 2 . . . m T d      ∈ R d×d , m k ∈ R d , k = 1, . . . , d, M π is defined as M π =      m T π(1) m T π(2) . . . m T π(d)      . ( ) Definition C.5. For identity matrix I d ∈ R d×d and permutation π, we define Λ k = (I d ) π k . For Λ k , we have the following important properties: Lemma C.6. For permutation π, matrix M ∈ R d×d and Λ k ∈ R d×d defined in C.5, we have M π = Λ 1 M. Proof. We first organize I d ∈ R d×d in the following form, where e k ∈ R d , k = 1, . . . , d represents the one-hot vector with the k-th element as one, i.e., I d =      e T 1 e T 2 . . . e T d      . ( ) Notice that e T k M = m T k , so we get Λ 1 M =      e T π(1) e T π . . . e T π(d)      M =      e T π(1) M e T π(2) M . . . e T π(d) M      =      m T π(1) m T π(2) . . . m T π(d)      = M π . Theorem C.7. For Λ k defined in C.5, we have: Λ k = Λ k 1 . Proof. We use induction for the proof. For k = 1, the conclusion is obvious. Now assuming that the conclusion holds for = s -1, when k = s, we have Λ s = (I d ) π s = ((I d ) π s-1 ) π = (Λ s-1 ) π = (Λ s-1 1 ) π . The next step is to prove (Λ s-1 1 ) π = Λ s 1 = Λ 1 Λ s-1 1 . The above conclusion follows from C.6. Theorem C.8. Λ k ∈ R d×d defined in C.5 are orthogonal matrices, i.e., Λ k Λ T k = Λ T k Λ k = I d . Proof. We first prove that the conclusion holds for k = 1: Λ 1 Λ T 1 =      e T π(1) e T π(2) . . . e T π(d)      e π(1) e π(2) . . . e π(d) , Λ 1 Λ T 1 st = e T π(s) e π(t) = δ st , Λ 1 Λ T 1 = I d . Since Λ 1 is a square matrix, we also have Λ T 1 Λ 1 = I d . In general cases, we only use C.7, i.e., Λ k Λ T k = Λ k 1 (Λ k 1 ) T = Λ k 1 (Λ T 1 ) k = Λ k-1 1 Λ 1 Λ T 1 (Λ T 1 ) k-1 = Λ k-1 1 (Λ T 1 ) k-1 = . . . = I d . With the same proof, we get Λ T k Λ k = I d . Based on the above conclusions, we can prove Proposition 3.3 below. Proof of Proposition 3.3. According to Theorem C.8 and the production of the orthogonal matrix is an orthogonal matrix, we can assume that W k has the following form (P ∈ R d×d is an orthogonal matrix), i.e., W k = P T Λ (k) P. (72) The next step is to verify that it satisfies Eq. 14, which follows Theorem C.7 and C.8: W T s W t = P T Λ (s) T PP T Λ (t) P = P T Λ (s) T Λ (t) P = P T Λ (s) T (Λ (1) ) t P = P T Λ (s) T (Λ (1) ) s (Λ (1) ) t-s P = P T Λ (s) T Λ (s) (Λ (1) ) t-s P = P T Λ (t-s) P = W t-s . C.7 IMPLEMENTATION LRPE(W s = P H Λ (s) P) contains two components, i.e., the fixed unitary matrix P and the unitary matrix family Λ (s) mentioned in proposition 3.1, 3.2, and 3.3. We first introduce the choice of matrices P/Λ (s) , and then illustrate some implementation tricks.

Choice of matrices

For matrix P, we employ three types as follows, • Householder matrix: denoted as a vector v ∈ R d , i.e., W = I d -2vv T /(v T v). In our implementation, we sample v from standard normal distribution, and make it deterministic or learnable. • Permutation matrix: formulated as per the following permutation (inspired by Flash (Hua et al., 2022) ), i.e., π(2k) = k, π(2k + 1) = ⌊d/2⌋ + 1, 1 ≤ 2k, 2k + 1 ≤ d. • FFT matrix: a matrix form of FFT (Fast Fourier Transform). For matrix family Λ (s) , we use the following settings: • For unitary (Solution 1) (3.1), we use the same method in (Su et al., 2021) with initialized α t = 10000 -2t/d , and make it deterministic. Since this method involves complex numbers, we only use the FFT matrix for the choice of P. • For orthogonal (Solution 2) (3.2), we test with two versions. In the first version, we set the dimension of the identity submatrix q = ⌊d/2⌋, initialized α t = 10000 -2t/d as in (Su et al., 2021) and make it deterministic. In the second version, we choose the dimension of identity submatrix q = 0 with initialized α t = 10000 -2t/d as in (Su et al., 2021) , and make it learnable. -Another notable version to choose the dimension of the identity submatrix q = 0 with initialized α t = 10000 -2t/d as in (Su et al., 2021) , and make it deterministic. When using this version along with the identity matrix, we can get RoPE (Su et al., 2021) . • For permutation (Solution 3) (3.3), we randomly choose the permutation and make it deterministic. -Notice that when combing this method with identity matrix, we can get a version of PermutateFormer (Chen, 2021) .

Implementation tricks

According to the following facts, we can simplify the computation, i.e., q H s W H s W t k t = q H s P H (Λ (s) ) H PP H Λ (t) Pk t = q H s P H (Λ (s) ) H Λ (t) Pk t = (Λ (s) Pq s ) H (Λ (t) Pk t ). (76) Hence, in practice, we can use W s = P H Λ (s) instead of W s = P H Λ (s) P to reduce the computational costs.

C.8 PSEUDOCODE

In this section, we provide pseudocodes for LRPE in Python: 



Figure 1: Illustration of existing relative positional encoding (left) and the proposed LRPE (right).Q, K, and V are all in the shape of n by d, where n is input length and d is feature dimension. Tensors in the same dashed line box are associated for computation. In the vanilla relative positional encoding, query key attention has to be calculated first, leading to a quadratic complexity. W t-s refers to relative positional encoding, where t, s are two positional indices on the query and key, respectively. Our LRPE achieves a decomposable encoding, i.e., W t and W s are only dependent on positions of the query and key, making it fully compatible with linear transformers. When dealing with long sequences, d ≪ n, the computation complexity is dominated by n, rendering d negligible.

Figure 2: Validation PPL of linear (left) and vanilla attention (right) of the bidirectional language model pretrained on the WikiText-103 dataset. In both cases, the best result of proposed LRPE has a better PPL and faster converge speed than competing methods.

if self.core_matrix == 1: if self.theta_learned: print("Learn theta!") self.theta = nn.Parameter(10000 ** (-2 / embedding_dim * torch.arange(embedding_dim // 2)).reshape(1, 1, -1)) else: print(f"Theta_type {self.theta_type}") return self.do_permutation elif self.core_matrix == 4: return self.complex_exp def get_permutation(self, max_positions, embedding_dim): permutation = torch.randperm(embedding_dim).reshape(1, -1) expanded = [torch.arange(embedding_dim).unsqueeze(0)] for _ in range(max_positions -1): previous = expanded[-1] current = previous.gather(-1, permutation) expanded.append(current) expanded = torch.stack(expanded, dim=1) return expanded def odd_even_permutation(self, x): # 2k->k, 2k+1->d+k e = x.shape[-1] d = e -e // 2 permutation = torch.arange(e) index = torch.arange(e) permutation[::2] = index[::2] // 2 permutation[1::2] = (index[1::2] -1) // 2 + d permutation = permutation.to(x.device) x = x.gather(-1, permutation.expand_as(x)) self.theta_type == "a": theta = 10000 ** (-2 / e * torch.arange(e // 2)) elif self.theta_type == "b": theta = np.pi / 2 / l / (e // 2) * torch.arange(1, e // 2 + 1) elif self.theta_type == "c": theta = np.pi / 2 / l / torch.arange(1, e // 2 + 1)

LRPE variants. The rows of the table represent the type of the P matrix, and the columns represent the type of the Λ (s) matrix.

Quantitative results of the autoregressive language model on the WikiText-103 dataset. The best result is highlighted with bold and the second best with underlined. ↓ means smaller is better.

Quantitative results of the Roberta model fine-tuned on the GLUE dataset. MNLI is reported by the match/mismatch splits. All the downstream tasks are measured by the accuracy. The best result is highlighted with bold and the second with underlined. ↑ means larger is better.

Quantitative results of machine translation on the WMT'14 En-De dataset. Evaluation metrics include the validation loss, validation BLEU(Papineni et al., 2002), and test SACRE BLEU(Post, 2018). The best result is highlighted with bold and the second with underlined. ↑ means larger is better. ±∆ means standard deviation.

Ablation results with different rotation matrix P for language modeling on the WikiText-103 dataset.

Mathematical notations used in the paper.

annex

return self.mix_reflect elif self.core_matrix == 3: theta = theta.reshape(1, 1, -1).to(x) theta = torch.stack ([theta, theta] , dim=-1).reshape(1, 1, e) theta = theta * torch.arange(l).reshape(1, -1, 1).to(x) # (-q1, -q3), (q0, q2) -> (-q1, q0, -q3, q2) x_half = torch.stack ([-x[..., 7 . For published datasets, WikiText-103 is obtained from https://www.salesforce.com/products/einstein/ai-research/thewikitext-dependency-language-modeling-dataset/, with Creative Commons Attribution-ShareAlike License. The GLUE dataset is obtained from https://gluebenchmark.com/. The WMT'14 EN-DE dataset is downloaded from https://www.statmt.org/wmt14/:

D.2 RESULTS OF BIDIRECTIONAL LANGUAGE MODEL

We report the pretrained results of the bidirectional language model in Table 8 . Our LRPE achieves competitive performance in both linear attention and vanilla attention. Notably, the UPPE variant U P RE ol h has the best quantitative results in all evaluation metrics. D .2, which indicates that our method maintains good efficiency without incurring too much computational burden.Table 9 : Training speed of different methods on the bidirectional language model. The value standards for the speed relative to the base method. ↑ means larger is faster.

Relative speed↑

Relative speed↑ Base (Vaswani et al., 2017) 1.00 1.00 RoPE (Su et al., 2021) 0.95 0.97 SPE (Liutkus et al., 2021) 0.42 0.41 PER (Chen, 2021) 0.88 -T5 (Raffel et al., 2019) -0.70 LRPE 0.91 0.95

