RELATIVE POSITIONAL ENCODING FAMILY VIA UNITARY TRANSFORMATION

Abstract

Relative positional encoding is widely used in vanilla and linear transformers to represent positional information. However, existing encoding methods of a vanilla transformer are not always directly applicable to a linear transformer, because the latter requires a decomposition of the query and key representations into separate kernel functions. Nevertheless, principles to design encoding methods suitable for linear transformers remain under-studied. In this work, we put together a variety of existing encoding approaches under a canonical form and further propose a family of relative positional encoding algorithms via unitary transformation. Our formulation leads to a principled framework that can be used to develop new relative positional encoding methods that preserve the linear space-time complexity. Equipped with different parameters, the proposed linearized relative positional encoding (LRPE) family derives effective encoding for various applications. Experiments show that compared with existing methods, LRPE achieves competitive performance on language modeling and various challenging downstream tasks, e.g., machine translation and text classification. In the meantime, it highlights a general paradigm to design broadly more relative positional encoding methods, applicable inclusively to linear and vanilla transformers.

1. INTRODUCTION

Transformers have achieved remarkable progress in natural language processing (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020 ), computer vision (Dosovitskiy et al., 2020; Liu et al., 2021; Arnab et al., 2021) and audio processing (Gulati et al., 2020) . As an important ingredient in transformers, positional encoding assigns a unique representation for each position of a token in a sequence so that the transformers can sense the position of input tokens. Among these encoding methods, absolute positional encoding (Vaswani et al., 2017; Sukhbaatar et al., 2015; Devlin et al., 2019; Liu et al., 2020) maps each individual position index into a continuous encoding. Whereas relative positional encoding (Shaw et al., 2018; Su et al., 2021; Horn et al., 2021; Liutkus et al., 2021; Huang et al., 2020; Raffel et al., 2019) generates encoding for each query-key pair, representing their relative positional offset. We focus on relative positional encoding as they are not constrained by input lengths (Chen, 2021) while showing superior performance (Shaw et al., 2018) . Linear transformers Chen (2021); Qin et al. (2022); Su et al. (2021) attract more attention recently as they can achieve linear space-time complexity with respect to input sequence length, while maintaining comparable performance with vanilla transformers. Most existing linear transformers use absolute positional encoding methods to encode positional information, since most existing relative positional encoding methods are designed for vanilla transformers and are not directly applicable to linear transformers. The main cause behind this limitation is that linear transformers decompose key and value representations in the self-attention modules into separate kernel functions to achieve linear space-time complexity. Such an additional requirement on the decomposibility is not always satisfied by existing relative positional encoding methods. On the other hand, despite some individual works (Qin et al., 2022; Chen, 2021) , general principles to design relative positional encoding for linear transformers remain largely under-studied. A recent work, RoPE Su et al. (2021) proposes a new set of multiplicative encoding solutions based on rotate positional encoding and can be applied to linear transformers. In Section C.7, we show that RoPE can be seen as a special form of LRPE. In this work, we aim to bridge this gap and study principal framework to develop relative positional encoding applicable for both linear and vanilla transformers. To this end, we start by presenting a canonical form of relative positional encoding, which reveals that differences in existing encoding methods boil down to choices of a set of query, key and relative positional matrix primitives. By properly selecting and composing these primitives, we could derive various existing encoding methods for vanilla (Vaswani et al., 2017; Huang et al., 2020; Shaw et al., 2018) and linear (Qin et al., 2022) transformers.

Softmax

d n × Q V K T 𝑤 𝑡-𝑠 d n × × Q 𝑤 𝑡 × × 𝑤 𝑠 V × × d * d d * d d * d O(n 2 d + n 2 d)≈O(n 2 ) O(nd 2 + nd 2 )≈O(n) K T Taking advantage of the canonical form, we introduce the main contribution of our work, i.e., a special family of relative positional encoding methods called linearized relative positional encoding (LRPE). Specifically, we supply a sufficient condition to design compatible encoding methods specially for linear transformers and prove that the linearized relative positional encoding is unitary transformation. Benefits of using unitary transformation are two-fold. On one side, since it is derived from the decomposable positional matrix, it can maintain the linear space-time complexity as shown in Fig. 1 . Second, the property of the unitary transformation allows us to effectively derive the family of closed-form solutions. In particular, we show that a number of encoding methods pertain to the LRPE family, including those used in RoPE (Su et al., 2021) and PermuteFormer (Chen, 2021). Furthermore, LRPE sheds light on a simple yet flexible theoretical paradigm to develop new effective relative positional encodings. To demonstrate this, we derive non-exhaustively three additional LRPE encoding methods by parameterizing the generic solution differently, including solutions living in either real or complex domains. Since unitary transformations are special cases of relative positional matrix, LRPE are applicable in both linear and vanilla transformers, and exclusively suitable within encoder and/or decoder layers. We experimentally demonstrate the effectiveness of the LRPE family on autoregressive and bidirectional language modelling, and on challenging downstream tasks, including machine translation and text classification. Results show that LRPE achieves competitive capability in representing relative positional information, commonly resulting in superior performance than previous encoding methods. In summary, our main contributions are three-fold: • We present a canonical form of relative positional encoding, which derives most existing relative positional encoding methods as its special case, including those used in linear and vanilla transformers. • Based on the canonical form, we propose linearized relative position encoding (LRPE), a simple yet principal formulation to derive an encoding family that respect the linear spacetime complexity in linear transformers, while being also applicable to vanilla transformers. We show several existing relative positional encoding methods in linear transformers are in LRPE family. We also provide additional particular solutions from this generic form. • Experiments on various downstream tasks, including language modeling, machine translation and text classification show that the LRPE family show more robust and commonly superior results across tasks than previous relative encoding methods, are flexible in be-



Figure 1: Illustration of existing relative positional encoding (left) and the proposed LRPE (right).Q, K, and V are all in the shape of n by d, where n is input length and d is feature dimension. Tensors in the same dashed line box are associated for computation. In the vanilla relative positional encoding, query key attention has to be calculated first, leading to a quadratic complexity. W t-s refers to relative positional encoding, where t, s are two positional indices on the query and key, respectively. Our LRPE achieves a decomposable encoding, i.e., W t and W s are only dependent on positions of the query and key, making it fully compatible with linear transformers. When dealing with long sequences, d ≪ n, the computation complexity is dominated by n, rendering d negligible.

