RETHINKING POSITIONAL ENCODING IN LANGUAGE PRE-TRAINING

Abstract

In this work, we investigate the positional encoding methods used in language pretraining (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol [CLS] the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called Transformer with Untied Positional Encoding (TUPE). In the self-attention module, TUPE computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together. This design removes the mixed and noisy correlations over heterogeneous embeddings and offers more expressiveness by using different projection matrices. Furthermore, TUPE unties the [CLS] symbol from other positions, making it easier to capture information from all positions. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. Codes and models are released at https://github.com/guolinke/TUPE.

1. INTRODUCTION

The Transformer model (Vaswani et al., 2017) is the most widely used architecture in language representation learning (Liu et al., 2019; Devlin et al., 2018; Radford et al., 2019; Bao et al., 2020) . In Transformer, positional encoding is an essential component since other main components of the model are entirely invariant to sequence order. The original Transformer uses the absolute positional encoding, which provides each position an embedding vector. The positional embedding is added to the word embedding, which is found significantly helpful at learning the contextual representations of words at different positions. Besides using the absolute positional encoding, Shaw et al. (2018); Raffel et al. (2019) further propose the relative positional encoding, which incorporates some carefully designed bias term inside the self-attention module to encode the distance between any two positions. In this work, we revisit and study the formulation of the widely used absolute/relative positional encoding. First, we question the rationality of adding the word embedding with the absolute positional embedding in the input layer. Since the two kinds of embeddings are apparently heterogeneous, this addition operation brings mixed correlations 1 between the positional information and word semantics. For example, by expanding the dot-production function of keys and values in the self-attention module of the first layer, we find that there are explicit terms that use "word" to query "positions" and vice versa. However, words may only have weak correlations to where they appear in the sentence. Our empirical analysis also supports this by showing that in a well-trained model, such correlation is noisy. Second, we notice that the BERT model does not only handle natural language words. A special symbol [CLS] is usually attached to the sentence. It is widely acknowledged that this symbol receives and summarizes useful information from all the positions, and the contextual representation of [CLS] will be used as the representation of the sentence in the downstream tasks. As the role of the [CLS] symbol is different from regular words that naturally contain semantics, we argue that it will be ineffective if we treat its position the same as word positions in the sentence. For example, if we apply the relative positional encoding to this symbol, the attention distribution of some heads will likely be biased to the first several words, which hurts the understanding of the whole sentence. Self-Attention Add & Layer Norm FFN FFN Add & Layer Norm … … 𝑤 𝑗 𝑤 𝑖 Untie [CLS] 𝑏 0 𝑏 1 𝑏 2 𝑏 3 𝑏 -1 𝑏 0 𝑏 1 𝑏 2 𝑏 -2 𝑏 -1 𝑏 0 𝑏 1 𝑏 -3 𝑏 -2 𝑏 -1 𝑏 0 Relative positional correlation Based on the investigation above, we propose several simple, yet effective modifications to the current methods, which lead to a new positional encoding called Transformer with Untied Positional Encoding (TUPE) for language pre-training, see Figure 1 . In TUPE, the Transformer only uses the word embedding as input. In the self-attention module, different types of correlations are separately computed to reflect different aspects of information, including word contextual correlation and absolute (and relative) positional correlation. Each kind of correlation has its own parameters and will be added together to generate the attention distribution. A specialized positional correlation is further set to the [CLS] symbol, aiming to capture the global representation of the sentence correctly. First, we can see that in TUPE, the positional correlation and word contextual correlation are de-coupled and computed using different parameters. This design successfully removes the randomness in word-to-position (or position-to-word) correlations and gives more expressiveness to characterize the relationship between a pair of words or positions. Second, TUPE uses a different function to compute the correlations between the [CLS] symbol and other positions. This flexibility can help the model learn an accurate representation of the whole sentence. We provide an efficient implementation of TUPE. To validate the method, we conduct extensive experiments and ablation studies on the GLUE benchmark dataset. Empirical results confirm that our proposed TUPE consistently improves the model performance on almost all tasks. In particular, we observe that by imposing this inductive bias to encode the positional information, the model can be trained more effectively, and the training time of the pre-training stage can be largely improved.

2.1. ATTENTION MODULE

The attention module (Vaswani et al., 2017) is formulated as querying a dictionary with key-value pairs, e.g., Attention(Q, K, V ) = softmax( QK T √ d )V , where d is the dimensionality of the hidden



Figure 1: The architecture of TUPE. The positional correlation and word correlation are computed separately, and added together in the self-attention module. The positional attention related to the [CLS] token is treated more positionless, to encourage it captures the global information.

