RETHINKING POSITIONAL ENCODING IN LANGUAGE PRE-TRAINING

Abstract

In this work, we investigate the positional encoding methods used in language pretraining (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol [CLS] the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called Transformer with Untied Positional Encoding (TUPE). In the self-attention module, TUPE computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together. This design removes the mixed and noisy correlations over heterogeneous embeddings and offers more expressiveness by using different projection matrices. Furthermore, TUPE unties the [CLS] symbol from other positions, making it easier to capture information from all positions. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. Codes and models are released at https://github.com/guolinke/TUPE.

1. INTRODUCTION

The Transformer model (Vaswani et al., 2017) is the most widely used architecture in language representation learning (Liu et al., 2019; Devlin et al., 2018; Radford et al., 2019; Bao et al., 2020) . In Transformer, positional encoding is an essential component since other main components of the model are entirely invariant to sequence order. The original Transformer uses the absolute positional encoding, which provides each position an embedding vector. The positional embedding is added to the word embedding, which is found significantly helpful at learning the contextual representations of words at different positions. Besides using the absolute positional encoding, Shaw et al. ( 2018 In this work, we revisit and study the formulation of the widely used absolute/relative positional encoding. First, we question the rationality of adding the word embedding with the absolute positional embedding in the input layer. Since the two kinds of embeddings are apparently heterogeneous, this addition operation brings mixed correlations 1 between the positional information and word semantics. For example, by expanding the dot-production function of keys and values in the self-attention module of the first layer, we find that there are explicit terms that use "word" to query "positions" and vice versa. However, words may only have weak correlations to where they appear in the sentence. Our empirical analysis also supports this by showing that in a well-trained model, such correlation is noisy. Second, we notice that the BERT model does not only handle natural language words. A special symbol [CLS] is usually attached to the sentence. It is widely acknowledged that this symbol



);Raffel  et al. (2019)  further propose the relative positional encoding, which incorporates some carefully designed bias term inside the self-attention module to encode the distance between any two positions.

