ON POSITION EMBEDDINGS IN BERT

Abstract

Various Position Embeddings (PEs) have been proposed in Transformer based architectures (e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way. Moreover, we propose a new probing test (called 'identical word probing') and mathematical indicators to quantitatively detect the general attention patterns with respect to the above properties. An empirical evaluation of seven PEs (and their combinations) for classification (GLUE) and span prediction (SQuAD) shows that: (1) both classification and span prediction benefit from translation invariance and local monotonicity, while symmetry slightly decreases performance; (2) The fully-learnable absolute PE performs better in classification, while relative PEs perform better in span prediction. We contribute the first formal and quantitative analysis of desiderata for PEs, and a principled discussion about their correlation to the performance of typical downstream tasks.

1. INTRODUCTION

Position embeddings (PEs) are crucial in Transformer-based architectures for capturing word order; without them, the representation is bag-of-words. Fully learnable absolute position embeddings (APEs) were first proposed by Gehring et al. (2017) to capture word position in Convolutional Seq2seq architectures. Sinusoidal functions were also used with Transformers to parameterize PEs in a fixed ad hoc way (Vaswani et al., 2017) . Recently, Shaw et al. (2018) used relative position embedding (RPEs) with Transformers for machine translation. More recently, in Transformer pretrained language models, BERT (Devlin et al., 2018; Liu et al., 2019) and GPT (Radford et al., 2018) used fully learnable PEs. Yang et al. (2019) modified RPEs and used them in the XLNet pre-trained language model. To our knowledge, the fundamental differences between the various PEs have not been studied in a principled way. We posit that the aim of PEs is to capture the sequential nature of positions in vector space, or technically, to bridge the distances in N (for positions) and R D (for position vectors). We therefore propose three expected properties for PEs: monotonicity, translation invariance, and symmetry 1 . Using these properties, we formally reinterpret existing PEs and show the limitations of sinusoidal PEs (Vaswani et al., 2017) : they cannot adaptively meet the monotonicity property -thus we propose learnable sinusoidal PEs. We benchmark 13 PEs (including APEs, RPEs, and their combinations) in GLUE and SQuAD, in a total of 11 individual tasks. Several indicators are devised to quantitatively measure translation invariance, monotonicity, and symmetry, which can be further used to calculate their statistical correlations with empirical performance in downstream tasks. We empirically find that both text classification tasks (in GLUE) and span prediction tasks (SQuAD V1.0 and V 2.0) can benefit from monotonicity (in nearby offset) and translation invariance (in particular without considering special tokens like [CLS]), but symmetry decreases performance since it can not deal with directions between query vectors and key vectors when calculating attentions. Plus, models with unbalanced attention regarding directions (generally attending more to preceding tokens than to succeeding tokens) slightly correlate with better performance (especially for span prediction tasks). Experiments also show that the fully-learnable APE performs better in classification, while RPEs perform better in span prediction tasks. This is explained by our proposed properties as follows: RPEs perform better in span prediction tasks since they meet better translation invariance, monotonicity , and asymmetry; the fully-learnable APE which does not strictly have the translation invariance and monotonicity properties during parameterizations (as it also performed worse in measuring translation invariance and local monotonicity than other APEs and all RPEs) still performs well because it can flexibly deal with special tokens (especially, unshiftable [CLS]). Regarding the newly-proposed learnable sinusoidal PEs, the learnable sinusoidal APE satisfies the three properties to a greater extent than other APE variants, and the learnable sinusoidal RPE exhibits better direction awareness than other PE variants. Experiments show that BERT with sinusoidal APEs slightly outperforms the fully-learnable APE in span prediction, but underperforms in classification tasks. Both for APEs and RPEs, learning frequencies in sinusoidal PEs appears to be beneficial. Lastly, sinusoidal PEs can be generalized to treat longer documents because they completely satisfy the translation invariance property, while the fully-learnable APE does not. The contributions of this paper are summarised below: 1) We propose three principled properties for PEs that are either formally examined or empirically evaluated by quantitative indicators in a novel Identical Word Probing test; 2) We benchmark 13 PEs (including APEs, RPEs and their combinations) in GLUE, SQuAD V1.1 and SQuAD V2.0, in a total of 11 individual tasks; 3) we experimentally evaluate how the performance in individual tasks benefits from the above properties; 4) We propose two new PEs to extend sinusoidal PEs to learnable versions for APEs/RPEs. x ∈ R D in some Euclidean space. By standard methods in representation learning, similarity between embedded objects x and y is typically expressed by an inner product x, y , for instance the dot product gives rise to the usual cosine similarity between x and y. Generally, if words appear close to each other in a text (i.e., their positions are nearby), they are more likely to determine the (local) semantics together, than if they occurred far apart. Hence, positional proximity of words x and y should result in proximity of their embedded representations x and y. One common way of formalizing this is that an embedding should preserve the order of distances among positionsfoot_2 . We denote φ(•, •) as a function to calculate closeness/proximity between embedded positions, and any inner product can be a special case of φ(•, •) with good properties. We can express preservation of the order of distances as: For every x, y, z ∈ N, |x -y| > |x -z| =⇒ φ( x, y) < φ( x, z)

2. PROPERTIES OF POSITION EMBEDDINGS

(1) Note that on the underlying space, the property in Eq. (1) has been studied for almost 60 years (Shepard, 1962) , in both algorithmics (Bilu & Linial, 2005; Badoiu et al., 2008; Maehara, 2013) , and machine learning (Terada & Luxburg, 2014; Jain et al., 2016) under the name ordinal embedding. As we are interested in the simple case of positions from N, Eq. ( 1) reduces to the following property:



Informally, as positions are originally positive integers, one may expect position vectors in vector space to have the following properties: 1) neighboring positions are embedded closer than faraway ones; 2) distances of two arbitrary m-offset position vectors are identical; 3) the metric (distance) itself is symmetric. Theoretical evidence for this is nontrivial unless we assume more about the particular non-linear functions. We empirically find that all learned PEs can preserve the order of distance



Gehring et al. (2017); Vaswani et al. (2017) use absolute word positions as additional features in neural networks. Positions x ∈ N are distributively represented as an embedding of x as an element

