CONDITIONAL POSITIONAL ENCODINGS FOR VISION TRANSFORMERS

Abstract

We propose a conditional positional encoding (CPE) scheme for vision Transformers (Dosovitskiy et al., 2021; Touvron et al., 2020). Unlike previous fixed or learnable positional encodings that are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during the training. Besides, CPE can keep the desired translation equivalence in vision tasks, resulting in improved performance. We implement CPE with a simple Position Encoding Generator (PEG) to get seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our Code is available at: https://git.io/CPVT.

1. INTRODUCTION

Recently, Transformers (Vaswani et al., 2017) have been viewed as a strong alternative to Convolutional Neural Networks (CNNs) in visual recognition tasks such as classification (Dosovitskiy et al., 2021) and detection (Carion et al., 2020; Zhu et al., 2021) . Unlike the convolution operation in CNNs, which has a limited receptive field, the self-attention mechanism in the Transformers can capture the long-distance information and dynamically adapt the receptive field according to the image content. Consequently, Transformers are considered more flexible and powerful than CNNs, being promising to achieve more progress in visual recognition. However, the self-attention operation in Transformers is permutation-invariant, which discards the order of the tokens in an input sequence. To mitigate this issue, previous works (Vaswani et al., 2017; Dosovitskiy et al., 2021) add the absolute positional encodings to each input token (see Figure 1a ), which enables order-awareness. The positional encoding can either be learnable or fixed with sinusoidal functions of different frequencies. Despite being effective, these positional encodings seriously harm the flexibility of the Transformers, hampering their broader applications. Taking the learnable version as an example, the encodings are often a vector of equal length to the input sequence, which are jointly updated with the network weights during training. As a result, the length and the value of the positional encodings are fixed once trained. During testing, it causes difficulties of handling the sequences longer than the ones in the training data. The inability to adapt to longer input sequences during testing greatly limits the range of generalization. For instance, in vision tasks like object detection, we expect the model can be applied to the images of any size during inference, which might be much larger than the training images. A possible remedy is to use bicubic interpolation to upsample the positional encodings to the target length, but it degrades the performance without fine-tuning as later shown in our experiments. For vision in general, we expect that the models be translation-equivariant. For example, the output feature maps of CNNs shift accordingly as the target objects are moved in the input images. However, the absolute positional encoding scheme might break the translation equivalence because it adds unique positional encodings to each token (or each image patch). One may overcome the issue with relative positional encodings as in (Shaw et al., 2018) . However, relative positional encodings not only come with extra computational costs, but also require modifying the implementation of the standard Transformers. Last but not least, the relative positional encodings cannot work equally well as the absolute ones, because the image recognition task still requires absolute position information (Islam et al., 2020) , which the relative positional encodings fail to provide. In this work, we advocate a novel positional encoding (PE) scheme to incorporate the position information into Transformers. Unlike the predefined and input-agnostic positional encodings used in previous works (Dosovitskiy et al., 2021; Vaswani et al., 2017; Shaw et al., 2018) , the proposed PE is dynamically generated and conditioned on the local neighborhood of input tokens. Thus, our positional encodings can change along with the input size and try to keep translation equivalence. We demonstrate that the vision transformers (Dosovitskiy et al., 2021; Touvron et al., 2020) with our new PE (i.e. CPVT, see Figure 1c ) achieve even better performance. We summarize our contributions as, • We propose a novel positional encoding (PE) scheme, termed conditional position encodings (CPE). CPE is dynamically generated with Positional Encoding Generators (PEG) and can be effortlessly implemented by the modern deep learning frameworks (Paszke et al., 2019; Abadi et al., 2016; Chen et al., 2015) , requiring no changes to the current Transformer APIs. Through an in-depth analysis and thorough experimentations, we unveil that this design affords both absolute and relative encoding yet it goes above and beyond. • As opposed to widely-used absolute positional encodings, CPE can provide a kind of stronger explicit bias towards the translation equivalence which is important to improve the performance of Transformers. • Built on CPE, we propose Conditional Position encoding Vision Transformer (CPVT). It achieves better performance than previous vison transformers (Dosovitskiy et al., 2021; Touvron et al., 2020) . • CPE can well generalize to arbitrary input resolutions, which are required in many important downstream tasks such as segmentation and detection. Through experiments we show that CPE can boost the segmentation and detection performance for pyramid transformers like (Wang et al., 2021) by a clear margin.



Figure 1. Vision Transformers: (a) ViT (Dosovitskiy et al., 2021) with explicit 1D learnable positional encodings (PE) (b) CPVT with conditional positional encoding from the proposed Position Encoding Generator (PEG) plugin, which is the default choice. (c) CPVT-GAP without class token (cls), but with global average pooling (GAP) over all items in the sequence. Note that GAP is a bonus version which has boosted performance.

