CONDITIONAL POSITIONAL ENCODINGS FOR VISION TRANSFORMERS

Abstract

We propose a conditional positional encoding (CPE) scheme for vision Transformers (Dosovitskiy et al., 2021; Touvron et al., 2020). Unlike previous fixed or learnable positional encodings that are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during the training. Besides, CPE can keep the desired translation equivalence in vision tasks, resulting in improved performance. We implement CPE with a simple Position Encoding Generator (PEG) to get seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results.

1. INTRODUCTION

Recently, Transformers (Vaswani et al., 2017) have been viewed as a strong alternative to Convolutional Neural Networks (CNNs) in visual recognition tasks such as classification (Dosovitskiy et al., 2021) and detection (Carion et al., 2020; Zhu et al., 2021) . Unlike the convolution operation in CNNs, which has a limited receptive field, the self-attention mechanism in the Transformers can capture the long-distance information and dynamically adapt the receptive field according to the image content. Consequently, Transformers are considered more flexible and powerful than CNNs, being promising to achieve more progress in visual recognition. However, the self-attention operation in Transformers is permutation-invariant, which discards the order of the tokens in an input sequence. To mitigate this issue, previous works (Vaswani et al., 2017; Dosovitskiy et al., 2021) add the absolute positional encodings to each input token (see Figure 1a ), which enables order-awareness. The positional encoding can either be learnable or fixed with sinusoidal functions of different frequencies. Despite being effective, these positional encodings seriously harm the flexibility of the Transformers, hampering their broader applications. Taking the learnable version as an example, the encodings are often a vector of equal length to the input sequence, which are jointly updated with the network weights during training. As a result, the length and the value of the positional encodings are fixed once trained. During testing, it causes difficulties of handling the sequences longer than the ones in the training data. The inability to adapt to longer input sequences during testing greatly limits the range of generalization. For instance, in vision tasks like object detection, we expect the model can be applied to the images of any size during inference, which might be much larger than the training images. A possible remedy is to use bicubic interpolation to upsample the positional encodings to the target length, but it degrades the performance without fine-tuning as later shown in our experiments. For vision in general, we expect that the models be translation-equivariant. For example, the output feature maps of CNNs shift accordingly as the target objects are moved in the input images. However, the absolute positional encoding scheme might break the translation equivalence because it adds unique

