GPVIT: A HIGH RESOLUTION NON-HIERARCHICAL VISION TRANSFORMER WITH GROUP PROPAGATION

Abstract

We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require high-resolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters. Code and pre-trained models are available at https://github.com/ChenhongyiYang/GPViT.

Group Propagation

High-resolution features Feature Grouping Vision Transformer (ViT) architectures have achieved excellent results in general visual recognition tasks, outperforming ConvNets in many instances. In the original ViT architecture, image patches are passed through transformer encoder layers, each containing self-attention and MLP blocks. The spatial resolution of the image patches is constant throughout the network. Self-attention allows for information to be exchanged between patches across the whole image i.e. globally, however it is computationally expensive and does not place an emphasis on local information exchange between nearby patches, as a convolution would. Recent work has sought to build convolutional properties back into vision transformers (Liu et al., 2021; Wu et al., 2021; Wang et al., 2021) through a hierarchical (pyramidal) architecture. This design reduces computational cost, and improves ViT performance on tasks such as detection and segmentation. Feature Ungrouping Is this design necessary for structured prediction? It incorporates additional inductive biases e.g. the assumption that nearby image tokens contains similar information, which contrasts with the motivation for ViTs in the first place. A recent study (Li et al., 2022a) demonstrates that a plain non-hierarchical ViT, a model that maintains the same feature resolution in all layers (non-pyramidal), can achieve comparable performance on object detection and segmentation tasks to a hierarchical counterpart. How do we go one step further and surpass this? One path would be to increase feature resolution (i.e. the number of image tokens). A plain ViT with more tokens would maintain high-resolution features throughout the network as there is no downsampling. This would facilitate fine-grained, detailed outputs ideal for tasks such as object detection segmentation. It also simplifies the design for downstream applications, removing the need to find a way to combine different scales of features in a hierarchical ViT. However, this brings new challenges in terms of computation. Self-attention has quadratic complexity in the number of image tokens. Doubling feature resolution (i.e. quadrupling the number of tokens) would lead to a 16× increase in compute. How do we maintain global information exchange between image tokens without this huge increase in computational cost? In this paper, we propose the Group Propagation Vision Transformer (GPViT): a non-hierarchical ViT which uses high resolution features throughout, and allows for efficient global information exchange between image tokens. We design a novel Group Propagation Block (GP Block) for use in plain ViTs. Figure 1 provides a high-level illustration of how this block works. In detail, we use learnable group tokens and the cross-attention operation to group a large number of high-resolution image features into a fixed number of grouped features. Intuitively, we can view each group as a cluster of patches representing the same semantic concept. We then use an MLPMixer (Tolstikhin et al., 2021) module to update the grouped features and propagate global information among them. This process allows information exchange at a low computational cost, as the number of groups is much smaller than the number of image tokens. Finally, we ungroup the grouped features using another cross-attention operation where the updated grouped features act as key and value pairs, and are queried by the image token features. This updates the high resolution image token features with the group-propagated information. The GP Block only has a linear complexity in the number of image tokens, which allows it to scale better than ordinary self-attention. This block is the foundation of our simple non-hierarchical vision transformer architecture for general visual recognition. We conduct experiments on multiple visual recognition tasks including image classification, object detection, instance segmentation, and semantic segmentation. We show significant improvements over previous approaches, including hierarchical vision transformers, under the same model size in all tasks. The performance gain is especially large for object detection and segmentation. For example, in Figure 2 



Figure 1: An illustration of our GP Block. It groups image features into a fixed-size feature set. Then, global information is efficiently propagated between the grouped features. Finally, the grouped features are queried by the image features to transfer this global information into them.

Figure 2: A comparison on four visual recognition tasks between GPViT and the non-hierarchical DeiT (Touvron et al., 2021a) and the hierarchical Swin Transformer (Liu et al., 2021).

Vision Transformers have shown great success in visual recognition. They have fewer inductive biases, e.g. translation invariance, scale-invariance, and feature locality(Xu et al.,  2021b)  than ConvNets and can better capture long-range relationships between image pixels. In the original ViT architecture(Dosovitskiy et al., 2021; Touvron et al., 2021a), images are split into patches and are transformed into tokens that are passed through the encoder of a transformer(Vaswani et al.,

