GLOBAL CONTEXT VISION TRANSFORMERS

Abstract

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision tasks. The core of the novel model are global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, as an alternative to complex operations such as an attention masks or local windows shifting. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and values. In addition, we address the lack of inductive bias in ViTs and improve the modeling of inter-channel dependencies by proposing a novel downsampler which leverages a parameter-efficient fused inverted residual block. The proposed GC ViT achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the tiny, small and base variants of GC ViT with 28M, 51M and 90M parameters achieve 83.4%, 83.9% and 84.4% Top-1 accuracy, respectively, surpassing comparablysized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins.

1. INTRODUCTION

During the recent years, Transformers (Vaswani et al., 2017) have achieved State-Of-The-Art (SOTA) performance in Natural Language Processing (NLP) benchmarks and became the de facto model for various tasks. A key element in the success of Transformers is the self-attention mechanism which allows for capturing contextual representations via attending to both distant and nearby tokens (Yin et al., 2021) . Following this trend, Vision Transformer (ViT) (Dosovitskiy et al., 2020) proposed to utilize image patches as tokens in a monolithic architecture with minor differences comparing to encoder of the original Transformer. Despite the historic dominance of Convolutional Neural Network (CNN) in computer vision, ViT-based models have achieved SOTA or competitive performance in various computer vision tasks. In essence, the self-attention mechanism in ViT allows for learning more uniform short and long-range information (Raghu et al., 2021) in comparison to CNN. However, the monolithic architecture of ViT and quadratic computational complexity of self-attention baffle their swift application to high resolution images (Yang et al., 2021a) in which capturing multi-scale long-range information is crucial for accurate representation modeling. Several efforts (Liu et al., 2021; Dong et al., 2022; Chu et al., 2021a; Tu et al., 2022) , most notably Swin Transformer (Liu et al., 2021) , have attempted to address the balance between short-and longrange spatial dependencies by proposing multi-resolution architectures in which the self-attention is computed in local windows. In this paradigm, cross-window connections such as window shifting are used for modeling the interactions across different regions. Despite the progress, the limited receptive field of local windows challenges the capability of self-attention to capture long-range information, and window-connection schemes such as shifting only cover a small neighborhood in the vicinity of each window. Subsequent efforts such as Focal Transformer (Yang et al., 2021b) attempted to address this issue by designing highly sophisticated self-attention modules with increased model complexity. In this work, we introduce the Global Context (GC) ViT network to address these limitations. Specifically, we propose a hierarchical ViT architecture consisting of local and global self-attention modules. At each stage, we compute global query tokens, using a novel fused inverted residual blocks, which we refer to as modified Fused-MBConv blocks, that encompass global contextual information from different image regions. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and value representations. The design of our proposed framework for global query generator and self-attention is intuitive and simple and can be efficiently implemented using major deep learning framework. Hence, it eliminates sophisticated and computationally expensive operations and ensures the effectiveness of self-attention when applied to high-resolution images. In addition, we propose a novel downsampling block with a parameter-efficient fused-MBConv layer to address the lack of inductive bias in ViTs and enhancing the modeling of inter-channel dependencies. We have extensively validated the effectiveness of the proposed GC ViT using three publicly available datasets for various computer vision tasks. For image classification using ImageNet-1K dataset, GC ViT with 28M, 51M, 90M and 201M parameters, referred to as tiny, small, base and large variants, achieve new SOTA benchmarks of 83.4%, 83.9%, 84.4% and 84.6% Top-1 accuracy. Hence, GC ViT consistently outperforms both ConvNeXt (Liu et al., 2022) and Swin Transformer (Liu et al., 2021) models by a significant margin (see Fig. 1 ). Using a pre-trained GC ViT base backbone with a Cascade Mask RCNN (He et al., 2017) head, our model achieves a box mAP of 52.9 for object detection and a mask mAP of 45.8 for instance segmentation on the MS COCO dataset. In addition, using an UPerNet (Xiao et al., 2018) head, our model achieves a mIoU of 49.0 on ADE20K for semantic segmentation. Other variants of GC ViT with different learning capacities also demonstrate SOTA results when compared to similarly-sized models on both MS COCO and ADE20K datasets. Hence, GC ViT demonstrates great scalability for high-resolution images on various downstream tasks, validating the effectiveness of the proposed framework in capturing both short and long-range information. The main contributions of our work are summarized as follows: • We introduce a compute and parameter-optimized hierarchical ViT with reparametrization of the design space (e.g., embedding dimension, number of heads, MLP ratio). Downsampler. We leverage an idea of spatial feature contraction from CNN models that imposes locality bias and cross channel interaction while reducing dimensions. We utilize a modified Fused-MBConv block, followed by a max pooling layer with a kernel size of 3 and stride of 2 as a downsampling operator, see Fig 3 . The Fused-MBConv block in our work is similar to the one in EfficientNetV2 (Tan & Le, 2021) with modifications as in x = DW-Conv 3×3 (x), x = GELU(x), x = SE(x), x = Conv 1×1 (x) + x, where SE, GELU and DW-Conv 3×3 denote Squeeze and Excitation block (Hu et al., 2018) , Gaussian Error Linear Unit (Hendrycks & Gimpel, 2016) and 3×3 depth-wise convolution, respectively. In our proposed architecture, the Fused-MBConv blocks provide desirable properties such as inductive bias and modeling of inter-channel dependencies. It is ablated in Table 5 . Attention. Multi-head self-attention is the the core computational operator in the 

Fused MBConv

Max pool 2x2

Stage-wise global tokens

Spatially matched with local tokens Tokenized features times for input-to-stage dimension matching 

2.1. GLOBAL QUERY GENERATOR

We propose to generate global query tokens that encompass information across the entire input feature maps for interaction with local key and value feature pairs. Specifically, as shown in Fig. 5 , a layer f in the the generator consists of a Fused-MBConv block followed by a max pooling layer, similar to the one described in Sec. 2, and the final global query q g,i at stage i (i ∈ {1, 2, 3, 4}) of GC ViT is computed according to x i = F-MBConv(x i-1 ), x i = MaxPool(x i ). (2) These query tokens are computed once at every stage of the model and shared across all global attention blocks, hence decreasing number of parameters and FLOPs and improving the generalizability. In addition, the global attention layers only learn local key and value features which will be used for interaction with global query tokens. both local and global spatial information. Fig. 6 illustrates the difference between local and global self-attention. The global attention query q g has a size of B × C × h × w, wherein B, C, h and w denote batch size, embedding dimension, local window height and width, respectively. Moreover, q g is repeated along the batch dimension to compensate for the overall number of windows and batch size B * = B × N where N is the number of local windows. q g is further reshaped into multiple heads. The value and key are computed within each local window using a linear layer. The global self-attention query, key and value features are computed as follows 

2.2. GLOBAL SELF-ATTENTION

Qg ∈ R B * ×C×h×w := [qg, ..., qg], qg ∈ R B×C×h×w , (3) qg ∈ R B * ×N ×C reshape ←---Qg ∈ R B * ×C×h×w , k, v = g(x) ∈ R B * ×N ×C . = f(x).reshape(B * , N, 2, F, C // F) kv = kv.permute(2, 0, 3, 1, 4) k, v = split(kv, (1, 1), 0) q_g = q_g.repeat(1, B * // B, 1, 1) q_g = q_g.reshape(B * , F, N, C // F ) qk = matmul(q_g,k.transpose(-2, - )) attn = softmax(qk) return matmul (attn, v) .reshape(B * , N, C) Since the partitioned windows only contain local information, interaction with rich contextual information embedded in the global query tokens provides an effective way of enlarging the receptive field and attending to various regions in the input feature maps. The self-attention module is computed as in Attention(qg, k, v) = Softmax( qgk √ d + b)v, ( ) where d is scaling factor and b is a learnable relative position bias term. Assuming position change between [-p + 1, p -1] along horizontal and vertical axes, b is sampled from the grid b ∈ R (2p-1)×(2p-1) . As shown in Sec. 5, relative position bias improves the performance, especially for dense prediction downstream tasks. In Algorithm 1, we present a PyTorch-like pseudocode for computing global selfattention in GC ViT. A complexity analysis of the global self-attention is presented in the supplementary materials.

3. RELATED WORK

ViT. The ViT (Dosovitskiy et al., 2020) was first proposed as an alternative to CNNs with the advantage of enlarged receptive field, due to its self-attention layers. However, it lacked desirable properties of CNNs such as inductive biases and translation invariance and required large-scale training datasets to achieve competitive performance. Data-efficient Image Transformers (DeiT) 

4. EXPERIMENTS

Image classification. For image classification, we trained and tested our model on ImageNet-1K dataset (Deng et al., 2009) . To allow for a fair comparison, all GC ViT variants are trained by following training configurations of previous efforts (Liu et al., 2021; Yang et al., 2021b; Chu et al., 2021a) . Specifically, all models are trained on 4 nodes (32 A100 GPUs) with the AdamW (Loshchilov & Hutter, 2017) optimizer for 300 epochs with an initial learning rate of 0.001, weight decay of 0.05, cosine decay scheduler and 20 warm-up and cool-down epochs, respectively. We use total batch sizes of 4096 for GC ViT-XXT, GC ViT-XT and GC ViT-T models and 1028 for all other variants. See supplementary materials for more training details. Object detection and semantic segmentation For object detection and instance segmentation, we trained our model on MS COCO (Lin et al., 2014) with a Mask-RCNN (He et al., 2017) 

4.3. ADE20K SEMANTIC SEGMENTATION RESULTS

We present semantic segmentation benchmarks on ADE20K dataset in Table 3 . The models using pretrained GC ViT-T (47.0), GC ViT-S (48.3) and GC ViT-B (49.0) backbones outperform counterpart models with pre-trained Twins-SVT-S (Chu et al., 2021a) As shown in Table 4 , we study the role of each component in GC ViT model for classification, detection, instance and semantic segmentation. For simplicity, we start with Swin Transformer as the base model and progressively re-design the components to demonstrate their importance in improving the performance. Firstly, we remove the window shifting and predictably observe significant performance degradation across all tasks. Changing distribution of parameters to our design improves the performance by +1.7, +2.8, +2.2 and +1.7 in terms of accuracy, box AP, mask AP and mIoU. Such reparametrization includes changing the window size, MLP ratio, number of layers to name but a few. Adding the CNN-based stem of GC ViT to the model provides additional improvements of +0.3, +0.2, +0.2 and +0.2 in terms of accuracy, box AP, mask AP and mIoU. In addition, using our proposed downsampler further improves the accuracy, box AP, mask AP and mIoU by +0.4, +0.1, +0.1 and +0.3, respectively. The last two changes demonstrate the importance of convolutional inductive bias and capturing the inter-channel dependencies in our model. Finally, leveraging the proposed global self-attention improves the performance by by +0.8, +0.8, +0.6 and +1.2 in terms of accuracy, box AP, mask AP and mIoU. Hence, this validates the effectiveness of the proposed global self-attention, in particular for downstream tasks with high resolution images such as semantic segmentation in which modeling long-range spatial dependencies is critical.

5.2. DOWNSAMPLER DESIGN

We studied the effectiveness of various downsampler blocks in Table 5 . The simplest alternative to our design is a pair of convolutional and maxpooling layers. However, it results in a reduction of ImageNet Top-1 accuracy by -0.7. Patch merging is another variant which was introduced in Swin Transformers (Liu et al., 2021) .

Down-sampler Architecture

Top-1 Conv Conv (s=1), Maxpool 82.7 Swin Linear 82.9

GC ViT

Modified Fused-MBConv (s=2) 83.4 Table 5 -Ablation study on the effectiveness of downsampler in GC ViT architecture on ImageNet Top-1 accuracy. However, it reduces the accuracy by -0.5. Finally, our down-sampler which consists of a modified Fused-MBConv block and strided convolution and shows the best result. Importance of the former component is explained by the SE operation which boosts cross channel interaction while keeping number of parameters and FLOPs low. We conclude that our proposed down-sampler is essential to achieve high accuracy as it introduces convolutional inductive bias.

6. INTERPRETABILITY

To provide further insights on interpretability of the proposed global self-attention and query tokens, we demonstrate visualization of the learned attention and Grad-CAM (Selvaraju et al., 2017) EdgeViT : EdgeViT and GC ViT use completely different self-attention blocks. The EdgeViT uses a series of local aggregation (convolution), sparse attention and local propagation(depthwise convolution), whereas GC ViT only uses an interleaved pattern of local and global self-attention layers without convolution in order to compute self-attention. The proposed global sparse attention in EdgeViT and GCViT are competently different. EdgeViT samples representative tokens and only computes sparse self-attention between these representative tokens with reduced feature size. On the contrary, GC ViT computes self-attention between the global queries (not just the token) and local keys and values without any subsampling in their respective local regions. Furthermore, in EdgeViT, only subsampled representative tokens per region interact In the self-attention module; however, in GC ViT, the global queries interact with the entire local regions, instead of interacting with each other, and hence provide an effective mechanism for capturing both short and long-range spatial dependencies. In addition, GC ViT generates global query tokens by using a series of modified Fused MB-Conv from the entire image and without subsampling. Note that the resolution of global query tokens are the same as local query and values. However, in EdgeViT: (A) the representative tokens are obtained per local window, not the entire image, and by subsampling and reducing the feature resolution. Hence, since generated tokens have a lower resolution compared to their respective local windows, this could result in loss of spatial information and impact the effectiveness of self-attention. Unlike EdgeViT, the downsampler in GCViT also benefits from modified Fused-MBConv blocks which allows for modeling cross channel interactions and impose more locality and convolutional inductive bias. BigBird : Bigbird, which is primarily introduced for NLP applications with 1D inputs, has significant differences compared to GC ViT, which is proposed for computer vision with mainly 2D inputs. Firstly, BigBird uses a combination of random, window and global attention mechanisms, which is different from the proposed local and global self-attention scheme in GC ViT. In addition, BigBird does not have any specific mechanisms for extracting global tokens as the existing tokens or additional special tokens can be specified as global tokens. On the contrary, the global tokens in GC ViT are extracted by the proposed global query generator module which consists of a series of modified Fused MB-Conv blocks to extract contextual information from the entire input features. Lastly, BigBird employs a set of global tokens which attend to the entire input sequence; in this case, select global query, key and values attend to local query, key and value tensors. However, as opposed to this formulation, in GC ViT, the global query tokens attend to local key and value tokens in partitioned windows. This is due to the fact that attending to the entire input sequence, as done in BigBird, is not feasible considering the larger size of input features in computer vision.

H IMAGENETV2 BENCHMARKS

In Table S .4, we have evaluated the performance of GC ViT on ImageNetV2 dataset (?) to further measure its robustness. Specifically, we have used different sampling strategies of Matched Frequency and Threshold-0.7. These benchmarks demonstrate the competetive performance of GC ViT on ImageNetV2 dataset and validates its effectiveness in robustness and generalizability.

I EFFECT OF GLOBAL CONTEXT MODULE

In order to demonstrate the effectiveness of Global Context (GC) module, we use Swin Transformers as the base model and add our propoped GC module. In this analysis, we remove the window shifting operation from Swin Transformers, since GC module is capable of modeling cross-region interactions. As shown in Table S .5, addition of GC module improves the ImageNet Top-1 accuracy by +0.9% and +0.7% for Swin Transformers Tiny and Small variants respectively.

J IMAGENET CLASSIFICATION BENCHMARKS

In Table S .6, we provide a comprehensive benchmark in terms of Top-1 accuracy for the models that are only trained on ImageNet-1K (Deng et al., 2009) dataset, and without additional data. 



Figure 1 -Top-1 accuracy vs. model FLOPs/parameter size on ImageNet-1K dataset. GC ViT achieves new SOTA benchmarks for different model sizes as well as FLOPs, outperforming competing approaches by a significant margin. Best viewed in color.

Figure 3 -Downsampling block for dimension reduction.

Figure 4 -Attention formulation. Local attention is computed on feature patches within local window only (left). On the other hand, the global features are extracted from the entire input features and then repeated to form global query tokens. The global query is interacted with local key and value tokens, hence allowing to capture long-range information via cross-region interaction. Best viewed in color.

Figure 5 -Global query generator schematic diagram. It is designed to (i) transform an input feature map to the current stage of dimension H, W, C denoting height, width, and channel respectively, (ii) extract features via repeating the modified Fused MBConv block, joint with down-sampling, log 2 H h times for dimension matching to local window size h (iii) output is reshaped and repeated to ( H h ) 2 number of local tokens that can attend to global contextual information. ⋆ denotes merged dimensions during reshaping.

Fig.4demonstrates the main idea behind our contribution. Local self-attention can only query patches within a local window, whereas the global attention can query different image regions while still operating within the window. At each stage, the global query component is pre-computed as described in Sec.2.1. The global self-attention utilizes the extracted global query tokens, obtained according to Eq. 2 and shared across all blocks, to interact with the local key and value representations. In addition, GC ViT employs alternating local and global self-attention blocks to effectively capture

Algorithm. 1 Global Attention Pseudocode in PyTorch Style# Input/output shape: (B * , N, C) # B * : Batchsize * Num Windows; H: Height; # W: Width; C: dim; q_g: Global Token ; # F: Num Attention Head; N: Num Windows; def init(): f = nn.Linear(C, 2 * C) softmax = nn.Softmax(dim=-1) def forward(x, q_g): B * , N, C = x.shape B, C, h, w = q_g.shape kv

(a) Original images from ImageNet-1K validation set. (b) Global attention maps from GC ViT model (ours).(c) Corresponding Grad-CAM maps.

Figure 7 -Visualization of : (a) input images (b) global self-attention maps from GC ViT-T model (c) corresponding Grad-CAM attention maps. Both short and long-range spatial dependencies are captured effectively. Please see the supplementary materials for illustration of learned global query token feature maps.

• We design an efficient CNN-like token generator that encodes spatial features at different resolutions for global query representations. • We propose global query tokens that can effectively capture contextual information in an efficient manner and model both local and global interactions. • We introduce a parameter-efficient downsampling module with modified Fused MB-Conv blocks that not only integrates inductive bias but also enables the modeling of inter-channel dependencies.

interactions of feature channels. Convolutional vision Transformer (CvT)(Wu et al., 2021) introduced convolutional token embedding layer and Transformer block in a hierarchical architecture to improve the efficiency and accuracy of ViTs. Conditional Position encoding Vision Transformer (CPVT)(Chu et al., 2021b) demonstrated improved performance on various tasks such as image classification and object detection by conditioning the position encoding on localized patch token. Tokens-To-Token Vision Transformer (T2T-ViT)(Yuan et al., 2021) proposed a transformation layer for aggregating adjacent tokens and establishing image prior by exploiting spatial correlations. Pyramid Vision Transformer (PVT)(Wang et al., 2021) proposed a hierarchical architecture with patch embedding at the beginning of each stage and spatial dimension reduction to improve the computational efficiency. Independently, Swin Transformers(Liu et al., 2021) also proposed a hierarchical architecture in which self-attention is computed within local windows which are shifted for region interaction. Twins Transformer(Chu et al., 2021a)  proposed a spatially separable self-attention with locally-grouped and global sub-sampling modules to improve the efficiency. Focal Transformer(Yang  et al., 2021b)  introduced the Focal self-attention to capture long-range spatial interactions. PVT-v2(Wang et al., 2022) improved performance and efficiency comparing to PVT(Wang et al., 2021) by introducing overlapping patch embedding, convolutional feed-forward network and linear attention. EdgeViT(Pan et al., 2022) introduced a lightweight ViT model, with global sparse attention and local aggregations and propagation modules for capturing short and long-range information ConvNet. Since the advent of deep learning, CNNs(Krizhevsky et al., 2012;Simonyan & Zisserman, 2014;Howard et al., 2017;He et al., 2016;Szegedy et al., 2016;Huang et al., 2017;Hu et al., 2018) have dominated computer vision benchmarks with SOTA performance. Recently, inspired by ViTs, ConvMixer(Trockman & Kolter, 2022) introduced a simple architecture with large-kernel depth-wise and point-wise convolutional layers and global pooling with competitive performance for classification. Furthermore, ConvNeXt(Liu et al., 2022) proposed modifications to the architecture of ResNet(He et al., 2016), and achieved competitive benchmarks for classification, detection and segmentation tasks.

head, using ×3 LR schedule with an initial learning rate of 0.0001, a batch size of 16 and weight decay of 0.05. FollowingLiu et al. (2022), we compared against Tiny, Small and Base model variants using Cascade Mask-RCNN but only compared against Tiny variant using Mask-RCNN. For semantic segmentation, we used the ADE20K dataset(Zhou et al., 2017) with a UPerNet(Xiao et al., 2018) segmentation head. Following previous efforts, we used a random crop size of 512 × 512 for the input images. For fair assessment, we only compare against models with a pre-trained ImageNet-1K backbone.

Image classification benchmarks on ImageNet-1K dataset(Deng et al., 2009).

Object detection and instance segmentation benchmarks using Mask R-CNN and Cascade Mask R-CNN on MS COCO dataset(Lin et al., 2014). All models employ 3× schedule.

Ablation study on the effectiveness of various components in GC ViT on classification, detection and segmentation performance.

maps in Fig.7. The associated attention distributions uncovered by the global self-attention modules align with image semantics, and hence act as an informative source for local attention modules. In addition, corresponding Grad-CAM maps demonstrate accurate object localization with most intricate details.7 CONCLUSIONWe introduced a new vision transformer architecture named GC ViT that can efficiently capture global context by utilizing global query tokens and interact with local regions. Through extensive experiments we show SOTA benchmarks for image classification on ImageNet-1K dataset, surpassing CNN/ViT-based counterparts by large margin. We also consistently achieved SOTA for downstream tasks of detection, instance and semantic segmentation on MS COCO and ADE20K datasets.

1 -Architecture configurations for GC ViT. LG-SA and Conv denotes local, global self-attention and 3 × 3 convolutional layer, respectively. GC ViT-XT, GC ViT-T, GC ViT-S and GC ViT-B denote XTiny, Tiny, Small and Base variants, respectively.We performed ablation studies to validate the effectiveness of the proposed global query. Using the same architecture, instead of global query, we compute: (1) global key and value features and interact them with local query (2) global value features and interact it with local query and key. As shown in TableS.2, replacing global query may significantly impact the performance for image segmentation and downstream tasks such as object detection, instance segmentation and semantic segmentation. Table S.5 -Ablation study on the effectiveness of Global Context (GC) module in Swin Transformers architecture on ImageNet Top-1 accuracy.

TableS.6 -Image classification benchmarks on ImageNet-1K dataset(Deng et al., 2009).

Model

Local Batch Size Global Batch Size EMA Top- Table S .3 -Ablation study on the effect of EMA and batch size on GC ViT-T ImageNet Top-1 accuracy. 

E TRAINING DETAILS

For image classification, GC ViT models were trained using four computational nodes with 32 NVIDIA A100 GPUs. The total training batch size is 1024 (32 per GPU) for GC ViT-S, GC ViT-B, GC ViT-L and 4096 (128 per GPU) for GC ViT-XXT, GC ViT-XT and GC ViT-T. On average, each model required 32 hours of training with the specified hyper-parameters as indicated in the paper. All classification models were trained using the timm package (Wightman, 2019) . Object detection and instance segmentation models as well as semantic segmentation models were trained using one computational node with 8 NVIDIA A40 GPUs using a total batch size of 16, hence a batch size of 2 per GPU. Detection and instance segmentation models were trained using mmdetection (Chen et al., 2019) package and on average required 56 hours of training. Semantic segmentation models were trained using mmsegmentation (Contributors, 2020) package, and on average required 34 hours of training.

F COMPLEXITY ANALYSIS

Given an input feature map of x ∈ R H×W ×C at each stage with a window size of h × w, the computational complexity of GC ViT is as followsThe efficient design of global query token generator and other components allows to maintain a similar computational complexity in comparison to Swin Transformer Liu et al. (2021) while being able to capture long-range information and achieve better higher accuracy for classification and downstream tasks such as detection and segmentation.

G COMPARISON TO OTHER GLOBAL SELF-ATTENTION MODULES

Other efforts such as EdgeViT (Pan et al., 2022) in computer vision and BigBird (Zaheer et al., 2020) in NLP have proposed global self-attention in their respective applications. In this section, we discuss the differences between the proposed global self-attention in GC ViT and these efforts.

