GLOBAL SELF-ATTENTION NETWORKS FOR IMAGE RECOGNITION Anonymous

Abstract

Recently, a series of works in computer vision have shown promising results on various image and video understanding tasks using self-attention. However, due to the quadratic computational and memory complexities of self-attention, these works either apply attention only to low-resolution feature maps in later stages of a deep network or restrict the receptive field of attention in each layer to a small local region. To overcome these limitations, this work introduces a new global self-attention module, referred to as the GSA module, which is efficient enough to serve as the backbone component of a deep network. This module consists of two parallel layers: a content attention layer that attends to pixels based only on their content and a positional attention layer that attends to pixels based on their spatial locations. The output of this module is the sum of the outputs of the two layers. Based on the proposed GSA module, we introduce new standalone global attention-based deep networks that use GSA modules instead of convolutions to model pixel interactions. Due to the global extent of the proposed GSA module, a GSA network has the ability to model long-range pixel interactions throughout the network. Our experimental results show that GSA networks outperform the corresponding convolution-based networks significantly on the CIFAR-100 and ImageNet datasets while using less parameters and computations. The proposed GSA networks also outperform various existing attention-based networks on the ImageNet dataset.

1. INTRODUCTION

Self-attention is a mechanism in neural networks that focuses on modeling long-range dependencies. Its advantage in terms of establishing global dependencies over other mechanisms, e.g., convolution and recurrence, has made it prevalent in modern deep learning. In computer vision, several recent works have augmented Convolutional Neural Networks (CNNs) with global self-attention modules and showed promising results for various image and video understanding tasks (Bello et al., 2019; Chen et al., 2018; Huang et al., 2019; Shen et al., 2018; Wang et al., 2018; Yue et al., 2018) . For brevity, in the rest of the paper, we refer to self-attention simply as attention. The main challenge in using the global attention mechanism for computer vision tasks is the large spatial dimensions of the input. An input image in a computer vision task typically contains tens of thousands of pixels, and the quadratic computational and memory complexities of the attention mechanism make global attention prohibitively expensive for such large inputs. Because of this, earlier works such as Bello et al. ( 2019 2018) made the global attention mechanism efficient by either removing the softmax normalization on the product of queries and keys and changing the order of matrix multiplications involved in the attention computation (Chen et al., 2018; Shen et al., 2018; Yue et al., 2018) or decomposing one global attention layer into a sequence of multiple axial attention layers (Huang et al., 2019) . However, all these works use content-only attention which does not take the spatial arrangement of pixels into account. Since images are spatially-structured inputs, an attention mechanism that ignores spatial information is not best-suited for image understanding tasks on its own. Hence, these works incorporate attention modules as auxiliary modules into standard CNNs. To address the above issues, we introduce a new global self-attention module, referred to as the GSA module, that performs attention taking both the content and spatial positions of the pixels into account. This module consists of two parallel layers: a content attention layer and a positional attention layer, whose outputs are summed at the end. The content attention layer attends to all the pixels at once based only on their content. It uses an efficient global attention mechanism similar to Chen et al. ( 2018 



); Wang et al. (2018) restricted the use of global attention mechanism to low-resolution feature maps in later stages of a deep network. Alternatively, other recent works such as Hu et al. (2019); Ramachandran et al. (2019); Zhao et al. (2020) restricted the receptive field of the attention operation to small local regions. While both these strategies are effective at capping the resource consumption of attention modules, they deprive the network of the ability to model long-range pixel interactions in its early and middle stages, preventing the attention mechanism from reaching its full potential. Different from the above works, Chen et al. (2018); Huang et al. (2019); Shen et al. (2018); Yue et al. (

); Shen et al. (2018) whose computational and memory complexities are linear in the number of pixels. The positional attention layer computes the attention map for each pixel based on its own content and its relative spatial positions with respect to other pixels. Following the axial formulation(Ho et al., 2019; Huang et al., 2019), the positional attention layer is implemented as a column-only attention layer followed by a row-only attention layer. The computational and memory complexities of this axial positional attention layer are O(N √ N ) in the number of pixels. The proposed GSA module is efficient enough to act as the backbone component of a deep network. Based on this module, we introduce new standalone global attention-based deep networks, referred to as global self-attention networks. A GSA network uses GSA modules instead of convolutions to model pixel interactions. By virtue of the global extent of the GSA module, a GSA network has the ability to model long-range pixel interactions throughout the network. Recently, Wang et al. (2020) also introduced standalone global attention-based deep networks that use axial attention mechanism for both content and positional attentions. Different from Wang et al. (2020), the proposed GSA module uses a non-axial global content attention mechanism that attends to the entire image at once rather than just a row or column. Our experimental results show that GSA-ResNet, a GSA network that adopts ResNet (He et al., 2016) structure, outperforms the original convolution-based ResNet and various recent global or local attention-based ResNets on the widely-used ImageNet dataset.MAJOR CONTRIBUTIONS • GSA module: We introduce a new global attention module that is efficient enough to act as the backbone component of a deep network. Different from Wang et al. (2018); Yue et al. (2018); Chen et al. (2018); Shen et al. (2018); Huang et al. (2019), the proposed module attends to pixels based on both content and spatial positions. Different from Zhao et al. (2020); Hu et al. (2019); Ramachandran et al. (2019), the proposed module attends to the entire input rather than a small local neighborhood. Different from Wang et al. (2020), the proposed GSA module uses a non-axial global content attention mechanism that attends to the entire image at once rather than just a row or column. • GSA network: We introduce new standalone global attention-based networks that use GSA modules instead of spatial convolutions to model pixel interactions. This is one of the first works (Wang et al. (2020) being the only other work) to explore standalone global attention-based networks for image understanding tasks. Existing global attention-based works insert their attention modules into CNNs as auxiliary blocks at later stages of the network, and existing standalone attention-based networks use local attention modules. • Experiments: We show that the proposed GSA networks outperform the corresponding CNNs significantly on the CIFAR-100 and ImageNet datasets while using less parameters and computations. We also show that the GSA networks outperform various existing attention-based networks including the latest standalone global attention-based network of Wang et al. (2020) on the ImageNet dataset. 2 RELATED WORKS 2.1 AUXILIARY VISUAL ATTENTION Wang et al. (2018) proposed the non-local block, which is the first adaptation of the dot-product attention mechanism for long-range dependency modeling in computer vision. They empirically verified its effectiveness on video classification and object detection. Follow-up works extended it to

