

ABSTRACT

Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high-and low-resolution features alike. GrafT can be easily adopted in both homogeneous and pyramid Transformers while showing consistent gains. It has the flexibility of branchingout at arbitrary depths, widening a network with multiple scales. This grafting operation enables us to share most of the parameters and computations of the backbone, adding only minimal complexity, but with a higher yield. In fact, the process of progressively compounding multi-scale receptive fields in GrafT enables communications between local regions. We show the benefits of the proposed method on multiple benchmarks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), object detection and instance segmentation (COCO2017). Our code and models will be made available.

1. INTRODUCTION

Self-attention mechanism in Transformers (Bello et al., 2019) has been widely-adopted in language domain for some time now. It can look into pairwise correlations between input sequences, learning long-range dependencies. More recently, following the seminal work in Vision Transformers (ViT) (Dosovitskiy et al., 2020) , the vision community has also started exploiting this property,



Figure 1: We introduce GrafT, an add-on component which makes use of global and multi-scale dependencies at arbitrary depths of a network. (a) An overview of how GrafT modules are branched-out (or grafted) from a backbone Transformer. Each GrafT may consider multiple scales of features (i.e., token representations), widening a network efficiently while relying on the backbone to perform most of the computations. It can be adopted to both homogeneous (e.g., ViT (Dosovitskiy et al., 2020)) and pyramid (e.g., Swin (Liu et al., 2021)) architectures. (b) Performance-complexity trade-off of our Swin+GrafT, in comparison with previous related methods. GrafT shows a considerable performance gains with a minimal increment in complexity.

