

Abstract

1 (%) Swin + GrafT Swin Region-ViT PoolFormer PvT TNT T2T ViL

(b)

Figure 1 : We introduce GrafT, an add-on component which makes use of global and multi-scale dependencies at arbitrary depths of a network. (a) An overview of how GrafT modules are branched-out (or grafted) from a backbone Transformer. Each GrafT may consider multiple scales of features (i.e., token representations), widening a network efficiently while relying on the backbone to perform most of the computations. It can be adopted to both homogeneous (e.g., ViT (Dosovitskiy et al., 2020) ) and pyramid (e.g., Swin (Liu et al., 2021) ) architectures. (b) Performance-complexity trade-off of our Swin+GrafT, in comparison with previous related methods. GrafT shows a considerable performance gains with a minimal increment in complexity.

ABSTRACT

Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high-and low-resolution features alike. GrafT can be easily adopted in both homogeneous and pyramid Transformers while showing consistent gains. It has the flexibility of branchingout at arbitrary depths, widening a network with multiple scales. This grafting operation enables us to share most of the parameters and computations of the backbone, adding only minimal complexity, but with a higher yield. In fact, the process of progressively compounding multi-scale receptive fields in GrafT enables communications between local regions. We show the benefits of the proposed method on multiple benchmarks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), object detection and instance segmentation (COCO2017). Our code and models will be made available.

1. INTRODUCTION

Self-attention mechanism in Transformers (Bello et al., 2019) has been widely-adopted in language domain for some time now. It can look into pairwise correlations between input sequences, learning long-range dependencies. More recently, following the seminal work in Vision Transformers (ViT) (Dosovitskiy et al., 2020) , the vision community has also started exploiting this property, showing state-of-the-art results on various tasks including classification, segmentation and detection, outperforming convolutional networks (CNNs) (He et al., 2016; Tan & Le, 2019; Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019; Brock et al., 2021) . by this cess, many variants of vision transformers (e.g., DeiT (Touvron et al., 2021) , CrossViT (Chen et al., 2021) , TNT (Han et al., 2021) ) emerged, inheriting the same homogeneous structure of ViT (i.e., a structure w/o downsampling). However, due to the quadratic complexity of attention, such a structure becomes expensive, especially for high-resolution inputs and does not benefit from the semantically-rich information present in multi-scale representations. To address these shortcomings, Transformers with pyramid structures (i.e., structures w/ downsampling) such as Swin (Liu et al., 2021) were introduced with hierarchical downsampling and window-based attention, which can learn multi-scale representations at a computational complexity linear with input resolution. As a result, pyramid structures become more suited for tasks such as segmentation and detection. However, still, multiple scales arise deep in to the network due to stage-wise downsampling, meaning that only the latter stages of the model may benefit from them. Thus, we poise the question: what if we can introduce multi-scale information even at early stages of a Transformer, without incurring a heavy computational burden? Previous work has also looked into the direction above, both in CNNs (Szegedy et al., 2015a) and in Transformers (Chen et al., 2021; 2022) . However, models such as CrossViT requires carefully tuning the spatial ratio of two feature maps in two branches and RegionViT needs considerable modifications to handle multi-scale training. To mitigate these issues, in this paper, we propose a simple and efficient add-on component called GrafT (see Figure 1-(a) ). It can be easily adopted in existing homogeneous or pyramid architectures, enabling multi-scale features throughout a network (even in shallow layers) and showing consistent performance gains, while being computationally lightweight. GrafT is applicable at any arbitrary layer of a network. It consists of three main components: (1) a left-right pathway for downsampling, (2) a right-left pathway for upsampling, and (3) a bottom-up connection for information sharing at each scale. The left-right pathway uses a series of average pooling operations to create a set of multi-scale representations. For instance, if GrafT is attached to a layer with (56 × 56) resolution, it can create scales of (28 × 28), (14 × 14) and (7 × 7). We then process information at each scale with a L-MSA block, a local self-attention mechanism (e.g., window-attention)-which becomes global-attention in the coarsest scale, as window-size becomes the same as the resolution. Next, the right-left pathway uses a series of learnable and window-based bi-linear interpolation (W-Bilinear) operations to generate high-resolution features by upsampling the low-resolution outputs of L-MSA-which contains global (or high-level) semantics extracted efficiently, at a lower resolution. Such upsampled features are merged with highresolution features of the branch-to-the-left, which contain lower-level semantics, as also done in Feature Pyramid Networks (Lin et al., 2017) . Refer to Figure 2-(b ) for a detailed view. GrafT is unique in the sense that it can extract multi-scale information at any given layer of a Transformer, while also being efficient. It relies on the backbone to do the heavy-lifting, by using a minimal computation overhead within grafted branches, in contrast to having completely-separate branches as in CrossViT (Chen et al., 2021) . In our evaluations, we show the benefits of GrafT in both homogeneous (ViT) and pyramid (Swin) architectures, across multiple benchmarks: on ImageNet-1K (Deng et al., 2009) 

2. GRAFTING VISION TRANSFORMERS

Our goal is to provide multi-scale global information to the backbone transformer from the bottom layer so that high-level semantics from GrafT can help the transformer to construct more efficient features. Since Graft is modular, it can be applied to various transformer architectures. We select two representative transformers, ViT (Dosovitskiy et al., 2020) based on a homogeneous structure and Swin (Liu et al., 2021) based on a pyramid structure, to show that GrafT is a general purpose module.



, +3.9% with ViT-T+GrafT, +1.4% with Swin-T+GrafT, and +0.5% with , on COCO2017 (Lin et al., 2014), +1.1 AP b for object detection and +0.8 AP m for instance segmentation with Swin-T+GrafT, and on ADE20K (Zhou et al., 2017) semantic segmentation, +1.0 mIOU ss , +1.3 mIOU ms with Swin-T+GrafT. Figure 1-(b) shows the performance-complexity tradeoff Swin-T+GrafT on ImageNet-1K.

