MOAT: ALTERNATING MOBILE CONVOLUTION AND ATTENTION BRINGS STRONG VISION MODELS

Abstract

This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% / 81.5% top-1 accuracy on ImageNet-1K / ImageNet-1K-V2 with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2% AP box with 227M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention. Code is publicly available.

1. INTRODUCTION

The vision community has witnessed the prevalence of self-attention (Bahdanau et al., 2015) and Transformers (Vaswani et al., 2017) . The success of Transformers in natural language processing motivates the creation of their variants for vision recognition. The Vision Transformer (ViT) (Dosovitskiy et al., 2021) has great representation capacity with global receptive field. However, it requires pretraining on a large-scale proprietary dataset (Sun et al., 2017) . Its unsatisfying performance, when trained with a small number of images, calls for the need of better training recipes (Touvron et al., 2021a; Steiner et al., 2021) or architectural designs (Liu et al., 2021; Graham et al., 2021) . On the other hand, ConvNet has been the dominant network choice since the advent of AlexNet (Krizhevsky et al., 2012) in 2012. Vision researchers have condensed the years of network design experience into multiple principles, and have started to incorporate them to vision transformers. For example, there are some works adopting the ConvNet's hierarchical structure to extract multi-scale features for vision transformers (Liu et al., 2021; Fan et al., 2021; Wang et al., 2022) , and others proposing to integrate the translation equivariance of convolution into transformers (Graham et al., 2021; d'Ascoli et al., 2021; Xiao et al., 2021) . Along the same direction of combining the best from Transformers and ConvNets, CoAtNet (Dai et al., 2021) and MobileViT (Mehta & Rastegari, 2022a) demonstrate outstanding performance by stacking Mobile Convolution (MBConv) blocks (i.e., inverted residual blocks (Sandler et al., 2018) ) and Transformer blocks (i.e., a self-attention layer and a Multi-Layer Perceptron (MLP)). However, both works focus on the macro-level network design. They consider MBConv and Transformer blocks as individual separate ones, and systematically study the effect of stacking them to strike a better balance between the remarkable efficiency of MBConv and strong capacity of Transformer. In this work, on the contrary, we study the micro-level building block design by taking a deeper look at the combination of MBConv and Transformer blocks. We make two key observations after a careful examination of those blocks. First, the MLP module in Transformer block is similar to MBConv, as both adopt the inverted bottleneck design. However, MBConv is a more powerful operation by employing one extra 3 × 3 depthwise convolution (to encode local interaction between pixels), and more activation (Hendrycks & Gimpel, 2016) and normalization (Ioffe & Szegedy, 2015) are employed between convolutions. Second, to extract multi-scale features using Transformer blocks, one may apply the average-pooling (with stride 2) to input features before the self-attention layer. However, the pooling operation reduces the representation capacity of self-attention. Our observations motivate us to propose a novel MObile convolution with ATtention (MOAT) block, which efficiently combines MBConv and Transformer blocks. The proposed MOAT block modifies the Transformer block by first replacing its MLP with a MBConv block, and then reversing the order of attention and MBConv. The replacement of MLP with MBConv brings more representation capacity to the network, and reversing the order (MBConv comes before self-attention) delegates the downsampling duty to the strided depthwise convolution within the MBConv, learning a better downsampling kernel. We further develop a family of MOAT models by stacking and increasing the channels of network blocks. Surprisingly, our extremely simple design results in a remarkable impact. On the challenging ImageNet-1K classification benchmark (Russakovsky et al., 2015) , our model (190M parameters) achieves 86.7% top-1 accuracy without extra data. When further pretraining on ImageNet-22K, our best model (483M parameters) attains 89.1% / 81.5% top-1 accuracy on ImageNet-1K (Tab. 2) / ImageNet-1K-V2 (Tab. 9), setting a new state-of-the-art. Additionally, MOAT can be seamlessly deployed to downstream tasks that require large resolution inputs by simply converting the global attention to non-overlapping local window attention. Thanks to the MBConv that effectively exchanges local information between pixels (enabling cross-window propagation), MOAT does not need the extra window-shifting mechanism (Liu et al., 2021) . As a result, on COCO object detection (Lin et al., 2014) and ADE20K semantic segmentation (Zhou et al., 2019) , MOAT shows superior performances. Specifically, on COCO object detection (Tab. 3), our best model (227M parameters), achieves 59.2% AP box with single-scale inference and hard NMS, setting a new state-of-the-art in the regime of model size 200M with Cascade Mask R-CNN (Cai & Vasconcelos, 2018; He et al., 2017) . On ADE20K semantic segmentation (Tab. 4), our best model (496M parameters), adopting DeepLabv3+ (Chen et al., 2018) , attains 57.6% mIoU with single-scale inference, also setting a new state-of-the-art in the regime of models using input size 641 × 641. Finally, to explore the scalability of MOAT models, we simply scale down the models by reducing the channel sizes (without any other change), resulting in the tiny-MOAT family, which also surprisingly outperforms mobile-specific transformer-based models, such as Mobile-Former (Chen et al., 2022c) and MobileViTs (Mehta & Rastegari, 2022a; b) . Specifically, in the regime of model parameters 5M, 10M, and 20M, our tiny MOAT outperforms the concurrent MobileViTv2 (Mehta & Rastegari, 2022b ) by 1.1%, 1.3%, and 2.0% top-1 accuracy on ImageNet-1K classification benchmark (Tab. 5). Furthermore, we benchmark tiny-MOAT on COCO object detection and ADE20K semantic segmentation. In summary, our method advocates the design principle of simplicity. Without inventing extra complicated operations, the proposed MOAT block effectively merges the strengths of both mobile convolution and self-attention into one block by a careful redesign. Despite its conceptual simplicity, impressive results have been obtained on multiple core vision recognition tasks. We hope our study will inspire future research on seamless integration of convolution and self-attention. (Sandler et al., 2018) employs the inverted bottleneck design with depthwise convolution and squeeze-and-excitation (Hu et al., 2018) applied to the expanded features. (b) The Transformer block (Vaswani et al., 2017) consists of a self-attention module and a MLP module. (c) The proposed MOAT block effectively combines them. The illustration assumes the input tensor has channels c.

2. METHOD

Herein, we review the Mobile Convolution (MBConv) (Sandler et al., 2018) and Transformer (Vaswani et al., 2017) blocks before introducing the proposed MOAT block. We then present MOAT, a family of neural networks, targeting at different trade-offs between accuracy and model complexity.

2.1. MOBILE CONVOLUTION AND TRANSFORMER BLOCKS

MBConv block. Also known as the inverted residual block, the Mobile Convolution (MBConv) (Sandler et al., 2018) block (Fig. 1 (a) ) is an effective building block that has been widely used in mobile models (Howard et al., 2019; Mehta & Rastegari, 2022a) or efficient models (Tan & Le, 2019; Dai et al., 2021) . Unlike the bottleneck block in ResNet (He et al., 2016a) , the MBConv block employs the design of an "inverted bottleneck", together with the efficient depthwise convolution (Howard et al., 2017) . Specifically, a 1 × 1 convolution is first applied to expand the input channels by a factor of 4. Then, a 3 × 3 depthwise convolution is used to effectively capture the local spatial interactions between pixels. Finally, the features are projected back to the original channel size via a 1 × 1 convolution, enabling a residual connection (He et al., 2016a ). An optional Squeeze-and-Excitation (SE) (Hu et al., 2018) module (which uses the global information to re-weight the channel activation) may also be used after the depthwise convolution, following MobileNetV3 (Howard et al., 2019) . Note that one could tune the channel expansion ratio and depthwise convolution kernel size for better performance. We fix them throughout the experiments for simplicity. Formally, given an input tensor x ∈ R H×W ×C (H, W, C are its height, width, and channels), the MBConv block is represented as follows: MBConv(x) = x + (N 2 • S • D • N 1 )(BN(x)), (1) N 1 (x) = GeLU(BN(Conv(x))), (2) D(x) = GeLU(BN(DepthConv(x))), (3) S(x) = σ(MLP(GAP(x)) • x, (4) N 2 (x) = Conv(x), where BN, GeLU, GAP, and MLP stand for Batch Normalization (Ioffe & Szegedy, 2015) , Gaussian error Linear Unit (Hendrycks & Gimpel, 2016) , Global Average Pooling, and Multi-Layer Perceptron (with reduction ratio 4 and hard-swish (Ramachandran et al., 2017 )), respectively. The MBConv block consists of four main functions: N 1 , D, S, and N 2 , which correspond to the 1 × 1 convolution for channel expansion (by 4×), 3 × 3 depthwise convolution, squeeze-and-excitation (Hu et al., 2018) (σ is the sigmoid function), and 1 × 1 convolution for channel projection (by 4×), respectively. Transformer block. The Transformer (Vaswani et al., 2017) block (Fig. 1 (b) ) is a powerful building block that effectively captures the global information via the data-dependent self-attention operation. It consists of two main operations: self-attention and MLP. The self-attention operation computes the attention map based on the pairwise similarity between every pair of pixels in the input tensor, thus enabling the model's receptive field to encompass the entire spatial domain. Additionally, the attention map dynamically depends on the input, enlarging the model's representation capacity (unlike the convolution kernels, which are data-independent). The MLP operation contains two 1 × 1 convolutions, where the first one expands the channels (by 4×), the second one shrinks back the channels, and GeLU non-linearity is used in-between. Formally, given an input tensor x ∈ R H×W ×C , the Transformer block is represented as follows: Transformer(x) = x + (M 2 • M 1 • Attn)(LN(x)), (6) M 1 (x) = GeLU(Conv(LN(x))), (7) M 2 (x) = Conv(x), where LN and Attn denote the Layer Normalization (Ba et al., 2016) , and self-attention (Vaswani et al., 2017) . The self-attention operation also includes a residual connection (He et al., 2016a) , which is not shown in the equations for simplicity, while the MLP operation is represented by two functions M 1 and M 2 , which correspond to the 1 × 1 convolution for channel expansion (by 4×) and 1 × 1 convolution for channel projection, respectively.

2.2. MOBILE CONVOLUTION WITH ATTENTION (MOAT) BLOCK

Comparing MBConv and Transformer blocks. Before getting into the architecture of our MOAT block, it is worthwhile to compare the MBConv (Sandler et al., 2018) and Transformer (Vaswani et al., 2017) blocks, which helps to understand our design motivations. Specifically, we make the following key observations. First, both MBConv and Transformer blocks advocate the "inverted bottleneck" design, where the channels of input tensors are expanded and then projected by 1 × 1 convolutions. However, MBConv additionally employs a 3 × 3 depthwise convolution between those two 1 × 1 convolutions, and there are both batch normalization and GeLU activation between the convolutions. Second, to capture the global information, the MBConv block may employ a Squeeze-and-Excitation (SE) module, while the Transformer block adopts the self-attention operation. Note that the SE module squeezes the spatial information via a global average pooling, while the self-attention module maintains the tensor's spatial resolution. Third, the downsampling operation is performed at different places within the block. To downsample the features, the standard MBConv block uses the strided depthwise convolution, while the Transformer block, deployed in the modern hybrid model CoAtNet (Dai et al., 2021) , adopts an average-pooling operation before the self-attention. MOAT block. Given the above observations, we now attempt to design a new block that effectively merges the best from both MBConv and Transformer blocks. We begin with the powerful Transformer block, and gradually refine over it. Based on the first observation, both MBConv and Transformer blocks employ the "inverted bottleneck" design. Since depthwise convolution could effectively encode local interaction between pixels, which is crucial for modeling the translation equivariance in ConvNets, we thus start to add the depthwise convolution to Transformer's MLP module. However, we did not observe any performance improvement until we also added the extra normalization and activations between convolutions. For the second observation, we simply do not add the SE module to the MBConv block. The self-attention operation is kept to capture the global information. We found the third observation critical. The downsampling operation (average-pooling) right before the self-attention operation in Transformer block slightly reduces its representation capacity. On the other hand, the MBConv block is well-designed for the downsampling operation with the strided depthwise convolution, which effectively learns the downsampling convolution kernel for each input channel. Therefore, we further reorder the "inverted bottleneck" (containing depthwise convolution) before the self-attention operation, delegating the downsampling operation to depthwise convolution. In this way, we need no extra downsampling layer like average-pooling in CoAtNet (Dai et al., 2021) , or patch-embedding layers in Swin (Liu et al., 2021) and ConvNeXt (Liu et al., 2022b) . Finally, it results in our MObile convolution with ATtention (MOAT) block, as illustrated in Fig. 1 (c ). Formally, given an input tensor x ∈ R H×W ×C , the MOAT block is represented as follows: MOAT(x) = x + (Attn • N 2 • D • N 1 )(BN(x)), where MBConv (w/o SE) contains functions N 1 (Eq. 2), D (Eq. 3), and N 2 (Eq. 5), and Attn denotes the self-attention operation. The MOAT block then simply consists of MBConv (w/o SE) and the self-attention operation, successfully combining the best from the MBConv block and Transformer block into one (which we will show empirically).

2.3. META ARCHITECTURE

Macro-level network design. After developing the MOAT block, we then study how to effectively stack them to form our base model. We adopt the same strategy as the existing works (Liu et al., 2021; Wang et al., 2021b; Graham et al., 2021; Xiao et al., 2021; Dai et al., 2021; Mehta & Rastegari, 2022a) . Specifically, we summarize several key findings from those works, and use them as design principles of our meta architecture. • Employing convolutions in the early stages improves the performance and training convergence of Transformer models (Wu et al., 2021; Graham et al., 2021; Xiao et al., 2021) . • The Mobile Convolution (MBConv) (Sandler et al., 2018) blocks are also effective building blocks in the hybrid Conv-Transformer models (Dai et al., 2021; Mehta & Rastegari, 2022a ). • Extracting multi-scale backbone features benefits the downstream tasks, such as detection and segmentation (Liu et al., 2021; Wang et al., 2021b; Fan et al., 2021; Heo et al., 2021) . As a result, our meta architecture consists of the convolutional stem, MBConv blocks, and MOAT blocks. Additionally, through the ablation study in the appendix, we found the layer layout proposed by CoAtNet-1 (Dai et al., 2021) effective. We thus follow their layer layout, resulting in our base model MOAT-1. To form the MOAT model family, we then scale down or up MOAT-1 in the dimensions of number of blocks and number of channels, as shown in Tab. 1. We only scale the number of blocks in the third and fourth stages (out of five stages). The downsampling operation is performed in the first block of each stage. Note that our base model MOAT-1 and CoAtNet-1 share the same layer layout and channel sizes. However, we take a different scaling strategy: our MOAT is scaled up (or down) by alternatively increasing the depth and expanding the width between variants. 

3. EXPERIMENTAL RESULTS

In this section, we show that MOAT variants are effective on the ImageNet-1K (Russakovsky et al., 2015) image classification. We then deploy them to other recognition tasks, including COCO object detection (Lin et al., 2014) , instance segmentation (Hariharan et al., 2014) , and ADE20K (Zhou et al., 2019) semantic segmentation. MOAT can be seamlessly applied to downstream tasks. For small resolution inputs, we directly fine-tune the global attention, while for large resolution inputs, we simply convert the global attention to non-overlapping local window attention without using extra window-shifting mechanism. The detailed experiment setup could be found in the appendix. ImageNet Image Classification. In Tab. 2, we include the current state-of-art methods in the categories of ConvNets, ViTs and Hybrid models. At similar model costs (parameters or FLOPs), our MOAT models consistently outperform all of them. Specifically, with the ImageNet-1K data only and input size 224, for light-weight models, our MOAT-0 significantly outperforms ConvNeXt-T (Liu et al., 2022b) , Swin-T (Liu et al., 2022b) , and CoAtNet-0 (Dai et al., 2021 ) by 1.2%, 2.0%, and 1.7%, respectively. For large-scale models using input size 384, MOAT-3 is able to surpass ConvNeXt-L, CoAtNet-3 by 1.0% and 0.7%, respectively. With the ImageNet-22K pretraining and input size 384, the prior arts ConvNeXt-L, Swin-L, and CoAtNet-3 already show strong performances (87.5%, 87.3% and 87.6%), while our MOAT-3 achieves the score of 88.2%, outperforming them by 0.7%, 0.9%, and 0.6%, respectively. For ImageNet-1K and input size 224, we plot the performances vs. parameters and FLOPs in Fig. 2 and Fig. 3 , respectively. For ImageNet-22K pretraining and input size 384, we plot the performances vs. parameters and FLOPs in Fig. 4 and Fig. 5 , respectively. In the figures, MOAT clearly demonstrates the best performance in all computation regimes. Finally, our largest model MOAT-4, with ImageNet-22K and input size 512, further attains 89.1% accuracy. COCO Detection. Tab. 3 summarizes the COCO object detection (box) and instance segmentation (mask) results. Our MOAT backbones significantly outperform the baseline methods, including Swin (Liu et al., 2021) and ConvNeXt (Liu et al., 2022b) across different model sizes. Specifically, our MOAT-0 outperforms Swin-T and ConvNeXt-T by 5.4% and 5.5% AP box (3.7% and 3.7% AP mask ). Our MOAT-1 surpasses Swin-S and ConvNeXt-S by 5.9% and 5.8% AP box (4.3% and 4.0% AP mask ). Our MOAT-2, with 110M parameters, is still 5.5% and 4.5% AP box (3.5% and 2.4% AP mask ) better than Swin-B and ConvNeXt-B. Finally, our MOAT-3, using 227M parameters, achieves 59.2% AP box (50.3% AP mask ), setting a new state-of-the-art in the regime of model size 200M that is built on top of Cascade Mask R-CNN (Cai & Vasconcelos, 2018; He et al., 2017) . More comparisons with smaller input size can be found in Tab. 12. For tiny-MOAT, tiny-MOAT-0/1 achieve the same performance as Swin-T/S and ConvNeXt-T/S but only use less than half of the parameters. Furthermore, tiny-MOAT-3 is pretrained with ImageNet-1K and attains 55.2 AP box with 57M parameters, surpassing the ImageNet-22k pretrained Swin-L (53.9 AP box with 254M parameters) and ConvNeXt-L (54.8 AP box with 255M parameters). Finally, when using input size 641 2 , our MOAT-4 achieves the performance of 57.6% mIoU, setting a new state-of-the-art in the regime of models using input size 641 2 . For tiny-MOAT, tiny-MOAT-3 achieves comparable performance with ConvNeXt-S with less than half of the parameters. tiny-MOAT on ImageNet. We simply scale down the channels of MOAT-0 to obtain the tiny-MOAT family without any specific adaptions. In the left of Tab. 5, with the similar model parameters, tiny-MOAT-0/1/2 surpass the Mobile-Former counterparts by 6.8%, 5.5%, and 4.3%, respectively. In the right of Tab. 5, our tiny-MOAT also shows stronger performances than MobileViT (Mehta & Rastegari, 2022a) . Even compared with the concurrent work MobileViTv2 (Mehta & Rastegari, 2022b) , tiny-MOAT-1/2/3 surpass their counterparts by 1.1%, 1.3%, and 2.1%, respectively. (Dai et al., 2021) , (2) patch-embedding layer (i.e., 2 × 2 convolution with stride 2) in Swin (Liu et al., 2021) and ConvNeXt (Liu et al., 2022b) , or (3) strided depthwise convolution in PiT (Heo et al., 2021) and RegionViT (Chen et al., 2022a) . As shown in Tab. 7, using patch-embedding layer indeed improves over the average-pooling scheme by 0.2% accuracy, but it takes more cost of model parameters. Additionally, using the strided depthwise convolution for downsampling leads to 0.2% worse performance than the patch-embedding layer. By contrast, our MOAT design (i.e., delegating the downsampling to the MBConv block) shows the best performance with the least cost of parameters and comparable FLOPs. Table 7 : Ablation studies of the downsampling layer design on ImageNet-1K, using MOAT-0 and input size 224. We compare our MOAT design (in grey) with ( 1) CoAtNet (using average-pooling for downsampling), (2) Swin/ConvNeXt designs (using strided 2 × 2 convolution for downsampling), and (3) PiT/RegionViT designs (using strided 3 × 3 depthwise convolution for downsampling). 

5. RELATED WORK

Transformers (Vaswani et al., 2017) were recently introduced to the vision community (Wang et al., 2018; Ramachandran et al., 2019; Hu et al., 2019) and demonstrated remarkable performance on vision recognition tasks (Carion et al., 2020; Zhu et al., 2021; Wang et al., 2021a; Arnab et al., 2021; Liu et al., 2021; Cheng et al., 2021; Yu et al., 2022a; Kim et al., 2022; Cheng et al., 2022; Yu et al., 2022b) , thanks to their ability to efficiently encode long-range interaction via the attention mechanism (Bahdanau et al., 2015) . Particularly, ViT (Dosovitskiy et al., 2021) obtains impressive results on ImageNet (Russakovsky et al., 2015) by applying the vanilla Transformer with the novel large stride patch embedding, after pretraining on the proprietary large-scale JFT dataset (Sun et al., 2017) . There have been several works aiming to improve the vision transformers, either with better training strategies (Touvron et al., 2021a; b; Steiner et al., 2021; Zhai et al., 2022; Touvron et al., 2022) or with efficient local-attention modules (Huang et al., 2019; Ho et al., 2019; Wang et al., 2020; Liu et al., 2021; Chu et al., 2021; Yang et al., 2021; Yu et al., 2021; Dong et al., 2022; Tu et al., 2022) . Since the debut of AlexNet (Krizhevsky et al., 2012) , the vision community has witnessed a rapid improvement on the ImageNet benchmark using different types of ConvNets, including (but not limited to) VGGNet (Simonyan & Zisserman, 2015) , Inceptions (Szegedy et al., 2015; Ioffe & Szegedy, 2015; Szegedy et al., 2016; 2017) , ResNets (He et al., 2016a; b) , ResNeXt (Xie et al., 2017) , DenseNet (Huang et al., 2017) , SENet (Hu et al., 2018) , MobileNets (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019) , EfficientNets (Tan & Le, 2019; 2021) , and ConvNeXt (Liu et al., 2022b) each focusing on different aspects of accuracy and efficiency. The ubiquity of ConvNets in computer vision could be attributed to their built-in inductive biases. Given the success of Transformers and ConvNets, another line of research is to explore how to effectively combine them. Swin (Liu et al., 2021; 2022a) , PVT (Wang et al., 2021b; 2022) , MViT (Fan et al., 2021; Li et al., 2022) , and PiT (Heo et al., 2021) adopt the ConvNet hierarchical structure to extract multi-scale features for Transformers. SASA (Ramachandran et al., 2019) , AA-ResNet (Bello et al., 2019) , Axial-ResNet (Wang et al., 2020) and BoTNet (Srinivas et al., 2021) incorporate the attention modules to ResNets. CvT (Wu et al., 2021) , LeViT (Graham et al., 2021) , Visformer (Chen et al., 2021b) , and ViT C (Xiao et al., 2021 ) replace ViT's patch embedding with strided convolutions. CeiT (Yuan et al., 2021a) and CMT (Guo et al., 2022) incorporate depthwise convolution to the transformer block's MLP. ViTAE (Xu et al., 2021) adopts parallel attention modules and convolutional layers. LVT (Yang et al., 2022) introduces local self-attention into the convolution. Recently, CoAtNet (Dai et al., 2021) and MobileViT (Mehta & Rastegari, 2022a) propose hybrid models that build on top of the efficient Mobile Convolution (Sandler et al., 2018) and Transformer block. (Cai & Vasconcelos, 2018; He et al., 2017) on the COCO 2017 dataset (Lin et al., 2014) with our MOAT architectures. The dataset contains 118K training and 5K validation samples. We use the official TensorFlow (Abadi et al., 2016) implementation of Cascade Mask R-CNN by TF-Vision Model Garden (Yu et al., 2020) . Our training setting closely follows the prior works (Chen et al., 2022b; Tu et al., 2022) , except that we use batch size 64 and initial learning rate 0.0001. To adapt the MOAT models to high-resolution inputs, we partition the features into non-overlapping windows for the self-attention computations with the window size set to 14 for the second last stage, and use global attention for the last stage. As a result of this window partition, the input size must be divisible by 14. The TF-Vision Model Garden codebase further requires the input size to be square (with padding) and divisible by 64. Hence, we choose 1344 as the input size, similar to the size used in the baseline methods (i.e., longest side is no more than 1333). We use Feature Pyramid Network (Lin et al., 2017) to integrate features from different levels.

A.3.2 MORE COCO OBJECT DETECTION EXPERIMENTAL RESULTS

In this section, we perform more COCO object detection experiments with 896 input size. All the backbone are pretrained on ImageNet-1K dataset. MOAT-0/1/2 surpass UViT (Chen et al., 2021a) and MaxViT (Tu et al., 2022 ) by 3.9/5.2/4.9% AP box (3.1/4.1/3.9% AP mask ), and 3.0/4.0/4.0% AP box (2.4/3.2/3.0% AP mask ), respectively. Ablation studies on MOAT meta architecture. We perform ablation studies on the meta-architecture by varying the number of blocks per stage. For simplicity, we only vary the block numbers in the third and fourth stages, while keeping the block numbers in the other stages unchanged. Note that the first stage corresponds to the convolutional stem. The studies with MOAT-1 meta architecture are shown in Tab. 16. In the end, we choose the layout {2, 2, 6, 14, 2} because it has the best performance and lower parameter cost. Interestingly, our discovery echoes the layer layout proposed by CoAtNet (Dai et al., 2021) . We visualize the architecture of MOAT-1 in Fig. 6 . Table 19 : ImageNet throughput measurement of MOAT models. We re-implement MOAT with the popular "timm" (Wightman, 2019) library in PyTorch, and measure the throughput on an Nvidia V100 GPU, following the same settings as DeiT (Touvron et al., 2021a) , Swin (Liu et al., 2021) , and ConvNeXt (Liu et al., 2022b) . 



Figure 1: Block comparison. (a) The MBConv block(Sandler et al., 2018) employs the inverted bottleneck design with depthwise convolution and squeeze-and-excitation(Hu et al., 2018) applied to the expanded features. (b) The Transformer block(Vaswani et al., 2017) consists of a self-attention module and a MLP module. (c) The proposed MOAT block effectively combines them. The illustration assumes the input tensor has channels c.

Figure 2: Parameters vs. accuracy using ImageNet-1K only with input size 224.

MOAT variants differ in the number of blocks B and number of channels C in each stage.

Performance on ImageNet-1K. 1K only: Using ImageNet-1K only. 22K + 1K: ImageNet-22K pretraining and ImageNet-1K fine-tuning. Tab. 8 shows comparisions with more SOTA methods and Tab. 9 reports the performances on ImageNet-1K-V2.

Object detection and instance segmentation on the COCO 2017 val set. We employ Cascade Mask-RCNN, and single-scale inference (hard NMS). †: use ImageNet-22K pretrained weights. When using tiny-MOAT series as backbones, most of the model parameters come from the decoder. More comparisons at input size 896 is reported in Tab. 12.

Semantic segmentation on ADE20K val set. We employ DeepLabv3+ (single-scale inference). Results for ConvNeXt and MOAT are obtained using the official code-base(Weber et al., 2021) with the same training recipe. †: use ImageNet-22K pretrained weights.

Performances of tiny-MOAT family on ImageNet-1K.MOAT block design. In Tab. 6, we ablate the MOAT block design, which only affects the last two stages of MOAT, and we keep everything else the same (e.g., training recipes). We start from the Transformer block, consisting of Attn (self-attention) and MLP, which already attains a strong top-1 accuracy (82.6%). Directly inserting a 3 × 3 depthwise convolution in the MLP degrades the performance by 0.9%. If we additionally insert batch normalization and GeLU between convolutions (i.e., replace MLP with MBConv, but no Squeeze-and-Excitation), the performance is improved to 82.9%. Finally, placing MBConv before Attn reaches the performance of 83.3%. Additionally, our MOAT block brings more improvements (from 1.2% up to 2.6% gains) in the tiny model regime.

Ablation studies of MOAT block design on ImageNet-1K with input size 224.

MOAT ImageNet hyper-parameter settings.

tiny-MOAT ImageNet hyper-parameter settings. ⋆ : use EMA decay rate 0.9999 for tiny-MOAT-3.

Object detection and instance segmentation on the COCO 2017 val set. We employ Cascade Mask-RCNN, and single-scale inference (hard NMS). All backbones are pretrained on ImageNet-1K.

Ablation studies of the order of MBConv and Attention (Attn) on ImageNet-1K with input 224. We also ablate the place, where we apply the spatial downsampling and channel expansion. However, this design is equivalent to shifting the first Attn layer to its previous stage, reducing the representation capacity of the current stage. More concretely, only the last stage will be affected, since one layer is shifted. Third, to enhance the representation capacity, reversing the order of Attn and MBConv allows us to keep the first Attn layer in the same stage. This design further improves the performance by 0.7% and 1.2% for MOAT-0 and tiny-MOAT-2. Fourth, to compensate for the shifting effect, we could also employ another 1 × 1 convolution to expand the channels at the first Attn layer (then, MBConv only performs the spatial downsampling). However, this design performs similarly to our MOAT block design, but uses more parameters and FLOPs.A.6.2 ABLATION STUDIES ON THE MOAT MACRO-LEVEL DESIGNAblation studies on MOAT-based model. In Tab. 15, we ablate the stage-wise design by using either MBConv or MOAT block in stage 2 to stage 5. The first stage is the convolutional stem, containing two 3 × 3 convolutions. We use the layer layout of MOAT-0. As shown in the table, the pure MOAT-based model (i.e., using MOAT blocks for all four stages) achieves the best performance of 83.6%, which however uses the most FLOPs. Our MOAT model design (i.e., use MOAT block in the last two stages) attains the better trade-off between accuracy and model complexity.

Ablation studies of MOAT-based model on ImageNet-1K, using MOAT-0 layer layout and input size 224. We change the block type (MBConv vs. MOAT block) from stage 2 to stage 5. The first stage is fixed to use the convolutional stem.

Ablation studies of MOAT meta-architecture design on ImageNet-1K, using MOAT-1 and input size 224. We control the first, second and last stages to have two blocks, and vary the block numbers of the third and fourth stages. Architecture of MOAT-1, including the convolutional stem, MBConv, and MOAT blocks.A.7 IMAGENET TRAINING TIME, PEAK TRAINING MEMORY AND TROUGHPUT MEASUREMENTS

ImageNet training time measured in We use 16 TPUv4 cores for training MOAT-{0,1,2} and 32 TPUv4 cores for MOAT-3. MOAT is training efficient: for ImageNet-22k pretraining, MOAT takes no more than 2.05 days, while for ImageNet-1k pretraining, MOAT takes < 1 day.

ImageNet peak training memory of MOAT models. The input size is 224 × 224.

acknowledgement

Acknowledgements We thank Wen-Sheng Chu for the support and discussion. We gratefully acknowledge supports from the Office of Naval Research. N00014-21-1-2812.

A APPENDIX

In the appendix, we provide more details for both our model and experiments.• In section A.1, we provide MOAT implementation details.• In section A.2.1, we provide ImageNet experimental details.• In section A.2.2, we provide ImageNet-V2 experimental results.• In section A.3.1, we provide COCO detection experimental details.• In section A.3.2, we provide more COCO object detection experimental results. • In section A.4, we provide ADE20K semantic segmentation experimental detaills.• In section A.5, we provide COCO panoptic segmentation experiments.• In section A.6.1, we provide ablation studies on the MOAT micro-level design.• In section A.6.2, we provide ablation studies on the MOAT macro-level design.• In section A.7, we provide the ImageNet trainng time, peak training memory and throughput measurement of MOAT models. • In section A.8, we discuss limitations of our model.

A.1 MOAT IMPLEMENTATION DETAILS

In the MOTA networks, we employ kernel size 3 for both convolutions and depthwise convolutions. We use the multi-head self attention (Vaswani et al., 2017) , where each attention head has channels 32. For the MBConv and MOAT blocks, we use expansion ratio 4. The SE module (Hu et al., 2018) in the MBConv blocks (i.e., 2nd and 3rd stages) adopt reduction ratio 4 (relative to the input channels).Our MOAT block includes the relative positional embedding (Shaw et al., 2018; Dai et al., 2021) for ImageNet. However, the downstream tasks usually take a larger input resolution than ImageNet, demanding for a special adaptation (e.g., bilinear interpolation of pretrained positional embedding). For simplicity, we remove the positional embedding, when running MOAT on downstream tasks.

A.2.1 IMAGENET EXPERIMENTS

The ImageNet-1K dataset (Russakovsky et al., 2015) contains 1.2M training images with 1000 classes. We report top-1 accuracy on the ImageNet-1K validation set, using the last checkpoint. We also experiment with pretraining on the larger ImageNet-22K dataset, and then fine-tuning on the ImageNet-1K. We closely follow the prior works (Dai et al., 2021; Liu et al., 2022b) and provide more details below. In Tab. 8, we compare our MOAT with more state-of-the-art models.Experimental setup. We train MOAT models on ImageNet-1K with resolution 224 for 300 epochs. If pretraining on the larger ImageNet-22K, we use resolution 224 and 90 epochs. Afterwards, the models are fine-tuned on ImageNet-1K for 30 epochs. During fine-tuning, we also experiment with larger resolutions (e.g., 384 and 512). We employ the typical regularization methods during training, such as label smoothing (Szegedy et al., 2016 ), RandAugment (Cubuk et al., 2020) , MixUp (Zhang et al., 2017) , stochastic depth (Huang et al., 2016) , and Adam (Kingma & Ba, 2015) with decoupled weight decay (i.e., AdamW (Loshchilov & Hutter, 2019) ). See Tab. 10 and Tab. 11 for detailed hyper-parameters.

A.2.2 IMAGENET-1K-V2 EVALUATION

To further demonstrate the transferability and generalizability of our MOAT models, we perform additional evaluations on the ImageNet-1K-V2 (Recht et al., 2019) , using our ImageNet (Russakovsky et al., 2015) pretrained checkpoints. We report an extensive evaluation, using MOAT and several input resolutions, on ImageNet-1K-V2, aiming to establish another solid baseline for the community, as we notice that most of the existing models do not report results on ImageNet-1K-V2. As shown in the Tab. 9, MOAT does not overfit to ImageNet-1K-V1 dataset and generalizes well to ImageNet-1K-V2 dataset, as we observe a continuous performance improvement from small to large models. 

A.5 COCO PANOPTIC SEGMENTATION

Experimental setup. We also evaluate the proposed MOAT architectures on the challenging COCO panoptic segmentation dataset (Lin et al., 2014) using Panoptic-DeepLab (Cheng et al., 2020) with the official codebase (Weber et al., 2021) . We fine-tune the global attention on downstream segmentation tasks for MOAT. We adopt the same training strategies for MOAT and its counterparts. Specifically, for training hyper-parameters, we train the model with 32 TPU cores for 200k iterations with the first 2k for warm-up stage. We use batch size 64, Adam (Kingma & Ba, 2015) optimizer, and a poly schedule learning rate starting at 0.0005. For data augmentations, the inputs images are resized and padded to 641 × 641, with random cropping, flipping, and color jittering (Cubuk et al., 2019) . No test-time augmentation is used during inference.Main results. The results summarized in Tab. 13, where MOAT consistently outperforms other backbones. Specifically, our MOAT-0 surpasses ConvNeXt-T significantly by 4.3% PQ. In the large model regime, MOAT-3 surpasses ConvNeXt-L by 3.5%. Our MOAT-4 achieves the performance of 46.7% PQ, outperforming the heavy backbone SWideRNet (Chen et al., 2020) by 2.3%. MBConv) , delegating the downsampling duty to the strided depthwise convolution within the MBConv. However, the dowsampling can be still performed in the MBConv with the original order (i.e., Attn + MBConv). Since the operations, Attn and MBConv, are interlaced, the key difference then comes from the first block in each stage, where the Attn is operated on the (1) spatially downsampled and/or (2) channel expanded features. To conduct the study, we employ different blocks in the MOAT variants, using "Attn + MLP", "Attn + MBConv", or "MBConv + Attn". For the "Attn + MBConv" block, we further ablate the place (Attn vs. MBConv), where we apply the spatial downsampling and channel expansion operations.In Tab. 14, we observe the following results. First, replacing the MLP with MBConv improves the performance by 0.3% and 0.7% for MOAT-0 and tiny-MOAT-2. Second, if we perform both spatial downsampling and channel expansion at the MBConv block, the performance is further improved by 0.5% and 0.9% for MOAT-0 and tiny-MOAT-2, showing that MBConv learns better downsampled

