MOAT: ALTERNATING MOBILE CONVOLUTION AND ATTENTION BRINGS STRONG VISION MODELS

Abstract

This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% / 81.5% top-1 accuracy on ImageNet-1K / ImageNet-1K-V2 with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2% AP box with 227M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention. Code is publicly available.

1. INTRODUCTION

The vision community has witnessed the prevalence of self-attention (Bahdanau et al., 2015) and Transformers (Vaswani et al., 2017) . The success of Transformers in natural language processing motivates the creation of their variants for vision recognition. The Vision Transformer (ViT) (Dosovitskiy et al., 2021) has great representation capacity with global receptive field. However, it requires pretraining on a large-scale proprietary dataset (Sun et al., 2017) . Its unsatisfying performance, when trained with a small number of images, calls for the need of better training recipes (Touvron et al., 2021a; Steiner et al., 2021) or architectural designs (Liu et al., 2021; Graham et al., 2021) . On the other hand, ConvNet has been the dominant network choice since the advent of AlexNet (Krizhevsky et al., 2012) in 2012. Vision researchers have condensed the years of network design experience into multiple principles, and have started to incorporate them to vision transformers. For example, there are some works adopting the ConvNet's hierarchical structure to extract multi-scale features for vision transformers (Liu et al., 2021; Fan et al., 2021; Wang et al., 2022) , and others proposing to integrate the translation equivariance of convolution into transformers (Graham et al., 2021; d'Ascoli et al., 2021; Xiao et al., 2021) . Along the same direction of combining the best from Transformers and ConvNets, CoAtNet (Dai et al., 2021) and MobileViT (Mehta & Rastegari, 2022a) demonstrate outstanding performance by stacking Mobile Convolution (MBConv) blocks (i.e., inverted residual blocks (Sandler et al., 2018) ) and Transformer blocks (i.e., a self-attention layer and a Multi-Layer Perceptron (MLP)). However, both works focus on the macro-level network design. They consider MBConv and Transformer blocks as individual separate ones, and systematically study the effect of stacking them to strike a better balance between the remarkable efficiency of MBConv and strong capacity of Transformer. In this work, on the contrary, we study the micro-level building block design by taking a deeper look at the combination of MBConv and Transformer blocks. We make two key observations after a careful examination of those blocks. First, the MLP module in Transformer block is similar to MBConv, as both adopt the inverted bottleneck design. However, MBConv is a more powerful operation by employing one extra 3 × 3 depthwise convolution (to encode local interaction between pixels), and more activation (Hendrycks & Gimpel, 2016) and normalization (Ioffe & Szegedy, 2015) are employed between convolutions. Second, to extract multi-scale features using Transformer blocks, one may apply the average-pooling (with stride 2) to input features before the self-attention layer. However, the pooling operation reduces the representation capacity of self-attention. Our observations motivate us to propose a novel MObile convolution with ATtention (MOAT) block, which efficiently combines MBConv and Transformer blocks. The proposed MOAT block modifies the Transformer block by first replacing its MLP with a MBConv block, and then reversing the order of attention and MBConv. The replacement of MLP with MBConv brings more representation capacity to the network, and reversing the order (MBConv comes before self-attention) delegates the downsampling duty to the strided depthwise convolution within the MBConv, learning a better downsampling kernel. We further develop a family of MOAT models by stacking and increasing the channels of network blocks. Surprisingly, our extremely simple design results in a remarkable impact. On the challenging ImageNet-1K classification benchmark (Russakovsky et al., 2015) , our model (190M parameters) achieves 86.7% top-1 accuracy without extra data. When further pretraining on ImageNet-22K, our best model (483M parameters) attains 89.1% / 81.5% top-1 accuracy on ImageNet-1K (Tab. 2) / ImageNet-1K-V2 (Tab. 9), setting a new state-of-the-art. Additionally, MOAT can be seamlessly deployed to downstream tasks that require large resolution inputs by simply converting the global attention to non-overlapping local window attention. Thanks to the MBConv that effectively exchanges local information between pixels (enabling cross-window propagation), MOAT does not need the extra window-shifting mechanism (Liu et al., 2021) . As a result, on COCO object detection (Lin et al., 2014) and ADE20K semantic segmentation (Zhou et al., 2019) , MOAT shows superior performances. Specifically, on COCO object detection (Tab. 3), our best model (227M parameters), achieves 59.2% AP box with single-scale inference and hard NMS, setting a new state-of-the-art in the regime of model size 200M with Cascade Mask R-CNN (Cai & Vasconcelos, 2018; He et al., 2017) . On ADE20K semantic segmentation (Tab. 4), our best model (496M parameters), adopting DeepLabv3+ (Chen et al., 2018) , attains 57.6% mIoU with single-scale inference, also setting a new state-of-the-art in the regime of models using input size 641 × 641. Finally, to explore the scalability of MOAT models, we simply scale down the models by reducing the channel sizes (without any other change), resulting in the tiny-MOAT family, which also surprisingly outperforms mobile-specific transformer-based models, such as Mobile-Former (Chen et al., 2022c) and MobileViTs (Mehta & Rastegari, 2022a; b) . Specifically, in the regime of model parameters 5M, 10M, and 20M, our tiny MOAT outperforms the concurrent MobileViTv2 (Mehta & Rastegari, 2022b) by 1.1%, 1.3%, and 2.0% top-1 accuracy on ImageNet-1K classification benchmark (Tab. 5). Furthermore, we benchmark tiny-MOAT on COCO object detection and ADE20K semantic segmentation. In summary, our method advocates the design principle of simplicity. Without inventing extra complicated operations, the proposed MOAT block effectively merges the strengths of both mobile convolution and self-attention into one block by a careful redesign. Despite its conceptual simplicity, impressive results have been obtained on multiple core vision recognition tasks. We hope our study will inspire future research on seamless integration of convolution and self-attention.

