MOAT: ALTERNATING MOBILE CONVOLUTION AND ATTENTION BRINGS STRONG VISION MODELS

Abstract

This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% / 81.5% top-1 accuracy on ImageNet-1K / ImageNet-1K-V2 with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2% AP box with 227M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention.

1. INTRODUCTION

The vision community has witnessed the prevalence of self-attention (Bahdanau et al., 2015) and Transformers (Vaswani et al., 2017) . The success of Transformers in natural language processing motivates the creation of their variants for vision recognition. The Vision Transformer (ViT) (Dosovitskiy et al., 2021) has great representation capacity with global receptive field. However, it requires pretraining on a large-scale proprietary dataset (Sun et al., 2017) . Its unsatisfying performance, when trained with a small number of images, calls for the need of better training recipes (Touvron et al., 2021a; Steiner et al., 2021) or architectural designs (Liu et al., 2021; Graham et al., 2021) . On the other hand, ConvNet has been the dominant network choice since the advent of AlexNet (Krizhevsky et al., 2012) in 2012. Vision researchers have condensed the years of network design experience into multiple principles, and have started to incorporate them to vision transformers. For example, there are some works adopting the ConvNet's hierarchical structure to extract multi-scale features for vision transformers (Liu et al., 2021; Fan et al., 2021; Wang et al., 2022) , and others proposing to integrate the translation equivariance of convolution into transformers (Graham et al., 2021; d'Ascoli et al., 2021; Xiao et al., 2021) . Along the same direction of combining the best from Transformers and ConvNets, CoAtNet (Dai et al., 2021) and MobileViT (Mehta & Rastegari, 2022a) demonstrate outstanding performance by

