SEAFORMER: SQUEEZE-ENHANCED AXIAL TRANS-FORMER FOR MOBILE SEMANTIC SEGMENTATION

Abstract

Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement render these methods unsuitable on the mobile device, especially for the high-resolution per-pixel semantic segmentation task. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile semantic segmentation. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K and Cityscapes datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification problem, demonstrating the potentials of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at

1. INTRODUCTION

As a fundamental problem in computer vision, semantic segmentation aims to assign a semantic class label to each pixel in an image. Conventional methods rely on stacking local convolution kernel Long et al. (2015) to perceive the long-range structure information of the image. However, these advances are still insufficient to satisfy the design requirements and constraints for mobile devices due to the high latency on the high-resolution inputs (see Figure 1 ). Recently there is a surge of interest in building a Transformer-based semantic segmentation. In order to reduce the computation cost at high resolution, TopFormer Zhang et al. (2022c) dedicates to applying the global attention at a 1/64 scale of the original input, which definitely harms the segmentation performance.

Since the introduction of Vision

To solve the dilemma of high-resolution computation for pixel-wise segmentation task and low latency requirement on the mobile device in a performance harmless way, we propose a family The core building block squeeze-enhanced Axial attention (SEA attention) seeks to squeeze (pool) the input feature maps along the horizontal/vertical axis into a compact column/row and computes self-attention. We concatenate query, keys and values to compensate the detail information sacrificed during squeeze and then feed it into a depth-wise convolution layer to enhance local details. Coupled with a light segmentation head, our design (see Figure 2 ) with the proposed SeaFormer layer in the small-scale feature is capable of conducting high-resolution image semantic segmentation with low latency on the mobile device. As shown in Figure 1 , the proposed SeaFormer outperforms other efficient neural networks on the ADE20K dataset with lower latency. In particular, SeaFormer-Base is superior to the lightweight CNN counterpart MobileNetV3 (41.0 vs.33.1 mIoU) with lower latency (106ms vs.126ms) on an ARM-based mobile device. We make the following contributions: (i) We introduce a novel squeeze-enhanced Axial Transformer 



Transformers Dosovitskiy et al. (2021), the landscape of semantic segmentation has significantly revolutionized. Transformer-based approaches Zheng et al. (2021); Xie et al. (2021) have remarkably demonstrated the capability of global context modeling. However, the computational cost and memory requirement of Transformer render these methods unsuitable on mobile devices, especially for high-resolution imagery inputs. Following conventional wisdom of efficient operation, local/window-based attention Luong et al. (2015); Liu et al. (2021); Huang et al. (2021a); Yuan et al. (2021), Axial attention Huang et al. (2019b); Ho et al. (2019); Wang et al. (2020a), dynamic graph message passing Zhang et al. (2020; 2022b) and some lightweight attention mechanisms Hou et al. (2020); Li et al. (2021b;c; 2020); Liu et al. (2018); Shen et al. (2021); Xu et al. (2021); Cao et al. (2019); Woo et al. (2018); Wang et al. (2020b); Choromanski et al. (2021); Chen et al. (2017); Mehta & Rastegari (2022a) are introduced.

Figure 1: Left: Latency comparison with Transformer Vaswani et al. (2017), MixFormer Chen et al. (2022a), ACmix Pan et al. (2022b), Axial attention Ho et al. (2019) and local attention Luong et al. (2015). It is measured with a single module of channel dimension 64 on a Qualcomm Snapdragon 865 processor. Right: The mIoU versus latency on the ADE20K val set. MV2 means Mo-bileNetV2 Sandler et al. (2018). MV3-L means MobileNetV3-Large Howard et al. (2019). MV3-Lr denotes MobileNetV3-Large-reduce Howard et al. (2019). The latency is measured on a single Qualcomm Snapdragon 865, and only an ARM CPU core is used for speed testing. No other means of acceleration, e.g., GPU or quantification, is used. For figure Right, the input size is 512×512. SeaFormer achieves superior trade-off between mIoU and latency.

SeaFormer) framework for mobile semantic segmentation; (ii) Critically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement; It can be used to create a family of backbone architectures with superior cost-effectiveness; (iii) We show top performance on the ADE20K and Cityscapes datasets, beating both the mobile-friendly rival and Transformer-based segmentation model with clear margins; (iv) Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to the image classification problem, demonstrating the potential of serving as a versatile mobile-friendly backbone. 2 RELATED WORK Combination of Transformers and convolution Convolution is relatively efficient but not suitable to capture long-range dependencies and vision Transformer has the powerful capability with a global receptive field but lacks efficiency due to the computation of self-attention. In order to make full use of both of their advantages, MobileViT Mehta & Rastegari (2022a), TopFormer Zhang et al. (2022c), LVT Yang et al. (2022), Mobile-Former Chen et al. (2022b), EdgeViTs Pan et al. (2022a), MobileViTv2 Mehta & Rastegari (2022b), EdgeFormer Zhang et al. (2022a) and EfficientFormer Li et al. (2022) are constructed as efficient ViTs by combining convolution with Transformers. Mobile-

availability

https://github.com/fudan-zvg

