MOBILEVITV3: MOBILE-FRIENDLY VISION TRANS-FORMER WITH SIMPLE AND EFFECTIVE FUSION OF LOCAL, GLOBAL AND INPUT FEATURES

Abstract

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-theart results, the fusion block inside MobileViTv1-block creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models will be made available on GitHub.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) [ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) and EfficientNet (Tan & Le, 2019) ] are widely used for vision tasks such as classification, detection and segmentation, due to their strong performance on the established benchmark datasets such as Imagenet (Russakovsky et al., 2015) , COCO (Lin et al., 2014) , PascalVOC (Everingham et al., 2015) , ADE20K (Zhou et al., 2017) and other similar datasets. When deploying CNNs on edge devices like mobile phones which are generally resource constrained, light-weight CNNs suitable for such environments come from family of models of MobileNets (MobileNetv1, MobileNetv2, MobileNetv3) (Howard et al., 2019) , ShuffleNets (ShuffleNetv1 and ShuffleNetv2) (Ma et al., 2018) and light-weight versions of EfficientNet (Tan & Le, 2019) (EfficientNet-B0 and EfficientNet-B1). These relatively small models lack in accuracy when compared to models with large parameters and FLOPs. Recently, Vision Transformers (ViTs) have emerged as an strong alternatives to CNNs on these vision tasks. Self-attention mechanism in ViTs interacts with all parts of the image to produce features which have global information embedded in them. This has been demonstrated to produce comparable results to CNNs but with large pre-training data and advance data augmentation (Dosovitskiy et al., 2020) . Also, this global processing comes at a cost of large parameters and FLOPs to match the performance of CNNs as seen in ViT (Dosovitskiy et al., 2020) , and its different versions such as DeiT (Touvron et al., 2021) , SwinT (Liu et al., 2021) , MViT (Fan et al., 2021 ), Focal-ViT (Yang et al., 2021) , PVT (Wang et al., 2021) , T2T-ViT (Yuan et al., 2021b) , XCiT (Ali et al., 2021) . ). Performance of many of these models on ImageNet-1K, with parameters and FLOPs is shown in Figure 1 . Among these models, only MobileViTs and MobileFormer are specifically designed for resource constrained environment such as mobile devices. These two models achieve competitive performance compared to other hybrid networks with less parameters and FLOPs. Even though these small hybrid models are critical for the vision tasks on mobile devices, there is little work done in this area.

Many recent work have introduced convolutional layers in

Our work focuses on improving one such light-weight family of models known as MobileViTs (Mo-bileViTv1 (Mehta & Rastegari, 2021) and MobileViTv2 (Mehta & Rastegari, 2022) ). When compared to the models with parameter budget of 6 million (M) or less, MobileViTs achieve competitive state-of-the-art results with a simple training recipe (basic data augmentation) on classification task. Also it can be used as an efficient backbone across different vision tasks such as detection and segmentation. While focusing on only the models with 6M parameters or less, we pose the question: Is it possible to change the model architecture to improve its performance by maintaining similar parameters and FLOPs? To do so, our work looks into challenges of MobileViT-block architecture and proposes simple and effective way to fuse input, local (CNN) and global (ViT) features which lead to significant performance improvements on Imagenet-1K, ADE20k, PascalVOC and COCO dataset. We propose four main changes to MobileViTv1 block (three changes w.r. 1 ), segmentation and detection tasks. For example, MobileViTv3-XXS and MobileViTv3-XS perform 2% and 1.9% better with similar parameters and FLOPs on ImageNet-1K dataset compared to MobileViTv1-XXS and MobileViTv1-XS respectively. In MobileViTv2, fusion block is absent. Our proposed fusion block is introduced in MobileViTv2 architecture to create MobileViTv3-1.0, 0.75 and 0.5 architectures. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively with similar parameters and FLOPs on ImageNet-1K dataset.



Figure 1: Comparing Top-1 of MobileViTv3, ViT variants and hybrid models on ImageNet-1K dataset. The area of bubbles correspond to number of FLOPs in the model. The reference FLOP sizes are shown in the bottom right (example, 250M is 250 Mega-FLOPs/Million-FLOPs). Models of our MobileViTv3 architecture outperforms other models with similar parameter budget of under 2M, 2-4M and 4-8M. Also, they achieve competitive results when compared to the models greater than 8M parameters.

t MobileViTv2 block) as shown in figure 2. Three changes are in the fusion block: First, 3x3 convolutional layer is replaced with 1x1 convolutional layer. Second, features of local and global representation blocks are fused together instead of input and global representation blocks. Third, input features are added in the fusion block as a final step before generating the output of MobileViT block. Fourth change is proposed in local representation block where normal 3x3 convolutional layer is replaced by depthwise 3x3 convolutional layer. These changes result in the reduction of parameters and FLOPs of MobileViTv1 block and allow scaling (increasing width of the model) to create a new MobileViTv3-S, XS and XXS architecture, which outperforms MobileViTv1 on classification (Figure

