MOBILEVITV3: MOBILE-FRIENDLY VISION TRANS-FORMER WITH SIMPLE AND EFFECTIVE FUSION OF LOCAL, GLOBAL AND INPUT FEATURES

Abstract

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-theart results, the fusion block inside MobileViTv1-block creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models will be made available on GitHub.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) [ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) and EfficientNet (Tan & Le, 2019) ] are widely used for vision tasks such as classification, detection and segmentation, due to their strong performance on the established benchmark datasets such as Imagenet (Russakovsky et al., 2015) , COCO (Lin et al., 2014 ), PascalVOC (Everingham et al., 2015) , ADE20K (Zhou et al., 2017) These relatively small models lack in accuracy when compared to models with large parameters and FLOPs. Recently, Vision Transformers (ViTs) have emerged as an strong alternatives to CNNs on these vision tasks. Self-attention mechanism in ViTs interacts with all parts of the image to produce features which have global information embedded in them. This has been demonstrated to produce comparable results to CNNs but with large pre-training data and advance data augmentation (Dosovitskiy et al., 2020) . Also, this global processing comes at a cost of large parameters and FLOPs to match the performance of CNNs as seen in ViT (Dosovitskiy et al., 2020) , and its different versions such as DeiT (Touvron et al., 2021) , SwinT (Liu et al., 2021) , MViT (Fan et al., 2021 ), Focal-ViT (Yang et al., 2021) , PVT (Wang et al., 2021) , T2T-ViT (Yuan et al., 2021b) , XCiT (Ali et al., 2021) . Many recent work have introduced convolutional layers in ViT architecture to form hybrid networks to improve performance, achieve sample efficiency and make the models more efficient in terms of parameters and FLOPs like MobileViTs (MobileViTv1 (Mehta & Rastegari, 2021) , MobileViTv2



and other similar datasets. When deploying CNNs on edge devices like mobile phones which are generally resource constrained, light-weight CNNs suitable for such environments come from family of models of MobileNets (MobileNetv1, MobileNetv2, MobileNetv3) (Howard et al., 2019), ShuffleNets (ShuffleNetv1 and ShuffleNetv2) (Ma et al., 2018) and light-weight versions of EfficientNet (Tan & Le, 2019) (EfficientNet-B0 and EfficientNet-B1).

