MIXPRO: DATA AUGMENTATION WITH MASKMIX AND PROGRESSIVE ATTENTION LABELING FOR VISION TRANSFORMER

Abstract

The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for vision transformer. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViTbased models at scales on ImageNet classification (73.8% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks.

1. INTRODUCTION

Transformers (Vaswani et al., 2017) have revolutionized the natural language processing (NLP) field and have recently inspired the emergence of transformer-style architectures in the computer vision (CV) field, such as Vision Transformer (ViT) (Dosovitskiy et al., 2020) . These methods design with competitive results in numerous CV tasks like image classification (Touvron et al., 2021a; Yuan et al., 2021; Wang et al., 2021; Liu et al., 2021; Touvron et al., 2021b; Ali et al., 2021) , object detection (Fang et al., 2021; Dai et al., 2021; Carion et al., 2020; Zhu et al., 2020) and image segmentation (Strudel et al., 2021; Wang et al., 2021; Liu et al., 2021) . Previous research has discovered that ViT-based networks are difficult to optimize and can easily overfit to training data with many images (Russakovsky et al., 2015) , resulting in a significant generalization gap in the test data. To improve the generalization and robustness of the model, the recent works (Dosovitskiy et al., 2020; Touvron et al., 2021a; Yuan et al., 2021; Wang et al., 2021; Liu et al., 2021; Touvron et al., 2021b; Ali et al., 2021) employ data augmentation (Zhang et al., 2017) and regularization techniques (Szegedy et al., 2016) during training. Among them, the mixup-base methods such as Mixup (Zhang et al., 2017 ), CutMix (Yun et al., 2019 ) and TransMix (Chen et al., 2021) are implemented to improve the generalization and robustness of ViT-based networks. For the CutMix, patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. Furthermore, TransMix based on CutMix considers that not all pixels are created equal.

