MIXPRO: DATA AUGMENTATION WITH MASKMIX AND PROGRESSIVE ATTENTION LABELING FOR VISION TRANSFORMER

Abstract

The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for vision transformer. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViTbased models at scales on ImageNet classification (73.8% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks.

1. INTRODUCTION

Transformers (Vaswani et al., 2017) have revolutionized the natural language processing (NLP) field and have recently inspired the emergence of transformer-style architectures in the computer vision (CV) field, such as Vision Transformer (ViT) (Dosovitskiy et al., 2020) . These methods design with competitive results in numerous CV tasks like image classification (Touvron et al., 2021a; Yuan et al., 2021; Wang et al., 2021; Liu et al., 2021; Touvron et al., 2021b; Ali et al., 2021) , object detection (Fang et al., 2021; Dai et al., 2021; Carion et al., 2020; Zhu et al., 2020) and image segmentation (Strudel et al., 2021; Wang et al., 2021; Liu et al., 2021) . Previous research has discovered that ViT-based networks are difficult to optimize and can easily overfit to training data with many images (Russakovsky et al., 2015) , resulting in a significant generalization gap in the test data. To improve the generalization and robustness of the model, the recent works (Dosovitskiy et al., 2020; Touvron et al., 2021a; Yuan et al., 2021; Wang et al., 2021; Liu et al., 2021; Touvron et al., 2021b; Ali et al., 2021) employ data augmentation (Zhang et al., 2017) and regularization techniques (Szegedy et al., 2016) during training. Among them, the mixup-base methods such as Mixup (Zhang et al., 2017 ), CutMix (Yun et al., 2019 ) and TransMix (Chen et al., 2021) are implemented to improve the generalization and robustness of ViT-based networks. For the CutMix, patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. Furthermore, TransMix based on CutMix considers that not all pixels are created equal. However, attention maps may not always be reliable during the training process. For example, at the beginning of the training, the model has no representation capability, and the attention maps gained are unreliable. In addition, it is possible to obtain difficult samples using massive data augmentation strategies, and the attention map is also unreliable. At this point, reassigning the mixed labels utilizing a low-confidence attention map will generate noisy mixed labels. To this end, we propose a novel data augmentation method, MixPro, to tackle the above issues from the perspective of image space and label space, respectively. Our approach is presented in Fig. 1 (b) . In detail, from the perspective of image space, we designed MaskMix, which is inspired by the mask strategy of MAE (He et al., 2022) . Our MaskMix replaces the masked patches of one image with visible patches of another image to create a mixed image. In particular, the scale of each mask patch is adjustable and is a multiple of the image patch size. In this way, each image patch comes from only one image (viewed as yellow and blue patches in the figure). In addition, the mask decomposition of pictures can take into account both region and global contents. From the perspective of label space, we designed Progressive Attention Labeling (PAL), which utilizes a progressive factor (α) to dynamically re-weight the attention weight of the mixed attention label. The progressive factor (α) provides an indirect measure of the confidence of the attention map for the mixed sample, to trade-off the attention proportional weight (λ attn ) and the area proportional weight (λ area ). In conclusion, we combine MaskMix and Progressive Attention Labeling, namely MixPro, as our data augmentation strategy to improve the generalization and robustness of ViT-based models. In experiments, we demonstrate extensive evaluations of MixPro on various ViT-based models and tasks. MixPro exhibits greater performance gains than TransMix for all listed ViT-based models. Notably, MixPro can further bring an improvement of 0.9% for DeiT-S. Moreover, we demonstrate that if the model is first pretrained with MixPro on ImageNet, the superiority can be further transferred onto downstream tasks including object detection, instance segmentation, and semantic segmentation.



Figure 1: Comparison between TransMix (Chen et al., 2021) (a) and our proposed MixPro (b). 1) For image space, TransMix shares the same cropped region with CutMix (Yun et al., 2019), which results in patches containing different regions from the two images (patches colored red). Differently, as shown on the right of the figure, MixPro mixes patches using a patch-like mask. The size of the mask patches is the multiple of the image patches. This enables each patch of the mixed image to come from only one image (patches colored yellow and blue). 2) For label space, TransMix computes λ by λ attn . In contrast, we propose progressive attention labeling that dynamically re-weights λ area and λ attn using a progressive factor (α).

