PROGRESSIVELY COMPRESSED AUTOENCODER FOR SELF-SUPERVISED REPRESENTATION LEARNING

Abstract

As a typical self-supervised learning strategy, Masked Image Modeling (MIM) is driven by recovering all masked patches from visible ones. However, patches from the same image are highly correlated and it is redundant to reconstruct all the masked patches. We find that this redundancy is neglected by existing MIM based methods and causes non-negligible overheads in computation that do not necessarily benefit self-supervised representation. In this paper, we present a novel approach named PCAE, short for Progressively Compressed AutoEncoder, to address the redundant reconstruction issue by progressively compacting tokens and only retaining necessary information for forward propagation and reconstruction. In particular, we identify those redundant tokens in an image via a simple yet effective similarity metric between each token with the mean of the token sequence. Those redundant tokens that other ones can probably represent are progressively dropped accordingly during the forward propagation, and importantly, we only focus on reconstructing these retained tokens. As a result, we are able to achieve a better trade-off between performance and efficiency for pre-training. Besides, benefitting from the flexible strategy, PCAE can be also directly employed for downstream fine-tuning tasks and enable scalable deployment. Experiments show that PCAE achieves comparable performance to MAE with only 1/8 GPU days.

1. INTRODUCTION

Contrastive learning has witnessed great progress and even outperforms its supervised counterpart in downstream tasks (Chen et al., 2020b; Caron et al., 2021; Grill et al., 2020) . However, contrastive-based methods usually take more epochs and computational overheads for considerable performance. Recently, Masked Image Modeling (MIM) becomes a popular topic for its scalability as well as promising performance (Bao et al., 2022; He et al., 2022; Dong et al., 2021; Chen et al., 2022; Xie et al., 2022) , especially for MAE (He et al., 2022) which significantly accelerates training via only operating on 25% visible patches in the encoder. MIM methods learn representations by recovering masked regions of input images from visible ones. These masked regions are represented with learnable mask tokens in pre-training. Since MIM methods usually apply high mask ratios on the input image, the number of mask tokens is considerable. As a result, these mask tokens take large computational overheads during training while contributing little information. For example, the number of mask tokens is three times the visible tokens in the decoder of MAE, resulting in the overhead of the decoder (5.28G Flops) exceeding that of the encoder (4.3G Flops) for a typical ViT Base, even though the decoder is relatively lightweight. Another type of MIM methods, i.e., SimMIM (Xie et al., 2022; Bao et al., 2022) alike methods, retain masked tokens in the complicated encoder, causing a more serious computational burden. However, due to the spatial redundancy of images, the masked regions are highly correlated and redundant, and thus it is unnecessary to recover all of them. Unfortunately, previous MIM methods neglect this and recover all masked regions through plenty of mask tokens, which causes non-negligible overheads in computation that do not necessarily benefit self-supervised representation. This observation motivates us to relax MIM pre-training by reducing the redundancy in reconstruction targets. In this paper, we present a novel approach named Progressively Compressed Auto-Encoder (PCAE) to reduce the redundancy in reconstruction targets. The core idea is, instead of recovering all masked patches, we identify and neglect those redundant ones and only reconstruct the representative targets. However, naively removing part of the reconstruction targets suffers serious performance degradation (please refer to Table 9 for more details) as some important information is also discarded. Therefore, we propose to mitigate information loss by exploiting the self-attention mechanism of the vision transformer to spread information from the discarded patches to the retained ones. Specifically, we employ the momentum encoder in pre-training and discard tokens produced by the momentum encoder, rather than the original reconstruction targets. To further alleviate the information loss, we propose a progressively discarding strategy where redundant tokens are progressively discarded at different layers of the momentum encoder (please refer to Table 3 for more details). The remained issue is how to identify the redundant tokens. Matrix decomposition precisely identifies the redundant component of the token sequence but exhibits prohibitive complexity in practice. As an alternative, we propose a simple yet effective criterion based on their similarity to the mean of the token sequence. Experimental results and visualizations demonstrate that this criterion is very effective and introduces negligible overhead. Benefiting from the proposed progressive token reduction (PTR) strategy, PCAE reduces the reconstruction targets from 147 tokens to 18 tokens (take ViT Base as an example), which significantly improves the efficiency. Experiments over multiple benchmarks demonstrate the effectiveness of the proposed method. Specifically, PCAE is able to accelerate training 2.25 times compared with MAE (He et al., 2022) (739.7 img/s v.s. 328.4 img/s, ViT Base, 32 GB V100), while enjoying much faster converge speed (PCAE 300 epoch 83.6 vs MAE 1600 epoch 83.6) or higher performance (PCAE 800 epoch 83.9 vs MAE 1600 epoch 83.6). Besides, we extend the proposed strategy to the inference phase as well as downstream fine-tuning for classification tasks, and the results show this strategy still works well. Benefiting from the flexibility, we could provide models with different in-



Figure 1: The overall framework of PCAE. (a) The pre-training framework consists of the student, teacher (EMA model), and the decoder module. The student operates on visible patches while the teacher operates only on masked ones. The teacher progressively discards redundant tokens and ultimately outputs much fewer tokens for reconstruction. The decoder reconstructs the output of the teacher according to the output of the student. (b) PCAE for downstream fine-tuning. (c) Progressive Token Reduction (PTR) module. PTR can be applied in the teacher for pre-training or the backbone for downstream fine-tuning.

availability

The code is available at https://github.com/caddyless/PCAE/ 

