PROGRESSIVELY COMPRESSED AUTOENCODER FOR SELF-SUPERVISED REPRESENTATION LEARNING

Abstract

As a typical self-supervised learning strategy, Masked Image Modeling (MIM) is driven by recovering all masked patches from visible ones. However, patches from the same image are highly correlated and it is redundant to reconstruct all the masked patches. We find that this redundancy is neglected by existing MIM based methods and causes non-negligible overheads in computation that do not necessarily benefit self-supervised representation. In this paper, we present a novel approach named PCAE, short for Progressively Compressed AutoEncoder, to address the redundant reconstruction issue by progressively compacting tokens and only retaining necessary information for forward propagation and reconstruction. In particular, we identify those redundant tokens in an image via a simple yet effective similarity metric between each token with the mean of the token sequence. Those redundant tokens that other ones can probably represent are progressively dropped accordingly during the forward propagation, and importantly, we only focus on reconstructing these retained tokens. As a result, we are able to achieve a better trade-off between performance and efficiency for pre-training. Besides, benefitting from the flexible strategy, PCAE can be also directly employed for downstream fine-tuning tasks and enable scalable deployment. Experiments show that PCAE achieves comparable performance to MAE with only 1/8 GPU days.

1. INTRODUCTION

Contrastive learning has witnessed great progress and even outperforms its supervised counterpart in downstream tasks (Chen et al., 2020b; Caron et al., 2021; Grill et al., 2020) . However, contrastive-based methods usually take more epochs and computational overheads for considerable performance. Recently, Masked Image Modeling (MIM) becomes a popular topic for its scalability as well as promising performance (Bao et al., 2022; He et al., 2022; Dong et al., 2021; Chen et al., 2022; Xie et al., 2022) , especially for MAE (He et al., 2022) which significantly accelerates training via only operating on 25% visible patches in the encoder. MIM methods learn representations by recovering masked regions of input images from visible ones. These masked regions are represented with learnable mask tokens in pre-training. Since MIM methods usually apply high mask ratios on the input image, the number of mask tokens is considerable. As a result, these mask tokens take large computational overheads during training while contributing little information. For example, the number of mask tokens is three times the visible tokens in the decoder of MAE, resulting in the overhead of the decoder (5.28G Flops) exceeding that of the encoder (4.3G Flops) for a typical ViT Base, even though the decoder is relatively lightweight. Another type of MIM methods, i.e., SimMIM (Xie et al., 2022; Bao et al., 2022) alike methods, retain masked tokens in the complicated encoder, causing a more serious computational burden. However, due to the spatial redundancy of images, the masked regions are highly correlated and redundant,

availability

The code is available at https://github.com/caddyless/PCAE/ 

