ENERGY-INSPIRED SELF-SUPERVISED PRETRAINING FOR VISION MODELS

Abstract

Motivated by the fact that forward and backward passes of a deep network naturally form symmetric mappings between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single network without any auxiliary components, e.g., an extra decoder. For the forward pass, we fit a network to an energy function that assigns low energy scores to samples that belong to an unlabeled dataset, and high energy otherwise. For the backward pass, we restore data from corrupted versions iteratively using gradientbased optimization along the direction of energy minimization. In this way, we naturally fold the encoder-decoder architecture widely used in masked image modeling into the forward and backward passes of a single vision model. Thus, our framework now accepts a wide range of pretext tasks with different data corruption methods, and permits models to be pretrained from masked image modeling, patch sorting, and image restoration, including super-resolution, denoising, and colorization. We support our findings with extensive experiments, and show the proposed method delivers comparable and even better performance with remarkably fewer epochs of training compared to the state-of-the-art self-supervised vision model pretraining methods. Our findings shed light on further exploring self-supervised vision model pretraining and pretext tasks beyond masked image modeling.

1. INTRODUCTION

The recent rapid development of computation hardware and deep network architectures have paved the way for learning very large deep networks that match and even exceed human intelligence on addressing complex tasks (Brown et al., 2020; He et al., 2017; Silver et al., 2016) . However, as annotating data remains costly, leveraging unlabeled data to facilitate the learning of very large models attracts increasing attention. Exploiting context information in the massive unlabeled data in natural language processing (NLP) stimulates Chen et al. (2020a) to use the direct modeling of pixel sequences as the pre-text tasks of vision model pretraining. Recent self-supervised vision model pretraining through masked image modeling (MIM) (He et al., 2022; Wei et al., 2021; Xie et al., 2022) typically adopt an auto-encoder (AE) architecture, where the target vision model to be pretrained serves as an encoder to encode an image with incomplete pixel information to a latent representation. An auxiliary decoder is jointly trained to restore the missing information from the latent representation. Contrastive self-supervised learning methods (Chen et al., 2020b) usually require large training batch sizes to provide sufficient negative samples. Recent Siamese network based self-supervised learning methods (Grill et al., 2020; Chen & He, 2021; Tian et al., 2021; He et al., 2020; Chen et al., 2021) alleviate the huge batch challenge by deploying an momentum copy of the target model to facilitate the training and prevent trivial solutions. VICReg (Bardes et al., 2022) prevents feature collapsing by two explicit regularization terms. Barlow Twins (Zbontar et al., 2021) reduce the need of large batch size or Siamese networks by proposing a new objective based on cross-correlation matrix between features of different image augmentations. In this paper, we make a further step towards the following question: Can we train a standard deep network to do both representation encoding and masked prediction simultaneously, so that no auxiliary components, heavy data augmentations, or modifications to the network structure are required? Hinted by the fact that the forward and the backward passes of a deep network naturally form symmetric mappings between input and output representations, we extend the recent progress on energy-based models (EBMs) (Xie et al., 2016; Du & Mordatch, 2019; Du et al., 2020b; Zhao et al., 2017) and introduce a model-agnostic self-supervised framework that pre-trains any deep vision models. Given an unlabeled dataset, we train the forward pass of the target vision model to perform discriminative recognition. Instead of instance-wise classification as in contrastive self-supervised learning, we train the target vision model to perform binary classification by fitting it to an energy function that assigns low energy values to positive samples from the dataset and high energy values otherwise. And we train the backward pass of the target vision model to perform conditional image restoration as in masked image modeling methods, by restoring positive image samples from their corrupted versions through conducting gradient-based updating iteratively along the direction of energy minimization. Such conditional sampling schemes can produce samples with satisfying quality using as few as one gradient step, thus prevents the unaffordable cost of applying the standard implicit sampling of EBMs on high-dimensional data. In this way, we naturally fold the encoder-decoder architecture widely used in masked image modeling into the forward and backward passes of a single vision model, so that the structure tailored for discriminative tasks is fully preserved with no auxiliary components or heavy data augmentation needed. Therefore the obtained vision model can better preserve the representation discriminability and prevent knowledge loss or redundancy. Moreover, after folding the corrupted data modeling (encoder) and the original data restoration (decoder) into a single network, the proposed framework now accepts a broader range of pretext tasks to be exploited. Specifically, we demonstrate that beyond typical masked image modeling, the proposed framework can be easily extended to learning from patch sorting and learning from image restoration, e.g., super-resolution and image colorization. We demonstrate the effectiveness of the proposed method with extensive experiments on ImageNet-1K. It is easy to notice that almost every parameter trained from the self-supervised training stage will be effectively used in the downstream fine-tuning. And we show that competitive performance can be achieved even with only 100 epochs of pretraining on a single 8-GPU machine.

2. RELATED WORK

Vision model pretraining. Pretraining language Transformers with masked language modeling (Kenton & Toutanova, 2019) has stimulated the research of using masked image modeling to pretrain vision models. BEIT (Bao et al., 2021) trains the ViT model to predict the discrete visual tokens given the masked image patches, where the visual tokens are obtained through the latent code of a discrete VAE (Ramesh et al., 2021) . iBoT (Zhou et al., 2022) improves the tokenizer with a online version obtained by a teacher network, and learns models through self-distillation. Masked auto-encoder (He et al., 2022) adopts an asymmetric encoder-decoder architecture and shows that scalable vision learners can be obtained simply by reconstructing the missing pixels. (Wei et al., 2021) studies empirically self-supervised training through predicting the features, instead of the raw pixels of the masked images. Different forms of context information for model pretraining are also discussed by learning from predicting the relative position of image patches (Doersch et al., 2015) , sorting sequential data (Noroozi & Favaro, 2016) , training denoising auto-encoders (Vincent et al., 2008) , image colorization (Zhang et al., 2016) , and image inpainting (Pathak et al., 2016) . Similar to metric learning (Hinton, 2002) , contrastive self-supervised learning methods learn visual representations by contrasting positive pairs of images against the negative pairs. (Wu et al., 2018) adopts noise-contrastive estimation to train networks to perform instance-level classification for feature learning. Recent methods construct positive pairs with data augmentation (Chen et al., 2020b) , and obtain pretrained models with high discriminability (Caron et al., 2021) . To relax the demand of large batch size for providing sufficient negative samples, (He et al., 2020; Chen et al., 2020c) are proposed to exploit supervision of negative pairs from memory queues. And it is shown that self-supervised learning can even be performed without contrastive pairs (Grill et al., 2020; Chen & He, 2021; Tian et al., 2021) Energy-based models. The proposed framework for vision model pre-training is inspired by the progress of energy-based models (LeCun et al., 2006) . As a family of generative models, EBMs are mainly studied to perform probabilistic modeling over data (Ngiam et al., 2011; Qiu et al., 2019; Nijkamp et al., 2020; Du & Mordatch, 2019; Du et al., 2020b; Zhao et al., 2020; Xie et al., 2016; 2017; 2020; 2021; Xiao et al., 2020; Arbel et al., 2021) , and conditional sampling (Du et al., 2020a; 2021) .



by establishing a dual pair of Siamese networks to facilitate the training. (Donahue & Simonyan, 2019) extends unsupervised learning with generative adversarial networks to learning discriminative features.

